Here’s a topic that’s near and dear to our hearts here at Appen: spelling standardization.

If you’re providing training data to a computer system to produce machine translation, speech recognition, or a computer voice, it’s important to spell each word the same way every time it comes up (otherwise, you’re watering down your training data and the language model gets confused).

 Even if you’re not going that high-tech and you just want to have reliable search through your database of client questions or your fieldwork notes, spelling words consistently matters.

How hard can that be? Every word has a way it should be spelled, right? Just look it up in the dictionary if you’re not sure.

Oh boy. Follow us down the rabbit hole.

Dialects

Here’s a problem straight away: is it “standardisation” or “standardization”? This one’s region-based, so it’s not too hard to come up with the relevant spelling for your database. But some cases may be more complex than this – in Norwegian there are two entirely separate spelling systems (Bokmål and Nynorsk) intended to reflect different sets of dialects.

Usually this area isn’t too hard – you decide in advance which spelling convention to follow for your chosen language and dialect. Stray spellings from other systems can be identified through automated checks and post-editing.

Register

Is it “gonna”, “goin’ a”, “gon’ to”, or “going to”? This one’s more difficult: the latter spelling is the formally correct option, but in some cases it can be a long way removed from the phonetic realisation coming from a speaker. What if you need to search later for specific realisations of the phrase? How do you separate the pronunciations in your lexicon, if you’re producing a speech database?

In some cases, the difference may be minimal enough that you can standardize to the dictionary form. In others it may be more sensible to adopt an informal representation.

Low-resource languages

It’s all well and good to refer to a dictionary, but some languages don’t have universal arbiters of spelling. We’ve worked with Australian and Papua New Guinean languages with no written tradition at all, with languages such as KiSwahili where many alternate spellings may be equally acceptable, and with languages where spelling reform is recent or incomplete. It can be difficult building a team to work in regions with fewer speakers, or less ready access to the Internet.

The key here is often working with university researchers and linguists. At the same time, it’s important to achieve consensus on acceptable spellings through consultation with speakers of the language living in their communities. You may find your database contributes to giving speakers of the language new access to writing resources!

Codepoints

Even when the spelling of the word is totally clear, we can run into trouble. Take a look at these two words:

café  саfé


How many letters do these have in common?
To a human, the whole thing. To a computer? Only the “f”! The “c” and the “a” on the right come from the Cyrillic alphabet, and the “é” on the right is made out of two characters instead of one.

Codepoint errors, as these are known, will look just fine when you’re reading them, but if you search your database, the text editor isn’t going to find all instances of the word you searched for. It’s even more trouble when your database is training data for automatic speech recognition or a speech synthesis program – the alternate spelling might not show up in your lexicon and the whole segment of audio could be discarded!

Okay, so it’s fairly unlikely that someone’s going to be entering Cyrillic characters in your Latin-alphabet database, but for some languages there really are ambiguous cases, identical to a human eye but distinct to a computer. That’s the case for “é” shown above, and it’s widespread in many other writing systems too. In Arabic, for example, every character in the Unicode range also has separate equivalent “presentation form” characters, so ‘beh’ may appear as ٻ or as ﭒ, and there will be the same invisible variation for every character in the Arabic alphabet.

Unlike other error types discussed above, codepoint errors don’t benefit from multiple passes from annotators or discussion with speakers of the language. This is one of those ones where you need a few post-editing scripts to identify things outside the usual range and correct them to standard forms.

Quite a bit to take in, right?

These are just a few of the challenges to face when you’re working on transcripts and text databases. We hope you discovered some new things about the trials and tribulations of maintaining all this text. At Appen, we’ve helped clients all over the world tackle these issues. If you’d like to discuss how we can help you or your organization, we’d be happy to hear from you! Contact us here to get started.