DialectsHere’s a problem straight away: is it “standardisation” or “standardization”? This one’s region-based, so it’s not too hard to come up with the relevant spelling for your database. But some cases may be more complex than this – in Norwegian there are two entirely separate spelling systems (Bokmål and Nynorsk) intended to reflect different sets of dialects. Usually this area isn’t too hard – you decide in advance which spelling convention to follow for your chosen language and dialect. Stray spellings from other systems can be identified through automated checks and post-editing.
RegisterIs it “gonna”, “goin’ a”, “gon’ to”, or “going to”? This one’s more difficult: the latter spelling is the formally correct option, but in some cases it can be a long way removed from the sounds coming from a person speaking. What if you need to search later for one of the more diverse pronunciations of a phrase? How do you separate the pronunciations in your lexicon, if you’re producing a speech database? In some cases, the difference may be minimal enough that you can standardize to the dictionary form. In others it may be more sensible to adopt an informal representation. No matter how you choose to approach the subject, the conclusion is the same: standardization is vital.
Low-resource languagesIt’s all well and good to refer to a dictionary, but some languages don’t have such handy arbiters of spelling. Appen has worked with Australian and Papua New Guinean languages with no written tradition at all, with languages such as KiSwahili where many alternate spellings may be equally acceptable, and with languages where spelling reform is recent or incomplete. It can be difficult building a team to work in regions with fewer speakers, or less ready access to the Internet. The key here is often working with university researchers and linguists. At the same time, it’s important to achieve consensus on acceptable spellings through consultation with speakers of the language living in their communities. You may find your database contributes to giving speakers of the language new access to writing resources!
CodepointsEven when the spelling of the word is totally clear, we can run into trouble. Take a look at these two words: