Automatic speech recognition (ASR) is something we work with every day here at Appen. Around the world, more and more people are using ASR on their phones, on their computers, or around the house. We have digital personal assistants at our beck and call to set reminders, reply to texts or emails, or even to search the web for us and recommend somewhere to eat.

This is all great, but even the best speech recognition isn’t 100% accurate. And when things go wrong, the errors can be really glaring, if not occasionally entertaining.

What sort of errors happen?
A speech recognition device will almost always come up with a string of words based on what it heard – that’s what they’re designed to do. But deciding which string of words it heard is a tricky task, and there are a few things that can really throw users off.

Guessing the wrong word
This is, of course, the classic problem. Natural language software still isn’t great at forming whole plausible sentences. There are all sorts of potential mishearings which might sound similar, but don’t make a whole lot of sense as a complete sentence:

Hearing things that aren’t what you were saying
If someone walks past and they’re talking at a loud volume, or you cough half way through a word, a computer often isn’t going to be able to tell which parts were you talking and which parts came from some other part of the audio. This can lead to things like someone’s phone taking a dictation when they were just practising the tuba.

So what’s going on here?
Why are these carefully trained algorithms making mistakes that any human listener would find completely laughable?

It’s all about the data we’ve used to train the software. Speech recognition algorithms learn by taking in hundreds of hours of audio and sometimes millions of words of text. If that audio or that text doesn’t match up with what you sound like or how you speak, that’s when problems will start to occur.

If all the computer knows is audio of people speaking in quiet recording booths, your attempted text message in a crowded restaurant is really going to push the system to its limits! If it’s only heard recordings of people who were born and raised within 5 miles of Buckingham Palace, your Canadian accent is going to lead to all sorts of confusion.

Humans are very good at focusing on just the voice of the person they’re talking to. We’re able to adapt and make allowances for where they’re from or where the conversation is taking place. Computers still have a ways to go when it comes to those things.

Lastly, voice recognition programs have lexicon files that they refer to, so they know all the words they can expect to hear and how those words are pronounced. If you use a word that isn’t in the lexicon, the recognition algorithm is never going to be able to write it out for you. A person in a conversation might think “oh, that’s a name I haven’t heard before, maybe I’ll guess a spelling and look it up later.” A computer is always taking its best guess, and if it hasn’t seen a word before, it won’t be able to guess!

What do people do when things break?
When things go wrong with speech recognition, they tend to keep going wrong. People are often a bit wary when talking to a virtual person at the best of times – it doesn’t take much to erode that fragile trust! And once an error has happened, people do all sorts of strange things to try to make themselves clearer.

Some folks will slow right down. Others might over-enunciate their words and make sure all their Ts and Ks are as crisp as can be. And other people will try to adopt the accent they think the computer will best understand, doing their finest impersonation of Queen Elizabeth II or of Ira Glass.

And here’s the thing – although those techniques might help if you’re speaking to a confused tourist or to someone over a bad phone line, they don’t help a computer at all! In fact, the further we stray from natural connected speech (the kind that was present in the recordings used to train the recogniser), the worse things will get, and the spiral will continue.

Appen can help
If you develop and maintain speech software, these limitations will be sounding very familiar. The solution is to ensure that you have the breadth and scale of training data necessary to cover all the people you expect to be using your software! At Appen, we’ve produced speech databases in over 150 different languages, and we’re ready to offer advice on the dialects, demographics, and environments that will give your training data the best accuracy across all your users. Contact us to talk about your needs, and how we can help you!