If you have a mobile device, tablet, smart-home system, or any other device in your home that uses automatic speech recognition, you’ve probably experienced this: the software works fine for mom and dad, but not so well for the kids. Why? Because there are several nuances to training machines to understand child speech that are not always well understood.
Part of the reason for this is that children speak very differently to adults – and not all speech recognition devices are well equipped to deal with this.
How is child speech different from adult speech?
At a surface level, we’re all familiar with the idiosyncratic ways that children speak. Ask an adult to do ‘baby talk’, and they’ll give you their best impression of a voice that is high pitched, with poorly formed vowels, mixed up consonants, possibly with some invented words or imaginative grammar. But at their core, these intuitive observations about how children speak reflect many of the actual problems that machines have when dealing with child speech.
From a purely biological standpoint, the vocal tract of a child is less developed than that of an adult. The vocal tract, which is shorter in adult females than in adult males, resulting in females’ higher-pitched voices, is also shorter in younger humans. The vocal folds (commonly known as vocal cords) of children are also shorter than those of both adult males and females.
The result is that the fundamental frequency of sounds generated by children averages over 300Hz, compared to 210Hz in adult females, and 125Hz in adult males. Speech recognition devices that are trained to tune in to voices with lower frequencies will often miss much of what a child says.
Learning to speak
The human vocal tract is complicated, and learning to use it takes time. Certain sounds require quite precise placement of articulators (active articulators such as the tongue, lips, teeth etc. relative to passive articulators like the palate and alveolar ridge), which young children have yet to master.
This results in the mispronunciation of words like ‘helicopter’ as ‘hewwicopter’ which, while admittedly cute, can cause chaos for speech recognition software that is trained to equate a set of pronunciations with a set of words in its lexicon – it’s not going to recognise that particular substitution of sounds.
As inexperienced speakers, children will also tend to stutter more, repeat themselves, or change direction mid-sentence; all things that automated speech recognition will struggle with when parsing input.
Part of learning to speak is experimenting and playing with words, and this is something that children do exceptionally well. Aside from genuine mistakes in pronouncing complex words, such as pronouncing ‘hospital’ as ‘hopspital’, children also engage in word play at word-level and sentence-level.
Young children who are still familiarising themselves with English morphological and inflectional processes might say ‘brunged’ instead of ‘brought’ for the past-tense of ‘bring’, or ‘sheepses’ for the plural of ‘sheep’. They might make up words for lack of a better word, like ‘take-home’ for ‘takeaway’ that is brought home, or even invent just for fun!
And in many cases, it is all about fun – to a child, a speech recognition device is a toy like any other, and more often than not, they will experiment and play with it just to see what it will do next.
Appen can help
As we mentioned in our previous blog post, When Speech Recognition Goes Wrong, it’s all about the data. Having the right data to ensure your systems are trained to deal with the challenges of child language is the key to developing a speech recognition device that caters to every member of the family, no matter how small. At Appen, we have experience in collecting both spontaneous and scripted child speech. We also work with transcribers that are familiar with child language, and use our knowledge of spelling standardisation to create the most accurate data possible. Contact us to talk about your needs and how we can help.