Where does the data come from?IBM’s initial work in the voice recognition space was done as part of the U.S. government’s Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program, which led to significant advances in speech recognition technology. The EARS program produced about 140 hours of supervised BN training data and around 9,000 hours of very lightly supervised training data from closed captions from television shows. By contrast, EARS produced around 2,000 hours of highly supervised, human-transcribed training data for conversational telephone speech (CTS).
Lost in translation?Because so much training data is available for CTS, the team from IBM and Appen endeavored to apply similar speech recognition strategies to BN to see how well those techniques translate across applications. To understand the challenge the team faced, it’s important to call out some important differences between the two speech styles: Broadcast news (BN)
- Clear, well-produced audio quality
- Wide variety of speakers with different speaking styles
- Varied background noise conditions — think of reporters in the field
- Wide variety of news topics
- Often poor audio quality with sound artifacts
- Interspersed with moments where speech overlaps between participants
- Interruptions, sentence restarts, and background confirmations between participants i.e. “okay”, “oh”, “yes”
How the team adapted speech recognition models from CTS to BNThe team adapted the speech recognition systems that were so successfully used for the EARS CTS research: Multiple long short-term memory (LSTM) and ResNet acoustic models trained on a range of acoustic features, along with word and character LSTMs and convolutional WaveNet-style language models. This strategy had produced results between 5.1% and 9.9% accuracy for CTS in a previous study, specifically the HUB5 2000 English Evaluation conducted by the Linguistic Data Consortium (LDC). The team tested a simplified version of this approach on the BN data set, which wasn’t human-annotated, but rather created using closed captions. Instead of adding all the available training data, the team carefully selected a reliable subset, then trained LSTM and residual network-based acoustic models with a combination of n-gram and neural network language models on that subset. In addition to automatic speech recognition testing, the team benchmarked the automatic system against an Appen-produced high-quality human transcription. The primary language model training text for all these models consisted of a total of 350 million words from different publicly available sources suitable for broadcast news.
Getting down to businessIn the first set of experiments the team separately tested the LSTM and ResNet models in conjunction with the n-gram and FF-NNLM before combining scores from the two acoustic models in comparison with the results obtained on the older CTS evaluation. Unlike results observed on original CTS testing, no significant reduction in the word error rate (WER) was achieved after scores from both the LSTM and ResNet models were combined. The LSTM model with an n-gram LM individually performs quite well and its results further improve with the addition of the FF-NNLM. For the second set of experiments, word lattices were generated after decoding with the LSTM+ResNet+n-gram+FF-NNLM model. The team generated n-best lists from these lattices and rescored them with the LSTM1-LM. LSTM2-LM was also used to rescore word lattices independently. Significant WER gains were observed after using the LSTM LMs. This led the researchers to hypothesize that the secondary fine-tuning with BN-specific data is what allows LSTM2-LM to perform better than LSTM1-LM.
The resultsOur ASR results have clearly improved state-of-the-art performance, and significant progress has been made compared to systems developed over the last decade. When compared to the human performance results, the absolute ASR WER is about 3% worse. Although the machine and human error rates are comparable, the ASR system has much higher substitution and deletion error rates. Looking at the different error types and rates, the research produced interesting takeaways:
- There’s a significant overlap in the words that ASR and humans delete, substitute, and insert.
- Humans seem to be careful about marking hesitations: %hesitation was the most inserted symbol in these experiments. Hesitations seem to be important in conveying meaning to the sentences in human transcriptions. The ASR systems, however, focus on blind recognition and were not successful in conveying the same meaning.
- Machines have trouble recognizing short function words: the, and, of, a, that and these get deleted the most. Humans on the other hand, seem to catch most of them. It seems likely that these words aren’t fully articulated so the machine fails to recognize them, while humans are able to infer these words naturally.