by William Meisel, Ph.D.
I spoke with Mark Brayan, Appen CEO, during a recent visit to Los Angeles. He summarized Appen’s place in advancing language technologies such as search, speech recognition, natural language interpretation (NLI), and translation as supplying the “fuel” for building and improving such technologies.
The comment struck me. I realized that the key role of building and quality-checking the databases used to build these technologies.
Machine learning techniques are used in many of these advanced language technologies. They are, of course, statistical methods that depend on data. And the move to more advanced models such as Deep Neural Networks drive a need for increasing amounts of data.
Developing and improving speech recognition using machine learning requires data labelled with the text of words spoken. NLI development using machine learning requires data labelled with the intent of the user. Both approaches require quality data; the old adage “garbage in, garbage out” clearly applies. Appen uses skilled transcriptionists and labelers checked by experienced quality control personnel, with guidance on how to handle difficult cases consistently.
Further, advanced techniques often require more than raw data. For example, one can improve search techniques (for the web or a specific site) by correcting search terms that are often misspelled. (It is much less effective to try to get enough data to effectively “learn” that these misspelled terms are equivalent.) Appen develops such databases as linguistic resources.
You can’t create fuel without the raw resource, and part of Appen’s success is developing the human resources that can help with the data creation process. They have access to many contractors that can support data creation and cleanup tasks. Long experience has led to procedures and software that support this process efficiently. In its 21 year history the company has supplied data and services to a range of technology companies, auto manufacturers and governments to help them build and improve their natural language technologies.
With machine learning and similar “cognitive computing” technologies available increasingly as cloud services, it often isn’t access to the core technologies that dissuade companies from using these powerful technologies. Companies often find that their raw data isn’t labelled, has quality problems, or simply is in the wrong format to use these cloud resources. Appen refines this raw resource into “fuel.”
William Meisel, Ph.D., is president of TMA Associates, and editor of LUI News (a monthly newsletter on commercial applications of the Language User Interface), organizer of the Conversational Interaction Conference, author of the 2013 book The Software Society, and a consultant on market and product opportunities created by the maturing of speech and natural language technology. His experience in speech technology includes founding and running a speech recognition company. He began his career as a professor of Electrical Engineering and Computer Science at USC and published the first book on machine learning.