Training Conversational AI Agents on Noisy Data
Chatbots, virtual assistants, robots, and more: conversational artificial intelligence (AI) is already highly visible in our daily lives. Companies looking to increase engagement with customers while reducing costs are investing heavily in the space. The numbers are clear: the conversational AI agents industry is expected to grow 20% year over year through at least 2025. By that time, Gartner predicts that organizations that leverage AI in their customer engagement platform will increase operational efficiency by 25%. The global pandemic has only accelerated these expectations, as conversational AI agents have been critical to businesses navigating a virtual world while still wanting to remain connected with customers. Conversational AI helps companies overcome digital communication’s impersonal nature by providing a tailored, humanized experience for each customer. These changes redefine the way brands engage and will undoubtedly become the new normal, even post-pandemic, given the successful proof-of-concept. Building conversational AI for real-world applications is still challenging, however. Mimicking the flow of human speech is extremely difficult. AI must account for different languages, accents, colloquialisms, pronunciations, turns of phrase, filler words, and other variability. This effort requires a vast collection of high-quality data. The problem is, this data is often noisy, filled with irrelevant entities that can misconstrue intent. Understanding the role data plays and the mitigation steps to manage noisy data will be essential toward reducing errors and failure rates.Data Collection and Annotation for Conversational AI Agents
To understand the complexities of creating a conversational agent, let’s walk through a typical process for building one with voice capabilities (such as Siri or Google Home).- Data Input. The human agent speaks a command, comment, or question captured as an audio file by the model. Using speech recognition machine learning (ML), the computer converts this audio to text.
- Natural Language Understanding (NLU). The model uses entity extraction, intent recognition, and domain identification (all techniques for understanding human language) to interpret the text file.
- Dialogue Management. Because speech recognition can be noisy, statistical modeling is used to map out distributions over the human agent’s likely goal. This is known as dialogue state tracking.
- Natural Language Generation (NLG). Structured data is converted into natural language.
- Data Output. Text-to-speech synthesization converts the natural language text data from the NLG stage into the audio output. If accurate, the output addresses the human agent’s original request or comment.
- Define Intents. What is the human agent’s goal? For example, “Where is my order?” “View lists” or “Find store” are all examples of intents or purposes.
- Utterance Collection. Different utterances working toward the same goal must be collected, mapped, and validated by data annotators. For example, “Where’s the closest store?” and “Find a store near me” have the same intent but are different utterances.
- Entity Extraction. This technique is applied to parse out critical entities in the utterance. If you have a sentence like, “Are there any vegetarian restaurants within 3 miles of my house?”, then “vegetarian” would be a type entity, “3 miles” would be a distance entity, and “my house” would be a reference entity.
Designing Dialogues for Social Robots

Learning by Imitation for Social Robots
If we could crowdsource human behaviors, we could collect higher-quality data more passively and cost-efficiently. We could observe human interactions, abstract typical behavior elements, and generate robot interactions based on this. One such team explored the validity of this idea by setting up a camera shop scenario. Let’s walk through their methodology:- Data Collection. The team collected data on human customers’ multimodal behaviors and shopkeepers, including three critical categories of speech, locomotion, and proxemics formation.
- Speech: Using automatic speech recognition, the model captured the typical utterances (for example, how many megapixels does this camera have? Or what is the resolution?) and used hierarchical clustering to map these utterances intents.
- Locomotion: Sensors captured tracking data on typical locations where humans congregate, such as the service counter, and distinct trajectories, such as from the door to the camera display. Clustering was used to determine the frequencies of each position and trajectory.
- Proxemics Formation: Sensors captured typical formations of customer and shopkeeper; for example, face-to-face, or the shopkeeper presenting a product.In addition, when a customer spoke or moved, that interaction was discretized into customer-shopkeeper action pairs.
- Model Training. The team then trained the model using the customer action (including the utterance, motion, and proxemics) and labeled data of the shopkeeper’s expected response. For example, the customer action might include asking, “How much does this cost?” while facing the shopkeeper; the shopkeeper would then reply, “It’s $300.”