Training Conversational AI Agents on Noisy DataChatbots, virtual assistants, robots, and more: conversational artificial intelligence (AI) is already highly visible in our daily lives. Companies looking to increase engagement with customers while reducing costs are investing heavily in the space. The numbers are clear: the conversational AI agents industry is expected to grow 20% year over year through at least 2025. By that time, Gartner predicts that organizations that leverage AI in their customer engagement platform will increase operational efficiency by 25%. The global pandemic has only accelerated these expectations, as conversational AI agents have been critical to businesses navigating a virtual world while still wanting to remain connected with customers. Conversational AI helps companies overcome digital communication’s impersonal nature by providing a tailored, humanized experience for each customer. These changes redefine the way brands engage and will undoubtedly become the new normal, even post-pandemic, given the successful proof-of-concept. Building conversational AI for real-world applications is still challenging, however. Mimicking the flow of human speech is extremely difficult. AI must account for different languages, accents, colloquialisms, pronunciations, turns of phrase, filler words, and other variability. This effort requires a vast collection of high-quality data. The problem is, this data is often noisy, filled with irrelevant entities that can misconstrue intent. Understanding the role data plays and the mitigation steps to manage noisy data will be essential toward reducing errors and failure rates.
Data Collection and Annotation for Conversational AI AgentsTo understand the complexities of creating a conversational agent, let’s walk through a typical process for building one with voice capabilities (such as Siri or Google Home).
- Data Input. The human agent speaks a command, comment, or question captured as an audio file by the model. Using speech recognition machine learning (ML), the computer converts this audio to text.
- Natural Language Understanding (NLU). The model uses entity extraction, intent recognition, and domain identification (all techniques for understanding human language) to interpret the text file.
- Dialogue Management. Because speech recognition can be noisy, statistical modeling is used to map out distributions over the human agent’s likely goal. This is known as dialogue state tracking.
- Natural Language Generation (NLG). Structured data is converted into natural language.
- Data Output. Text-to-speech synthesization converts the natural language text data from the NLG stage into the audio output. If accurate, the output addresses the human agent’s original request or comment.
- Define Intents. What is the human agent’s goal? For example, “Where is my order?” “View lists” or “Find store” are all examples of intents or purposes.
- Utterance Collection. Different utterances working toward the same goal must be collected, mapped, and validated by data annotators. For example, “Where’s the closest store?” and “Find a store near me” have the same intent but are different utterances.
- Entity Extraction. This technique is applied to parse out critical entities in the utterance. If you have a sentence like, “Are there any vegetarian restaurants within 3 miles of my house?”, then “vegetarian” would be a type entity, “3 miles” would be a distance entity, and “my house” would be a reference entity.
Designing Dialogues for Social RobotsIn many cases, a conversational agent’s goal is to enable them to interact with humans as peers, not as devices. This means communicating using speech and gesture, providing useful services, and leveraging natural language to maintain a natural conversation flow. How do we then develop social robots that can interact with people? One way to approach creating a social robot with personality is through flowchart-based visual programming. Flowchart blocks represent back-end functions, such as talking, shaking hands, and moving to a point. They catalog the flow of interaction. Content authors can use the flowchart to easily combine speech, gesture, and emotion to build engaging interactions. Erica (the ERATO Intelligent Conversational Android) was built using this method. Her content authors iteratively added content over months to develop her as a character and not just a question-answering device. She can now complete over 2,000 behaviors and over 50 topic sequences. Another approach to designing a social robot is teleoperation. The Nara Experiment employed a robot at the Nara, Japan, tourist center to act as a tour guide for visitors. Human tour guides created offline content for the robot (for example, background information on the local Todaiji Temple), and engineers programmed the robot with the information ahead of time. The team contrasted this method with teleoperation. When a human-in-the-loop teleoperator controlled the robot remotely, results were more accurate than when the robot relied on offline data. The problem was the method wasn’t very scalable, content entry was slow and error-prone, and it was challenging to control multimodal behaviors. While interesting case studies, these experiments prompt questions around more scalable alternatives to dialogue design. Would it not be more efficient to collect in-situ data from real human-to-human interactions?
Learning by Imitation for Social RobotsIf we could crowdsource human behaviors, we could collect higher-quality data more passively and cost-efficiently. We could observe human interactions, abstract typical behavior elements, and generate robot interactions based on this. One such team explored the validity of this idea by setting up a camera shop scenario. Let’s walk through their methodology:
- Data Collection. The team collected data on human customers’ multimodal behaviors and shopkeepers, including three critical categories of speech, locomotion, and proxemics formation.
- Speech: Using automatic speech recognition, the model captured the typical utterances (for example, how many megapixels does this camera have? Or what is the resolution?) and used hierarchical clustering to map these utterances intents.
- Locomotion: Sensors captured tracking data on typical locations where humans congregate, such as the service counter, and distinct trajectories, such as from the door to the camera display. Clustering was used to determine the frequencies of each position and trajectory.
- Proxemics Formation: Sensors captured typical formations of customer and shopkeeper; for example, face-to-face, or the shopkeeper presenting a product.In addition, when a customer spoke or moved, that interaction was discretized into customer-shopkeeper action pairs.
- Model Training. The team then trained the model using the customer action (including the utterance, motion, and proxemics) and labeled data of the shopkeeper’s expected response. For example, the customer action might include asking, “How much does this cost?” while facing the shopkeeper; the shopkeeper would then reply, “It’s $300.”