Great Machine Learning Data: It’s Not About Quantity or Quality

Great Machine Learning Data: It’s Not About Quantity or Quality — It’s About Both

Artificial intelligence (AI) is now a household term for consumers the world over, and a field that’s captured the attention – and budgets – of business and government globally. The rate of AI adoption has accelerated in recent years as organizations seek to harness its potential to drive a competitive advantage. And every organization faces the same challenge, securing the right machine learning data for their initiatives.

Investment in AI in 2016 was in the range of $26 billion to $39 billion, according to McKinsey, while IDC predicts that figure could grow to more than $52 billion worldwide by 2021. Where is all this activity taking place? Organizations are using AI to build and enhance web-based or physical products, fix security problems, deliver better customer experiences, make operations more efficient, and more.

Yet, even with the huge advances made in AI solutions in the last decade, and the growing number of them on the market and in our lives, there is a simple fact that holds true: AI is only as good as the machine learning data that trained it. To build a successful solution, you need the right data – and a lot of it. As McKinsey states in a 2018 discussion paper, applying large amounts of audio, video, image and text data to problems is a key differentiator which underpins higher value AI potential.

The relationship between data and machine learning

Machine learning is a form of AI that allows computers to learn without being explicitly programmed. By feeding machines large volumes of machine learning training data, they’re able to find patterns which help a computer identify the correct response to a range of situations.

In this respect, AI requires machine learning, and machine learning requires data – a lot of the right kind of data. But to be the most effective at interacting with and mimicking humans, AI requires not only large volumes of training data, but large volumes of quality training data.

Why quantity matters

Machine learning helps computers solve complex problems, and the complexity is due to inherent variation: There are often hundreds, thousands, or millions of variables for the resulting system, product, or application to cope with.

Think of machine learning data like survey data: the larger and more complete your sample size, the more reliable your conclusions will be. If the data sample isn’t big enough, it won’t capture all the discrepancies or take them into account, and your machine may reach inaccurate conclusions, learn patterns that don’t actually exist, or not recognize patterns that do.

So the more your machine learning data accounts for the variety an AI system will encounter in the real world, the better the end product will be. Need a sense of the volume? There are experts who recommend at least 10,000 hours of audio speech data to get a system to begin working at modest levels of accuracy.

Why quality matters

In machine learning, quality is equally important as high volume. This is primarily because an AI system can only perform correctly based on what it has learned from good quality data. In fact, in a recent study from Oxford Economics and ServiceNow, 51% of CIOs cite data quality as a substantial barrier to their company’s adoption of machine learning.

Even if an algorithm is appropriate for the task at hand, if the machine has been trained on poor quality data, it will learn the wrong lessons, come to the wrong conclusions, and not work as you (or your customers) expect. Many things can define “bad” in this context. The data may be unrelated to the problem at hand, inaccurately annotated, misleading, or incomplete.

For search engines, irrelevant results can be a problem when trying to effectively train a machine to return the best information to users. A computer can find the data, but doesn’t know which source is better unless it’s told. For speech and pattern recognition, “bad” data might be incomplete or inaccurate. For example, if a machine thinks the sound of someone saying the word “cat” corresponds to the text of the word “rat,” that’s going to create a frustrating user experience for someone trying to order cat food from a home assistant.

Learn more – download our new whitepaper

Appen partners with many global organizations to help them create and improve products using high-quality data for machine learning.

This article is just a snap shot – for a deeper dive on the topic, download our white paper, which we created to help business executives embarking on—or looking to improve—their machine learning initiatives. It covers more about why machine learning requires a high volume of data, the importance of high-quality data, and which data sources should be considered.

Download