How to Remove Bias in Training Data for Machine Learning

Machine learning (ML) algorithms are generally only as good as the data they are trained on. Bias in ML training data can take many forms, but the end result is that it can

Machine learning (ML) algorithms are generally only as good as the data they are trained on. Bias in ML training data can take many forms, but the end result is that it can cause an algorithm to miss the relevant relations between features and target outputs. Whether your organization is a small business, global enterprise, or governmental agency, it’s essential that you mitigate bias in your training data at every phase of your Artificial Intelligence (AI) initiatives.Across industries, AI and ML technologies are enabling new levels of efficiency and profitability. According to McKinsey, AI has the potential to deliver additional global economic activity of around $13 trillion by 2030, or about 16 percent higher cumulative GDP compared with today. This amounts to 1.2 percent additional GDP growth per year. With deep learning technologies playing an ever-increasing role in business and everyday life, it’s essential that companies use high-quality training data for their AI initiatives.

“It is up to us to determine the path machine learning algorithms take. As engineers and data scientists, we should carefully consider the prejudices we inherently carry when creating these technologies — and correct for them.” — Wilson Pang, CTO, Appen

First things first: What is machine learning training data?

The process for training ML models for AI projects involves building a mathematical model from input data from multiple datasets. Data can come from a variety of sources, including real-world usage data, survey data, public data sets, and simulated data. Choosing the source for your data will depend on availability and what makes sense for your specific project.The ML model is usually built in three phases: training, validation, and testing. In the training phase, a large amount of data is annotated – labelled by humans or another method – and input to a machine learning algorithm, with a specific result in mind. The algorithm looks for patterns in the training data that map the input data attributes to the target, then outputs a model that captures these patterns. For the model to be useful, it needs to be accurate, and that requires data that points to the requisite target or target attribute. Validation and testing help refine and prove the model.

What is biased data?

Machines need massive volumes of data to learn. And accurately annotating training data is as critical as the learning algorithm itself. A common reason that ML models fall short in terms accuracy is that they were created based on biased training data.Bias of ML models – or machine bias – can be a result of unbalanced data. Imagine a data set that consists mostly of images of washers, with just a few images of dryers thrown in for good measure. If you are using this data set to train an image classification model, this unbalanced data set is going to lean heavily toward identifying images as containing mostly washers. This is bias in action. Straightforward to correct, but critical.

As machine learning projects get more complex, with subtle variants to identify, it becomes crucial to have training data that is human-annotated in a completely unbiased way. Human bias when training data can wreak havoc on the accuracy of your machine learning model. Using the washer/dryer model above, imagine creating an ML model with the intention of differentiating between not only washers and dryers, but between condition of the appliances.If you have a team of in-house personnel annotating the images used for training this data, it’s essential that they adhere to a completely unbiased approach to classifying the images. Let’s say they’ll be classifying the appliances as excellent, good, fair, or poor condition. Without a diverse approach, you risk creating a less-than-accurate machine learning model. If you are basing a mobile app, for example, on the ability to comb e-commerce sites for appliances in a particular condition within a specific price range, a biased, inaccurate ML model is not going to drive the adoption you need to succeed.Without high-quality, unbiased data to train your machine learning model, money spent on AI initiatives is money wasted. A recent study from Oxford Economics and ServiceNow highlights the need for high-quality data, reporting that 51% of CIOs cite data quality as a substantial barrier to their company’s adoption of machine learning. But allowing your competitors to outpace you in the race to achieve AI-driven operational efficiencies is not a model for success.

How do we ensure that our training data isn’t biased?

To help ensure optimal results, it’s essential that organizations have tech teams with diverse members in charge of both building models and creating training data. In a recent interview, Wilson Pang, CTO at Appen had several pointers for creating accurate, unbiased ML models. Here are some key takeaways:

If training data comes from internal systems, try to find the most comprehensive data, and experiment with different datasets and metrics.
If training data is collected or processed by external partners, it is important to recruit diversified crowds so data can be more representative.
Design the tasks correctly and carefully communicate instructions so that the crowd is not biased when annotating data.
Once the training data is created, it’s important to check if the data has any bias.

If you are hoping to crowdsource your training data initiative externally, Appen provides comprehensive annotation services, with decades of experience curating crowds to collect data and annotate your data sets efficiently and accurately. Appen assembles curated crowds from a network of over one million people globally. This way, diversity requirements are easily met, and unique viewpoints or contributions are maintained.

If you perform data annotation internally, you have likely discovered it can be difficult to visualize high-dimensional training data and check the balance. Appen has built powerful training data visualization and insight tools to help validate your machine learning model and test for bias. At the end of the day, it’s important to remember that machine learning algorithms will be as biased as the people who collected, contextualized, and fed it its training data.“It is up to us to determine the path machine learning algorithms take. As engineers and data scientists, we should carefully consider the prejudices we inherently carry when creating these technologies — and correct for them.” - Wilson Pang, CTO, Appen—At Appen, we’ve helped leaders in machine learning and AI scale their programs from proof of concept to production. Contact us to learn more.