“It is up to us to determine the path machine learning algorithms take. As engineers and data scientists, we should carefully consider the prejudices we inherently carry when creating these technologies — and correct for them.” — Wilson Pang, CTO, Appen
First things first: What is machine learning training data?The process for training ML models for AI projects involves building a mathematical model from input data from multiple datasets. Data can come from a variety of sources, including real-world usage data, survey data, public data sets, and simulated data. Choosing the source for your data will depend on availability and what makes sense for your specific project. The ML model is usually built in three phases: training, validation, and testing. In the training phase, a large amount of data is annotated – labelled by humans or another method – and input to a machine learning algorithm, with a specific result in mind. The algorithm looks for patterns in the training data that map the input data attributes to the target, then outputs a model that captures these patterns. For the model to be useful, it needs to be accurate, and that requires data that points to the requisite target or target attribute. Validation and testing help refine and prove the model.
What is biased data?Machines need massive volumes of data to learn. And accurately annotating training data is as critical as the learning algorithm itself. A common reason that ML models fall short in terms accuracy is that they were created based on biased training data. Bias of ML models – or machine bias – can be a result of unbalanced data. Imagine a data set that consists mostly of images of washers, with just a few images of dryers thrown in for good measure. If you are using this data set to train an image classification model, this unbalanced data set is going to lean heavily toward identifying images as containing mostly washers. This is bias in action. Straightforward to correct, but critical. As machine learning projects get more complex, with subtle variants to identify, it becomes crucial to have training data that is human-annotated in a completely unbiased way. Human bias when training data can wreak havoc on the accuracy of your machine learning model. Using the washer/dryer model above, imagine creating an ML model with the intention of differentiating between not only washers and dryers, but between condition of the appliances. If you have a team of in-house personnel annotating the images used for training this data, it’s essential that they adhere to a completely unbiased approach to classifying the images. Let’s say they’ll be classifying the appliances as excellent, good, fair, or poor condition. Without a diverse approach, you risk creating a less-than-accurate machine learning model. If you are basing a mobile app, for example, on the ability to comb e-commerce sites for appliances in a particular condition within a specific price range, a biased, inaccurate ML model is not going to drive the adoption you need to succeed. Without high-quality, unbiased data to train your machine learning model, money spent on AI initiatives is money wasted. A recent study from Oxford Economics and ServiceNow highlights the need for high-quality data, reporting that 51% of CIOs cite data quality as a substantial barrier to their company’s adoption of machine learning. But allowing your competitors to outpace you in the race to achieve AI-driven operational efficiencies is not a model for success.
How do we ensure that our training data isn’t biased?To help ensure optimal results, it’s essential that organizations have tech teams with diverse members in charge of both building models and creating training data. In a recent interview, Wilson Pang, CTO at Appen had several pointers for creating accurate, unbiased ML models. Here are some key takeaways:
- If training data comes from internal systems, try to find the most comprehensive data, and experiment with different datasets and metrics.
- If training data is collected or processed by external partners, it is important to recruit diversified crowds so data can be more representative.
- Design the tasks correctly and carefully communicate instructions so that the crowd is not biased when annotating data.
- Once the training data is created, it’s important to check if the data has any bias.