“It is up to us to determine the path machine learning algorithms take. As engineers and data scientists, we should carefully consider the prejudices we inherently carry when creating these technologies — and correct for them.” — Wilson Pang, CTO, Appen
First things first: What is machine learning training data?
The process for training ML models for AI projects involves building a mathematical model from input data from multiple datasets. Data can come from a variety of sources, including real-world usage data, survey data, public data sets, and simulated data. Choosing the source for your data will depend on availability and what makes sense for your specific project. The ML model is usually built in three phases: training, validation, and testing. In the training phase, a large amount of data is annotated – labelled by humans or another method – and input to a machine learning algorithm, with a specific result in mind. The algorithm looks for patterns in the training data that map the input data attributes to the target, then outputs a model that captures these patterns. For the model to be useful, it needs to be accurate, and that requires data that points to the requisite target or target attribute. Validation and testing help refine and prove the model.What is biased data?
Machines need massive volumes of data to learn. And accurately annotating training data is as critical as the learning algorithm itself. A common reason that ML models fall short in terms accuracy is that they were created based on biased training data. Bias of ML models – or machine bias – can be a result of unbalanced data. Imagine a data set that consists mostly of images of washers, with just a few images of dryers thrown in for good measure. If you are using this data set to train an image classification model, this unbalanced data set is going to lean heavily toward identifying images as containing mostly washers. This is bias in action. Straightforward to correct, but critical.
How do we ensure that our training data isn’t biased?
To help ensure optimal results, it’s essential that organizations have tech teams with diverse members in charge of both building models and creating training data. In a recent interview, Wilson Pang, CTO at Appen had several pointers for creating accurate, unbiased ML models. Here are some key takeaways:- If training data comes from internal systems, try to find the most comprehensive data, and experiment with different datasets and metrics.
- If training data is collected or processed by external partners, it is important to recruit diversified crowds so data can be more representative.
- Design the tasks correctly and carefully communicate instructions so that the crowd is not biased when annotating data.
- Once the training data is created, it’s important to check if the data has any bias.
