When and Why You Should Select an Active Learning Approach
The key to training any machine learning (ML) model is data. Yet data remains one of the biggest barriers to success for organizations and teams investing in artificial intelligence (AI). For one, you need a lot of it to create high-performing models. And what’s more, you need data that’s accurately labeled. While many teams start off with manually labeling their datasets, more are turning to time-saving methods to partially automate the process, such as active learning.
To understand active learning, you first need to know the difference between supervised and unsupervised machine learning. In supervised learning, there’s a ground truth. We give the machine correctly-labeled data and it learns from those examples how to predict the right answer for unlabeled data. Unsupervised learning, on the other hand, provides the model with unlabeled, unstructured data. The model identifies patterns and structure in the data on its own. Each approach has its benefits; for the purpose of this article, we’ll focus on the supervised learning spectrum in which active learning falls.
The Active Learning Approach
Active learning fits under what we call “semi-supervised learning”. While a fully supervised learning approach will provide the model with a complete, labeled dataset, a semi-supervised active learning approach will provide the model only with a labeled subset of the dataset, with the assumption that not all data is necessary or valuable for training purposes. The process of active learning involves the prioritization of which data out of the dataset should be labelled for training the model. Essentially, the model gets to proactively choose which data it wants to learn from.
How it Works
There are three scenarios typically represented in active learning. The most popular one is known as pool-based sampling, and follows these five steps:
- A person (known in this process as the oracle) labels a small subset of the dataset and provides that labeled data to the model.
- The model (known as the active learner) processes that data and predicts the classes of the unlabeled data points with a certain confidence level.
- Assuming that initial prediction is below the desired accuracy and confidence levels, a sampling technique is then used to determine what the next subset of data to be labeled will be.
- People label the selected subset of data and send it back to the model for processing.
- The process continues until the model’s predictions are at the required confidence and accuracy levels.
Another active learning scenario is the stream-based selective scenario, in which the model is presented with an unlabeled data point and must immediately decide if it wants that data point to be labeled. In the third approach to active learning, the membership query synthesis scenario, the model constructs its own examples for labeling.
Sampling methods, also called querying strategies, are critical to the success of the active learning approach. A poor sampling method will lead to poor model predictions, and thus more iterations through the active learning cycle. Two of the most common sampling methods are uncertainty sampling and Query by Committee.
As its name implies, uncertainty sampling prioritizes data points for labeling that the model is least certain about. There are several techniques applied within this type of sampling:
- Least Confidence: The algorithm sorts its predictions from lowest confidence to highest. Those with the lowest confidence are selected for labeling.
- Smallest Margin: For each data point, the algorithm compares the highest probability class prediction to the second highest probability class prediction. The data points with the tightest margin will be prioritized for labeling, as the model is least certain which class they belong to.
- Entropy: The machine uses an equation to determine the data points with the highest uncertainty, also known as entropy, in class prediction. These data points are prioritized for labeling.
Query by Committee
This strategy uses multiple models trained on the same dataset to determine collectively which additional data points to label. Where the models have the largest disagreement are the data points selected for labeling.
Other popular sampling methods include expected impact and density-weighted, although these may be less utilized than those outlined above. In any case, the sampling method used is an important determinant in how quickly the model will reach performance standards.
You may need to experiment with different methods to reach optimal performance, as there’s no single method that works best for every use case.
When to Choose an Active Learning Approach
Manually labeling a full dataset (as is the case in fully supervised learning) can be prohibitively expensive and time-consuming to some organizations, which is why teams are turning to semi-supervised and unsupervised ML approaches. An active learning approach specifically is best leveraged under some or all of the following conditions:
- Your AI solution requires a fast time-to-market and manually labeling data may put your project at risk.
- You don’t have the money to pay data scientists or SMEs to manually label all of your data.
- You don’t have sufficient people available to manually label all of your data.
- You have a large pool of unlabeled data.
Active learning can be more cost-effective and fast compared to traditional supervised learning, but you still need to account for the computing costs and iterations needed to get to a working model. When done well, it can achieve the same level of quality and accuracy compared to its traditional counterparts.
It’s key to have the technical expertise in active learning on your data science team, as the sampling method chosen can make or break the effectiveness of the active learning approach as a whole. You may seek outside guidance in some cases; third-party data partners, for example, can assist you in creating an efficient active learning pipeline.
The Future of Active Learning in AI
Is active learning the future of AI? For now, it looks like a viable alternative to fully supervised forms of machine learning, and it can make sense for extremely large datasets: techniques like active learning allow data science teams to label smarter and faster. As data continues to be the foundation of great AI while also turning into the biggest roadblock if not well-handled, it’s no surprise that active learning is gaining popularity for the efficiency it offers.
Researchers are working on designing active learning sampling methods that improve on their predecessors, with the hope that we’ll be able to generalize those that perform best. While further research is needed (for example, it’s still difficult to tell ahead of time if a particular dataset would benefit from an active learning approach), the active learning cycle remains a powerful example of the human-in-the-loop process done right.
Expert Insight From Mehdi Iranmanesh – Research Scientist
Deep Learning has a strong learning ability especially for high-dimensional data and it is capable of automatic feature extraction. Active Learning has huge potential to improve the speed of machine learning development efficiently, and make new use cases available. Therefore, an effective approach is to combine DL and AL, as this will greatly expand their application potential. This combined approach is proposed by considering the complementary advantages of the two methods. In fact, combining a proper active learning algorithm with a deep model would not only make the need of less labelled data but also improve model performance.
Before utilizing Active Learning, we should consider that AL requires tuning just like deep learning algorithms. Active learning is mainly about dynamic sampling, and there are many strategies (i.e., considering confidence, forgetfulness, etc.) for this sampling. The key here is that the selection process itself can be considered as a machine learning problem and it should not be necessarily hard-coded or follow predetermined rules. This has its own difficulties, due to the fact that machine learning models can easily slide into overfitting.
While one dimension of success with active learning is to optimize the amount of data that needs to be labelled, another dimension is the amount of computation that is needed for that approach. Usually there is a tradeoff between the number of labels and computational cost and, in practice, it is only reasonable to utilize an active learning approach when the time to label, as well as the costs become unreasonable, when compared to the computational cost incurred with an AL technique. This is also the case for unsupervised and/or self-supervised approaches.
In general, a machine learning product is not just about building a machine learning model. The machine learning models need to be deployed, versioned, and maintained in production. Active Learning also obeys the same manner. Thus, a proper active learning algorithm includes ML engineers in addition to the ML scientists.
One common fear for beginners in the active learning domain is the possibility of generating biases. Thus might be caused by the initial sample sets, i.e., existence of incorrect labels in the data. At Appen, utilizing our human-in-loop process, we label the data meticulously and also try to identify the problematic data that is used to train our machine learning model as part of the active learning process to make sure we have an unbiased model leading to a more effective active learning process.
All in all, there is not one single algorithm or approach for active learning. It is a process that requires a solid and efficient labeling that you can trust. It requires to be tuned and developed properly and needs proper engineering process, versioning and monitoring systems.
How Appen Can Help
At Appen, we understand how the data annotation process in machine learning can be the biggest determinant of cost- and time-savings. Our data annotation platform helps you create high-quality, labeled training data to power your machine learning projects. Included in our annotation tools is our Smart Labeling suite, which offers:
- Pre-labeling: ML provides an initial hypothesis on your data labels.
- Speed Labeling: ML helps contributors determine data labels quickly and accurately.
- Smart Validators: Our ML models verify contributor labels before submission.
Collectively, these tools will advance your projects to completion faster without forcing you to sacrifice on quality or model performance.