A Comparison of Two Popular Machine Learning Techniques
Machine learning (ML) has grown exponentially as a field, but a familiar roadblock remains for many businesses: data. Training ML algorithms traditionally requires enormous amounts of manually-labeled data. The sheer size of data needed is often not available at scale and costly, not to mention the time and effort required to label it by hand. Data that is readily available is often short of desired quality standards. Active learning vs weak supervision: two great ML techniques you can leverage to overcome the data challenge. Labeling that data also requires human labelers – and in many cases, those labelers are subject matter experts (SMEs) to some degree – who can use their domain knowledge to make accurate annotations. But SMEs are both limited in availability and expensive to employ. With all of these challenges in mind, teams launching artificial intelligence (AI) solutions turn away from fully supervised learning (which requires complete, hand-labeled datasets for training ML models) to active learning and weak supervision. The latter learning techniques are generally faster and less labor-intensive while still capable of training models successfully. Understanding how they work and the benefits each type offers will help you decide if weak supervision or active learning (or a combination of both) may be the right training solution for your model.Active Learning vs Weak Supervision: How They Fit into Supervised Learning

Active Learning
Active learning is a form of semi-supervised learning. Unlike fully supervised learning, the ML algorithm is only given an initial subset of human-labeled data out of a larger, unlabeled dataset. The algorithm processes that data and provides a prediction with a certain confidence level. Anything below that confidence level will signal that more data is needed. These low-confidence predictions will be sent to a person to label the requested data and provide it back to the algorithm. The cycle repeats until the algorithm is trained and operating at desired prediction accuracy. This iterative human-in-the-loop method is built on the idea that not all samples are valuable for learning, so the algorithm chooses the data it learns from. A key differentiator in active learning is the sampling method used, which significantly affects how the model performs. Data scientists can test different sampling methods to select the one that produces the most precise results. Overall, active learning relies less on data annotation by people compared to fully supervised learning because not all of the dataset requires annotation, only the data points requested by the machine.Weak Supervision
Weak supervision is a learning technique that blends knowledge from various data sources, many of which are lower-quality or weak. These data sources could include:- Low-quality labeled data from cheaper, non-experts.
- Higher-level supervision from SMEs, for example, using heuristics (rules). A heuristic might say something like, “If datapoint = x, then label it as y.” Using a heuristic or set of heuristics can instantly label thousands, even millions, of data points.
- Pre-trained, old models, which may be biased or noisy.
What are the Differences Between Active Learning and Weak Supervision?
Both types of learning can produce high-performing models, but they’re notably different in several key ways:Source of Labels
The labels required for each type of learning are sourced very differently: Active Learning- Humans (usually SMEs) label the dataset.
- The labels are assumed to be accurate.
- The labels come from one source.
- Sources are flexible and come from any number of places.
- Labels aren’t necessarily very accurate or complete.
- Multiple data sources must be used.
Resources Required
The ratio of time, money, and people invested for each type of learning differs: Active Learning- Using SMEs for labeling purposes is expensive, as they require payment and have limited availability.
- Active learning requires humans to spend time labeling at least a portion of the data in a dataset.
- Labeling functions can be applied to millions of data points in seconds, saving tremendous amounts of time with labeling.
- The time invested in weak supervision training varies depending on the data sources but is generally less than what’s needed for an active learning project.
Process Iteration
While machine learning is always an iterative process, the amount of iteration varies with weak supervision vs active learning: Active Learning- Uses a human-in-the-loop iterative process of many cycles.
- The model is trained as data is labeled.
- The datasets are fully labeled prior to the start of training the model.
- There’s no human-in-the-loop baked into the training process.