A Comparison of Two Popular Machine Learning TechniquesMachine learning (ML) has grown exponentially as a field, but a familiar roadblock remains for many businesses: data. Training ML algorithms traditionally requires enormous amounts of manually-labeled data. The sheer size of data needed is often not available at scale and costly, not to mention the time and effort required to label it by hand. Data that is readily available is often short of desired quality standards. Active learning vs weak supervision: two great ML techniques you can leverage to overcome the data challenge. Labeling that data also requires human labelers – and in many cases, those labelers are subject matter experts (SMEs) to some degree – who can use their domain knowledge to make accurate annotations. But SMEs are both limited in availability and expensive to employ. With all of these challenges in mind, teams launching artificial intelligence (AI) solutions turn away from fully supervised learning (which requires complete, hand-labeled datasets for training ML models) to active learning and weak supervision. The latter learning techniques are generally faster and less labor-intensive while still capable of training models successfully. Understanding how they work and the benefits each type offers will help you decide if weak supervision or active learning (or a combination of both) may be the right training solution for your model.
Active Learning vs Weak Supervision: How They Fit into Supervised LearningIt’s important to recognize that there are different types of learning in ML, and they all fall under one of two categories: supervised and unsupervised. With supervised learning, the machine receives data points labeled by humans and uses those to make predictions. On the other hand, unsupervised data uses unlabeled data; the algorithm must extract structure and patterns from the data without human guidance. Under the supervised learning umbrella, there is a spectrum of learning types. On this spectrum, we find active learning, a form of semi-supervised learning, and weak supervision.
Active LearningActive learning is a form of semi-supervised learning. Unlike fully supervised learning, the ML algorithm is only given an initial subset of human-labeled data out of a larger, unlabeled dataset. The algorithm processes that data and provides a prediction with a certain confidence level. Anything below that confidence level will signal that more data is needed. These low-confidence predictions will be sent to a person to label the requested data and provide it back to the algorithm. The cycle repeats until the algorithm is trained and operating at desired prediction accuracy. This iterative human-in-the-loop method is built on the idea that not all samples are valuable for learning, so the algorithm chooses the data it learns from. A key differentiator in active learning is the sampling method used, which significantly affects how the model performs. Data scientists can test different sampling methods to select the one that produces the most precise results. Overall, active learning relies less on data annotation by people compared to fully supervised learning because not all of the dataset requires annotation, only the data points requested by the machine.
Weak SupervisionWeak supervision is a learning technique that blends knowledge from various data sources, many of which are lower-quality or weak. These data sources could include:
- Low-quality labeled data from cheaper, non-experts.
- Higher-level supervision from SMEs, for example, using heuristics (rules). A heuristic might say something like, “If datapoint = x, then label it as y.” Using a heuristic or set of heuristics can instantly label thousands, even millions, of data points.
- Pre-trained, old models, which may be biased or noisy.
What are the Differences Between Active Learning and Weak Supervision?Both types of learning can produce high-performing models, but they’re notably different in several key ways:
Source of LabelsThe labels required for each type of learning are sourced very differently: Active Learning
- Humans (usually SMEs) label the dataset.
- The labels are assumed to be accurate.
- The labels come from one source.
- Sources are flexible and come from any number of places.
- Labels aren’t necessarily very accurate or complete.
- Multiple data sources must be used.
Resources RequiredThe ratio of time, money, and people invested for each type of learning differs: Active Learning
- Using SMEs for labeling purposes is expensive, as they require payment and have limited availability.
- Active learning requires humans to spend time labeling at least a portion of the data in a dataset.
- Labeling functions can be applied to millions of data points in seconds, saving tremendous amounts of time with labeling.
- The time invested in weak supervision training varies depending on the data sources but is generally less than what’s needed for an active learning project.
Process IterationWhile machine learning is always an iterative process, the amount of iteration varies with weak supervision vs active learning: Active Learning
- Uses a human-in-the-loop iterative process of many cycles.
- The model is trained as data is labeled.
- The datasets are fully labeled prior to the start of training the model.
- There’s no human-in-the-loop baked into the training process.