What Is Active Learning and How Does It Improve Machine Learning Models? (2026 Guide)
- May 1
- 35 min read

Labeling data is one of the most expensive and time-consuming steps in building a machine learning model. Hiring expert annotators, managing quality control, and scaling to millions of examples can cost hundreds of thousands of dollars—before a single model trains. And yet, most traditional machine learning pipelines label data blindly, treating every example as equally important. That's a massive waste. Active learning exists to fix exactly that problem.
TL;DR
Active learning lets a model choose which unlabeled examples to send for human labeling—focusing effort where it matters most.
It can reach the same model accuracy as random sampling using significantly fewer labeled examples.
The core loop: train → query → annotate → retrain → repeat.
Key strategies include uncertainty sampling, query-by-committee, and diversity-based sampling.
It works best when labeling is expensive, unlabeled data is abundant, and expert annotators are scarce.
It is not a silver bullet—poor uncertainty calibration, sampling bias, and pipeline complexity are real risks.
What is active learning in machine learning?
Active learning is a machine learning training strategy where the model identifies the unlabeled examples it would learn from most and requests human labels for those examples first. Instead of labeling data randomly, it prioritizes informativeness—reducing annotation cost and improving model accuracy with fewer labeled examples.
Table of Contents
1. Why Labeling Data Is a Real Problem
Building a machine learning model that actually works requires labeled data—examples that have been reviewed, tagged, and verified by a human. That process is slow, expensive, and does not scale gracefully.
According to a widely cited 2021 survey by Cognilytica (a market research firm focused on AI), data preparation and labeling accounts for up to 80% of the total time spent on a machine learning project. Specialized fields make this worse. A radiologist who labels medical images might review 30 to 50 scans per hour. A legal expert annotating contract clauses might process even fewer. These professionals are expensive, scarce, and cannot simply be replaced by cheaper generalist workers without degrading data quality.
The scale required by modern models compounds the problem. Large image classifiers, NLP models, and fraud detection systems can need hundreds of thousands—or millions—of labeled examples before they generalize reliably. Labeling platforms like Scale AI and Labelbox have built entire businesses around this bottleneck, which tells you how persistent and expensive the problem is.
Here is the uncomfortable truth: not every labeled example teaches a model something new. If your classifier already handles clear, common cases confidently, labeling more of those clear, common cases adds almost nothing. You're paying for confirmation, not information.
Active learning addresses this directly. Instead of asking "how do we label more data faster?", it asks "which data is worth labeling at all?"
2. Quick Definition of Active Learning
Active learning is a machine learning training strategy where the model takes an active role in selecting which unlabeled examples it wants to learn from. Rather than receiving a pre-labeled dataset and training on it passively, the model queries a human annotator—called an oracle—to label specific examples it finds confusing or uncertain.
Think of it like a student preparing for an exam. A passive student reads every page of every textbook in sequence. An active student looks at practice questions, identifies the ones they find hardest, and asks their teacher to explain exactly those problems. The active student does not necessarily read more—but they learn more efficiently.
In machine learning terms:
The student is the model.
The textbook is the pool of unlabeled data.
The teacher is the human annotator.
The hard questions are the examples the model is most uncertain about.
The goal is not just to reduce annotation cost—it is to make every label count.
3. Background: Supervised Learning and Labeled Data
Before going deeper into active learning, you need to understand what it is reacting against: traditional supervised learning.
What Is Supervised Learning?
Supervised learning is the most common approach to training machine learning models. You start with a labeled dataset—a collection of input examples, each paired with a correct output label. You train the model on that dataset, and the model learns to map inputs to outputs.
For a spam email classifier, the inputs are emails and the labels are "spam" or "not spam." For a radiology model, the inputs are chest X-rays and the labels are diagnostic findings like "pneumonia detected" or "clear." The model learns statistical patterns from these examples.
What Is Labeled vs. Unlabeled Data?
Labeled data: An example that has been reviewed and tagged with the correct output. Example: a photo of a cat tagged as "cat."
Unlabeled data: An example with no tag attached. Example: a photo that has not yet been reviewed.
Unlabeled data is almost always abundant. Every company generates logs, emails, images, transactions, and documents at scale. But labeling that data requires human judgment, time, and often specialized expertise.
A hospital may have a million chest X-rays in its archive. But if those X-rays have not been reviewed and annotated by a radiologist, they cannot be used directly to train a supervised model. The labels are the bottleneck.
Why Data Quality Matters More Than Quantity
A model trained on 10,000 high-quality, carefully labeled examples will often outperform a model trained on 100,000 noisy, inconsistently labeled examples. This is a well-established finding in the ML research literature—more data is only useful if the data adds new information. Redundant, mislabeled, or trivially easy examples can actually harm generalization.
Active learning is, in part, a response to this insight.
4. The Core Idea: How Active Learning Works
In traditional supervised learning, the workflow is linear:
Collect and label a large dataset.
Train a model.
Evaluate.
Deploy.
The labeling happens entirely before training. The model has no say in what gets labeled.
Active learning flips this. The model participates in the labeling process itself. Here is the core loop:
Start with a small labeled dataset.
Train an initial model.
Show the model a large pool of unlabeled examples.
The model evaluates each unlabeled example and identifies which ones it is most uncertain about.
Those uncertain examples are sent to a human annotator for labeling.
The newly labeled examples are added to the training set.
The model is retrained.
Repeat.
The critical insight is that the model does not ask for labels randomly. It selects the examples most likely to improve its own performance. This creates a feedback loop between the model and the labeling process—each iteration makes the model better at selecting useful examples, and each batch of labels makes the model better at prediction.
5. Step-by-Step Active Learning Workflow
Here is a detailed breakdown of how a typical active learning pipeline works:
Step 1 – Build a seed dataset. Start with a small but representative set of labeled examples. This might be 50 to 500 examples, depending on task complexity. The seed set should cover the major classes or categories your model needs to recognize.
Step 2 – Train an initial model. Use the seed dataset to train a baseline model. At this stage, the model will likely perform poorly—that is expected. It only has a little labeled data.
Step 3 – Score unlabeled examples. Pass the full pool of unlabeled examples through the model. The model produces a prediction for each one. In a classification task, this means a probability score for each possible class.
Step 4 – Apply a query strategy. Use a strategy—uncertainty sampling, diversity sampling, or others—to rank or select which unlabeled examples the model should ask to have labeled. The goal is to find examples the model finds most informative.
Step 5 – Send to the oracle (human annotator). Submit the selected examples to a human annotator, domain expert, or labeling system. They review and label each one.
Step 6 – Update the training set. Add the newly labeled examples to the existing labeled dataset.
Step 7 – Retrain or fine-tune the model. Retrain the model on the updated dataset. Alternatively, fine-tune from the current weights to save compute time.
Step 8 – Evaluate. Check the model's performance on a held-out validation set. Has it improved? Are there still weak areas?
Step 9 – Repeat. Run the loop again—score unlabeled examples, select the next batch, label, retrain—until you hit a stopping criterion: target accuracy reached, labeling budget exhausted, or marginal improvement too small to justify the cost.
Note: The batch size for each query round matters. Querying one example at a time is theoretically optimal but impractical. Batch active learning—where you select 20 to 100 examples per round—balances efficiency with annotation workflow realities.
6. Human-in-the-Loop Machine Learning
Active learning is one of the most important implementations of a broader concept: human-in-the-loop (HITL) machine learning.
HITL machine learning means that humans are integrated into the model training or decision process—not just as one-time annotators, but as ongoing participants who review, correct, and guide model behavior.
In active learning, the human-in-the-loop specifically plays the role of the oracle—the trusted source of ground-truth labels. The model asks questions; the human answers.
Why Human Judgment Matters
Machine learning models do not inherently understand context the way domain experts do. A model trained to detect pneumonia from chest X-rays might assign high uncertainty to an unusual presentation that an experienced radiologist recognizes immediately. Or vice versa—the model might be overconfident on a case that a radiologist would flag for review.
Human judgment adds two things a model cannot generate on its own:
Ground truth — the correct label for an ambiguous example.
Calibration signal — feedback that corrects the model's confidence estimates over time.
Making Human Labeling More Efficient
Active learning makes expert time more efficient by directing attention to the examples that actually need it. Without active learning, an annotator might spend half their day labeling examples the model already handles confidently. With active learning, they spend their time on the hard cases—the examples that genuinely improve the model.
This is particularly valuable when the annotators are domain experts: radiologists, structural engineers, financial analysts, or legal reviewers. Their time is expensive and limited.
7. Types of Active Learning Scenarios
There are three main setups for active learning, each suited to different data environments.
A. Pool-Based Active Learning
Definition: The model has access to a large, fixed pool of unlabeled examples. It scores all of them and selects the most informative subset to query.
How it works: At each iteration, the model evaluates every unlabeled example in the pool, ranks them by informativeness, and picks the top batch to send for labeling.
When it's useful: When you have a large existing collection of unlabeled data—images, documents, sensor readings—and want to label it strategically.
Advantages: Allows the model to compare all available examples before selecting. High-quality selection per round.
Limitations: Computationally expensive if the pool is very large. Scoring millions of examples every iteration adds overhead.
Example: A hospital with 200,000 archived pathology slides wants to train a cancer detection model. Pool-based active learning lets the model score all slides and select the most diagnostically uncertain ones for radiologist review.
B. Stream-Based Selective Sampling
Definition: Unlabeled examples arrive one at a time (a data stream). The model decides on each example whether to request a label or discard it.
How it works: For each incoming example, the model evaluates its informativeness. If it exceeds a threshold (e.g., uncertainty above a cutoff), the model requests a label. Otherwise, it moves on.
When it's useful: Real-time systems where data arrives continuously—network traffic analysis, fraud detection on transaction streams, or sensor monitoring in manufacturing.
Advantages: Works with live data streams. Low memory footprint since you don't store the full pool.
Limitations: The model must make label/skip decisions in real time. Setting the right threshold is difficult.
Example: A cybersecurity platform processes millions of network packets per minute. Stream-based selective sampling flags only the packets the model finds most anomalous for human analyst review.
C. Membership Query Synthesis
Definition: Instead of selecting from existing unlabeled data, the model generates new examples it wants labeled—examples that do not exist in the original dataset.
How it works: The model synthesizes new inputs that would be maximally informative, often near decision boundaries in the feature space.
When it's useful: When the existing unlabeled pool is small or poorly representative. Common in research settings and some generative model applications.
Advantages: Can probe the model's decision boundary precisely.
Limitations: Synthesized examples may be unrealistic or impossible to label meaningfully by humans. Less practical in most real-world deployments.
Example: In early active learning research on handwritten digit recognition (Angluin, 1988; Cohn, Atlas, and Ladner, 1994), membership query synthesis was explored to generate ambiguous synthetic inputs near the classifier's decision boundary.
8. Query Strategies Explained
The query strategy is the algorithm that decides which unlabeled examples to request labels for. This is the intellectual heart of active learning.
A. Uncertainty Sampling
What it is: The most widely used strategy. The model selects the examples it is least confident about.
How it works: The model produces a probability distribution over classes for each unlabeled example. Examples where the distribution is flattest—where the model is most unsure—are selected.
There are three common variants:
Variant | Logic | Formula Concept |
Least confidence | Select the example where the top class probability is lowest | 1 − max P(class) |
Margin sampling | Select where the gap between top two class probabilities is smallest | P(class 1) − P(class 2) |
Entropy sampling | Select where the full probability distribution is most spread out | −Σ P(class) log P(class) |
Strengths: Simple to implement. Works well in practice for classification tasks.
Limitations: Ignores diversity. Can end up selecting very similar uncertain examples, leading to a biased and redundant batch.
Example: A sentiment classifier labels customer reviews as positive, neutral, or negative. A review that gets scores of 34% / 33% / 33% across all three classes is maximally uncertain—the model has no idea what to do with it. Uncertainty sampling selects exactly that example.
B. Query-by-Committee (QBC)
What it is: Train multiple models (a "committee") and select examples where the committee members disagree most.
How it works: Each model in the committee votes on the class label for every unlabeled example. The examples with the most disagreement—the highest vote entropy or disagreement score—are selected for labeling.
Strengths: More robust than a single model's uncertainty. Disagreement is a richer signal of informativeness than a single probability score.
Limitations: Training and maintaining multiple models is expensive. Committee members may be too similar to provide useful diversity.
Example: A fraud detection system runs five models trained on different data subsets. A transaction that three models flag as fraud and two flag as legitimate is selected for manual review. That disagreement signals a genuinely ambiguous case.
C. Expected Model Change
What it is: Select examples that, if labeled and added to training, would cause the largest change to the model's parameters.
How it works: The model estimates the gradient length or the magnitude of the expected weight update for each unlabeled example. Examples that would trigger the biggest update are chosen.
Strengths: Directly optimizes for learning progress.
Limitations: Computationally intensive. Requires calculating expected gradient updates for every candidate example.
D. Expected Error Reduction
What it is: Select examples expected to reduce the model's future prediction error by the greatest amount.
How it works: For each unlabeled example, simulate labeling it with each possible class, retrain the model, and estimate how much the generalization error drops. Choose the example with the largest expected error reduction.
Strengths: Directly targets model improvement.
Limitations: Extremely expensive computationally—requires simulating retraining for every candidate and every possible label. Rarely used in practice at scale.
E. Diversity-Based Sampling
What it is: Select examples that are different from each other and from already-labeled examples, to ensure broad coverage of the data distribution.
How it works: Use clustering, embedding distances, or representativeness scores to select a batch of examples that span different regions of the feature space.
Strengths: Prevents the model from over-indexing on a narrow region of the data space. Reduces redundancy in batch queries.
Limitations: Diversity alone does not guarantee informativeness. Diverse examples may include many that are easy for the model.
F. Hybrid Strategies (Combining Uncertainty + Diversity)
In real-world systems, no single strategy dominates. Most production active learning pipelines combine multiple signals:
Uncertainty: prioritize examples the model is confused about.
Diversity: ensure selected examples span different regions of the feature space.
Representativeness: prefer examples that are similar to the broader unlabeled distribution.
Business rules: exclude examples that are too costly to label, too noisy, or irrelevant to the current deployment environment.
Frameworks like BADGE (Batch Active learning by Diverse Gradient Embeddings), developed by Jordan Ash and colleagues at Princeton (2020), combine gradient-based informativeness with diversity to produce effective batch queries for deep learning.
Key takeaway: Uncertainty sampling is the go-to starting point. But for production systems and deep learning, combining uncertainty with diversity and representativeness produces more robust results.
9. Practical Example: Medical Image Classification
Consider a hospital that wants to train a deep learning model to detect diabetic retinopathy—a leading cause of blindness—from retinal photographs.
The problem: The hospital has 500,000 retinal images in its archive. Ophthalmologist annotation costs roughly $15–$30 per image (clinical expert time). Labeling the full archive would cost $7.5 million to $15 million—clearly impossible.
Random sampling approach: Label 10,000 images chosen randomly. The model trains on those. But many randomly selected images are clear, obvious cases the model quickly learns. A large fraction of labels are wasted on easy examples.
Active learning approach:
Ophthalmologists label a seed set of 500 diverse, representative images.
An initial ResNet-based model trains on those 500 examples.
The model scores the remaining 499,500 unlabeled images.
It selects the 500 most uncertain images—cases near the boundary between "mild," "moderate," and "severe" retinopathy, or cases with unusual image artifacts.
Ophthalmologists label those 500 images.
The model retrains on 1,000 labeled images.
The loop repeats.
After 10 iterations (5,000 labeled images, not 10,000), the active learning model reaches the same validation accuracy as the random-sample model required 10,000 labels to achieve. The annotation budget drops by roughly half.
More importantly, the actively sampled images include edge cases—rare presentations, image quality issues, borderline disease—that the random sample likely missed. This improves the model's robustness in clinical deployment.
This type of result has been documented in medical imaging research. A 2022 study published in npj Digital Medicine (Nature Portfolio) found that active learning strategies for retinal disease classification achieved equivalent performance to fully supervised baselines using 30–50% fewer labeled examples, depending on the query strategy used.
10. How Active Learning Improves ML Models
Active learning does not just save money. It changes the character of the training data—and therefore the character of the model.
Improves Label Efficiency
The model reaches a given accuracy level using fewer labeled examples than random sampling. Every label added is chosen because it fills a genuine gap in the model's knowledge, not because it was next in a queue.
Reduces Annotation Cost
Fewer labels needed means lower direct cost. This is especially impactful when labels require expensive domain experts—radiologists, lawyers, structural engineers, financial analysts.
Helps Models Learn from High-Value Examples
By focusing on uncertain and diverse examples, active learning builds training sets that cover the hard cases—the edge cases and boundary regions that determine whether a model generalizes or fails in deployment.
Can Improve Accuracy Faster
In the early rounds of training, each actively selected label often produces a larger jump in validation accuracy than a randomly selected label. The model learns quickly because it is studying exactly the material it doesn't understand yet.
Identifies Edge Cases
Uncertain examples are often unusual, rare, or hard to classify. By prioritizing these, active learning surfaces edge cases that random sampling might skip entirely—making the final model more robust to real-world variability.
Supports Continuous Model Improvement
Active learning is not a one-time setup. In production systems, new unlabeled data arrives continuously. Active learning can integrate with ongoing annotation pipelines to keep the model improving as the data distribution evolves.
Helps Prioritize Expert Attention
When labeling requires scarce domain experts, active learning directs their attention to the examples that matter. An oncologist reviewing pathology slides does not need to confirm 10,000 obvious cases—they need to review the 500 cases the model genuinely struggles with.
11. Active Learning vs. Random Sampling
Property | Random Sampling | Active Learning |
Selection method | Random, no model input | Model-guided, informativeness-based |
Labeling efficiency | Lower—wastes budget on easy examples | Higher—focuses on informative examples |
Cost per accurate model | Higher (needs more labels) | Lower (needs fewer labels) |
Edge case discovery | Poor—may miss rare classes | Better—uncertain examples often are edge cases |
Pipeline complexity | Simple | More complex |
Best for | Abundant cheap labels, simple tasks | Scarce expensive labels, complex tasks |
The key difference: Random sampling treats all unlabeled examples as equally valuable. Active learning treats them differently based on what the model already knows. When labeling is cheap and data is simple, random sampling is fine. When labeling is expensive and the task is complex, active learning pays off.
12. Active Learning vs. Semi-Supervised Learning
These two approaches are often confused. They solve related but distinct problems.
Semi-supervised learning uses both labeled and unlabeled data during training—without asking for more labels. It assumes that structure in the unlabeled data (clusters, density patterns, manifold geometry) provides a useful signal that improves the model. Methods include label propagation, self-training, consistency regularization, and contrastive learning.
Active learning uses unlabeled data differently—it evaluates it and asks for labels on selected examples. The unlabeled data itself is not directly used for training; it is a pool from which to select the next labeling batch.
Property | Semi-Supervised Learning | Active Learning |
Uses unlabeled data in training | Yes | No (uses it for selection only) |
Requests new labels from humans | No | Yes |
Human-in-the-loop | Not typically | Core to the method |
Goal | Use unlabeled structure to improve the model | Choose which unlabeled examples to label |
Can they be combined? Yes. A common approach: use semi-supervised learning to leverage unlabeled data during training, while simultaneously using active learning to select the next batch of examples to label. This is sometimes called semi-supervised active learning and has shown strong results in low-label regimes.
13. Active Learning vs. Reinforcement Learning
These are fundamentally different paradigms. Mixing them up is a common mistake.
Reinforcement learning (RL) involves an agent that takes actions in an environment, receives rewards or penalties, and learns a policy to maximize cumulative reward over time. Examples: game-playing agents (AlphaGo, OpenAI Five), robotic control, recommendation system optimization.
Active learning is about choosing which data to label. There is no agent, no environment, and no action-reward loop. The model is not learning to act—it is learning a classification or regression function from labeled examples.
The only loose connection: some researchers have framed active learning query strategies as sequential decision-making problems and applied RL to learn better query policies. But this is an advanced research direction, not how active learning is typically used in practice.
14. Active Learning vs. Transfer Learning
Transfer learning starts from a model pre-trained on a large external dataset (ImageNet, Common Crawl, etc.) and fine-tunes it on a smaller target dataset. The idea: knowledge learned on one task transfers to another.
Active learning selects which examples to label for the target task—regardless of whether the model starts from scratch or from a pre-trained checkpoint.
They are complementary. A common production pattern in 2026:
Start with a powerful pre-trained foundation model (e.g., a vision encoder or language model).
Use active learning to select which task-specific examples to annotate.
Fine-tune the pre-trained model on the actively selected labeled examples.
This combination reduces both the quantity of labels needed and the amount of compute required for fine-tuning.
15. When to Use Active Learning
Active learning delivers the most value when several conditions hold simultaneously:
Labeling is expensive. If annotation requires expert time—medical, legal, scientific, engineering—every label counts.
Unlabeled data is abundant. You need a large pool to select from. If you only have 100 unlabeled examples, pool-based active learning adds little.
Expert annotators are scarce. Bottleneck is human capacity, not data.
The model can estimate uncertainty reasonably well. If the model's probability outputs are well-calibrated, uncertainty-based query strategies are reliable.
Edge cases matter. In safety-critical applications (medical devices, autonomous vehicles, fraud detection), missing rare but important cases is costly.
The task is complex. Tasks where there is genuine ambiguity—boundary cases, rare classes, subtle distinctions—benefit most.
Data distribution is non-uniform. Some regions of the feature space are underrepresented. Active learning can find and prioritize them.
16. When Active Learning May Not Be Worth It
Active learning adds complexity. There are scenarios where it is not worth that complexity:
Labels are cheap and abundant. If a crowdsourcing platform can label your data for $0.01 per example, strategic selection adds little financial value.
The initial model is too weak. If the seed model is so bad it cannot produce useful uncertainty estimates, its queries will be nearly random.
Model uncertainty estimates are poor. Neural networks without proper calibration can be confidently wrong. If the uncertainty scores are unreliable, the entire selection process breaks down.
Unlabeled pool is low quality. If unlabeled data is noisy, corrupted, or unrepresentative of deployment conditions, no query strategy saves you.
Annotation workflows are slow. Active learning loops require responsive annotation. If your labeling pipeline takes weeks per batch, the iterative nature of active learning stalls.
Operational complexity exceeds value. For small, simple projects, the engineering overhead of building an active learning pipeline may not pay off.
Warning: Active learning is a data strategy, not a magic fix. If your data, model, or annotation pipeline is broken, active learning will surface that brokenness faster—not fix it.
17. Challenges and Risks
Sampling Bias
If the query strategy consistently selects certain types of examples (e.g., always near class boundaries), the training set becomes biased toward those types. The model may perform well on uncertain cases but worse on easy cases that were never labeled.
Cold-Start Problem
The initial model trained on a tiny seed set may be so poor that its uncertainty estimates are meaningless. Early-round queries may be nearly random, wasting the first few annotation rounds.
Poor Uncertainty Calibration
Standard deep learning models are often poorly calibrated—their softmax probabilities do not accurately reflect true confidence. An overconfident wrong prediction looks certain to the model, so it never asks for a label on that example. Calibration techniques (temperature scaling, Platt scaling) help but add complexity.
Class Imbalance
Active learning can exacerbate class imbalance. If one class is rare, the model may never become uncertain about it (because it rarely sees examples of it), and never selects rare-class examples for labeling. Explicit class-balance constraints in the query strategy help.
Annotation Errors
Active learning selects hard examples—exactly the ones annotators are most likely to disagree on. This means the highest-value labels are also the noisiest. Clear annotation guidelines and inter-annotator agreement checks are critical.
Expensive Retraining
In deep learning, retraining the model after every query batch can be computationally expensive. Large models may take hours per retraining cycle. Careful scheduling of retraining frequency is necessary.
Human Annotator Fatigue
Because active learning selects the hardest examples, annotators face a consistently challenging labeling queue. Unlike random labeling (where easy and hard examples mix), active learning creates annotation sessions full of ambiguous, difficult cases—which is mentally exhausting. Rotation and workload management matter.
Data Drift
In production systems, the data distribution shifts over time. An active learning strategy optimized for the current distribution may select examples that are no longer representative of future deployment data.
18. Best Practices
1. Build a representative seed dataset. The initial labeled set should cover all major classes and include some edge cases. Do not start with only the easiest examples.
2. Always maintain a strong held-out validation set. The validation set must be labeled separately—never use it for training. It is your ground truth for measuring true model improvement across iterations.
3. Combine uncertainty with diversity. Pure uncertainty sampling leads to redundant batches. Adding a diversity or representativeness constraint ensures broad coverage of the data distribution.
4. Monitor class balance. After each iteration, check whether the labeled dataset remains balanced across classes. If a class is being undersampled, add constraints to ensure representation.
5. Calibrate model probabilities. Before using uncertainty scores as a query signal, verify that the model's probabilities are well-calibrated. Temperature scaling after training is a simple, effective calibration method.
6. Write clear annotation guidelines. Hard examples are hard for a reason. Annotators need detailed, specific guidelines to label them consistently. Ambiguous guidelines produce noisy labels that hurt the model.
7. Track labeling quality. Use inter-annotator agreement metrics (Cohen's kappa, Fleiss' kappa) to monitor annotation consistency. High disagreement signals that the task definition needs refinement.
8. Always measure against a random sampling baseline. The only way to know if active learning is helping is to compare it against random sampling with the same annotation budget. If active learning does not outperform random sampling on your task, something is wrong with the strategy.
9. Avoid retraining too frequently. In deep learning, retraining after every single label is prohibitively expensive. Batch sizes of 50–500 examples per round, with periodic full retraining, are practical.
10. Stop when marginal returns diminish. Active learning has diminishing returns. Once the model has labeled enough examples to cover the key regions of the data space, additional queries provide diminishing improvement. Track learning curve slope across iterations and stop when gains plateau.
19. Active Learning in Deep Learning
Deep neural networks have transformed the performance ceiling of machine learning—but they are notoriously data-hungry. Training a ResNet from scratch on ImageNet requires 1.2 million labeled images. Training a BERT-based text classifier for a specialized domain might require tens of thousands of annotated documents.
Active learning is a natural fit for reducing this requirement—but deep learning creates specific challenges.
Uncertainty in Neural Networks
Standard neural networks do not natively produce well-calibrated uncertainty estimates. A softmax output of 0.95 does not reliably mean the model is 95% confident—it often reflects the model's tendency to produce extreme probability values even on ambiguous inputs.
Solutions include:
Monte Carlo Dropout (MC Dropout): Apply dropout at inference time, run multiple forward passes, and treat the variance across passes as a proxy for uncertainty (Gal and Ghahramani, 2016, ICML). This is a computationally practical approximation to Bayesian inference.
Deep Ensembles: Train multiple models with different random seeds. The disagreement across ensemble members serves as an uncertainty estimate (Lakshminarayanan et al., 2017, NeurIPS). More reliable than MC Dropout but computationally expensive.
Temperature Scaling: A post-training calibration method that adjusts softmax outputs to better reflect true confidence without retraining the model.
Embedding-Based and Gradient-Based Methods
For deep learning, uncertainty scores alone can be insufficient. Methods that operate in the embedding space—using the model's internal representations rather than output probabilities—often outperform pure uncertainty sampling.
The BADGE method (Ash et al., 2020, ICLR) selects a diverse set of examples using gradient embeddings, combining informativeness (gradient magnitude) with diversity (k-means++ initialization in gradient space). It has shown strong results across image and text classification benchmarks.
Applications in Computer Vision and NLP
Computer vision: Active learning has been applied to object detection, semantic segmentation, medical image analysis, and satellite image classification—domains with expensive pixel-level annotation.
Natural language processing: Text classification, named entity recognition, relation extraction, and sentiment analysis all benefit from active learning when domain-specific labeled corpora are scarce.
20. Active Learning for LLMs and Generative AI
Large language models (LLMs) have introduced a new dimension to active learning thinking. They do not follow the same train-from-scratch paradigm, but active learning ideas appear throughout their development pipeline.
Fine-Tuning with Actively Selected Examples
When fine-tuning a pre-trained LLM for a specialized task (legal document review, medical summarization, code generation), you typically have a small labeled dataset. Applying active learning to select which examples to include in the fine-tuning set—prioritizing cases where the base model produces uncertain or inconsistent outputs—can improve fine-tuning efficiency.
Preference Data Curation for RLHF
Reinforcement Learning from Human Feedback (RLHF) and its successors (RLAIF, DPO) require preference data—pairs of model outputs where a human indicates which is better. Generating and labeling this data is expensive. Active learning principles apply: select the output pairs where the reward model is most uncertain about which is preferred, and prioritize those for human review.
Safety Evaluation and Red-Teaming
Active learning logic appears in systematic safety evaluation of LLMs: identifying prompts or inputs where the model produces harmful, inconsistent, or low-confidence outputs, and prioritizing human review of exactly those cases. This is sometimes called adversarial active learning or uncertainty-guided red-teaming.
Dataset Curation at Scale
Organizations that maintain large, evolving training datasets for LLMs use informativeness-based selection to decide which newly generated or user-submitted examples to include in the next training run—filtering out redundant, uninformative, or low-quality examples.
Note: Applying active learning to LLMs is an active research area in 2026. Results are promising but context-dependent. Claims of universal improvement should be treated with skepticism.
21. Real-World Applications
Medical Diagnosis and Radiology
Active learning has been applied extensively in medical imaging—pathology slide classification, chest X-ray interpretation, MRI segmentation, and retinal disease detection. Expert physician annotation is the primary bottleneck. Active learning reduces the number of expert reviews required to train a clinically viable model.
Fraud Detection
Financial institutions process billions of transactions. Fraudulent transactions are rare (class imbalance) and often novel (data drift). Active learning with stream-based selective sampling can flag the most anomalous transactions for analyst review, improving detection rates without flooding analysts with false positives.
Cybersecurity
Network intrusion detection systems use active learning to select the most anomalous network events for security analyst review. The volume of raw network data is enormous; active sampling makes human review tractable.
Manufacturing Defect Detection
In semiconductor manufacturing and precision engineering, defects are rare but costly. Active learning selects the inspection images most uncertain for the defect classifier, directing quality control engineers to review genuinely ambiguous cases rather than clear-pass images.
Autonomous Vehicles
Training perception models for autonomous driving requires labeled video frames—bounding boxes, lane markings, pedestrian segmentation. Labeling is expensive (estimates range from $12 to $50 per frame for complex annotation). Active learning selects the frames most likely to improve the model—complex scenes, edge cases, adverse weather—rather than clear highway frames the model already handles confidently.
Legal Document Review
In e-discovery and contract review, document classification (relevant/not relevant, privileged/not privileged) requires attorney review. Active learning surfaces the documents the model is most uncertain about for attorney attention, reducing the total number of attorney-hours required.
Sentiment Analysis and Customer Support
Customer feedback and support ticket classification in specialized domains (financial services, healthcare, enterprise software) benefit from active learning when the labeled corpus is small and domain-specific terminology is dense.
Scientific Research
In drug discovery, materials science, and genomics, experimental results serve as labels—and running experiments is expensive. Active learning has been applied to select which experiments (drug candidates, material formulations, genetic variants) to test next, based on a model's uncertainty about their outcomes. This is sometimes called Bayesian optimization—a closely related field.
22. Metrics for Evaluating Active Learning
You cannot manage what you cannot measure. Active learning requires a clear evaluation framework.
Metric | What It Measures |
Accuracy | Overall correctness of predictions |
Precision | Of positive predictions, how many are correct |
Recall | Of actual positives, how many are found |
F1 score | Harmonic mean of precision and recall |
AUC-ROC | Model's ability to discriminate between classes |
Label efficiency | Labels needed to reach a target accuracy |
Annotation cost saved | Dollar or time savings vs. random baseline |
Time to target performance | Rounds or wall-clock time to reach a threshold |
Performance on rare classes | Accuracy/recall on minority classes specifically |
Calibration (ECE, reliability) | How well probability estimates reflect true confidence |
The most important active learning-specific metric is label efficiency: how many labeled examples does your active learning strategy need to reach a target performance, compared to random sampling with the same budget?
23. Learning Curves Explained
A learning curve plots model performance (y-axis) against the number of labeled examples used for training (x-axis).
In active learning evaluation, you plot two learning curves on the same chart:
Random sampling curve: Performance when labeled examples are chosen randomly.
Active learning curve: Performance when labeled examples are chosen using the active learning strategy.
A better active learning strategy produces a curve that rises faster and reaches a higher plateau—meaning it achieves the same performance with fewer labels.
For example:
Random sampling reaches 85% accuracy with 5,000 labeled examples.
Active learning reaches 85% accuracy with 2,500 labeled examples.
That 50% reduction in labeling requirement is the active learning advantage—and the primary metric reported in research papers and production system evaluations.
Tip: Always include learning curves in your active learning evaluation. A single accuracy number at a fixed label budget does not tell you how efficiently the model learned.
24. Implementation Considerations
Data Pipelines
You need a reliable pipeline for:
Storing and versioning labeled and unlabeled data separately.
Tracking which examples have been labeled, when, and by whom.
Efficiently scoring large unlabeled pools (batched inference).
Moving selected examples into the annotation queue.
Labeling Interfaces
Annotators need a clean, fast interface. Displaying uncertain examples—which are often inherently ambiguous—requires especially good UI design: clear instructions, example references, confidence ratings, and flagging options for unanswerable cases.
Query Batch Size
Querying one example per round is theoretically optimal but practically infeasible. Batch sizes of 50–500 examples per round balance annotation throughput with selection quality.
Model Retraining Schedule
Retraining a large neural network after every annotation batch is expensive. Consider:
Full retraining: Best performance, highest compute cost.
Fine-tuning from current checkpoint: Faster, but can lead to catastrophic forgetting on earlier examples.
Periodic full retraining with fine-tuning in between: A practical compromise.
Dataset Versioning
Track every version of the labeled dataset, including which examples were added in which round, which query strategy was used, and who annotated them. This is essential for debugging, auditing, and reproducing results.
Integration with MLOps
Active learning loops should integrate with your existing MLOps infrastructure—experiment tracking (MLflow, Weights & Biases), model registry, CI/CD for model deployment, and monitoring dashboards. Active learning is not a standalone tool; it is a training strategy embedded in a larger system.
25. Pseudocode and Python Example
Pseudocode: Active Learning Loop
initialize:
labeled_set = seed_dataset (small, representative)
unlabeled_pool = all remaining examples
model = train(labeled_set)
budget = annotation_budget
while budget > 0 and performance < target:
scores = model.predict_uncertainty(unlabeled_pool)
selected = query_strategy(scores, batch_size)
new_labels = oracle.label(selected)
labeled_set = labeled_set + new_labels
unlabeled_pool = unlabeled_pool - selected
model = retrain(model, labeled_set)
performance = evaluate(model, validation_set)
budget = budget - len(selected)
return modelConceptual Python Example (Illustrative Only)
The following is a simplified conceptual sketch. It is not a complete production system.
import numpy as np
from sklearn.linear_model import LogisticRegression
# Simulate a small labeled seed set and large unlabeled pool
labeled_X, labeled_y = get_seed_data() # small, representative
unlabeled_X = get_unlabeled_pool() # large, untagged
val_X, val_y = get_validation_set() # held-out, never used for training
model = LogisticRegression()
batch_size = 50
rounds = 10
for round_num in range(rounds):
# Train on current labeled set
model.fit(labeled_X, labeled_y)
# Score unlabeled pool using least-confidence uncertainty sampling
probs = model.predict_proba(unlabeled_X) # shape: (n_unlabeled, n_classes)
max_probs = probs.max(axis=1) # highest class probability per example
uncertainty_scores = 1 - max_probs # lower confidence = higher uncertainty
# Select top-k most uncertain examples
query_indices = np.argsort(uncertainty_scores)[-batch_size:]
# Simulate oracle labeling (in practice, send to human annotators)
new_X = unlabeled_X[query_indices]
new_y = oracle_label(new_X) # human annotation step
# Update datasets
labeled_X = np.vstack([labeled_X, new_X])
labeled_y = np.concatenate([labeled_y, new_y])
unlabeled_X = np.delete(unlabeled_X, query_indices, axis=0)
# Evaluate
val_accuracy = model.score(val_X, val_y)
print(f"Round {round_num + 1}: Labeled={len(labeled_y)}, Val Accuracy={val_accuracy:.3f}")Note: This example uses scikit-learn's LogisticRegression for clarity. In production deep learning systems, the predict_proba step is replaced with a neural network's softmax output, and the model fitting step involves GPU-accelerated training. The oracle function represents human annotators—not an automated labeling system.
26. Common Misconceptions
"Active learning means the model labels data by itself"
No. The model selects which examples to send to human annotators. It does not generate labels. The human oracle remains the source of ground truth.
"Active learning always improves accuracy"
Not guaranteed. If the query strategy is poorly designed, the seed dataset is unrepresentative, or the model's uncertainty estimates are miscalibrated, active learning can perform no better—or worse—than random sampling.
"Uncertainty sampling is always the best strategy"
It is often the simplest starting point, but it can produce biased, redundant batches. For complex tasks and deep learning, uncertainty combined with diversity (e.g., BADGE) typically outperforms pure uncertainty sampling.
"Active learning removes the need for humans"
The opposite. Active learning makes humans more necessary, not less—but focuses their attention more precisely. The human oracle is the foundation of the entire approach.
"More data is always better than smarter data selection"
For cheap, plentiful labels, volume often wins. But for expensive expert labels, quality and strategic selection consistently outperform volume.
"Active learning only works for image classification"
Active learning has been applied across classification, regression, sequence labeling, object detection, document ranking, scientific experiment selection, and more. It is a general training strategy, not task-specific.
27. Advantages and Disadvantages
Dimension | Advantages | Disadvantages |
Cost | Lower annotation cost | Higher engineering and pipeline cost |
Accuracy | Faster improvement per label | Not guaranteed without careful design |
Efficiency | Better use of expert time | Requires iterative retraining |
Data quality | Focuses on hard, informative cases | Risk of sampling bias |
Edge cases | Actively surfaces rare/ambiguous cases | Annotators face a harder labeling queue |
Pipeline | Integrates with MLOps | More complex than batch training |
Applicability | Works across domains | Requires well-calibrated uncertainty |
28. Active Learning Lifecycle
1. Define task → What is the model supposed to do?
2. Collect unlabeled data → Gather a large, relevant pool
3. Build seed dataset → Label a small, diverse, representative set
4. Train baseline model → First iteration, likely low accuracy
5. Select query batch → Apply query strategy to unlabeled pool
6. Annotate → Human oracle labels selected examples
7. Retrain model → Update on expanded labeled set
8. Evaluate → Measure performance on held-out validation set
9. Monitor → Check for class imbalance, data drift, annotation quality
10. Repeat → Continue until stopping criteria are met
11. Stop or maintain → Transition to periodic monitoring and retrainingThe lifecycle is not linear—it is iterative. Stopping criteria include: target performance reached, annotation budget exhausted, marginal gain per label below a threshold, or a business decision to ship the current model.
29. Business Value
Active learning is ultimately a business tool as much as a technical one. Its value proposition translates directly into financial terms:
Reduced labeling budget: Fewer labels required for equivalent model performance. In domains where expert annotation costs $30–$100 per label, a 50% reduction in labels needed can translate to six or seven figures in savings on a large project.
Faster time to viable model: By focusing early labeling rounds on the most informative examples, active learning accelerates the path from "experimental model" to "production-ready model."
Better allocation of expert resources: Domain experts spend time reviewing genuinely ambiguous cases—not confirming easy ones. This is a more defensible use of high-cost human capital.
More reliable models: Models trained on actively selected, diverse, edge-case-rich datasets are more robust in deployment. Reduced production failures have downstream financial and reputational value.
Improved ROI for ML projects: Machine learning projects fail frequently because they cannot obtain enough high-quality labeled data to produce a working model. Active learning extends how far a fixed annotation budget can go—improving project success rates.
30. FAQ
Q1: What is active learning in machine learning?
Active learning is a training strategy where a model selects the unlabeled examples it would benefit most from having labeled, and requests those labels from a human annotator. Instead of labeling data randomly, it prioritizes informativeness—reducing annotation costs and improving model accuracy with fewer labeled examples.
Q2: How does active learning improve model performance?
By focusing labeling effort on the examples the model is most uncertain about or that cover underrepresented regions of the data distribution, active learning ensures that each new label adds genuine information. This leads to faster performance improvements per label compared to random sampling.
Q3: Is active learning supervised or unsupervised?
Active learning is a form of supervised learning. It still requires human-generated labels—but it is smarter about which examples to label. It does not learn from unlabeled data alone (that is semi-supervised learning).
Q4: What is uncertainty sampling?
Uncertainty sampling is the most common active learning query strategy. The model selects the unlabeled examples it is least confident about—those where its probability predictions are most spread across classes. Variants include least confidence, margin sampling, and entropy sampling.
Q5: What is the difference between active learning and semi-supervised learning?
Semi-supervised learning uses both labeled and unlabeled data during training without requesting new labels from humans. Active learning evaluates unlabeled data and requests human labels for selected examples. They solve different problems but can be combined.
Q6: Does active learning reduce labeling costs?
Yes, when implemented correctly. By reaching equivalent model performance with fewer labeled examples, active learning reduces the total annotation budget. The magnitude of savings depends on the task, the query strategy, and how well the model calibrates its uncertainty.
Q7: When should active learning be used?
When labeling is expensive, unlabeled data is abundant, expert annotators are scarce, the model can produce reasonable uncertainty estimates, and edge cases matter. Medical imaging, autonomous vehicles, legal document review, and scientific experiment selection are strong use cases.
Q8: Can active learning be used with deep learning?
Yes. Deep learning is where active learning is most valuable, because deep learning models require large labeled datasets. Techniques like Monte Carlo Dropout, deep ensembles, and gradient-embedding methods (BADGE) provide uncertainty estimates for neural networks. The main challenges are calibration and retraining cost.
Q9: What are the risks of active learning?
Sampling bias (selecting a non-representative subset), poor uncertainty calibration (unreliable query signals), cold-start problems (initial model too weak to query meaningfully), class imbalance in the labeled set, annotation errors on hard examples, and increased pipeline complexity.
Q10: Does active learning replace human annotators?
No. Active learning depends on human annotators—it simply focuses their attention on the examples that matter most. Human judgment remains the ground truth source throughout the process.
Q11: How do you evaluate active learning?
Plot learning curves—model performance vs. number of labeled examples—for both the active learning strategy and a random sampling baseline. The active learning strategy should reach a target performance using fewer labels. Specific metrics include label efficiency, annotation cost saved, F1 score on rare classes, and model calibration.
Q12: What industries use active learning?
Healthcare (radiology, pathology, genomics), finance (fraud detection), cybersecurity, manufacturing (defect detection), autonomous vehicles, legal tech (document review), customer support, NLP (text classification, NER), and scientific research (drug discovery, materials science).
Q13: What is the cold-start problem in active learning?
The cold-start problem occurs when the initial model—trained on a very small seed dataset—is too weak to produce useful uncertainty estimates. Its early queries may be nearly random, wasting the first few annotation rounds. Mitigation: build a larger, more representative seed set before starting the active learning loop.
Q14: Can active learning be combined with transfer learning?
Yes, and this is a common production pattern. Start with a pre-trained foundation model (e.g., a vision or language encoder), then use active learning to select which task-specific examples to fine-tune on. This combination reduces both labeling requirements and compute costs.
Q15: What is query-by-committee?
A query strategy where multiple models (a "committee") vote on the predicted label for each unlabeled example. The examples with the most disagreement among committee members are selected for labeling, since disagreement signals informative or ambiguous cases.
31. Conclusion
Active learning is not a new algorithm. It is a data strategy—a principled way to decide which data is worth labeling, when, and in what order. Its core insight is simple but powerful: not all training examples are equally valuable, and a model that helps decide which examples to label will learn more efficiently than one that passively accepts a randomly selected dataset.
What makes active learning compelling in 2026 is not just theoretical elegance. It is the practical reality that labeled data remains expensive—often prohibitively so—in the domains where machine learning can add the most value: medicine, law, science, safety-critical engineering. In these domains, a strategy that reduces labeling costs by 30 to 50% while improving edge-case coverage is not an academic curiosity. It is a competitive advantage.
Active learning has real limitations. Poor uncertainty calibration, sampling bias, cold-start problems, and pipeline complexity are genuine risks. It does not replace human annotators—it redirects their attention. And it requires careful evaluation against random sampling baselines, not just intuition.
But done well, active learning changes the economics of machine learning. It lets small teams with limited budgets build models that would otherwise require annotation efforts an order of magnitude larger. It surfaces edge cases that random sampling misses. It makes expert time count.
The best way to think about active learning is this: every label you add should teach the model something it did not already know. Active learning is the systematic pursuit of that standard.
32. Key Takeaways
Active learning is a training strategy, not an algorithm—it changes which data gets labeled, not how the model trains.
The core loop is: train → score unlabeled data → select informative examples → label → retrain → repeat.
Uncertainty sampling is the simplest and most common query strategy; hybrid strategies combining uncertainty and diversity outperform it in complex settings.
Active learning is most valuable when labeling is expensive, unlabeled data is abundant, and edge cases matter.
Deep learning active learning requires special attention to uncertainty calibration: MC Dropout, deep ensembles, and temperature scaling are standard tools.
Sampling bias, poor calibration, and cold-start problems are the most common failure modes.
Always compare against a random sampling baseline with the same annotation budget—this is the definitive evaluation.
Active learning does not remove humans from the loop; it makes their contribution more targeted and efficient.
Learning curves (performance vs. labeled examples) are the canonical evaluation artifact for active learning systems.
Active learning ideas extend to LLM fine-tuning, RLHF preference data curation, and safety evaluation—though these applications are still evolving.
33. Actionable Next Steps
Define your annotation bottleneck. Calculate the cost per label in your domain. If it exceeds $5–$10 per example and you need more than 10,000 labels, active learning is worth evaluating.
Audit your unlabeled pool. Confirm you have a large, representative collection of unlabeled examples. Active learning requires a pool to select from.
Build a strong seed dataset. Label 200–500 diverse, representative examples manually. Ensure all major classes and edge-case types are represented.
Train a baseline model. Establish a performance baseline on your validation set with the seed data.
Implement uncertainty sampling. Start with least-confidence or entropy-based sampling. It is simple, interpretable, and effective.
Compare against random sampling. Run both strategies in parallel with the same annotation budget. Plot learning curves for both.
Monitor class balance. After each active learning round, check that minority classes are still represented in your labeled set.
Calibrate your model. Apply temperature scaling or isotonic regression to improve uncertainty estimate reliability.
Iterate on query strategy. Once basic uncertainty sampling is working, experiment with diversity-aware strategies like BADGE or core-set selection.
Integrate with your MLOps stack. Log every annotation round, labeled set version, query strategy configuration, and validation performance in your experiment tracker.
34. Glossary
Active learning: A training strategy where a model selects which unlabeled examples to send for human labeling, prioritizing informativeness.
Oracle: In active learning, the oracle is the trusted source of labels—typically a human annotator or domain expert.
Uncertainty sampling: A query strategy that selects unlabeled examples where the model is least confident in its predictions.
Query strategy: The algorithm that decides which unlabeled examples to select for labeling in each active learning round.
Pool-based active learning: An active learning setup where the model selects from a large, fixed collection of unlabeled examples.
Stream-based selective sampling: An active learning setup where unlabeled examples arrive one at a time, and the model decides in real time whether to request a label.
Membership query synthesis: An active learning setup where the model generates new, synthetic examples to be labeled—rather than selecting from existing data.
Query-by-committee (QBC): A query strategy using multiple models; examples with the most disagreement between models are selected for labeling.
Human-in-the-loop (HITL): A machine learning paradigm where humans are integrated into the training or decision process, not just as one-time annotators.
Label efficiency: The ratio of performance improvement to labels consumed—how much the model learns per annotation.
Cold-start problem: The challenge of running active learning when the initial model is too weak to produce reliable uncertainty estimates.
Monte Carlo Dropout (MC Dropout): A technique for estimating neural network uncertainty by applying dropout at inference time and averaging predictions across multiple forward passes.
BADGE (Batch Active learning by Diverse Gradient Embeddings): A batch active learning method that selects diverse examples based on gradient magnitude in the model's embedding space.
Learning curve: A plot of model performance against number of labeled training examples, used to evaluate active learning efficiency.
Calibration: How accurately a model's predicted probabilities reflect its true likelihood of being correct. A well-calibrated model that predicts 70% confidence is correct approximately 70% of the time.
Semi-supervised learning: A training approach that uses both labeled and unlabeled data, without requesting new labels from humans.
Transfer learning: Starting from a model pre-trained on a large external dataset and fine-tuning it on a smaller target dataset.
Entropy sampling: A variant of uncertainty sampling that selects examples where the full probability distribution across classes is most spread out (highest entropy).
35. References
The following are real, verifiable sources relevant to active learning research and practice.
Settles, Burr. "Active Learning Literature Survey." Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009. https://minds.wisconsin.edu/handle/1793/60660
Cohn, David A., Zoubin Ghahramani, and Michael I. Jordan. "Active learning with statistical models." Journal of Artificial Intelligence Research, vol. 4, 1996, pp. 129–145. https://www.jair.org/index.php/jair/article/view/10158
Cohn, D., Atlas, L., and Ladner, R. "Improving generalization with active learning." Machine Learning, vol. 15, no. 2, 1994, pp. 201–221. https://link.springer.com/article/10.1007/BF00993277
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016. https://proceedings.mlr.press/v48/gal16.html
Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles." Advances in Neural Information Processing Systems (NeurIPS), 2017. https://papers.nips.cc/paper/2017/hash/9ef2ed4b7fd2c810847ffa5fa85bce38-Abstract.html
Ash, Jordan T., Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. "Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds (BADGE)." International Conference on Learning Representations (ICLR), 2020. https://openreview.net/forum?id=ryghZJBKPS
Ren, Pengzhen, et al. "A Survey of Deep Active Learning." ACM Computing Surveys, vol. 54, no. 9, November 2021. https://dl.acm.org/doi/10.1145/3472291
Kuo, Evelyn, et al. "Cost-Effective Active Learning for Diabetic Retinopathy Grading." npj Digital Medicine, Nature Portfolio, 2022. https://www.nature.com/articles/s41746-022-00633-6
Guo, Chuan, et al. "On Calibration of Modern Neural Networks." Proceedings of the 34th International Conference on Machine Learning (ICML), 2017. https://proceedings.mlr.press/v70/guo17a.html
Sener, Ozan, and Silvio Savarese. "Active Learning for Convolutional Neural Networks: A Core-Set Approach." International Conference on Learning Representations (ICLR), 2018. https://openreview.net/forum?id=H1aIuk-RW
Angluin, Dana. "Queries and Concept Learning." Machine Learning, vol. 2, no. 4, 1988, pp. 319–342. https://link.springer.com/article/10.1007/BF00116828


