What Is a Test Set?

Q: What is a hidden test set?

A hidden test set is one whose labels are not disclosed to model developers. Used in academic benchmarks and competitions to prevent overfitting to the test set through repeated submission and feedback cycles.

11 hours ago
28 min read

Ultra-realistic machine learning dashboard with a highlighted test set and the title “What Is a Test Set?”

Most machine learning models look brilliant on paper. They score high on the data they trained on, impress the team during development, and get approved for deployment. Then reality hits. The model underperforms on real users, real transactions, or real medical images. The reason is almost always the same: the team never properly evaluated the model on data it had never seen before. That is exactly what a test set is for.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

TL;DR

A test set is a held-out portion of your dataset used exclusively to evaluate a fully trained model's real-world performance.
It must never be used during training or hyperparameter tuning—only once, at the very end.
The test set measures generalization: whether your model works on new, unseen data.
Common split ratios are 70/15/15 or 80/10/10 (train/validation/test).
Data leakage—when test information bleeds into training—is one of the most dangerous and common mistakes in ML.
Test metrics are estimates, not guarantees; statistical uncertainty always applies.

What is a test set?

A test set is a separate portion of a labeled dataset that is withheld from model training and used only at the end to measure final model performance. It simulates how the model will behave on new, unseen, real-world data. A test set should never influence training, feature engineering, or hyperparameter tuning decisions.

Bonus: AI in Business: Applications, Benefits & Implementation Guide

Bonus Plus: The Complete Guide to Physical AI: What It Is and Why It Matters

Bonus Plus Pro: AI Humanoid Robots: How They Work, Who's Building Them, and What's Next

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

Why Model Evaluation Matters
What Is a Test Set?
The Machine Learning Workflow
Training Set vs Validation Set vs Test Set
Why Test Sets Are Important
Generalization: The Core Goal
Overfitting and Underfitting
How to Split a Dataset
Types of Test Set Splits
Stratified Test Sets
Time-Based Test Sets
Data Leakage
Test Set Contamination
Validation Set vs Test Set
Cross-Validation and the Test Set
Test Metrics
Examples by Problem Type
Practical Example: Spam Classifier
Practical Example: House Price Regression
Best Practices
Common Mistakes
What Makes a Good Test Set?
Test Set Size
Test Sets in Real-World Production
Benchmark Test Sets
Test Sets in Deep Learning
Small-Data Situations
Statistical Uncertainty
Ethics and Fairness
Can You Use the Test Set More Than Once?
What If the Model Performs Poorly?
Test Set vs Holdout Set vs Evaluation Set vs Dev Set
Checklist for Creating a Test Set
Beginner-Friendly Summary
FAQ
Key Takeaways
Actionable Next Steps
Glossary
References

1. Why Model Evaluation Matters

Building a machine learning model is only half the job. The other half is knowing whether it actually works.

A model that scores 99% accuracy in your notebook can still fail spectacularly in production. It may have memorized the training data without learning the underlying patterns. It may never have been tested on data that reflects real-world variation. Without rigorous evaluation, you have no reliable way to know.

Model evaluation is the discipline that closes that gap. It gives teams an honest, evidence-based signal of how a model is likely to behave when deployed. At the center of that discipline sits a single, critical component: the test set.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

2. What Is a Test Set?

A test set is a labeled portion of your dataset that is held out entirely from model training, used only at the very end to measure the final model's performance.

Think of it as a sealed envelope. You prepare it before training begins, lock it away, and open it only once—after all decisions about the model have been made. The model has never seen these examples. Its score on this data is your best estimate of how it will behave on new, real-world inputs.

Key properties of a test set:

It is separate from training data and validation data.
The model does not learn from it—not directly, not indirectly.
It is used once (or as few times as possible) to produce a final performance estimate.
It represents the conditions the model will face in deployment.

The test set does not make a model better. It simply tells you, as honestly as possible, how good it already is.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

3. The Machine Learning Workflow

Understanding the test set requires understanding where it sits in the broader ML pipeline. Here is the typical workflow:

Collect data — Gather labeled examples relevant to your problem.
Clean data — Handle missing values, remove duplicates, fix errors.
Split data — Divide the dataset into training, validation, and test sets before any modeling.
Train model — Fit the model on the training set.
Validate and tune — Evaluate on the validation set, adjust hyperparameters, select the best model architecture.
Final test evaluation — Evaluate the chosen, finalized model on the test set exactly once.
Deploy model — Ship it. Monitor it in production.

The critical rule: the split happens before training. The test set must be invisible to the entire development process.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

4. Training Set vs Validation Set vs Test Set

Property	Training Set	Validation Set	Test Set
Purpose	Teaching the model	Tuning the model	Final performance estimate
When used	During training iterations	During development & tuning	Once, after all decisions are made
Model learns from it?	Yes	No (but decisions are based on it)	No
Influences model?	Directly	Indirectly (via hyperparameter choices)	Should not influence anything
Typical size (of total data)	60–80%	10–20%	10–20%
Example use	Fitting weights in a neural network	Choosing learning rate or depth	Reporting final accuracy

The validation set is a development tool. The test set is a verdict.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

5. Why Test Sets Are Important

Measuring Generalization

The purpose of machine learning is to learn patterns that generalize—that work on new inputs the model has never encountered. Training accuracy tells you how well a model fits historical data. Test accuracy tells you how well it generalizes.

Preventing Overconfidence

Without a proper test set, teams routinely overestimate model quality. A model achieving 95% accuracy on its own training data may score only 70% on truly new data—a gap that only test evaluation can reveal.

Detecting Overfitting

Overfitting happens when a model memorizes training data instead of learning underlying patterns. A test set exposes this immediately: overfitted models score high on training data and significantly lower on test data.

Comparing Models Fairly

When you have two competing models, evaluating both on the same test set gives you a fair, apples-to-apples comparison. Neither model saw the test data during development, so the comparison is unbiased.

Supporting Trustworthy Decision-Making

Regulatory frameworks and enterprise governance increasingly require documented evidence of model performance before deployment. A properly constructed test evaluation provides that evidence. In healthcare AI, financial services, and hiring tools, this is not optional—it is a compliance requirement.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

6. Generalization: The Core Goal

Generalization is a model's ability to perform well on data it was not trained on.

A model that generalizes well has identified real, underlying patterns in the data—not quirks, noise, or memorized examples. This is the entire point of machine learning. If the model does not generalize, it provides no value in production.

Consider a student who memorizes every answer from last year's homework but has never worked through the reasoning behind each problem. They may score perfectly on a homework re-test. Give them a new problem set with the same concepts but different numbers, and they fail.

This is exactly what happens when a model overfits. The training set is the homework. The test set is the new exam. A model that aces training data but struggles on test data has memorized rather than learned.

The test set is the exam. Skipping it or compromising its integrity is the equivalent of letting the student peek at the answer key before sitting the exam. The result looks good. The reality is not.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

7. Overfitting and Underfitting

Overfitting

Overfitting occurs when a model is too closely fit to the training data, capturing noise as if it were signal. Symptoms:

Very high training accuracy
Significantly lower validation and test accuracy
The gap between training and test performance widens as model complexity increases

Example: A decision tree with unlimited depth can perfectly classify every training example by essentially memorizing it. On test data, it performs no better than random guessing.

Underfitting

Underfitting occurs when a model is too simple to capture the real patterns in the data. Symptoms:

Low training accuracy
Low test accuracy
No significant gap between training and test performance—both are poor

Example: Fitting a straight line (linear regression) to data with a strong non-linear relationship will underfit—the model cannot represent the true pattern regardless of dataset.

The test set does not fix overfitting or underfitting. It reveals them, enabling you to make corrections before deployment.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

8. How to Split a Dataset

Common Split Ratios

Ratio (Train/Val/Test)	When to Use
70 / 15 / 15	General-purpose; balanced datasets of moderate size
80 / 10 / 10	Larger datasets where more training data helps
60 / 20 / 20	Smaller datasets where reliable validation and test estimates are critical

How Dataset Size Affects the Split

Large datasets (millions of records): Even 1–5% can produce a statistically reliable test set. More data in training usually improves model quality more than a larger test set would improve estimate precision.
Small datasets (thousands or fewer): A larger proportion in validation and test is necessary to get stable estimates, even if it reduces training size.

Why Random Splitting Is Common

For many tabular and image classification tasks, each example is assumed to be independent. Random shuffling before splitting prevents accidental ordering effects (e.g., all early records being in training, all late records in test).

When Random Splitting Is Not Appropriate

Time-series data: Future data must not appear in training (see Section 9).
Group-based data: Users, patients, or sessions must not be split across train and test—leakage can occur if the same entity appears in both.
Imbalanced data: Random splits may leave rare classes underrepresented in the test set (stratified splitting solves this).

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

9. Types of Test Set Splits

Random Split

The default. Shuffle the data and take the last N% as test. Works for independent, identically distributed (i.i.d.) data.

Stratified Split

Preserves the class distribution in each split. Essential when one class is rare—without stratification, the test set may contain very few or zero examples of that class.

Time-Based Split

All training data comes from earlier time periods; test data comes from later time periods. Required for any prediction problem where the past is used to forecast the future.

Group-Based Split

Ensures entire groups (e.g., all sessions from one user, all scans from one patient) stay within one split. Prevents information from one group's appearance in training from inflating performance on the same group in test.

Geographic Split

Training on data from certain regions, testing on others. Used when a model must generalize across geographies not seen during training.

User-Based Split

A form of group split—used in recommendation systems, personalization, and fraud detection where user-level independence is required.

Out-of-Distribution (OOD) Test Set

A test set intentionally drawn from a different distribution than training data. Tests robustness and identifies failure modes that in-distribution testing misses.

Holdout Test Set

General term for any test set withheld from training. Often used synonymously with "test set."

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

10. Stratified Test Sets

Stratification means maintaining the proportional representation of classes across splits.

Consider a fraud detection dataset where 1% of transactions are fraudulent. A random 80/20 split might produce a test set with only 0.5% fraudulent examples—or even zero, in small datasets. An accuracy metric on this test set would be meaningless: a model that always predicts "not fraud" scores 99.5%.

Stratified splitting guarantees that each split contains approximately the same 1% fraud rate, making evaluation reliable. Tools like scikit-learn's train_test_split support this with the stratify parameter (Pedregosa et al., scikit-learn documentation, 2025).

The same logic applies to disease detection (rare positive cases), spam classification (skewed spam rates), and any multi-class problem with significant class imbalance.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

11. Time-Based Test Sets

For time-series and temporal prediction problems, the test set must be drawn from a later time period than the training set.

The Problem With Random Splitting in Time Series

If you randomly shuffle a sales forecasting dataset and split it, your training set will contain data from December 2025 while your test set contains data from January 2025. The model is effectively predicting the past using the future—a phenomenon called future leakage or temporal leakage.

The resulting test scores will be artificially optimistic. When deployed, the model faces true future data it has never seen in the correct temporal order, and performance collapses.

Correct Approach

Set a cutoff date. All data before the cutoff goes into training (and validation). All data after the cutoff goes into the test set. The model is evaluated strictly on its ability to forecast forward in time.

Examples of problems requiring time-based splits:

Stock return forecasting
Weather prediction
E-commerce sales forecasting
Customer churn prediction
Demand planning

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

12. Data Leakage

Data leakage occurs when information from the test set—or from the future—is inadvertently included in the training process, causing the model to appear better than it actually is.

Leakage is one of the most dangerous and frequently overlooked problems in applied machine learning. A Kaggle survey of data scientists found that data leakage was consistently cited as a top source of unreliable competition results (Kaggle, 2023 Machine Learning & Data Science Survey).

Common Sources of Leakage

Leakage Type	Example
Target leakage	Including a feature that is determined after the target is known (e.g., using a "claim paid" flag to predict whether a claim will be filed)
Train-test contamination	Fitting a scaler or imputer on the full dataset before splitting
Temporal leakage	Randomly splitting time-series data
Group leakage	Same user appears in both train and test
Label leakage	A feature directly encodes the label (e.g., a diagnosis code used to predict diagnosis)

Why Leakage Makes Scores Look Great

When training data contains hints about the correct answers in the test set, the model exploits those hints. Evaluation scores spike. Teams celebrate. Deployment reveals the truth: the hints are gone in production, and performance tanks.

How to Avoid Leakage

Split before any preprocessing. Fit scalers, encoders, and imputers only on training data; apply (transform only) to validation and test data.
Audit features. Ask: could this feature be influenced by the target? If yes, investigate carefully.
Use temporal splits for time-dependent data.
Check for group overlaps. Ensure no user, patient, or session ID appears in both train and test.
Inspect suspiciously high scores. A model scoring 99%+ on a hard problem is a red flag worth investigating.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

13. Test Set Contamination

Test set contamination is the gradual corruption of the test set's integrity through repeated use in decisions.

It happens like this: a team trains a model, checks test accuracy, adjusts a feature, retrains, checks test accuracy again—and keeps iterating. Each inspection doesn't technically train on the test set, but every decision that responds to test scores is indirectly guided by the test set. Over many iterations, the model is effectively tuned to the test data.

This is sometimes called overfitting to the test set or adaptive overfitting. Blumer et al. first formalized related concerns in statistical learning theory; the phenomenon in applied ML practice is well-documented in the literature on benchmark integrity (Recht et al., "Do ImageNet Classifiers Generalize to ImageNet?", ICML 2019).

The solution: treat the test set as a one-time measurement instrument. Make all development decisions using the validation set. Open the test set once, record the result, and stop.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

14. Validation Set vs Test Set

This is one of the most commonly confused distinctions in machine learning.

	Validation Set	Test Set
When used	During development, repeatedly	Once, at the end
Purpose	Guide hyperparameter tuning, model selection	Final, unbiased performance estimate
Influences model?	Yes—indirectly through decisions	Should not influence anything
Can be used multiple times?	Yes	Ideally, only once

Hyperparameter tuning is the process of choosing model settings (such as learning rate, tree depth, or regularization strength) that are not learned from data but set by the developer. These choices should be guided by validation performance—never by test performance.

Model selection is choosing between competing architectures or algorithms. Again, use validation performance.

Only once you have committed to a final model—with all hyperparameters locked, all features fixed, all preprocessing pipelines finalized—do you evaluate on the test set.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

15. Cross-Validation and the Test Set

Cross-validation is a technique for more reliably estimating validation performance when data is limited.

In k-fold cross-validation, the training data is divided into k equal parts (folds). The model is trained k times, each time using k–1 folds for training and 1 fold for validation. The results are averaged across all k runs. Common values of k are 5 and 10.

Cross-validation reduces the variance in performance estimates compared to a single validation split. It is especially useful when the dataset is too small to spare a dedicated validation set.

Does Cross-Validation Replace the Test Set?

No. Cross-validation replaces the validation set. The test set remains separate and untouched.

Even when using cross-validation for model selection and hyperparameter tuning, a held-out test set is still needed to produce a final, unbiased performance estimate. The folds used in cross-validation have all participated in the tuning process and are no longer fully independent.

When data is extremely scarce, nested cross-validation (an outer loop for test evaluation, an inner loop for model selection) provides an alternative, but it is computationally expensive and less common in practice.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

16. Test Metrics

The metric you use to evaluate the test set must match the problem and the business objective.

Classification Metrics

Metric	What It Measures	When to Use
Accuracy	Fraction of correct predictions	Balanced classes, equal error costs
Precision	Of positive predictions, how many are correct	When false positives are costly
Recall	Of actual positives, how many are caught	When false negatives are costly
F1 Score	Harmonic mean of precision and recall	Imbalanced classes, balanced FP/FN concern
ROC-AUC	Model's ability to distinguish classes at all thresholds	General discrimination ability
PR-AUC	Precision-Recall tradeoff across thresholds	Heavily imbalanced datasets
Log Loss	Confidence of predictions	Probabilistic outputs
Confusion Matrix	Counts of TP, TN, FP, FN	Diagnosing specific error types

Regression Metrics

Metric	What It Measures
MAE (Mean Absolute Error)	Average absolute prediction error
MSE (Mean Squared Error)	Average squared error; penalizes large errors
RMSE (Root Mean Squared Error)	Square root of MSE; same units as target
R² (R-Squared)	Proportion of variance explained by the model

No metric is universally correct. Choosing the wrong metric can lead to deploying a model that optimizes the wrong thing—high accuracy on an imbalanced fraud dataset, for example, tells you almost nothing useful.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

17. Examples by Problem Type

Classification

A loan default classifier uses ROC-AUC on the test set to measure discrimination ability, and F1 to balance recall (catching defaulters) against precision (avoiding false denials).

Regression

A house price estimator uses RMSE on the test set, which penalizes large errors more heavily than MAE—appropriate when a $100,000 miss is far more costly than a $10,000 miss.

Time Series Forecasting

A demand forecasting model uses MAPE (Mean Absolute Percentage Error) or RMSE on a future-period test set drawn from a genuine holdout time window.

Recommendation Systems

A product recommendation engine uses NDCG (Normalized Discounted Cumulative Gain) or MAP (Mean Average Precision) on a user-based test split, ensuring no user appears in both train and test.

Natural Language Processing

A sentiment classifier uses F1 score and accuracy on a stratified test set; a text generation model may use BLEU, ROUGE, or BERTScore on held-out reference texts.

Computer Vision

An object detection model uses mAP (mean Average Precision) across classes and IoU thresholds on a held-out image test set, with care taken that no training images overlap with test images.

Medical AI

A diagnostic imaging model uses sensitivity (recall), specificity, and AUC on a test set drawn from a separate patient cohort—ideally a different hospital or time period—to test true generalization.

Fraud Detection

A fraud detection model is evaluated using Precision@K and recall at a fixed false positive rate on a time-ordered test set (not random), since fraudsters adapt over time.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

18. Practical Example: Spam Classifier

Problem: Build a model to classify emails as spam or not spam.

Dataset: 10,000 labeled emails (1,500 spam, 8,500 not spam).

Split: Stratified 70/15/15 → 7,000 training / 1,500 validation / 1,500 test, each preserving the 15% spam rate.

Training set role: The model learns word patterns, sender features, and structural signals associated with spam vs. legitimate email.

Validation set role: The team tunes the classification threshold, regularization strength, and feature selection. They compare a logistic regression against a gradient boosting model. All these decisions use validation F1 scores.

Test set role: After committing to the gradient boosting model with fixed hyperparameters, the team runs one final evaluation on the 1,500 test emails. Result: F1 = 0.93, precision = 0.91, recall = 0.95.

Interpretation: The model correctly flags 95% of spam (recall) while keeping false positive rates low (precision = 0.91). This score is reported as the model's expected production performance under similar email distributions.

Warning: If the test set had been checked after each tuning iteration, the 0.93 F1 would be an optimistic overestimate. Keeping the test set sealed ensures it reflects genuine generalization.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

19. Practical Example: House Price Regression

Problem: Predict residential property sale prices.

Dataset: 20,000 home sales records with features including square footage, location, age, and condition.

Split: Random 80/10/10 → 16,000 training / 2,000 validation / 2,000 test.

Preprocessing: The StandardScaler is fit only on the 16,000 training records, then applied to validation and test. Fitting on all 20,000 would introduce leakage.

Training set role: A gradient boosting regressor learns relationships between features and sale prices.

Validation set role: The team tunes the number of trees, learning rate, and maximum tree depth. RMSE on the validation set guides each iteration.

Test set role: Final evaluation. RMSE = $22,400, R² = 0.88.

Interpretation: On average, the model's price estimate is off by roughly $22,400. It explains 88% of the variance in sale prices. This is the number the business uses when deciding whether to deploy the model in their pricing tool—not the validation RMSE, which was slightly lower after tuning.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

20. Best Practices

Split before any preprocessing. Decide on train/validation/test boundaries before touching the data in any analytical way.
Fit preprocessing only on training data. Scalers, encoders, imputers—fit on training, transform only on validation and test.
Use stratified splits when class distributions are imbalanced.
Use time-based splits for any temporal or sequential data.
Avoid duplicate records across splits. Remove or deduplicate before splitting.
Check for group overlaps. Users, patients, devices, or sessions must not appear in multiple splits.
Audit for leakage. Inspect features for post-target information and temporal contamination.
Use a representative test set. It should mirror the real-world distribution the model will face.
Evaluate multiple metrics. No single metric captures all aspects of model quality.
Treat the test set as final. Evaluate once. Document the result. Do not iterate based on test scores.
Document the split procedure. Record the random seed, split ratios, stratification strategy, and cutoff dates for reproducibility.
Plan for a new test set if the deployment domain changes significantly.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

21. Common Mistakes

Warning: These mistakes consistently inflate model performance estimates and lead to poor production outcomes.

Testing on training data. The most basic error. The model has seen this data; the score is meaningless as a generalization estimate.
Using the test set during development. Checking test scores before committing to a final model corrupts the test set's independence.
Tuning hyperparameters based on test results. This converts the test set into a second validation set, destroying its unbiased character.
Data leakage. Preprocessing before splitting, target-correlated features, temporal contamination.
Randomly splitting time series data. Future data in the training set artificially boosts test scores.
Ignoring class imbalance. Without stratification, rare classes may be underrepresented or missing in the test set.
Duplicate entities across splits. Same user or patient in train and test inflates scores.
Using a tiny test set. Fewer than a few hundred examples produces high-variance, unreliable estimates.
Reporting only one metric. Accuracy alone on an imbalanced dataset is almost always misleading.
Confusing validation data with test data. Using validation performance as the final reported result overstates expected generalization.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

22. What Makes a Good Test Set?

A high-quality test set is:

Representative. It reflects the real-world distribution of inputs the model will encounter after deployment—not a cherry-picked or convenient sample.
Sufficiently large. Large enough to produce statistically stable performance estimates. Hundreds of examples at minimum; thousands are better for reliable confidence intervals.
Independent. No example in the test set appears in the training set. No preprocessing fitted on test data. No group overlap.
Free from leakage. No future information, no target-correlated features unavailable at inference time.
Properly labeled. Labels must be accurate. A test set with mislabeled examples will understate good model performance and overstate poor model performance.
Reflective of deployment conditions. If the model will be used on data from 2026, the test set should come from 2026—not 2020.
Stable. For ongoing benchmark use, it should not change unless explicitly versioned and documented.
Aligned with the objective. Designed around what the model will actually be used for, not what was convenient to collect.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

23. Test Set Size

There is no universal rule for test set size. The right size depends on the number of classes, the rarity of events, and the precision needed in the performance estimate.

General Guidelines

Dataset Size	Typical Test Set Size	Note
< 1,000 examples	20–30%	Use cross-validation where possible
1,000–10,000	15–20%	Standard holdout or CV
10,000–100,000	10–15%	Larger training set improves model
> 1,000,000	1–5%	Even 1% is statistically large

The Tradeoff

A larger test set produces more precise estimates but reduces training data. For very large datasets, precision gains from a 20% test set versus a 10% test set are marginal—while the 10% savings in training data can meaningfully improve model quality.

For rare events (e.g., fraud, rare disease), the test set must contain enough positive examples to evaluate precision and recall reliably. If your test set has only 10 positive examples, precision and recall estimates will have very wide confidence intervals.

A practical rule: aim for at least 1,000 examples per class in the test set when evaluating classification problems with multiple classes.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

24. Test Sets in Real-World Production

Offline Evaluation

The test set evaluation described throughout this article is an offline evaluation—conducted on historical, labeled data before deployment. It is the primary quality gate before a model goes live.

Online Evaluation

Once deployed, a model is evaluated in production through online evaluation methods, including:

A/B testing: Traffic is split between the old model and new model. Real user behavior (clicks, purchases, outcomes) determines which model performs better.
Shadow deployment: The new model runs in parallel with the existing system, its predictions logged but not acted on. Outcomes are compared retrospectively.
Monitoring: Continuous tracking of prediction distributions, feature distributions, and business metrics to detect degradation.

Concept Drift and Data Drift

After deployment, the data distribution often changes. User behavior shifts. Fraud patterns evolve. Language use changes. These shifts—called concept drift (change in the relationship between inputs and outputs) and data drift (change in input distributions)—mean that even a well-constructed test set becomes stale.

A model that scored 92% on a test set built from 2024 data may score 78% on 2026 real-world inputs. This is why production monitoring and periodic retraining—with new, up-to-date test sets—are essential for long-lived models.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

25. Benchmark Test Sets

Benchmark datasets are standardized test sets used to compare model performance across the research community.

Examples include image classification benchmarks used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), natural language inference benchmarks, and reading comprehension benchmarks from leading NLP research groups.

Hidden Test Sets

In academic competitions and standardized evaluations, test set labels are hidden from participants. Participants submit predictions; an automated system computes the score. This prevents direct test set overfitting.

Leaderboard Overfitting

Despite hidden test sets, leaderboard overfitting occurs when teams submit thousands of predictions, use leaderboard scores as feedback, and effectively tune to the test set through trial and error. Research by Recht et al. (ICML 2019) documented this phenomenon with empirical evidence on ImageNet, showing that repeated evaluation on the same test distribution inflates reported scores over time.

This finding reinforces why test sets—even public benchmarks—should be treated as limited evaluation resources rather than optimization targets.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

26. Test Sets in Deep Learning

Deep learning models operate at massive scale, often trained on millions of examples. The test set principles are identical, but several practical details deserve attention.

Early Stopping

Early stopping—halting training when validation loss stops improving—uses the validation set, not the test set. This prevents the model from overfitting the training data without contaminating the test evaluation.

Architecture Decisions

Choices about the number of layers, attention heads, or convolutional filters should be guided by validation performance, never test performance. Architecture exploration is a development activity.

Test-Time Evaluation

At test time, deep learning models should be set to evaluation mode (disabling dropout, batch normalization running statistics frozen). Test-time augmentation (averaging predictions over multiple augmented versions of each test input) is a legitimate practice but must be applied consistently and documented.

Large-Scale Test Sets

For large-scale tasks, the test set may contain hundreds of thousands of examples. Evaluation is usually parallelized across GPUs. The same integrity rules apply: a separate process evaluates on test data, results are logged, and the test set is not revisited for development purposes.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

27. Small-Data Situations

When data is scarce, dedicating 15–20% to a test set may leave too little for training. Options:

Cross-Validation (Again)

Use k-fold cross-validation on the full dataset for model selection, then report average CV performance. If a single withheld test set is not feasible, document this clearly—CV performance is an optimistic estimate.

Nested Cross-Validation

The outer loop produces test-fold estimates. The inner loop selects hyperparameters. This provides an approximately unbiased performance estimate without a separate test set, though it is computationally expensive.

Bootstrapping

Generate multiple training/test splits by sampling with replacement. Average performance across splits. This provides estimates of variability but is less common in practice than k-fold CV.

The Honest Caveat

With very small datasets, test performance estimates have wide confidence intervals. A model scoring 80% accuracy on 100 test examples has a 95% confidence interval of roughly [71%, 87%] (using a normal approximation). This uncertainty must be communicated clearly to stakeholders.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

28. Statistical Uncertainty

A test set score is a sample estimate, not an absolute truth.

If your model scores 85% accuracy on 1,000 test examples, that 85% carries statistical uncertainty. The true expected accuracy—the accuracy you would observe across infinite data from the same distribution—might be 83% or 87%.

Confidence Intervals

For a proportion estimate p on n test examples, the 95% confidence interval (Wilson or Agresti-Coull methods are more reliable than simple normal approximation for extreme values) gives a range of plausible true performance values.

Practical implication: Two models scoring 84.2% and 85.1% on the same 1,000-example test set are not necessarily meaningfully different. Statistical significance testing (e.g., McNemar's test for paired predictions) should be used when comparing close-scoring models.

Always report: The test score alongside the test set size. A score without a denominator is incomplete information.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

29. Ethics and Fairness

A model that scores 90% overall may score 60% for a specific demographic group. Aggregate test metrics obscure this.

Subgroup Evaluation

Evaluate the test set across meaningful subgroups—by age, gender, race, geography, income level, or any other factor relevant to the deployment context. Identify performance disparities before deployment.

Representative Test Sets

If your test set underrepresents a group that the model will be deployed on, your evaluation is incomplete. A medical AI test set drawn entirely from one hospital in one city may not generalize to patients at hospitals in other regions with different demographic compositions.

Real-World Consequences

Biased or unrepresentative test evaluation has led to documented failures:

The Optum health risk prediction algorithm, audited by Obermeyer et al. in Science (2019), was found to significantly underserve Black patients because the training and evaluation data used healthcare costs as a proxy for health needs—a proxy that reflected historical inequities, not actual illness burden.
Commercial facial recognition systems from multiple vendors showed significantly higher error rates on darker-skinned faces, as documented in the Gender Shades audit by Buolamwini and Gebru (2018), because test sets—and training sets—were not representative.

Implication for test set design: Explicitly plan for subgroup coverage. Document demographic composition. Report disaggregated performance metrics alongside aggregate scores.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

30. Can You Use the Test Set More Than Once?

The textbook answer is: ideally, once.

In practice, most teams inspect test results after final evaluation, and that single inspection is fine. The problem arises when teams:

Tune the model based on what they observe in test results
Try multiple model variants and pick the one with the best test score
Revisit test evaluation repeatedly over weeks of iteration

Each additional use of the test set as a decision input reduces its independence. The more decisions it informs, the more the reported test score overestimates true generalization.

The rigorous recommendation: Treat the test set as a one-time measurement. If your deployment domain changes substantially, create a new, up-to-date test set rather than reusing the original.

In Kaggle-style competitions, a private test set is evaluated only once (upon submission deadline) precisely to prevent adaptive overfitting.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

31. What If the Model Performs Poorly on the Test Set?

Poor test performance is valuable information. It tells you something is wrong before you deploy. Possible causes:

Cause	How to Diagnose
Overfitting	Training accuracy much higher than test accuracy
Poor features	Feature importance analysis; domain expert review
Bad data quality	Audit labels, check for corruption or bias
Distribution mismatch	Compare feature distributions of train vs test
Insufficient training data	Learning curves; add data if available
Wrong model choice	Try simpler or more complex alternatives on validation set
Leakage in validation	Validation scores looked great; test scores diverged significantly
Misaligned metric	The metric doesn't reflect the actual problem

What to Do

Diagnose the cause using the list above.
Fix the issue—add features, more data, better preprocessing, or choose a different model.
Evaluate the new model on the validation set during iteration.
Use the test set again only if you create a new holdout partition or if you commit that this evaluation is final.

Do not tune the model based on test error patterns. That converts your test set into a validation set, and you will no longer have an unbiased performance estimate.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

32. Test Set vs Holdout Set vs Evaluation Set vs Dev Set

Test Set vs Holdout Set

These terms are largely interchangeable. "Holdout set" emphasizes that data is held out from training. Some teams use "holdout" for the validation set and "test set" for the final evaluation; others use both terms for the final evaluation. Context matters—always clarify within your team.

Test Set vs Evaluation Set

Also largely interchangeable. "Evaluation set" is sometimes preferred in NLP and deep learning contexts. The key property (unseen data used to assess final model quality) is the same.

Test Set vs Dev Set

A dev set (development set) typically means the validation set in the usage popularized by Andrew Ng's deep learning courses (Ng et al., Coursera Deep Learning Specialization). It is used during development for tuning decisions, not as a final evaluation benchmark. If a colleague says "dev set," assume they mean validation set unless they clarify otherwise.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

33. Simple Checklist for Creating a Test Set

[ ] Split the data before any preprocessing or modeling.
[ ] Use stratified splitting for imbalanced classification tasks.
[ ] Use time-ordered splitting for time series or temporal prediction tasks.
[ ] Fit all preprocessing transformers (scalers, encoders) on training data only.
[ ] Check for duplicate records and remove them before splitting.
[ ] Check for group-level overlaps (same user/patient/entity in train and test).
[ ] Audit features for leakage (post-target information, temporal contamination).
[ ] Ensure the test set is large enough for reliable estimates.
[ ] Ensure the test set reflects the deployment distribution.
[ ] Document the split procedure (ratio, random seed, cutoff date, stratification).
[ ] Evaluate on the test set exactly once (or as few times as possible).
[ ] Report multiple metrics, not just accuracy.
[ ] Evaluate disaggregated performance across relevant subgroups.
[ ] Do not tune the model in response to test results.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

34. Beginner-Friendly Summary

Imagine you are a teacher. You want to know how well your students have learned, not just memorized your lessons. You give them homework and study materials (the training set). You hold a practice quiz to help them prepare and adjust your teaching (the validation set). Then, at the end of the course, you give a final exam with questions they have never seen before. That final exam is the test set.

If students score well on the exam, they genuinely learned the material—they can handle new questions. If they score poorly, something in the learning process did not work. The exam gives you an honest verdict.

In machine learning, the model is the student. The training data is the study material. The validation set is the practice quiz. The test set is the final exam. Your job is to keep that exam sealed until the very end—because if the student gets a peek at the answers beforehand, the exam no longer tells you anything useful.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

FAQ

What is a test set in machine learning?

A test set is a labeled portion of your dataset withheld from model training and used only at the end to evaluate final model performance. It estimates how the model will behave on new, unseen, real-world data.

Why do we need a test set?

Because training accuracy is not a reliable indicator of real-world performance. A model can memorize training data without learning generalizable patterns. The test set provides an honest, unbiased performance estimate before deployment.

Is a test set the same as a validation set?

No. The validation set is used repeatedly during development to guide hyperparameter tuning and model selection. The test set is used only once, after all development decisions have been locked in, to produce a final performance estimate.

Can a model train on the test set?

No. If a model trains on the test set—directly or indirectly—the test score becomes meaningless as a generalization estimate. The test set must remain completely isolated from the training process.

How big should a test set be?

It depends on total dataset size and the rarity of events you are predicting. Common ratios are 10–20% of total data. Aim for at least a few hundred examples per class; thousands are preferable for stable estimates.

What is the difference between training accuracy and test accuracy?

Training accuracy measures how well the model fits the data it learned from. Test accuracy measures how well it generalizes to new, unseen data. A large gap between the two usually indicates overfitting.

What does poor test performance mean?

It may indicate overfitting, poor features, data quality issues, distribution mismatch, leakage in the validation process, insufficient training data, or the wrong model choice. It is a diagnostic signal, not a failure—it prevents you from deploying a weak model.

Can cross-validation replace a test set?

Cross-validation replaces the validation set, not the test set. A final held-out test set, untouched during all cross-validation iterations, is still recommended for a fully unbiased final performance estimate.

What is test data leakage?

Leakage occurs when information from the test set or from the future is inadvertently included in the training process. It makes the model appear better than it truly is, because it is exploiting hints that will not exist in production.

How do I choose the right test metric?

Choose based on your problem type and business objective. For imbalanced classification, use F1, ROC-AUC, or PR-AUC rather than accuracy. For regression, use RMSE or MAE. For time series, use MAPE or RMSE on a held-out future window. Always consider what error type is most costly in your context.

Should test data be random?

For independent data (tabular, images), yes. For time-series or temporal data, no—use a time-based cutoff. For user or patient data, no—use a group-based split to prevent leakage.

What is an untouched test set?

A test set that has never been used to guide any modeling, feature engineering, hyperparameter tuning, or architectural decisions. Its integrity is fully preserved, making it a reliable final performance judge.

What is an independent test set?

A test set that shares no examples, entities, or preprocessing fitting with the training set. Independence is what makes the test set a valid generalization estimate.

How is a test set used in deep learning?

In deep learning, training and validation sets guide weight updates and early stopping. The test set is used once, after training is complete and all architecture decisions are finalized, to report the model's expected real-world performance.

What is a hidden test set?

A test set whose labels are not disclosed to model developers. Used in academic benchmarks and competitions to prevent overfitting to the test set through repeated submission and feedback cycles.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

Key Takeaways

A test set is a sealed, held-out portion of data used only once to evaluate final model performance.
It must never be used during training, feature engineering, preprocessing fitting, or hyperparameter tuning.
The gap between training and test performance reveals overfitting; similar poor scores on both reveal underfitting.
Data leakage is one of the most dangerous ML pitfalls—it inflates scores and hides model weakness until production.
Use stratified splits for imbalanced data; time-based splits for temporal data; group-based splits when entities must not overlap.
Fit all preprocessing transformers only on training data; apply (transform) to validation and test.
Test metrics are estimates with statistical uncertainty—report test set size alongside performance numbers.
Evaluate disaggregated performance across subgroups to catch demographic disparities before deployment.
Treat the test set like a final exam: seal it, use it once, record the result, and stop.
Production performance may diverge from test performance due to concept drift and data drift—ongoing monitoring is required.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

Actionable Next Steps

Audit your current project. Identify whether your test set was created before or after preprocessing. If after, redo the split.
Check for leakage. Review every feature for post-target information or temporal contamination. Confirm preprocessing is fit only on training data.
Verify your split type. Is your data temporal or grouped? Confirm you are using the correct split strategy.
Set the test set aside. Create a separate DataFrame, file, or data partition that is not touched until final evaluation.
Choose metrics before evaluating. Decide which metrics matter for your business objective before you open the test set.
Evaluate once. Run final evaluation, record all metrics, and document the result with test set size and split details.
Run subgroup analysis. Break down test performance by relevant demographic or operational subgroups and investigate disparities.
Plan for monitoring. Once deployed, set up data drift and performance monitoring so you know when a new test set and retraining cycle are needed.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

Glossary

Benchmark dataset: A standardized, publicly shared dataset used to compare model performance across research teams and organizations.
Concept drift: A change in the statistical relationship between input features and the target variable over time, degrading model performance.
Cross-validation: A technique that divides training data into multiple folds, training and validating repeatedly to produce a more stable performance estimate.
Data drift: A change in the statistical distribution of input features over time, which can reduce model accuracy.
Data leakage: The introduction of information from the test set or from the future into the training process, artificially inflating model performance estimates.
Generalization: A model's ability to perform accurately on new, unseen data beyond its training examples.
Holdout set: Data withheld from training; often used synonymously with test set or validation set depending on context.
Hyperparameter: A model setting chosen by the developer rather than learned from data (e.g., learning rate, tree depth, regularization strength).
Overfitting: When a model fits training data too closely—including noise—and performs worse on new data.
Stratified split: A data splitting method that preserves the proportional class distribution of the original dataset in each split.
Test set: A labeled dataset partition held out from all training and tuning, used exactly once to produce a final, unbiased model performance estimate.
Training set: The portion of data used to fit a model's parameters.
Underfitting: When a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test data.
Validation set: A held-out subset used during model development to guide hyperparameter tuning and model selection, separate from the test set.

AI/ML Foundations for Builders

$39.00$19.00

See What’s Inside

References

Pedregosa, F. et al. "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, Vol. 12, 2011. https://jmlr.org/papers/v12/pedregosa11a.html
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. "Do ImageNet Classifiers Generalize to ImageNet?" Proceedings of the 36th International Conference on Machine Learning (ICML). 2019. https://proceedings.mlr.press/v97/recht19a.html
Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. "Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations." Science, Vol. 366, Issue 6464, pp. 447–453. October 25, 2019. https://www.science.org/doi/10.1126/science.aax2342
Buolamwini, J. and Gebru, T. "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification." Proceedings of the 1st Conference on Fairness, Accountability and Transparency (FAT* 2018). https://proceedings.mlr.press/v81/buolamwini18a.html
Kaggle. "2023 AI & Machine Learning Survey." Kaggle, 2023. https://www.kaggle.com/competitions/kaggle-survey-2023
Ng, A. et al. "Structuring Machine Learning Projects." Deep Learning Specialization, Coursera. DeepLearning.AI. https://www.coursera.org/learn/machine-learning-projects
Google Developers. "Machine Learning Crash Course: Training and Test Sets." Google, 2024. https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data
Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd ed. Springer, 2009. https://web.stanford.edu/~hastie/ElemStatLearn/
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M.K. "Learnability and the Vapnik-Chervonenkis Dimension." Journal of the ACM, Vol. 36, No. 4, 1989. https://dl.acm.org/doi/10.1145/76359.76371

Explore Our Artificial Intelligence Services – See How We Can Help You Succeed

TL;DR

What is a test set?

Table of Contents

1. Why Model Evaluation Matters

2. What Is a Test Set?

3. The Machine Learning Workflow

4. Training Set vs Validation Set vs Test Set

5. Why Test Sets Are Important

Measuring Generalization

Preventing Overconfidence

Detecting Overfitting

Comparing Models Fairly

Supporting Trustworthy Decision-Making

6. Generalization: The Core Goal

7. Overfitting and Underfitting

Overfitting

Underfitting

8. How to Split a Dataset

Common Split Ratios

How Dataset Size Affects the Split

Why Random Splitting Is Common

When Random Splitting Is Not Appropriate

9. Types of Test Set Splits

Random Split

Stratified Split

Time-Based Split

Group-Based Split

Geographic Split

User-Based Split

Out-of-Distribution (OOD) Test Set

Holdout Test Set

10. Stratified Test Sets

11. Time-Based Test Sets

The Problem With Random Splitting in Time Series

Correct Approach

12. Data Leakage

Common Sources of Leakage

Why Leakage Makes Scores Look Great

How to Avoid Leakage

13. Test Set Contamination

14. Validation Set vs Test Set

15. Cross-Validation and the Test Set

Does Cross-Validation Replace the Test Set?

16. Test Metrics

Classification Metrics

Regression Metrics

17. Examples by Problem Type

Classification

Regression

Time Series Forecasting

Recommendation Systems

Natural Language Processing

Computer Vision

Medical AI

Fraud Detection

18. Practical Example: Spam Classifier

19. Practical Example: House Price Regression

20. Best Practices

21. Common Mistakes

22. What Makes a Good Test Set?

23. Test Set Size

General Guidelines

The Tradeoff

24. Test Sets in Real-World Production

Offline Evaluation

Online Evaluation

Concept Drift and Data Drift

25. Benchmark Test Sets

Hidden Test Sets

Leaderboard Overfitting

26. Test Sets in Deep Learning

Early Stopping

Architecture Decisions

Test-Time Evaluation

Large-Scale Test Sets

27. Small-Data Situations

Cross-Validation (Again)

Nested Cross-Validation

Bootstrapping

The Honest Caveat