top of page

What Is a Test Set?

  • 11 hours ago
  • 28 min read
Ultra-realistic machine learning dashboard with a highlighted test set and the title “What Is a Test Set?”

Most machine learning models look brilliant on paper. They score high on the data they trained on, impress the team during development, and get approved for deployment. Then reality hits. The model underperforms on real users, real transactions, or real medical images. The reason is almost always the same: the team never properly evaluated the model on data it had never seen before. That is exactly what a test set is for.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

TL;DR

  • A test set is a held-out portion of your dataset used exclusively to evaluate a fully trained model's real-world performance.

  • It must never be used during training or hyperparameter tuning—only once, at the very end.

  • The test set measures generalization: whether your model works on new, unseen data.

  • Common split ratios are 70/15/15 or 80/10/10 (train/validation/test).

  • Data leakage—when test information bleeds into training—is one of the most dangerous and common mistakes in ML.

  • Test metrics are estimates, not guarantees; statistical uncertainty always applies.


What is a test set?

A test set is a separate portion of a labeled dataset that is withheld from model training and used only at the end to measure final model performance. It simulates how the model will behave on new, unseen, real-world data. A test set should never influence training, feature engineering, or hyperparameter tuning decisions.





AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

Table of Contents

  1. Why Model Evaluation Matters

  2. What Is a Test Set?

  3. The Machine Learning Workflow

  4. Training Set vs Validation Set vs Test Set

  5. Why Test Sets Are Important

  6. Generalization: The Core Goal

  7. Overfitting and Underfitting

  8. How to Split a Dataset

  9. Types of Test Set Splits

  10. Stratified Test Sets

  11. Time-Based Test Sets

  12. Data Leakage

  13. Test Set Contamination

  14. Validation Set vs Test Set

  15. Cross-Validation and the Test Set

  16. Test Metrics

  17. Examples by Problem Type

  18. Practical Example: Spam Classifier

  19. Practical Example: House Price Regression

  20. Best Practices

  21. Common Mistakes

  22. What Makes a Good Test Set?

  23. Test Set Size

  24. Test Sets in Real-World Production

  25. Benchmark Test Sets

  26. Test Sets in Deep Learning

  27. Small-Data Situations

  28. Statistical Uncertainty

  29. Ethics and Fairness

  30. Can You Use the Test Set More Than Once?

  31. What If the Model Performs Poorly?

  32. Test Set vs Holdout Set vs Evaluation Set vs Dev Set

  33. Checklist for Creating a Test Set

  34. Beginner-Friendly Summary

  35. FAQ

  36. Key Takeaways

  37. Actionable Next Steps

  38. Glossary

  39. References


1. Why Model Evaluation Matters

Building a machine learning model is only half the job. The other half is knowing whether it actually works.


A model that scores 99% accuracy in your notebook can still fail spectacularly in production. It may have memorized the training data without learning the underlying patterns. It may never have been tested on data that reflects real-world variation. Without rigorous evaluation, you have no reliable way to know.


Model evaluation is the discipline that closes that gap. It gives teams an honest, evidence-based signal of how a model is likely to behave when deployed. At the center of that discipline sits a single, critical component: the test set.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

2. What Is a Test Set?

A test set is a labeled portion of your dataset that is held out entirely from model training, used only at the very end to measure the final model's performance.


Think of it as a sealed envelope. You prepare it before training begins, lock it away, and open it only once—after all decisions about the model have been made. The model has never seen these examples. Its score on this data is your best estimate of how it will behave on new, real-world inputs.


Key properties of a test set:

  • It is separate from training data and validation data.

  • The model does not learn from it—not directly, not indirectly.

  • It is used once (or as few times as possible) to produce a final performance estimate.

  • It represents the conditions the model will face in deployment.


The test set does not make a model better. It simply tells you, as honestly as possible, how good it already is.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

3. The Machine Learning Workflow

Understanding the test set requires understanding where it sits in the broader ML pipeline. Here is the typical workflow:

  1. Collect data — Gather labeled examples relevant to your problem.

  2. Clean data — Handle missing values, remove duplicates, fix errors.

  3. Split data — Divide the dataset into training, validation, and test sets before any modeling.

  4. Train model — Fit the model on the training set.

  5. Validate and tune — Evaluate on the validation set, adjust hyperparameters, select the best model architecture.

  6. Final test evaluation — Evaluate the chosen, finalized model on the test set exactly once.

  7. Deploy model — Ship it. Monitor it in production.


The critical rule: the split happens before training. The test set must be invisible to the entire development process.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

4. Training Set vs Validation Set vs Test Set

Property

Training Set

Validation Set

Test Set

Purpose

Teaching the model

Tuning the model

Final performance estimate

When used

During training iterations

During development & tuning

Once, after all decisions are made

Model learns from it?

Yes

No (but decisions are based on it)

No

Influences model?

Directly

Indirectly (via hyperparameter choices)

Should not influence anything

Typical size (of total data)

60–80%

10–20%

10–20%

Example use

Fitting weights in a neural network

Choosing learning rate or depth

Reporting final accuracy

The validation set is a development tool. The test set is a verdict.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

5. Why Test Sets Are Important


Measuring Generalization

The purpose of machine learning is to learn patterns that generalize—that work on new inputs the model has never encountered. Training accuracy tells you how well a model fits historical data. Test accuracy tells you how well it generalizes.


Preventing Overconfidence

Without a proper test set, teams routinely overestimate model quality. A model achieving 95% accuracy on its own training data may score only 70% on truly new data—a gap that only test evaluation can reveal.


Detecting Overfitting

Overfitting happens when a model memorizes training data instead of learning underlying patterns. A test set exposes this immediately: overfitted models score high on training data and significantly lower on test data.


Comparing Models Fairly

When you have two competing models, evaluating both on the same test set gives you a fair, apples-to-apples comparison. Neither model saw the test data during development, so the comparison is unbiased.


Supporting Trustworthy Decision-Making

Regulatory frameworks and enterprise governance increasingly require documented evidence of model performance before deployment. A properly constructed test evaluation provides that evidence. In healthcare AI, financial services, and hiring tools, this is not optional—it is a compliance requirement.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

6. Generalization: The Core Goal

Generalization is a model's ability to perform well on data it was not trained on.


A model that generalizes well has identified real, underlying patterns in the data—not quirks, noise, or memorized examples. This is the entire point of machine learning. If the model does not generalize, it provides no value in production.


Consider a student who memorizes every answer from last year's homework but has never worked through the reasoning behind each problem. They may score perfectly on a homework re-test. Give them a new problem set with the same concepts but different numbers, and they fail.


This is exactly what happens when a model overfits. The training set is the homework. The test set is the new exam. A model that aces training data but struggles on test data has memorized rather than learned.


The test set is the exam. Skipping it or compromising its integrity is the equivalent of letting the student peek at the answer key before sitting the exam. The result looks good. The reality is not.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

7. Overfitting and Underfitting


Overfitting

Overfitting occurs when a model is too closely fit to the training data, capturing noise as if it were signal. Symptoms:

  • Very high training accuracy

  • Significantly lower validation and test accuracy

  • The gap between training and test performance widens as model complexity increases


Example: A decision tree with unlimited depth can perfectly classify every training example by essentially memorizing it. On test data, it performs no better than random guessing.


Underfitting

Underfitting occurs when a model is too simple to capture the real patterns in the data. Symptoms:

  • Low training accuracy

  • Low test accuracy

  • No significant gap between training and test performance—both are poor


Example: Fitting a straight line (linear regression) to data with a strong non-linear relationship will underfit—the model cannot represent the true pattern regardless of dataset.


The test set does not fix overfitting or underfitting. It reveals them, enabling you to make corrections before deployment.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

8. How to Split a Dataset


Common Split Ratios

Ratio (Train/Val/Test)

When to Use

70 / 15 / 15

General-purpose; balanced datasets of moderate size

80 / 10 / 10

Larger datasets where more training data helps

60 / 20 / 20

Smaller datasets where reliable validation and test estimates are critical

How Dataset Size Affects the Split

  • Large datasets (millions of records): Even 1–5% can produce a statistically reliable test set. More data in training usually improves model quality more than a larger test set would improve estimate precision.

  • Small datasets (thousands or fewer): A larger proportion in validation and test is necessary to get stable estimates, even if it reduces training size.


Why Random Splitting Is Common

For many tabular and image classification tasks, each example is assumed to be independent. Random shuffling before splitting prevents accidental ordering effects (e.g., all early records being in training, all late records in test).


When Random Splitting Is Not Appropriate

  • Time-series data: Future data must not appear in training (see Section 9).

  • Group-based data: Users, patients, or sessions must not be split across train and test—leakage can occur if the same entity appears in both.

  • Imbalanced data: Random splits may leave rare classes underrepresented in the test set (stratified splitting solves this).


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

9. Types of Test Set Splits


Random Split

The default. Shuffle the data and take the last N% as test. Works for independent, identically distributed (i.i.d.) data.


Stratified Split

Preserves the class distribution in each split. Essential when one class is rare—without stratification, the test set may contain very few or zero examples of that class.


Time-Based Split

All training data comes from earlier time periods; test data comes from later time periods. Required for any prediction problem where the past is used to forecast the future.


Group-Based Split

Ensures entire groups (e.g., all sessions from one user, all scans from one patient) stay within one split. Prevents information from one group's appearance in training from inflating performance on the same group in test.


Geographic Split

Training on data from certain regions, testing on others. Used when a model must generalize across geographies not seen during training.


User-Based Split

A form of group split—used in recommendation systems, personalization, and fraud detection where user-level independence is required.


Out-of-Distribution (OOD) Test Set

A test set intentionally drawn from a different distribution than training data. Tests robustness and identifies failure modes that in-distribution testing misses.


Holdout Test Set

General term for any test set withheld from training. Often used synonymously with "test set."


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

10. Stratified Test Sets

Stratification means maintaining the proportional representation of classes across splits.


Consider a fraud detection dataset where 1% of transactions are fraudulent. A random 80/20 split might produce a test set with only 0.5% fraudulent examples—or even zero, in small datasets. An accuracy metric on this test set would be meaningless: a model that always predicts "not fraud" scores 99.5%.


Stratified splitting guarantees that each split contains approximately the same 1% fraud rate, making evaluation reliable. Tools like scikit-learn's train_test_split support this with the stratify parameter (Pedregosa et al., scikit-learn documentation, 2025).


The same logic applies to disease detection (rare positive cases), spam classification (skewed spam rates), and any multi-class problem with significant class imbalance.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

11. Time-Based Test Sets

For time-series and temporal prediction problems, the test set must be drawn from a later time period than the training set.


The Problem With Random Splitting in Time Series

If you randomly shuffle a sales forecasting dataset and split it, your training set will contain data from December 2025 while your test set contains data from January 2025. The model is effectively predicting the past using the future—a phenomenon called future leakage or temporal leakage.


The resulting test scores will be artificially optimistic. When deployed, the model faces true future data it has never seen in the correct temporal order, and performance collapses.


Correct Approach

Set a cutoff date. All data before the cutoff goes into training (and validation). All data after the cutoff goes into the test set. The model is evaluated strictly on its ability to forecast forward in time.


Examples of problems requiring time-based splits:

  • Stock return forecasting

  • Weather prediction

  • E-commerce sales forecasting

  • Customer churn prediction

  • Demand planning


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

12. Data Leakage

Data leakage occurs when information from the test set—or from the future—is inadvertently included in the training process, causing the model to appear better than it actually is.


Leakage is one of the most dangerous and frequently overlooked problems in applied machine learning. A Kaggle survey of data scientists found that data leakage was consistently cited as a top source of unreliable competition results (Kaggle, 2023 Machine Learning & Data Science Survey).


Common Sources of Leakage

Leakage Type

Example

Target leakage

Including a feature that is determined after the target is known (e.g., using a "claim paid" flag to predict whether a claim will be filed)

Train-test contamination

Fitting a scaler or imputer on the full dataset before splitting

Temporal leakage

Randomly splitting time-series data

Group leakage

Same user appears in both train and test

Label leakage

A feature directly encodes the label (e.g., a diagnosis code used to predict diagnosis)

Why Leakage Makes Scores Look Great

When training data contains hints about the correct answers in the test set, the model exploits those hints. Evaluation scores spike. Teams celebrate. Deployment reveals the truth: the hints are gone in production, and performance tanks.


How to Avoid Leakage

  1. Split before any preprocessing. Fit scalers, encoders, and imputers only on training data; apply (transform only) to validation and test data.

  2. Audit features. Ask: could this feature be influenced by the target? If yes, investigate carefully.

  3. Use temporal splits for time-dependent data.

  4. Check for group overlaps. Ensure no user, patient, or session ID appears in both train and test.

  5. Inspect suspiciously high scores. A model scoring 99%+ on a hard problem is a red flag worth investigating.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

13. Test Set Contamination

Test set contamination is the gradual corruption of the test set's integrity through repeated use in decisions.


It happens like this: a team trains a model, checks test accuracy, adjusts a feature, retrains, checks test accuracy again—and keeps iterating. Each inspection doesn't technically train on the test set, but every decision that responds to test scores is indirectly guided by the test set. Over many iterations, the model is effectively tuned to the test data.


This is sometimes called overfitting to the test set or adaptive overfitting. Blumer et al. first formalized related concerns in statistical learning theory; the phenomenon in applied ML practice is well-documented in the literature on benchmark integrity (Recht et al., "Do ImageNet Classifiers Generalize to ImageNet?", ICML 2019).


The solution: treat the test set as a one-time measurement instrument. Make all development decisions using the validation set. Open the test set once, record the result, and stop.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

14. Validation Set vs Test Set

This is one of the most commonly confused distinctions in machine learning.


Validation Set

Test Set

When used

During development, repeatedly

Once, at the end

Purpose

Guide hyperparameter tuning, model selection

Final, unbiased performance estimate

Influences model?

Yes—indirectly through decisions

Should not influence anything

Can be used multiple times?

Yes

Ideally, only once

Hyperparameter tuning is the process of choosing model settings (such as learning rate, tree depth, or regularization strength) that are not learned from data but set by the developer. These choices should be guided by validation performance—never by test performance.


Model selection is choosing between competing architectures or algorithms. Again, use validation performance.


Only once you have committed to a final model—with all hyperparameters locked, all features fixed, all preprocessing pipelines finalized—do you evaluate on the test set.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

15. Cross-Validation and the Test Set

Cross-validation is a technique for more reliably estimating validation performance when data is limited.


In k-fold cross-validation, the training data is divided into k equal parts (folds). The model is trained k times, each time using k–1 folds for training and 1 fold for validation. The results are averaged across all k runs. Common values of k are 5 and 10.


Cross-validation reduces the variance in performance estimates compared to a single validation split. It is especially useful when the dataset is too small to spare a dedicated validation set.


Does Cross-Validation Replace the Test Set?

No. Cross-validation replaces the validation set. The test set remains separate and untouched.


Even when using cross-validation for model selection and hyperparameter tuning, a held-out test set is still needed to produce a final, unbiased performance estimate. The folds used in cross-validation have all participated in the tuning process and are no longer fully independent.


When data is extremely scarce, nested cross-validation (an outer loop for test evaluation, an inner loop for model selection) provides an alternative, but it is computationally expensive and less common in practice.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

16. Test Metrics

The metric you use to evaluate the test set must match the problem and the business objective.


Classification Metrics

Metric

What It Measures

When to Use

Accuracy

Fraction of correct predictions

Balanced classes, equal error costs

Precision

Of positive predictions, how many are correct

When false positives are costly

Recall

Of actual positives, how many are caught

When false negatives are costly

F1 Score

Harmonic mean of precision and recall

Imbalanced classes, balanced FP/FN concern

ROC-AUC

Model's ability to distinguish classes at all thresholds

General discrimination ability

PR-AUC

Precision-Recall tradeoff across thresholds

Heavily imbalanced datasets

Log Loss

Confidence of predictions

Probabilistic outputs

Confusion Matrix

Counts of TP, TN, FP, FN

Diagnosing specific error types

Regression Metrics

Metric

What It Measures

MAE (Mean Absolute Error)

Average absolute prediction error

MSE (Mean Squared Error)

Average squared error; penalizes large errors

RMSE (Root Mean Squared Error)

Square root of MSE; same units as target

R² (R-Squared)

Proportion of variance explained by the model

No metric is universally correct. Choosing the wrong metric can lead to deploying a model that optimizes the wrong thing—high accuracy on an imbalanced fraud dataset, for example, tells you almost nothing useful.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

17. Examples by Problem Type


Classification

A loan default classifier uses ROC-AUC on the test set to measure discrimination ability, and F1 to balance recall (catching defaulters) against precision (avoiding false denials).


Regression

A house price estimator uses RMSE on the test set, which penalizes large errors more heavily than MAE—appropriate when a $100,000 miss is far more costly than a $10,000 miss.


Time Series Forecasting

A demand forecasting model uses MAPE (Mean Absolute Percentage Error) or RMSE on a future-period test set drawn from a genuine holdout time window.


Recommendation Systems

A product recommendation engine uses NDCG (Normalized Discounted Cumulative Gain) or MAP (Mean Average Precision) on a user-based test split, ensuring no user appears in both train and test.


Natural Language Processing

A sentiment classifier uses F1 score and accuracy on a stratified test set; a text generation model may use BLEU, ROUGE, or BERTScore on held-out reference texts.


Computer Vision

An object detection model uses mAP (mean Average Precision) across classes and IoU thresholds on a held-out image test set, with care taken that no training images overlap with test images.


Medical AI

A diagnostic imaging model uses sensitivity (recall), specificity, and AUC on a test set drawn from a separate patient cohort—ideally a different hospital or time period—to test true generalization.


Fraud Detection

A fraud detection model is evaluated using Precision@K and recall at a fixed false positive rate on a time-ordered test set (not random), since fraudsters adapt over time.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

18. Practical Example: Spam Classifier


Problem: Build a model to classify emails as spam or not spam.


Dataset: 10,000 labeled emails (1,500 spam, 8,500 not spam).


Split: Stratified 70/15/15 → 7,000 training / 1,500 validation / 1,500 test, each preserving the 15% spam rate.


Training set role: The model learns word patterns, sender features, and structural signals associated with spam vs. legitimate email.


Validation set role: The team tunes the classification threshold, regularization strength, and feature selection. They compare a logistic regression against a gradient boosting model. All these decisions use validation F1 scores.


Test set role: After committing to the gradient boosting model with fixed hyperparameters, the team runs one final evaluation on the 1,500 test emails. Result: F1 = 0.93, precision = 0.91, recall = 0.95.


Interpretation: The model correctly flags 95% of spam (recall) while keeping false positive rates low (precision = 0.91). This score is reported as the model's expected production performance under similar email distributions.


Warning: If the test set had been checked after each tuning iteration, the 0.93 F1 would be an optimistic overestimate. Keeping the test set sealed ensures it reflects genuine generalization.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

19. Practical Example: House Price Regression


Problem: Predict residential property sale prices.


Dataset: 20,000 home sales records with features including square footage, location, age, and condition.


Split: Random 80/10/10 → 16,000 training / 2,000 validation / 2,000 test.


Preprocessing: The StandardScaler is fit only on the 16,000 training records, then applied to validation and test. Fitting on all 20,000 would introduce leakage.


Training set role: A gradient boosting regressor learns relationships between features and sale prices.


Validation set role: The team tunes the number of trees, learning rate, and maximum tree depth. RMSE on the validation set guides each iteration.


Test set role: Final evaluation. RMSE = $22,400, R² = 0.88.


Interpretation: On average, the model's price estimate is off by roughly $22,400. It explains 88% of the variance in sale prices. This is the number the business uses when deciding whether to deploy the model in their pricing tool—not the validation RMSE, which was slightly lower after tuning.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

20. Best Practices

  1. Split before any preprocessing. Decide on train/validation/test boundaries before touching the data in any analytical way.

  2. Fit preprocessing only on training data. Scalers, encoders, imputers—fit on training, transform only on validation and test.

  3. Use stratified splits when class distributions are imbalanced.

  4. Use time-based splits for any temporal or sequential data.

  5. Avoid duplicate records across splits. Remove or deduplicate before splitting.

  6. Check for group overlaps. Users, patients, devices, or sessions must not appear in multiple splits.

  7. Audit for leakage. Inspect features for post-target information and temporal contamination.

  8. Use a representative test set. It should mirror the real-world distribution the model will face.

  9. Evaluate multiple metrics. No single metric captures all aspects of model quality.

  10. Treat the test set as final. Evaluate once. Document the result. Do not iterate based on test scores.

  11. Document the split procedure. Record the random seed, split ratios, stratification strategy, and cutoff dates for reproducibility.

  12. Plan for a new test set if the deployment domain changes significantly.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

21. Common Mistakes

Warning: These mistakes consistently inflate model performance estimates and lead to poor production outcomes.
  • Testing on training data. The most basic error. The model has seen this data; the score is meaningless as a generalization estimate.

  • Using the test set during development. Checking test scores before committing to a final model corrupts the test set's independence.

  • Tuning hyperparameters based on test results. This converts the test set into a second validation set, destroying its unbiased character.

  • Data leakage. Preprocessing before splitting, target-correlated features, temporal contamination.

  • Randomly splitting time series data. Future data in the training set artificially boosts test scores.

  • Ignoring class imbalance. Without stratification, rare classes may be underrepresented or missing in the test set.

  • Duplicate entities across splits. Same user or patient in train and test inflates scores.

  • Using a tiny test set. Fewer than a few hundred examples produces high-variance, unreliable estimates.

  • Reporting only one metric. Accuracy alone on an imbalanced dataset is almost always misleading.

  • Confusing validation data with test data. Using validation performance as the final reported result overstates expected generalization.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

22. What Makes a Good Test Set?

A high-quality test set is:

  • Representative. It reflects the real-world distribution of inputs the model will encounter after deployment—not a cherry-picked or convenient sample.

  • Sufficiently large. Large enough to produce statistically stable performance estimates. Hundreds of examples at minimum; thousands are better for reliable confidence intervals.

  • Independent. No example in the test set appears in the training set. No preprocessing fitted on test data. No group overlap.

  • Free from leakage. No future information, no target-correlated features unavailable at inference time.

  • Properly labeled. Labels must be accurate. A test set with mislabeled examples will understate good model performance and overstate poor model performance.

  • Reflective of deployment conditions. If the model will be used on data from 2026, the test set should come from 2026—not 2020.

  • Stable. For ongoing benchmark use, it should not change unless explicitly versioned and documented.

  • Aligned with the objective. Designed around what the model will actually be used for, not what was convenient to collect.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

23. Test Set Size

There is no universal rule for test set size. The right size depends on the number of classes, the rarity of events, and the precision needed in the performance estimate.


General Guidelines

Dataset Size

Typical Test Set Size

Note

< 1,000 examples

20–30%

Use cross-validation where possible

1,000–10,000

15–20%

Standard holdout or CV

10,000–100,000

10–15%

Larger training set improves model

> 1,000,000

1–5%

Even 1% is statistically large

The Tradeoff

A larger test set produces more precise estimates but reduces training data. For very large datasets, precision gains from a 20% test set versus a 10% test set are marginal—while the 10% savings in training data can meaningfully improve model quality.


For rare events (e.g., fraud, rare disease), the test set must contain enough positive examples to evaluate precision and recall reliably. If your test set has only 10 positive examples, precision and recall estimates will have very wide confidence intervals.


A practical rule: aim for at least 1,000 examples per class in the test set when evaluating classification problems with multiple classes.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

24. Test Sets in Real-World Production


Offline Evaluation

The test set evaluation described throughout this article is an offline evaluation—conducted on historical, labeled data before deployment. It is the primary quality gate before a model goes live.


Online Evaluation

Once deployed, a model is evaluated in production through online evaluation methods, including:

  • A/B testing: Traffic is split between the old model and new model. Real user behavior (clicks, purchases, outcomes) determines which model performs better.

  • Shadow deployment: The new model runs in parallel with the existing system, its predictions logged but not acted on. Outcomes are compared retrospectively.

  • Monitoring: Continuous tracking of prediction distributions, feature distributions, and business metrics to detect degradation.


Concept Drift and Data Drift

After deployment, the data distribution often changes. User behavior shifts. Fraud patterns evolve. Language use changes. These shifts—called concept drift (change in the relationship between inputs and outputs) and data drift (change in input distributions)—mean that even a well-constructed test set becomes stale.


A model that scored 92% on a test set built from 2024 data may score 78% on 2026 real-world inputs. This is why production monitoring and periodic retraining—with new, up-to-date test sets—are essential for long-lived models.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

25. Benchmark Test Sets

Benchmark datasets are standardized test sets used to compare model performance across the research community.


Examples include image classification benchmarks used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), natural language inference benchmarks, and reading comprehension benchmarks from leading NLP research groups.


Hidden Test Sets

In academic competitions and standardized evaluations, test set labels are hidden from participants. Participants submit predictions; an automated system computes the score. This prevents direct test set overfitting.


Leaderboard Overfitting

Despite hidden test sets, leaderboard overfitting occurs when teams submit thousands of predictions, use leaderboard scores as feedback, and effectively tune to the test set through trial and error. Research by Recht et al. (ICML 2019) documented this phenomenon with empirical evidence on ImageNet, showing that repeated evaluation on the same test distribution inflates reported scores over time.


This finding reinforces why test sets—even public benchmarks—should be treated as limited evaluation resources rather than optimization targets.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

26. Test Sets in Deep Learning

Deep learning models operate at massive scale, often trained on millions of examples. The test set principles are identical, but several practical details deserve attention.


Early Stopping

Early stopping—halting training when validation loss stops improving—uses the validation set, not the test set. This prevents the model from overfitting the training data without contaminating the test evaluation.


Architecture Decisions

Choices about the number of layers, attention heads, or convolutional filters should be guided by validation performance, never test performance. Architecture exploration is a development activity.


Test-Time Evaluation

At test time, deep learning models should be set to evaluation mode (disabling dropout, batch normalization running statistics frozen). Test-time augmentation (averaging predictions over multiple augmented versions of each test input) is a legitimate practice but must be applied consistently and documented.


Large-Scale Test Sets

For large-scale tasks, the test set may contain hundreds of thousands of examples. Evaluation is usually parallelized across GPUs. The same integrity rules apply: a separate process evaluates on test data, results are logged, and the test set is not revisited for development purposes.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

27. Small-Data Situations

When data is scarce, dedicating 15–20% to a test set may leave too little for training. Options:


Cross-Validation (Again)

Use k-fold cross-validation on the full dataset for model selection, then report average CV performance. If a single withheld test set is not feasible, document this clearly—CV performance is an optimistic estimate.


Nested Cross-Validation

The outer loop produces test-fold estimates. The inner loop selects hyperparameters. This provides an approximately unbiased performance estimate without a separate test set, though it is computationally expensive.


Bootstrapping

Generate multiple training/test splits by sampling with replacement. Average performance across splits. This provides estimates of variability but is less common in practice than k-fold CV.


The Honest Caveat

With very small datasets, test performance estimates have wide confidence intervals. A model scoring 80% accuracy on 100 test examples has a 95% confidence interval of roughly [71%, 87%] (using a normal approximation). This uncertainty must be communicated clearly to stakeholders.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

28. Statistical Uncertainty

A test set score is a sample estimate, not an absolute truth.


If your model scores 85% accuracy on 1,000 test examples, that 85% carries statistical uncertainty. The true expected accuracy—the accuracy you would observe across infinite data from the same distribution—might be 83% or 87%.


Confidence Intervals

For a proportion estimate p on n test examples, the 95% confidence interval (Wilson or Agresti-Coull methods are more reliable than simple normal approximation for extreme values) gives a range of plausible true performance values.


Practical implication: Two models scoring 84.2% and 85.1% on the same 1,000-example test set are not necessarily meaningfully different. Statistical significance testing (e.g., McNemar's test for paired predictions) should be used when comparing close-scoring models.


Always report: The test score alongside the test set size. A score without a denominator is incomplete information.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

29. Ethics and Fairness

A model that scores 90% overall may score 60% for a specific demographic group. Aggregate test metrics obscure this.


Subgroup Evaluation

Evaluate the test set across meaningful subgroups—by age, gender, race, geography, income level, or any other factor relevant to the deployment context. Identify performance disparities before deployment.


Representative Test Sets

If your test set underrepresents a group that the model will be deployed on, your evaluation is incomplete. A medical AI test set drawn entirely from one hospital in one city may not generalize to patients at hospitals in other regions with different demographic compositions.


Real-World Consequences

Biased or unrepresentative test evaluation has led to documented failures:

  • The Optum health risk prediction algorithm, audited by Obermeyer et al. in Science (2019), was found to significantly underserve Black patients because the training and evaluation data used healthcare costs as a proxy for health needs—a proxy that reflected historical inequities, not actual illness burden.

  • Commercial facial recognition systems from multiple vendors showed significantly higher error rates on darker-skinned faces, as documented in the Gender Shades audit by Buolamwini and Gebru (2018), because test sets—and training sets—were not representative.


Implication for test set design: Explicitly plan for subgroup coverage. Document demographic composition. Report disaggregated performance metrics alongside aggregate scores.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

30. Can You Use the Test Set More Than Once?


The textbook answer is: ideally, once.


In practice, most teams inspect test results after final evaluation, and that single inspection is fine. The problem arises when teams:

  • Tune the model based on what they observe in test results

  • Try multiple model variants and pick the one with the best test score

  • Revisit test evaluation repeatedly over weeks of iteration


Each additional use of the test set as a decision input reduces its independence. The more decisions it informs, the more the reported test score overestimates true generalization.


The rigorous recommendation: Treat the test set as a one-time measurement. If your deployment domain changes substantially, create a new, up-to-date test set rather than reusing the original.


In Kaggle-style competitions, a private test set is evaluated only once (upon submission deadline) precisely to prevent adaptive overfitting.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

31. What If the Model Performs Poorly on the Test Set?

Poor test performance is valuable information. It tells you something is wrong before you deploy. Possible causes:

Cause

How to Diagnose

Overfitting

Training accuracy much higher than test accuracy

Poor features

Feature importance analysis; domain expert review

Bad data quality

Audit labels, check for corruption or bias

Distribution mismatch

Compare feature distributions of train vs test

Insufficient training data

Learning curves; add data if available

Wrong model choice

Try simpler or more complex alternatives on validation set

Leakage in validation

Validation scores looked great; test scores diverged significantly

Misaligned metric

The metric doesn't reflect the actual problem

What to Do

  1. Diagnose the cause using the list above.

  2. Fix the issue—add features, more data, better preprocessing, or choose a different model.

  3. Evaluate the new model on the validation set during iteration.

  4. Use the test set again only if you create a new holdout partition or if you commit that this evaluation is final.


Do not tune the model based on test error patterns. That converts your test set into a validation set, and you will no longer have an unbiased performance estimate.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

32. Test Set vs Holdout Set vs Evaluation Set vs Dev Set


Test Set vs Holdout Set

These terms are largely interchangeable. "Holdout set" emphasizes that data is held out from training. Some teams use "holdout" for the validation set and "test set" for the final evaluation; others use both terms for the final evaluation. Context matters—always clarify within your team.


Test Set vs Evaluation Set

Also largely interchangeable. "Evaluation set" is sometimes preferred in NLP and deep learning contexts. The key property (unseen data used to assess final model quality) is the same.


Test Set vs Dev Set

A dev set (development set) typically means the validation set in the usage popularized by Andrew Ng's deep learning courses (Ng et al., Coursera Deep Learning Specialization). It is used during development for tuning decisions, not as a final evaluation benchmark. If a colleague says "dev set," assume they mean validation set unless they clarify otherwise.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

33. Simple Checklist for Creating a Test Set

  • [ ] Split the data before any preprocessing or modeling.

  • [ ] Use stratified splitting for imbalanced classification tasks.

  • [ ] Use time-ordered splitting for time series or temporal prediction tasks.

  • [ ] Fit all preprocessing transformers (scalers, encoders) on training data only.

  • [ ] Check for duplicate records and remove them before splitting.

  • [ ] Check for group-level overlaps (same user/patient/entity in train and test).

  • [ ] Audit features for leakage (post-target information, temporal contamination).

  • [ ] Ensure the test set is large enough for reliable estimates.

  • [ ] Ensure the test set reflects the deployment distribution.

  • [ ] Document the split procedure (ratio, random seed, cutoff date, stratification).

  • [ ] Evaluate on the test set exactly once (or as few times as possible).

  • [ ] Report multiple metrics, not just accuracy.

  • [ ] Evaluate disaggregated performance across relevant subgroups.

  • [ ] Do not tune the model in response to test results.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

34. Beginner-Friendly Summary

Imagine you are a teacher. You want to know how well your students have learned, not just memorized your lessons. You give them homework and study materials (the training set). You hold a practice quiz to help them prepare and adjust your teaching (the validation set). Then, at the end of the course, you give a final exam with questions they have never seen before. That final exam is the test set.


If students score well on the exam, they genuinely learned the material—they can handle new questions. If they score poorly, something in the learning process did not work. The exam gives you an honest verdict.


In machine learning, the model is the student. The training data is the study material. The validation set is the practice quiz. The test set is the final exam. Your job is to keep that exam sealed until the very end—because if the student gets a peek at the answers beforehand, the exam no longer tells you anything useful.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

FAQ


What is a test set in machine learning?

A test set is a labeled portion of your dataset withheld from model training and used only at the end to evaluate final model performance. It estimates how the model will behave on new, unseen, real-world data.


Why do we need a test set?

Because training accuracy is not a reliable indicator of real-world performance. A model can memorize training data without learning generalizable patterns. The test set provides an honest, unbiased performance estimate before deployment.


Is a test set the same as a validation set?

No. The validation set is used repeatedly during development to guide hyperparameter tuning and model selection. The test set is used only once, after all development decisions have been locked in, to produce a final performance estimate.


Can a model train on the test set?

No. If a model trains on the test set—directly or indirectly—the test score becomes meaningless as a generalization estimate. The test set must remain completely isolated from the training process.


How big should a test set be?

It depends on total dataset size and the rarity of events you are predicting. Common ratios are 10–20% of total data. Aim for at least a few hundred examples per class; thousands are preferable for stable estimates.


What is the difference between training accuracy and test accuracy?

Training accuracy measures how well the model fits the data it learned from. Test accuracy measures how well it generalizes to new, unseen data. A large gap between the two usually indicates overfitting.


What does poor test performance mean?

It may indicate overfitting, poor features, data quality issues, distribution mismatch, leakage in the validation process, insufficient training data, or the wrong model choice. It is a diagnostic signal, not a failure—it prevents you from deploying a weak model.


Can cross-validation replace a test set?

Cross-validation replaces the validation set, not the test set. A final held-out test set, untouched during all cross-validation iterations, is still recommended for a fully unbiased final performance estimate.


What is test data leakage?

Leakage occurs when information from the test set or from the future is inadvertently included in the training process. It makes the model appear better than it truly is, because it is exploiting hints that will not exist in production.


How do I choose the right test metric?

Choose based on your problem type and business objective. For imbalanced classification, use F1, ROC-AUC, or PR-AUC rather than accuracy. For regression, use RMSE or MAE. For time series, use MAPE or RMSE on a held-out future window. Always consider what error type is most costly in your context.


Should test data be random?

For independent data (tabular, images), yes. For time-series or temporal data, no—use a time-based cutoff. For user or patient data, no—use a group-based split to prevent leakage.


What is an untouched test set?

A test set that has never been used to guide any modeling, feature engineering, hyperparameter tuning, or architectural decisions. Its integrity is fully preserved, making it a reliable final performance judge.


What is an independent test set?

A test set that shares no examples, entities, or preprocessing fitting with the training set. Independence is what makes the test set a valid generalization estimate.


How is a test set used in deep learning?

In deep learning, training and validation sets guide weight updates and early stopping. The test set is used once, after training is complete and all architecture decisions are finalized, to report the model's expected real-world performance.


What is a hidden test set?

A test set whose labels are not disclosed to model developers. Used in academic benchmarks and competitions to prevent overfitting to the test set through repeated submission and feedback cycles.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

Key Takeaways

  • A test set is a sealed, held-out portion of data used only once to evaluate final model performance.

  • It must never be used during training, feature engineering, preprocessing fitting, or hyperparameter tuning.

  • The gap between training and test performance reveals overfitting; similar poor scores on both reveal underfitting.

  • Data leakage is one of the most dangerous ML pitfalls—it inflates scores and hides model weakness until production.

  • Use stratified splits for imbalanced data; time-based splits for temporal data; group-based splits when entities must not overlap.

  • Fit all preprocessing transformers only on training data; apply (transform) to validation and test.

  • Test metrics are estimates with statistical uncertainty—report test set size alongside performance numbers.

  • Evaluate disaggregated performance across subgroups to catch demographic disparities before deployment.

  • Treat the test set like a final exam: seal it, use it once, record the result, and stop.

  • Production performance may diverge from test performance due to concept drift and data drift—ongoing monitoring is required.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

Actionable Next Steps

  1. Audit your current project. Identify whether your test set was created before or after preprocessing. If after, redo the split.

  2. Check for leakage. Review every feature for post-target information or temporal contamination. Confirm preprocessing is fit only on training data.

  3. Verify your split type. Is your data temporal or grouped? Confirm you are using the correct split strategy.

  4. Set the test set aside. Create a separate DataFrame, file, or data partition that is not touched until final evaluation.

  5. Choose metrics before evaluating. Decide which metrics matter for your business objective before you open the test set.

  6. Evaluate once. Run final evaluation, record all metrics, and document the result with test set size and split details.

  7. Run subgroup analysis. Break down test performance by relevant demographic or operational subgroups and investigate disparities.

  8. Plan for monitoring. Once deployed, set up data drift and performance monitoring so you know when a new test set and retraining cycle are needed.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

Glossary

  1. Benchmark dataset: A standardized, publicly shared dataset used to compare model performance across research teams and organizations.

  2. Concept drift: A change in the statistical relationship between input features and the target variable over time, degrading model performance.

  3. Cross-validation: A technique that divides training data into multiple folds, training and validating repeatedly to produce a more stable performance estimate.

  4. Data drift: A change in the statistical distribution of input features over time, which can reduce model accuracy.

  5. Data leakage: The introduction of information from the test set or from the future into the training process, artificially inflating model performance estimates.

  6. Generalization: A model's ability to perform accurately on new, unseen data beyond its training examples.

  7. Holdout set: Data withheld from training; often used synonymously with test set or validation set depending on context.

  8. Hyperparameter: A model setting chosen by the developer rather than learned from data (e.g., learning rate, tree depth, regularization strength).

  9. Overfitting: When a model fits training data too closely—including noise—and performs worse on new data.

  10. Stratified split: A data splitting method that preserves the proportional class distribution of the original dataset in each split.

  11. Test set: A labeled dataset partition held out from all training and tuning, used exactly once to produce a final, unbiased model performance estimate.

  12. Training set: The portion of data used to fit a model's parameters.

  13. Underfitting: When a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test data.

  14. Validation set: A held-out subset used during model development to guide hyperparameter tuning and model selection, separate from the test set.


AI/ML Foundations for Builders
$39.00$19.00
See What’s Inside

References

  1. Pedregosa, F. et al. "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, Vol. 12, 2011. https://jmlr.org/papers/v12/pedregosa11a.html

  2. Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. "Do ImageNet Classifiers Generalize to ImageNet?" Proceedings of the 36th International Conference on Machine Learning (ICML). 2019. https://proceedings.mlr.press/v97/recht19a.html

  3. Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. "Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations." Science, Vol. 366, Issue 6464, pp. 447–453. October 25, 2019. https://www.science.org/doi/10.1126/science.aax2342

  4. Buolamwini, J. and Gebru, T. "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification." Proceedings of the 1st Conference on Fairness, Accountability and Transparency (FAT* 2018). https://proceedings.mlr.press/v81/buolamwini18a.html

  5. Kaggle. "2023 AI & Machine Learning Survey." Kaggle, 2023. https://www.kaggle.com/competitions/kaggle-survey-2023

  6. Ng, A. et al. "Structuring Machine Learning Projects." Deep Learning Specialization, Coursera. DeepLearning.AI. https://www.coursera.org/learn/machine-learning-projects

  7. Google Developers. "Machine Learning Crash Course: Training and Test Sets." Google, 2024. https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data

  8. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd ed. Springer, 2009. https://web.stanford.edu/~hastie/ElemStatLearn/

  9. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M.K. "Learnability and the Vapnik-Chervonenkis Dimension." Journal of the ACM, Vol. 36, No. 4, 1989. https://dl.acm.org/doi/10.1145/76359.76371




 
 
bottom of page