The test set is the data held out before training begins, never seen by the model during training, and used only at the end to estimate generalization performance — how well the trained model will do on new data it hasn’t seen.
A trained model that has memorized its Training set will report flatteringly low losses on it. That tells us essentially nothing about how it’ll behave on examples outside the training data. The test set’s role is to provide an honest answer.
The right way to handle it:
- Carve off the test set first, before training, before any preprocessing decisions, before any model tuning. Typical sizes: 20-30% of the total data.
- Lock it in a metaphorical drawer. Don’t look at it. Don’t compute summary statistics from it. Don’t fit any StandardScaler or imputer on it.
- Train and tune on the remaining 70-80%, using a Validation set (or K-fold cross-validation) for hyperparameter tuning.
- Once the model is finalized, open the drawer, run the model on the test set, and report whatever number comes out. Don’t tune further. Don’t go back and try different things.
This last point is the discipline that’s hardest to maintain. The temptation, after seeing a disappointing test score, is to just tweak one thing and re-run. As soon as we do, the test set is no longer measuring generalization — it’s measuring how well we tuned to that particular test set. After a few rounds of this, the test number is meaningless.
Choosing the size
The right test size is empirical:
- With very little data, we want to give the model as much as possible to learn from, so we keep the test set small (e.g., 15%).
- With abundant data, we can afford a larger test set (e.g., 30%), which gives a more precise estimate of generalization.
The standard error of the test-set estimate scales roughly as (the variance itself scales as ), so doubling the test size cuts the uncertainty by a factor of .
Data leakage and the test set
The most common ways the test set gets contaminated:
- Preprocessing on the whole dataset before splitting. The mean and standard deviation used for Normalization are computed from the entire data, including the test portion. The test set has influenced the preprocessing. See Data leakage.
- The same example appearing in both training and test sets. Duplicates in the dataset, or time-series data split randomly when consecutive samples are practically the same recording.
- Repeated tuning against the test set. Every time we look at the test number and adjust something, we leak a little bit.
The fix for the first one: split first, fit preprocessing on the training set only, then transform the test set using the training-fitted parameters. In scikit-learn, this is the difference between fit_transform (training only) and transform (test data).
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)