StandardScaler

StandardScaler is the scikit-learn class that implements Normalization: subtract the mean and divide by the standard deviation, column by column, so each feature ends up with mean 0 and standard deviation 1.

One detail that bites people: StandardScaler divides by the population standard deviation (ddof=0, divisor $N$ ), not the sample one. The scikit-learn docs are explicit: “We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0).” This matters if you compare StandardScaler’s output to a scaling done in Pandas, whose .std() defaults to ddof=1 ( $N - 1$ , Bessel’s correction). The two disagree slightly for small samples; for large $N$ the difference is negligible and doesn’t affect model behaviour. StandardScaler scales to mean 0 and (population) standard deviation 1, not to the range $[0, 1]$ . That’s MinMaxScaler.

The standard usage:

from sklearn.preprocessing import StandardScaler
 
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

The .fit(X) call measures the per-column means and standard deviations from X and stores them on the scaler object. The .transform(X) call applies the stored parameters to scale X. The .fit_transform(X) call does both at once, fitting the scaler and transforming X. Convenient when you have just one dataset, dangerous when you have separate training and test sets.

The discipline that matters: fit_transform on the training set only, then transform on the test set. Never fit_transform on the test set (it would re-fit the scaler to the test data, leaking test information into the model) and never fit_transform on the entire dataset before the test split (same problem, harder to notice). This is the canonical example of Data leakage, and the reason scikit-learn keeps fit and transform as separate operations.

Inside a scikit-learn pipeline, this happens automatically:

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
 
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=10000))
clf.fit(X_train, y_train)             # scaler fits on X_train, then transforms it
y_pred = clf.predict(X_test)          # scaler transforms X_test with stored params

The pipeline’s .fit() fits the scaler on training data and passes the scaled data to the classifier. The pipeline’s .predict() transforms test data with the already-fit scaler and passes it through. There’s no way for test data to leak into the scaler’s fit.

For scaling beyond zero-mean/unit-variance, scikit-learn has MinMaxScaler (scales to a fixed range like $[0, 1]$ ), RobustScaler (uses median and interquartile range, holds up better against outliers), and MaxAbsScaler (scales by the maximum absolute value). StandardScaler is the right default for most cases; the alternatives address specific problems.

Idriss Rami — Notes

Explorer

StandardScaler

Graph View

Backlinks