Variance is the mean squared deviation from the mean — a measure of how spread out a set of values is. For values with mean :

Each term is the squared deviation of one value from the mean. Squaring makes the deviation positive (so values above and below the mean don’t cancel) and amplifies large deviations more than small ones. Dividing by takes the average.

The square root of the variance is the Standard deviation — same information, but in the original units of the data instead of squared units. If the data is voltages, the variance is in V², and the standard deviation is in V.

The formula above is the population variance, with divisor . When the dataset is a sample used to estimate the variance of a larger population, the unbiased estimator divides by instead — Bessel’s correction:

The intuition: is itself computed from the sample and is closer (on average) to the sample values than the true population mean is, so systematically underestimates . Dividing by instead of corrects the bias. The difference matters most for small samples; for large it’s negligible.

This is one of the most well-known gotchas in numerical Python:

  • numpy.var(arr) defaults to ddof=0 (population, divisor ). Pass ddof=1 for the sample form.
  • pandas.Series.var() defaults to ddof=1 (sample, divisor ). Pass ddof=0 to match NumPy.
  • scikit-learn’s StandardScaler internally uses ddof=0.

Identical code can give different numbers depending on which library scales the data. For Introduction to Data Science the divisor-by- form is used throughout, but production code should be explicit about which convention is in use.

In machine learning and dimensionality reduction, variance is often interpreted as a proxy for information content. A dataset with high variance carries a lot of information — the points are spread out, and their positions are meaningfully distinct. A dataset with low variance carries little — the points are clumped, indistinguishable from one another. This is why Principal Component Analysis tries to project onto directions that maximize variance: those are the directions that preserve the most information.

The intuition behind treating variance as information: two distributions that are otherwise similar are easier to distinguish if they’re tightly concentrated than if they’re broadly spread. High variance makes distinctions visible.

For higher moments of the distribution beyond variance, see Skewness (third moment, measures asymmetry) and Kurtosis (fourth moment, measures tail heaviness). Mean, variance, skewness, and kurtosis together describe the first four moments — a remarkably compact summary of a distribution’s shape.