Zero-replacement imputation

Zero-replacement imputation is the simplest possible scheme for handling Missing data: every missing value gets replaced with $0$ . It’s fast and trivial to implement, and almost always the wrong choice.

The problem is that 0 is rarely a plausible estimate of what the missing value would have been. If the surrounding samples are all in the 200-400 range, dropping a 0 in among them creates an artifact more disruptive than the original gap was: a downward spike where the data should have been smooth. Any downstream model has to either learn that 0 means missing (and treat it specially) or just be confused by the artifact.

On a scatter plot, a point whose value is missing gets moved down to the x-axis. For a signal that’s already centered around zero (already normalized), zero-replacement is less catastrophic, because 0 is in the data’s natural range. For raw signals that aren’t centered, zero-replacement creates a clearly artificial value.

When zero-replacement is actually OK:

The missing rate is very low and any downstream model is robust to the occasional artifact.
The signal is already normalized to zero mean.
0 carries the same semantic meaning as missing in the application (a count of events that didn’t happen, for instance).

When it isn’t:

The signal has a meaningful baseline far from 0.
The downstream model is sensitive to outliers.
Multiple consecutive samples are missing, producing a long flat segment of zeros that looks like nothing in the real signal.

In Pandas:

df.fillna(0)                       # all NaN → 0
df.fillna({'col1': 0, 'col2': 1.5}) # different fill per column

Better alternatives: Sample-and-hold imputation (repeats the previous valid value), Linear interpolation (averages the two neighbors), or Non-linear interpolation (fits a smooth curve through several neighbors).

Idriss Rami — Notes

Explorer

Zero-replacement imputation

Graph View

Backlinks