Zero-replacement imputation is the simplest possible scheme for handling Missing data: every missing value gets replaced with . It’s fast and trivial to implement, and almost always the wrong choice.
The problem is that 0 is rarely a plausible estimate of what the missing value would have been. If the surrounding samples are all in the 200-400 range, dropping a 0 in among them creates an artifact more disruptive than the original gap was — a downward spike where the data should have been smooth. Any downstream model has to either learn that 0 means missing (and treat it specially) or just be confused by the artifact.
Visually, if we plotted our data points on a scatter plot, a point whose value is missing gets moved down to the x-axis. For a signal that’s already centered around zero — already normalized — zero-replacement is less catastrophic, because 0 is in the data’s natural range. For raw signals that aren’t centered, zero-replacement creates a clearly artificial value.
When zero-replacement is actually OK:
- The missing rate is very low and any downstream model is robust to the occasional artifact.
- The signal is already normalized to zero mean.
- 0 carries the same semantic meaning as missing in the application (a count of events that didn’t happen, for instance).
When it isn’t:
- The signal has a meaningful baseline far from 0.
- The downstream model is sensitive to outliers.
- Multiple consecutive samples are missing, producing a long flat segment of zeros that looks like nothing in the real signal.
In Pandas:
df.fillna(0) # all NaN → 0
df.fillna({'col1': 0, 'col2': 1.5}) # different fill per columnFor better alternatives, see Sample-and-hold imputation (repeats the previous valid value), Linear interpolation (averages the two neighbors), or Non-linear interpolation (fits a smooth curve through several neighbors). Imputation gives the bigger picture of when each method applies.