Missing data

Missing data is gaps in a dataset where values should exist but don’t. A sensor that’s supposed to produce one sample every 10 ms occasionally fails to deliver one, and the resulting record has a hole where a number should be.

Causes are mundane and many. The sensor itself becomes faulty (dirt under a MEMS spring, mechanical shock, hardware failure). The electronics that sample and digitize glitch. The memory module briefly fails to write. The Bluetooth chip drops a packet. The recording stops and starts again, leaving a gap. Each cause produces the same effect: a sample that should be there isn’t.

Three conventions are used to make missing values visible. Leave them blank. Write NULL (or NaN, not a number, in numerical contexts). Write a long dash (—). The conventions are interchangeable; what matters is that the absence is recorded, so downstream code can see the gap rather than reading the next sample as if it were the missing one.

Two broad strategies for dealing with the gaps:

Deletion discards rows or columns containing missing values entirely. Fast, no computation beyond filtering. The cost is loss of data (sometimes a lot of it) and desynchronization when paired channels are involved: if we have an ECG channel and an EEG channel sampled in lockstep, deleting only the ECG samples that are missing leaves the two channels misaligned in time.

Imputation replaces missing values with estimates. Preserves the structure of the dataset and keeps signals aligned. More accurate than deletion when the imputation method is good. The cost is computation: imputation uses CPU, memory, and battery, and can introduce latency in real-time systems.

For most datasets, imputation is the right default. The remaining question is how to impute: Zero-replacement imputation, Sample-and-hold imputation, Linear interpolation, or for smoother gaps, Non-linear interpolation. More sophisticated methods (EM, kNN, model-based imputation) exist but are usually deferred to later courses.

In Pandas, df.isna() produces a Boolean DataFrame the same shape as the original, with True wherever a value is missing. df.isna().sum() gives a per-column count of missing values, much more useful than scrolling through a huge Boolean table.

Idriss Rami — Notes

Explorer

Missing data

Graph View

Backlinks