Feature extraction

Feature extraction converts a long raw signal into a small handful of summary measurements (features) that a model can actually consume. Instead of feeding 30,000 raw ECG samples per minute to a classifier, we extract a few summary numbers per heartbeat: the maximum, minimum, mean, Standard deviation, Skewness, Kurtosis.

Signals are long. A one-minute ECG at 500 Hz contains 30,000 samples. A one-second window of accelerometer data at 200 Hz from 12 sensors is 2,400 samples. Learning directly from this many numbers per example is possible (convolutional and recurrent networks do it) but it’s expensive, needs far more data than we usually have, and ignores domain knowledge about what’s actually informative.

For many problems we already know what to look at. In an ECG, the heartbeat has a characteristic shape with a peak (the R wave) and trough; the maximum and minimum within each beat tell us most of what we need. In an Accelerometer signal during walking, the rhythm of foot strikes produces regular peaks, and the average peak height plus the variability between peaks is diagnostic. Feature extraction encodes this domain knowledge directly into the data the model sees.

The window idea

We divide the signal into windows and compute features within each. If we choose a window size roughly the length of one event (one heartbeat, one footstep), each window contains roughly one event, and the features describe that event.

For an ECG with eight heartbeats, draw eight pink rectangles over the signal, one per beat. Within each rectangle, mark the maximum value with a star at the top of the R wave, the minimum with a dot near the bottom, the mean with a triangle in the middle. Those three numbers per beat are a far more compact representation than the raw waveform, and they preserve the information most cardiac classifiers actually use.

Statistical features

A typical feature set includes the first four moments of the signal’s distribution within each window:

Mean, average value.
Standard deviation, how spread out the values are.
Skewness, asymmetry of the distribution (third moment).
Kurtosis, tail heaviness (fourth moment).

Plus often the maximum and minimum. For ECG, max captures R-wave height; min captures the deepest trough.

Implementation in Pandas

The Pandas rolling method makes this a few lines:

features = pd.DataFrame(columns=['mean', 'std', 'max', 'kurtosis', 'skew'])
window_size = 125
features['mean']     = data['signal'].rolling(window_size).mean()
features['std']      = data['signal'].rolling(window_size).std()
features['max']      = data['signal'].rolling(window_size).max()
features['kurtosis'] = data['signal'].rolling(window_size).kurt()
features['skew']     = data['signal'].rolling(window_size).skew()

Pick a column, call rolling(window_size), chain on the statistical method. Each call returns a Series the same length as the source, with the first $N - 1$ values as NaN. The resulting features DataFrame has one row per sample position and one column per feature.

The five columns are what the classifier sees. The classifier doesn’t see the raw signal, it sees this much smaller representation, the input to Logistic regression or any other classifier downstream.

Design choices

Three design choices: window size (how long), window overlap (how much consecutive windows share), and how many features to include.

Idriss Rami — Notes

Explorer

Feature extraction

The window idea

Statistical features

Implementation in Pandas

Design choices

Graph View

Table of Contents

Backlinks