Lab vs in-the-wild data

Lab data is collected in a controlled environment (a lab, clinic, recording studio), while in-the-wild data is collected in the messy real-world setting where the model will eventually be used. The two trade off in opposite directions, and the choice shapes every later decision in the pipeline.

Lab data is clean. Lighting is controlled, background noise is suppressed, subjects know they’re being recorded and are following instructions. The CK+ facial-expression dataset was built this way: subjects sat in front of a camera and produced specific expressions on cue. The images are uniform and easy to label. The downside is that the world isn’t a lab. A model trained on CK+ alone tends to fail on faces photographed under poor lighting, unusual poses, or showing the subtle mixed expressions that occur in real life.

In-the-wild data is the opposite. AffectNet, for example, is a messier dataset of facial expressions scraped from the web: uneven lighting, motion blur, occlusions, every kind of pose. A model trained on it generalizes better to real faces, but the data is harder to label and the model has to be more sophisticated to extract a clean signal.

The same trade-off recurs across domains. A film studio’s controlled mocap stage gives extremely accurate joint trajectories but only for what actors did on the stage; capturing how a real person walks down a real street means accepting all the noise of the street. Hospital ECG recordings (patient lying still, leads correctly placed, room quiet) are easier to model than ECG-like recordings from a wearable on someone going about their day, but the wearable data is what the deployment model will see.

The right answer is usually both. Collect lab data first to understand the signal cleanly, then collect smaller amounts of in-the-wild data to confirm the model doesn’t fall apart outside the lab. A model that works in the clinic but not in the home isn’t done.

Idriss Rami — Notes

Explorer

Lab vs in-the-wild data

Graph View

Backlinks