Data collection is the deliberate, planned acquisition of measurements that bear on a question we want to answer. The word deliberate is doing the work — sticking a microphone in a room and recording for an hour isn’t data collection in the sense we mean, but recording for an hour because we want to study a particular bird’s singing, with the microphone calibrated, the sample rate chosen for the relevant frequencies, and the location and time written down, is.
Everything downstream depends on it. A machine-learning model is, in a real sense, an extremely complicated summary of the data it was trained on. If the data faithfully reflects the world the model will be deployed in, the model has a chance of being useful there. If the data was collected sloppily, or under conditions that don’t resemble deployment conditions, the model will be confidently wrong the first time it sees a real example. Garbage in, garbage out.
The choice of how to collect splits along several axes: lab vs in-the-wild, the sensors used, the metadata recorded, the labelling strategy, and the ethical and legal constraints that govern who and what we may measure.