Crowdsourcing labels is the practice of distributing small labelling tasks to many anonymous workers — typically through an online platform — and aggregating their answers. The canonical platform is Amazon Mechanical Turk, where a requester posts a task (is this image a cat or a dog?) and pays a few cents per answer. The bet is the law of large numbers: any individual annotator might be wrong, distracted, or careless, but if twenty different annotators answer and we take the majority, the result is usually correct.
Whether crowdsourcing works in practice depends heavily on the platform’s structure. Mechanical Turk has a reputation for reasonable quality because tasks are small, payment is per-task, and the platform has a built-in rating system that lets requesters filter for high-accuracy workers. A site like Quora doesn’t work as a labelling platform — questions and answers are too unstructured, and there’s no reliable way to combine multiple people’s responses. Community Notes on Twitter/X is an interesting middle ground: it’s crowdsourced labelling of whether posts are misleading, with a clever mechanism for combining notes from people across the political spectrum.
Crowdsourcing is appropriate when the labels don’t require expert judgement and the volume needed is large. It’s inappropriate when the categories are specialized — identifying individual humpback whales from tail markings needs a marine biologist, not a Mechanical Turker. The pattern in many real projects is a mix: use crowdsourcing for the bulk of the labels, use experts for the hard cases or the gold-standard subset against which crowdsourced quality is measured.
The output of crowdsourcing inevitably contains Label noise, handled with majority voting, confidence-weighted aggregation, or Active learning.