Active learning is a labelling strategy in which the model identifies the training examples it’s most uncertain about and requests human labels for those specifically, rather than for a random subset. Most of the labelling budget goes toward the examples where labels would teach the model the most.

The motivating observation is that not all training examples are equally useful. A model already confident that some image is a cat learns very little from a label confirming that. A model uncertain whether some image is a cat or a small dog learns a lot from being told. If labelling is expensive — domain experts, crowdsourced workers — focusing the budget on the uncertain examples gets more learning per dollar.

The typical active-learning loop is:

  1. Train an initial model on a small seed of labelled data.
  2. Use the model to score every unlabelled example by its uncertainty — entropy of the predicted class probabilities, distance to the decision boundary, disagreement among an ensemble.
  3. Send the highest-uncertainty examples to a human for labelling.
  4. Add the newly labelled examples to the training set and retrain.
  5. Repeat.

Common uncertainty measures include uncertainty sampling (pick the example with the most ambiguous predicted probabilities, e.g. highest entropy), query-by-committee (train an ensemble and pick the examples where its members disagree most), and expected model change (pick examples whose label would shift the model’s parameters the most). Each is a different proxy for “how much would the model learn from this label.”

Active learning is one of the standard responses to Label noise and limited labelling budgets, alongside majority voting and confidence-weighted labelling. It’s also the conceptual basis for the hybrid pipelines that combine Automated labelling with targeted expert review — the model picks the cases where it’s least confident, the expert reviews exactly those.