Unsupervised learning

Unsupervised learning is the family of machine-learning algorithms that works with unlabelled data: inputs only, no correct answers. The model’s job is to find structure in the data on its own: clusters, patterns, anomalies, low-dimensional representations. No supervisor tells it what’s right and wrong, so it has to make sense of the data using only the data itself.

Common unsupervised tasks:

Clustering. Partition examples into groups such that examples within a group are similar and examples in different groups are dissimilar. k-means is the canonical algorithm. Hierarchical clustering, DBSCAN, Gaussian mixture models are others.
Dimensionality reduction. Find a low-dimensional representation that preserves the structure that matters. PCA (linear, preserves variance) and t-SNE (non-linear, preserves neighborhoods) are the two methods this textbook covers.
Anomaly detection. Identify examples that don’t fit the typical pattern. Useful for fraud detection, fault monitoring, novelty detection.
Density estimation. Estimate the probability distribution that generated the data.

To feel the distinction from Supervised learning: imagine someone shows us a stack of fruit photographs. If they’re labelled (orange ones say “orange”), we can learn what distinguishes the categories, which is supervised learning. If they’re unlabelled, we can still sort them into piles by visual similarity, ending up with two piles (mostly orange-colored, mostly red-colored) without anyone telling us the categories. We’ve discovered structure without being told what to look for, which is unsupervised learning.

The output of unsupervised learning is structural rather than predictive. After clustering, we have the data partitioned into groups. After dimensionality reduction, we have the data in a more compact form. There’s no labelled test set to evaluate against the way supervised learning has. The closest thing is intrinsic metrics (silhouette score for clustering, reconstruction error for dimensionality reduction) and downstream usefulness (do the discovered clusters correspond to something meaningful in the domain?).

The other two ML families are Supervised learning (with labels) and Reinforcement learning (with environmental feedback). Introduction to Data Science focuses primarily on supervised learning, with unsupervised methods appearing in Chapter 3 as visualization aids for high-dimensional data.

Idriss Rami — Notes

Explorer

Unsupervised learning

Graph View

Backlinks