Accuracy is the most intuitive classification metric: the fraction of predictions that are correct.
The numerator is the number of correct predictions (true positives plus true negatives — the main diagonal of the Confusion matrix). The denominator is the total number of predictions. Multiplying by 100 gives the familiar the model got 87% phrasing.
Accuracy is a useful summary — one number that captures how the classifier is doing overall. But it has a serious failure mode: it can be misleading for imbalanced datasets.
The imbalance problem
Suppose 99% of the examples are negative. A trivial classifier that always predicts negative gets 99% accuracy by doing nothing useful — it correctly classifies every negative and misses every positive. The accuracy number is high; the classifier is useless. A real-world example: cancer screening on the general population. Most people don’t have cancer, so a classifier that always says no cancer gets nearly perfect accuracy and catches zero cases. Useless.
The fix is to look at finer-grained metrics:
- Recall tells us of the actual positives, how many did the classifier catch? — sensitive to false negatives.
- Specificity tells us of the actual negatives, how many did the classifier correctly identify? — sensitive to false positives.
- Precision tells us of the predicted positives, how many were actually positive? — also sensitive to false positives.
- F1 score combines precision and recall into a single number.
- AUC measures the classifier’s ranking ability across all thresholds and is largely insensitive to class imbalance.
For balanced problems where the two classes are roughly equally common, accuracy is fine. For imbalanced problems, it’s misleading on its own and should be paired with one or more of the metrics above.
In scikit-learn
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_pred)For multi-class classification, accuracy generalizes naturally: the fraction of predictions where the predicted class equals the true class. The imbalance pitfall generalizes too — accuracy on a multi-class problem dominated by one class can be high without the classifier being useful for the rarer classes.