Binary cross-entropy

Binary cross-entropy (BCE) is the standard Loss function for binary classification. It goes by several equivalent names — log loss, negative log-likelihood of a Bernoulli model, and logistic loss all describe the same quantity in the binary case. Categorical cross-entropy is the multi-class generalization. For a dataset of $N$ examples with true labels $y_{i} \in {0, 1}$ and predicted probabilities $f (x_{i}) \in (0, 1)$ :

$J (w) = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} lo g f (x_{i}) + (1 - y_{i}) lo g (1 - f (x_{i}))]$

The structure: an average over the $N$ examples. For each example, we add either the first term or the second, depending on the true label. The $1/ N$ averaging is conventional in most modern references (scikit-learn’s log_loss, PyTorch’s BCELoss with reduction='mean'); some textbooks drop it and use the un-normalized sum. The minimizer is the same either way — scaling a loss by a positive constant doesn’t move its minimum.

Why this form

It’s worth seeing why the expression has the form it does — consider the two cases for an individual example.

When $y_{i} = 1$ — the true class is 1 — the second term vanishes (because $1 - y_{i} = 0$ ), and what’s left is $- lo g f (x_{i})$ . We want this to be small. $- lo g$ is large when its argument is near 0 and small when its argument is near 1. So $- lo g f (x_{i})$ is small when $f (x_{i})$ is close to 1 — the model correctly predicts a high probability for class 1. And it’s large when $f (x_{i})$ is close to 0 — the model wrongly predicts class 0 with high confidence.

When $y_{i} = 0$ — the true class is 0 — the first term vanishes, leaving $- lo g (1 - f (x_{i}))$ . By the same argument, this is small when $1 - f (x_{i})$ is close to 1, i.e., when $f (x_{i})$ is close to 0 — the model correctly predicts low probability for class 1. And it’s large when $f (x_{i})$ is close to 1 — confidently wrong in the other direction.

In both cases, the loss is small when the model is correctly confident and large when the model is wrongly confident. It penalizes confident mistakes much more harshly than wrong-but-uncertain predictions, which is intuitively right: a hedging model should suffer less than a model that loudly insists on the wrong answer.

Why not MSE

Mean squared error is the standard for Regression, but it doesn’t work for classification. Combined with the sigmoid of Logistic regression, MSE produces a non-convex loss surface — gradient descent can get stuck in local minima that aren’t the global optimum. Binary cross-entropy with sigmoid gives a clean convex bowl, and gradient descent reliably finds the bottom.

There’s also a probabilistic justification: cross-entropy is the negative log-likelihood under the Bernoulli model where $f (x_{i})$ is the predicted probability. Maximizing the likelihood of the data is the same as minimizing the cross-entropy, which makes the loss principled rather than arbitrary.

Multi-class generalization

For more than two classes, categorical cross-entropy generalizes the same idea:

$J (w) = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{k = 1}^{K} y_{ik} lo g f_{k} (x_{i})$

where $y_{ik}$ is 1 if example $i$ belongs to class $k$ and 0 otherwise, and $f_{k} (x_{i})$ is the model’s predicted probability for class $k$ (typically the output of a softmax). The binary case ( $K = 2$ ) reduces to the formula above.

Training

The partial derivatives of cross-entropy with respect to the Logistic regression parameters have a clean form — they look very much like the Linear regression gradients, with $f (x_{i}) - y_{i}$ playing the role of the residual. The chain rule is a small exercise. In practice, scikit-learn does the gradient descent for us — we just call LogisticRegression() and let it work.

Idriss Rami — Notes

Explorer

Binary cross-entropy

Why this form

Why not MSE

Multi-class generalization

Training

Graph View

Table of Contents

Backlinks