Linear regression

Linear regression is the simplest Regression model: predict the output as a linear function of the inputs. For a single input feature $x$ :

$f (x, w) = w_{0} + w_{1} x$

$w_{0}$ is the intercept, the value of $f$ when $x = 0$ . $w_{1}$ is the slope, how much $f$ changes per unit increase in $x$ . The vector $w = (w_{0}, w_{1})$ contains everything the model has learned. For multiple input features $x_{1}, x_{2}, \dots, x_{m}$ , the model generalizes to a sum of weighted inputs:

$f (x, w) = w_{0} + w_{1} x_{1} + w_{2} x_{2} + \dots + w_{m} x_{m}$

If the intercept is absorbed into $w$ by prepending a constant 1 to $x$ , this is just $\overset{y}{^} = w^{T} x$ , the inner product of the weight vector and the input vector. The hat on $\overset{y}{^}$ marks it as the model’s prediction, distinct from the true label $y$ . Training the model means finding good values for the weights $w$ .

What good fit means

We pick a Loss function that measures how badly the model fits the data, and minimize it. The standard regression loss is Mean squared error:

$J (w) = \frac{1}{2 N} \sum_{i = 1}^{N} (f (x_{i}, w) - y_{i})^{2}$

A small $J$ means a good fit. The training problem becomes: find $w$ that minimizes $J$ .

Training

For linear regression with MSE loss, there’s a closed-form solution: take partial derivatives of $J$ with respect to each $w_{j}$ , set them to zero, solve the resulting system of equations. In matrix form, stacking the inputs (with a leading column of 1s for the intercept) into a design matrix $X$ and the targets into a vector $y$ , this gives the normal equation:

$w^{*} = (X^{T} X)^{- 1} X^{T} y$

This is the exact best $w$ in one calculation, no iteration. In practice we don’t form $(X^{T} X)^{- 1}$ explicitly; numerically stable implementations (including scikit-learn’s LinearRegression) solve the system via SVD or a least-squares routine.

For more complex models, neural networks with millions of parameters, or any model that isn’t linear in its parameters, the closed form breaks down. Solving the equations becomes intractable. The fallback is Gradient descent, which works on linear regression too and generalizes to everything else.

The gradient of $J$ with respect to $w_{0}$ and $w_{1}$ :

$\frac{\partial J}{\partial w _{0}} = \frac{1}{N} \sum_{i = 1}^{N} (f (x_{i}) - y_{i})$

$\frac{\partial J}{\partial w _{1}} = \frac{1}{N} \sum_{i = 1}^{N} (f (x_{i}) - y_{i}) \cdot x_{i}$

The factor of 2 from differentiating the square cancels the 1/2 in the loss formula, the reason that 1/2 is there in the first place.

The training loop:

Initialize $w_{0}, w_{1}$ , often to zero.
Compute predictions $f (x_{i}) = w_{0} + w_{1} x_{i}$ for every training example.
Compute the loss $J (w)$ .
Compute the two gradients above.
Update: $w_{0} \leftarrow w_{0} - η \frac{\partial J}{\partial w _{0}}$ , $w_{1} \leftarrow w_{1} - η \frac{\partial J}{\partial w _{1}}$ , where $η$ is the Learning rate.
Repeat until the loss stops improving.

Limitations

Linear regression assumes the relationship between input and output is linear. Many real relationships aren’t. For curvature, use Polynomial regression. For classification, the linear function gets wrapped in a sigmoid to become Logistic regression. For non-linear decision boundaries that can’t be expressed as polynomials, use neural networks, decision trees, or kernel methods.

In scikit-learn:

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

This uses the closed-form solver. For very large datasets, SGDRegressor uses stochastic gradient descent.

Idriss Rami — Notes

Explorer

Linear regression

What good fit means

Training

Limitations

Graph View

Table of Contents

Backlinks