Linear regression is the simplest Regression model: predict the output as a linear function of the inputs. For a single input feature :
is the intercept — the value of when . is the slope — how much changes per unit increase in . The vector contains everything the model has learned. For multiple input features , the model generalizes to a sum of weighted inputs:
If the intercept is absorbed into by prepending a constant 1 to , this is just — the inner product of the weight vector and the input vector. The hat on marks it as the model’s prediction, distinct from the true label . Training the model means finding good values for the weights .
What good fit means
We pick a Loss function that measures how badly the model fits the data, and minimize it. The standard regression loss is Mean squared error:
A small means a good fit. The training problem becomes: find that minimizes .
Training
For linear regression with MSE loss, there’s a closed-form solution — take partial derivatives of with respect to each , set them to zero, solve the resulting system of equations. In matrix form, stacking the inputs (with a leading column of 1s for the intercept) into a design matrix and the targets into a vector , this gives the normal equation:
This is the exact best in one calculation — no iteration. In practice we don’t form explicitly; numerically stable implementations (including scikit-learn’s LinearRegression) solve the system via SVD or a least-squares routine.
For more complex models — neural networks with millions of parameters, or any model that isn’t linear in its parameters — the closed form breaks down. Solving the equations becomes intractable. The fallback is Gradient descent, which works on linear regression too and generalizes to everything else.
The gradient of with respect to and :
The factor of 2 from differentiating the square cancels the 1/2 in the loss formula — the reason that 1/2 is there in the first place.
The training loop:
- Initialize — often to zero.
- Compute predictions for every training example.
- Compute the loss .
- Compute the two gradients above.
- Update: , , where is the Learning rate.
- Repeat until the loss stops improving.
Limitations
Linear regression assumes the relationship between input and output is linear. Many real relationships aren’t. For curvature, use Polynomial regression. For classification, the linear function gets wrapped in a sigmoid to become Logistic regression. For non-linear decision boundaries that can’t be expressed as polynomials, use neural networks, decision trees, or kernel methods.
In scikit-learn:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)This uses the closed-form solver. For very large datasets, SGDRegressor uses stochastic gradient descent.