Gradient

For a scalar function $f (x_{1}, x_{2}, \dots, x_{n})$ — a function that takes $n$ inputs and produces a single output — the gradient vector $\nabla f$ is the vector of partial derivatives:

$\nabla f = (\frac{\partial f}{\partial x _{1}}, \frac{\partial f}{\partial x _{2}}, \dots, \frac{\partial f}{\partial x _{n}})$

The partial derivative $\partial f / \partial x_{i}$ measures how fast $f$ changes when we increase $x_{i}$ slightly while keeping the other inputs fixed. The gradient packages all of these directions of change into a single vector.

Two facts about the gradient are important:

The gradient points in the direction of steepest ascent. If we’re standing at some point on a hilly landscape — where the height is given by $f$ — and we want to know which direction up the hill is steepest, the gradient points there. The magnitude of the gradient tells us how steep.

The negative of the gradient points in the direction of steepest descent. If we want to go downhill as fast as possible, we move in the direction $- \nabla f$ . This is the fact that makes Gradient descent work.

A worked example

Take

$f (x_{1}, x_{2}) = x_{1}^{2} + 2 x_{1} + 3 x_{1} x_{2}$

Partial derivative with respect to $x_{1}$ : the first term $x_{1}^{2}$ differentiates to $2 x_{1}$ , the second term $2 x_{1}$ differentiates to $2$ , the third term $3 x_{1} x_{2}$ differentiates to $3 x_{2}$ (treating $x_{2}$ as a constant). So

$\frac{\partial f}{\partial x _{1}} = 2 x_{1} + 2 + 3 x_{2}$

Partial derivative with respect to $x_{2}$ : the first two terms have no $x_{2}$ , so they differentiate to 0; the third differentiates to $3 x_{1}$ .

$\frac{\partial f}{\partial x _{2}} = 3 x_{1}$

The gradient is

$\nabla f = (2 x_{1} + 2 + 3 x_{2}, 3 x_{1})$

At any specific point $(x_{1}, x_{2})$ , plug in the coordinates to get a specific vector that points uphill from there. Negate it to point downhill.

Why this matters for machine learning

In Supervised learning, we have a Loss function $J (w)$ that depends on the model’s parameter vector $w$ . The job of training is to find $w$ that minimizes $J$ . The negative gradient $- \nabla J$ points in the direction of steepest decrease of $J$ , so taking a step in that direction reduces the loss. Iterating this is Gradient descent.

For a function with one input, the gradient reduces to the ordinary derivative. For a function with many inputs — like a neural-network loss with millions of weight parameters — the gradient is a million-dimensional vector, and computing it efficiently is the job of the backpropagation algorithm.

In vector calculus

Three geometric facts get used constantly in Vector Calculus and Complex Analysis.

Perpendicular to level surfaces. At every point, $\nabla f$ is perpendicular to the level surface ${f = c}$ passing through that point. Proof: parameterize any curve $r (t)$ on the level surface, so $f (r (t)) = c$ is constant. Differentiating, $\nabla f (r) \cdot r^{'} (t) = 0$ . Since $r^{'}$ is tangent to the surface and the equation holds for any such curve, $\nabla f$ is perpendicular to every tangent direction — i.e., perpendicular to the surface.

This is what makes the gradient the normal vector to a level surface, used everywhere from defining surface normals in flux integrals to identifying the steepest-ascent direction (steepest ascent must be perpendicular to “no change,” which is the level surface).

Directional derivative formula. The rate of change of $f$ in any unit direction $\hat{u}$ is $D_{\hat{u}} f = \nabla f \cdot \hat{u} = ∣\nabla f ∣ cos θ$ , maximized when $\hat{u}$ aligns with $\nabla f$ . This confirms “steepest ascent direction” and “magnitude = how steep” — and is the calculus tool used in Gradient descent and in walking along level surfaces.

Conservative field structure. Vector fields of the form $F = \nabla ϕ$ are called conservative fields (or gradient fields). These are the special fields whose line integrals depend only on endpoints — see Fundamental theorem of line integrals.

Gradient in curvilinear coordinates

The coordinate expression depends on whether you’re working in Cartesian, cylindrical, or spherical coordinates. In each, the position-dependent unit vectors pick up scale factors:

Cartesian $(x, y, z)$ :

$\nabla f = \frac{\partial f}{\partial x} \hat{i} + \frac{\partial f}{\partial y} \hat{j} + \frac{\partial f}{\partial z} \hat{k} .$

Cylindrical $(ρ, ϕ, z)$ :

$\nabla f = \frac{\partial f}{\partial ρ} \hat{ρ} + \frac{1}{ρ} \frac{\partial f}{\partial ϕ} \hat{ϕ} + \frac{\partial f}{\partial z} \hat{z} .$

Spherical $(r, θ, ϕ)$ :

$\nabla f = \frac{\partial f}{\partial r} \hat{r} + \frac{1}{r} \frac{\partial f}{\partial θ} \hat{θ} + \frac{1}{r s i n θ} \frac{\partial f}{\partial ϕ} \hat{ϕ} .$

The $1/ ρ$ , $1/ r$ , $1/ (r sin θ)$ factors are the scale-factor reciprocals, coming from the curvilinear differential length element ( $d ρ$ , $r d θ$ , $r sin θ d ϕ$ ). Same story as for Divergence and Curl in curvilinear coordinates — the scale factors propagate through every differential operator.

Idriss Rami — Notes

Explorer

Gradient

A worked example

Why this matters for machine learning

In vector calculus

Gradient in curvilinear coordinates

Graph View

Table of Contents

Backlinks