Optima

Where root-finding asks what input drives the output to zero?, optimization asks the dual question: what input drives the output to its largest (or smallest) value? “Largest” and “smallest” only make sense for an ordered output, so the function we study is now a scalar field f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} — vector input, single real output — and the target has shifted from finding roots (inputs at which ff vanishes) to finding extrema (also called optima), inputs at which ff attains a largest or smallest value.

Two independent distinctions sharpen the idea:

  • maximum vs minimum — largest value vs smallest value.
  • global vs local — largest/smallest across the entire domain vs largest/smallest only within some neighborhood of the candidate point.

That gives four cases, each a different way a point x0D\mathbf{x}_0 \in D can be extremal for ff.

Global Extrema

A global extremum compares f(x0)f(\mathbf{x}_0) against ff‘s value at every other point of the domain.

Let f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} be a scalar field. A point x0D\mathbf{x}_0 \in D is a global maximum of ff if

f(x0)f(x)xDf(\mathbf{x}_0) \geq f(\mathbf{x}) \quad \forall \mathbf{x} \in D

and a global minimum of ff if

f(x0)f(x)xDf(\mathbf{x}_0) \leq f(\mathbf{x}) \quad \forall \mathbf{x} \in D

In either case f(x0)f(\mathbf{x}_0) is the corresponding global maximum value or global minimum value of ff, and x0\mathbf{x}_0 is a global extremum — collectively global extrema.

In words: nowhere in the domain does ff go higher than f(x0)f(\mathbf{x}_0) (for a maximum) or lower (for a minimum). The condition is strong precisely because the comparison ranges over all of DD — there is nowhere left for a counterexample to hide.

Local Extrema

A local extremum weakens the comparison: x0\mathbf{x}_0 only has to win on its own turf — some ε-ball around the point.

Let f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} be a scalar field. A point x0D\mathbf{x}_0 \in D is a local maximum of ff if there exists some ε>0\varepsilon > 0 such that

f(x0)f(x)xBε(x0)f(\mathbf{x}_0) \geq f(\mathbf{x}) \quad \forall \mathbf{x} \in B_\varepsilon(\mathbf{x}_0)

and a local minimum of ff if there exists some ε>0\varepsilon > 0 such that

f(x0)f(x)xBε(x0)f(\mathbf{x}_0) \leq f(\mathbf{x}) \quad \forall \mathbf{x} \in B_\varepsilon(\mathbf{x}_0)

In either case f(x0)f(\mathbf{x}_0) is the corresponding local maximum value or local minimum value, and x0\mathbf{x}_0 is a local extremum — collectively local extrema.

The existential ε>0\exists \varepsilon > 0 does the heavy lifting: it says some neighborhood, however small, witnesses the inequality. A point on a small hilltop with a much higher peak elsewhere in the domain still counts as a local maximum, because we are only asked to beat the immediate neighborhood.

Every global extremum is automatically a local extremum — any neighborhood works once the inequality already holds globally. The converse fails: a function can have many local maxima but only one global maximum (or none at all).

Strict Extrema

The four definitions above use non-strict inequalities (\geq, \leq), which allow ties: another point in the comparison region can attain the same extreme value as x0\mathbf{x}_0 and the definition still applies. Every point on a flat plateau, for instance, qualifies as a maximum together with the others.

If the inequalities are tightened to strict ones — >> in place of \geq, << in place of \leq — then x0\mathbf{x}_0 is the unique point in the comparison region at which the extreme value is attained. Such a point is called a strict local or global extremum:

  • A strict global maximum satisfies f(x0)>f(x)f(\mathbf{x}_0) > f(\mathbf{x}) for every xD\mathbf{x} \in D with xx0\mathbf{x} \neq \mathbf{x}_0, and analogously for a strict global minimum with <<.
  • A strict local maximum satisfies f(x0)>f(x)f(\mathbf{x}_0) > f(\mathbf{x}) for every xBε(x0)\mathbf{x} \in B_\varepsilon(\mathbf{x}_0) with xx0\mathbf{x} \neq \mathbf{x}_0, for some ε>0\varepsilon > 0, and analogously for a strict local minimum with <<.

Strict extrema are isolated in their comparison region — no other point comes within reach of the extreme value, so the maximum (or minimum) is genuinely a single peak (or trough) rather than the highest tier of several tied points.

Stationary Points

How do we actually find an extremum? In 1D, a smooth ff has every local extremum either at a point where ff' vanishes or on the boundary of the domain. The exact same dichotomy lifts to higher dimensions, with the gradient f\nabla f playing the role of ff'.

For a continuously partially differentiable scalar field f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R}, every local extremum at an inner point of DD occurs at a point where the gradient vanishes:

x0 is a local extremum of f at an inner point    f(x0)=0\mathbf{x}_0 \text{ is a local extremum of } f \text{ at an inner point} \implies \nabla f(\mathbf{x}_0) = \mathbf{0}

The only other candidates for a local extremum are points on the boundary of DD.

Why the gradient must vanish at inner extrema

Suppose x0\mathbf{x}_0 is a local extremum at an inner point of DD. Fix every coordinate of x0\mathbf{x}_0 except the ii-th and watch how ff varies along that one axis. Concretely, define

gi(t)=f(x0+tei)g_i(t) = f(\mathbf{x}_0 + t\,\mathbf{e}_i)

— a 1D function that takes a step of size tt along the ii-th coordinate axis and reads off the corresponding ff-value. For tt in a small interval (δ,δ)(-\delta, \delta) this is well-defined: x0\mathbf{x}_0 is an inner point, so a small ball around it stays inside DD.

Since x0\mathbf{x}_0 is a local extremum of ff, the 1D function gig_i inherits the property — it has a 1D local extremum at t=0t = 0. The 1D criterion then forces gi(0)=0g_i'(0) = 0, and by the definition of the partial derivative,

gi(0)=fxi(x0)=0g_i'(0) = f_{x_i}(\mathbf{x}_0) = 0

Repeating this for every axis i=1,,ni = 1, \ldots, n gives fx1(x0)==fxn(x0)=0f_{x_1}(\mathbf{x}_0) = \cdots = f_{x_n}(\mathbf{x}_0) = 0, which is exactly f(x0)=0\nabla f(\mathbf{x}_0) = \mathbf{0}.

The argument fails on the boundary because x0\mathbf{x}_0 may not have a full ball around it inside DD — moving by tei-t\,\mathbf{e}_i might step outside the domain. The 1D restriction gig_i may then only be defined on one side of t=0t = 0, which is too little for the 1D extremum criterion to apply.

So the points where the gradient vanishes deserve their own name — they are exactly the candidates for inner extrema.

A point aD\mathbf{a} \in D is a stationary point (also called a critical point) of a scalar field f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} if the gradient vanishes there:

f(a)=0\nabla f(\mathbf{a}) = \mathbf{0}

Practically, this is what makes the problem tractable. Instead of hunting for extrema across all of DD (an uncountable set), we hunt for the finitely many roots of f=0\nabla f = \mathbf{0} in the interior, and check the boundary of DD separately. The boundary side requires different machinery and is not pursued here; the rest of this chapter focuses on the inner, stationary-point side.

Saddle Points

The implication runs in only one direction. Every inner local extremum is a stationary point — but not every stationary point is an extremum. The gradient can vanish at x0\mathbf{x}_0 without the function attaining a local maximum or minimum there at all.

The smallest example sits in R2\mathbb{R}^2. Consider

f:R2R,f(x,y)=x3f : \mathbb{R}^2 \to \mathbb{R}, \quad f(x, y) = x^3

The gradient f(x,y)=(3x2,0)\nabla f(x, y) = (3x^2, 0)^\top vanishes at the origin, so (0,0)(0, 0) is stationary. Yet it is not a local extremum: in any neighborhood of (0,0)(0, 0) the function takes both positive values (just to the right, where f(x,0)=x3>0f(x, 0) = x^3 > 0) and negative ones (just to the left, where f(x,0)=x3<0f(x, 0) = x^3 < 0), so f(0,0)=0f(0, 0) = 0 is neither smallest nor largest nearby.

f:R2R,f(x,y)=x3,f(x,y)=(3x20)f : \mathbb{R}^2 \to \mathbb{R}, \quad f(x, y) = x^3, \quad \nabla f(x, y) = \begin{pmatrix} 3x^2 \\ 0 \end{pmatrix}

The surface is a ramp that does not depend on yy: flat along the yy-axis, ascending cubically along the xx-axis. The gradient vanishes not only at the origin but along the entire line x=0x = 0 — every point of the form (0,y0)(0, y_0) is stationary, since the formula f=(3x2,0)\nabla f = (3x^2, 0)^\top depends only on xx. Yet none of these points is a local extremum: moving in either ±x\pm x direction immediately wins or loses height, so the ramp briefly flattens but never peaks or troughs.

Such points get their own name.

Let f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} be a scalar field. A saddle point of ff is a stationary point aD\mathbf{a} \in D that is neither a local maximum nor a local minimum of ff.

The x3x^3 example above is a saddle point in this technical sense, but it does not look like one — it’s a flat ramp that happens to plateau along x=0x = 0. The geometric archetype, the surface that gives the name, is

f:R2R,f(x,y)=x2y2f : \mathbb{R}^2 \to \mathbb{R}, \quad f(x, y) = x^2 - y^2
f:R2R,f(x,y)=x2y2,f(x,y)=(2x2y)f : \mathbb{R}^2 \to \mathbb{R}, \quad f(x, y) = x^2 - y^2, \quad \nabla f(x, y) = \begin{pmatrix} 2x \\ -2y \end{pmatrix}

At the origin the gradient f(x,y)=(2x,2y)\nabla f(x, y) = (2x, -2y)^\top vanishes, so (0,0)(0, 0) is stationary. But the curvature pulls in opposite ways along the two axes:

  • Along the xx-axis (y=0y = 0), f(x,0)=x2f(x, 0) = x^2 has a local minimum at x=0x = 0.
  • Along the yy-axis (x=0x = 0), f(0,y)=y2f(0, y) = -y^2 has a local maximum at y=0y = 0.

Walking out from (0,0)(0, 0) along xx takes you uphill; walking out along yy takes you downhill. Neither side wins, and the surface twists into the horse-saddle shape every rider can picture — uphill front-and-back, downhill side-to-side. The name “saddle point” generalizes from this prototype to any stationary point with the same defining feature: zero gradient, no local extremum.

The takeaway is that the gradient alone cannot tell extrema apart from saddle points: f(x0)=0\nabla f(\mathbf{x}_0) = \mathbf{0} identifies a candidate, but distinguishing the cases requires looking at second-order information through the Hessian.

Definiteness of a Symmetric Matrix

To carry the 1D second-derivative test into nn dimensions, we first need a way to talk about the “sign” of a matrix. The 1D second derivative f(a)f''(a) is a single number — positive means the curve opens upward (local minimum), negative means it opens downward (local maximum), zero is inconclusive. In nn dimensions the second derivative is the whole Hessian matrix, and curvature now varies with direction: a saddle point literally curves upward along one axis and downward along another. The single-number “sign” picture has to be replaced with something richer.

The replacement is definiteness. For a symmetric matrix MM, the quadratic form xMx\mathbf{x}^\top M \mathbf{x} produces a single scalar for each input xRn\mathbf{x} \in \mathbb{R}^n, and the sign of that scalar — across all non-zero x\mathbf{x} — is exactly what classifies MM.

A symmetric matrix MRn×nM \in \mathbb{R}^{n \times n} is

  • positive definite if xMx>0\mathbf{x}^\top M \mathbf{x} > 0 for all xRn{0}\mathbf{x} \in \mathbb{R}^n \setminus \{\mathbf{0}\},
  • negative definite if xMx<0\mathbf{x}^\top M \mathbf{x} < 0 for all xRn{0}\mathbf{x} \in \mathbb{R}^n \setminus \{\mathbf{0}\},
  • positive semi-definite if xMx0\mathbf{x}^\top M \mathbf{x} \geq 0 for all xRn\mathbf{x} \in \mathbb{R}^n,
  • negative semi-definite if xMx0\mathbf{x}^\top M \mathbf{x} \leq 0 for all xRn\mathbf{x} \in \mathbb{R}^n.

If MM is neither positive nor negative semi-definite, it is indefinite.

The cases form a graded scale. Strict positivity of the form everywhere off the origin is positive definite; relaxing the strict inequality to allow zero values is positive semi-definite. The negative side mirrors this, and indefinite is what’s left: the form is positive in some directions and negative in others as x\mathbf{x} varies, so neither semi-definite condition can hold.

The exclusion of x=0\mathbf{x} = \mathbf{0} in the strict cases is mandatory, not cosmetic — 0M0=0\mathbf{0}^\top M \mathbf{0} = 0 for every MM, so without the exclusion the strict inequality >0> 0 would fail at the origin for every matrix and the condition would be unsatisfiable.

Eigenvalue Criterion

Checking the inequality xMx>0\mathbf{x}^\top M \mathbf{x} > 0 at every non-zero direction in Rn\mathbb{R}^n is, taken literally, an infinite test. The shortcut is to read the definiteness off the eigenvalues of MM. Let λi\lambda_i be an eigenvalue of MM — guaranteed to be real, since MM is symmetric. Then:

  • MM is positive definite     \iff all λi>0\lambda_i > 0,
  • MM is negative definite     \iff all λi<0\lambda_i < 0,
  • MM is positive semi-definite     \iff all λi0\lambda_i \geq 0,
  • MM is negative semi-definite     \iff all λi0\lambda_i \leq 0,
  • MM is indefinite     \iff some λi>0\lambda_i > 0 and some λj<0\lambda_j < 0.
Intuition: why eigenvalue signs decide the form’s sign

The picture becomes concrete once we view it in the right coordinate system. For a symmetric MM, the eigenvectors can be chosen as an orthonormal basis v1,,vn\mathbf{v}_1, \ldots, \mathbf{v}_nnn perpendicular unit axes — and along each axis MM does nothing but scale by its eigenvalue. Writing any x\mathbf{x} in this eigenbasis as x=c1v1++cnvn\mathbf{x} = c_1 \mathbf{v}_1 + \cdots + c_n \mathbf{v}_n, the quadratic form collapses into a clean weighted sum of squares:

xMx=λ1c12+λ2c22++λncn2\mathbf{x}^\top M \mathbf{x} = \lambda_1 c_1^2 + \lambda_2 c_2^2 + \cdots + \lambda_n c_n^2

The squared coefficients ci2c_i^2 are non-negative regardless of which x\mathbf{x} we picked, so only the eigenvalues control the sign of each term — and the form is just their sum. If every λi>0\lambda_i > 0, the form is a sum of non-negative pieces and comes out positive for every non-zero x\mathbf{x} (positive definite). If every λi<0\lambda_i < 0, the form is negative for every non-zero x\mathbf{x}. Mixed signs let some terms push the sum up and others pull it down: depending on which eigen-axes x\mathbf{x} leans toward, the form lands positive or negative — that is exactly the indefinite case. A zero eigenvalue erases its axis’s contribution entirely, so the form vanishes along that direction even when x0\mathbf{x} \neq \mathbf{0} — the loophole that separates semi-definite from definite.

For diagonal matrices the eigenvalues sit on the diagonal, so the criterion is immediate. For general symmetric matrices it reduces an infinite test to a finite computation: find the roots of the characteristic polynomial and inspect their signs.

Classifying Stationary Points

Both pieces are now in place. The Hessian of a C2C^2 scalar field is symmetric, so its definiteness is well-defined — and at a stationary point of ff, that definiteness decides which of the four cases (local minimum, local maximum, saddle, inconclusive) the point falls into.

Let f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} with DD open be a twice continuously partially differentiable scalar field, and let x0D\mathbf{x}_0 \in D be a stationary point of ff. Then:

  • Hf(x0)H_f(\mathbf{x}_0) negative definite     \implies x0\mathbf{x}_0 is a local maximum of ff,
  • Hf(x0)H_f(\mathbf{x}_0) positive definite     \implies x0\mathbf{x}_0 is a local minimum of ff,
  • Hf(x0)H_f(\mathbf{x}_0) indefinite     \implies x0\mathbf{x}_0 is a saddle point of ff,
  • Hf(x0)H_f(\mathbf{x}_0) semi-definite but not definite     \implies the test is inconclusive.

In eigenvalue terms, the semi-definite but not definite case is exactly when at least one eigenvalue of Hf(x0)H_f(\mathbf{x}_0) is zero and the rest share a single sign — no mixed signs to push it indefinite, and a zero eigenvalue blocking strict definiteness.

Intuition from Taylor approximation

Each case maps cleanly onto the 1D intuition. The connection runs through the second-order Taylor approximation: near a stationary point, the expansion gives

f(x0+u)f(x0)+12uHf(x0)uf(\mathbf{x}_0 + \mathbf{u}) \approx f(\mathbf{x}_0) + \tfrac{1}{2}\, \mathbf{u}^\top H_f(\mathbf{x}_0)\, \mathbf{u}

— the gradient term has dropped out because f(x0)=0\nabla f(\mathbf{x}_0) = \mathbf{0}. So the change in ff as we step away from x0\mathbf{x}_0 is, to leading order, the quadratic form uHf(x0)u\mathbf{u}^\top H_f(\mathbf{x}_0)\, \mathbf{u}. If that form is positive in every direction (positive definite Hessian), ff goes up no matter which way we step — a local minimum. Negative in every direction is a local maximum. Mixed signs across directions (indefinite Hessian) means ff rises along some directions and drops along others — the saddle picture from the previous section. And when the form is non-negative or non-positive but vanishes along at least one direction (semi-definite but not definite), the second-order term cannot decide on its own, just as f(a)=0f''(a) = 0 leaves the 1D test inconclusive — a higher-order analysis would be needed to settle the case.

Putting it all together gives a complete recipe for finding the inner extrema of a C2C^2 scalar field on an open domain:

  1. Compute f\nabla f and solve f(x)=0\nabla f(\mathbf{x}) = \mathbf{0} to find all stationary points.
  2. At each stationary point x0\mathbf{x}_0, compute the Hessian Hf(x0)H_f(\mathbf{x}_0).
  3. Determine its definiteness — typically via the eigenvalue signs — and classify x0\mathbf{x}_0 as local minimum, local maximum, saddle point, or inconclusive.

2D saddle-point shortcut. The determinant of any square matrix equals the product of its eigenvalues. Specializing to a 2×22 \times 2 Hessian, det(Hf(x0))=λ1λ2\det(H_f(\mathbf{x}_0)) = \lambda_1 \lambda_2 — so a negative determinant forces the two eigenvalues to opposite signs, making the (symmetric) Hessian indefinite. At a stationary point x0R2\mathbf{x}_0 \in \mathbb{R}^2, the single check det(Hf(x0))<0\det(H_f(\mathbf{x}_0)) < 0 is therefore enough to conclude x0\mathbf{x}_0 is a saddle point — no eigenvalue computation needed.