Taylor Expansion

Recall from 1D calculus that the Taylor polynomial of order kk approximates a smooth function f:RRf : \mathbb{R} \to \mathbb{R} near a base point aa using its first kk derivatives:

Tk,f,a(x)=i=0kf(i)(a)i!(xa)iT_{k,f,a}(x) = \sum_{i=0}^{k} \frac{f^{(i)}(a)}{i!}(x - a)^i

The zeroth-order polynomial is just the constant value f(a)f(a) — a flat line at that height. The first-order polynomial adds the linear correction f(a)(xa)f'(a)(x - a), producing the tangent line at aa. The second-order polynomial adds the quadratic correction 12f(a)(xa)2\tfrac{1}{2}f''(a)(x - a)^2, bending the line into a parabola whose curvature matches ff at aa. Each step uses one more piece of local information to track ff better near the base point.

The same construction extends to scalar fields with the gradient playing the role of ff' and the Hessian playing the role of ff''.

Multivariate Taylor Polynomials

Let f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} be a twice (i.e. second-order) partially differentiable scalar field on an open and convex domain DD, and let aD\mathbf{a} \in D be a base point. The zeroth-, first-, and second-order Taylor polynomials of ff at a\mathbf{a} are the scalar fields Tk,f,a:RnRT_{k,f,\mathbf{a}} : \mathbb{R}^n \to \mathbb{R} defined by:

T0,f,a(x)=f(a)T1,f,a(x)=f(a)+f(a)(xa)T2,f,a(x)=f(a)+f(a)(xa)+12(xa)Hf(a)(xa)\begin{aligned} T_{0,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) \\ T_{1,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) \\ T_{2,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \end{aligned}

Each Tk,f,aT_{k,f,\mathbf{a}} takes a point xRn\mathbf{x} \in \mathbb{R}^n and returns a single real number, so every Taylor polynomial is itself a scalar field — an approximation of ff in the same shape as ff. The subscripts encode three pieces of information at once: the order kk of the approximation, the function ff being approximated, and the base point a\mathbf{a} at which the approximation is centered. Only x\mathbf{x} varies as we evaluate Tk,f,aT_{k,f,\mathbf{a}} across the domain; a\mathbf{a} is fixed.

The two domain conditions earn their place. Open ensures every base point a\mathbf{a} has breathing room — exactly the condition needed for f(a)\nabla f(\mathbf{a}) and Hf(a)H_f(\mathbf{a}) to be well-defined, since computing partial derivatives at a\mathbf{a} requires probing ff at points just to either side of a\mathbf{a} along every axis. Convex ensures that for any x,aD\mathbf{x}, \mathbf{a} \in D the entire line segment between them stays inside DD, so the approximation is well-defined on the whole segment along which the polynomial is meant to track ff.

Before turning to the per-order analysis, it helps to see the construction at work in the simplest possible setting — 1D, where everything is directly drawable.

f(x)=sin(x),a=π4f(x) = \sin(x), \quad a = \tfrac{\pi}{4}

The example above is a 1D demo — not itself a multivariate Taylor polynomial. The function f(x)=sin(x)f(x) = \sin(x) is a single-variable function RR\mathbb{R} \to \mathbb{R}, used here purely for visual intuition, since the multivariate case is hard to draw on a flat page. The blue curve is ff and the red dot marks the base point (a,f(a))(a, f(a)) at a=π4a = \tfrac{\pi}{4}. The flat gray line is T0T_0 — it sits at height f(a)f(a) and ignores xx entirely. The orange line is T1T_1 — it passes through (a,f(a))(a, f(a)) with slope f(a)f'(a), the tangent line to ff at aa. The green parabola is T2T_2 — it adds the quadratic correction 12f(a)(xa)2\tfrac{1}{2}f''(a)(x-a)^2, bending the tangent line so its curvature matches ff at aa. Notice how each higher order hugs the blue curve over a wider window around aa before drifting off.

The multivariate Taylor polynomial Tk,f,aT_{k,f,\mathbf{a}} follows exactly the same principle — only the dimensions grow. The 1D slope f(a)f'(a) is replaced by the gradient f(a)\nabla f(\mathbf{a}), and the 1D curvature f(a)f''(a) by the Hessian Hf(a)H_f(\mathbf{a}). The three “shapes” carry over directly: the flat line becomes a flat hyperplane at height f(a)f(\mathbf{a}), the tangent line becomes a tangent hyperplane (a tangent plane in 2D), and the fitted parabola becomes a quadric surface. The picture in 1D is the picture in nnD — just lifted into more dimensions.

Zeroth Order: A Flat Plateau

The zeroth-order Taylor polynomial

T0,f,a(x)=f(a)T_{0,f,\mathbf{a}}(\mathbf{x}) = f(\mathbf{a})

is a constant scalar field — it does not depend on x\mathbf{x} at all. Its graph is a horizontal hyperplane sitting at height f(a)f(\mathbf{a}), like a flat plateau spread across the entire domain. It matches ff exactly at a\mathbf{a} and ignores how ff varies anywhere else. The crudest possible approximation, but the natural starting point.

First Order: The Tangent Hyperplane

The first-order Taylor polynomial

T1,f,a(x)=f(a)+f(a)(xa)T_{1,f,\mathbf{a}}(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a})

adds a linear correction on top of T0T_0. The inner product f(a)(xa)=f(a),xa\nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) = \langle \nabla f(\mathbf{a}),\, \mathbf{x} - \mathbf{a} \rangle predicts how much ff would change under a displacement xa\mathbf{x} - \mathbf{a} if its rate of change stayed perfectly constant — exactly the directional derivative times step length picture, scaled to whatever displacement xa\mathbf{x} - \mathbf{a} we plug in.

Geometrically, the graph of T1T_1 is the tangent hyperplane to the graph of ff at (a,f(a))(\mathbf{a}, f(\mathbf{a})) — for n=2n = 2 the familiar tangent plane in R3\mathbb{R}^3, for n=1n = 1 the tangent line. It is the multivariate generalization of the affine approximation already used in the Jacobian discussion: f(x)f(a)+Jf(a)(xa)f(\mathbf{x}) \approx f(\mathbf{a}) + J_f(\mathbf{a})(\mathbf{x} - \mathbf{a}), with Jf(a)=f(a)J_f(\mathbf{a}) = \nabla f(\mathbf{a})^\top in the scalar-field case.

Second Order: A Quadratic Bowl

The second-order Taylor polynomial

T2,f,a(x)=f(a)+f(a)(xa)+12(xa)Hf(a)(xa)T_{2,f,\mathbf{a}}(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a})

adds a quadratic correction on top of the tangent hyperplane. The quadratic form (xa)Hf(a)(xa)(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) uses the Hessian to encode local curvature — how the slope itself changes as we step away from a\mathbf{a} — and the factor 12\tfrac{1}{2} matches the Taylor coefficient 12!\tfrac{1}{2!} from the 1D case.

T2T_2 is a polynomial of degree 2 in the coordinates of x\mathbf{x}, so its graph is a quadric surface — a paraboloid, saddle, or trough depending on the eigenvalues of Hf(a)H_f(\mathbf{a}). It is the best polynomial of degree at most 2 that hugs ff near a\mathbf{a}, faithfully reproducing both the slope and the bending of ff at the base point.

Matching at the Base Point

The formula for T2,f,aT_{2,f,\mathbf{a}} did not fall out of the sky. It was reverse-engineered to satisfy a very specific design goal: at the base point a\mathbf{a}, the polynomial should be indistinguishable from ff up to second order — same height, same slope in every direction, same curvature in every direction. This section verifies that the formula in the definition actually meets that specification.

This is also what makes T2T_2 trustworthy as a local approximation. If the value, gradient, and Hessian of T2T_2 all agree with those of ff at a\mathbf{a}, then near a\mathbf{a} (where the displacement xa\mathbf{x} - \mathbf{a} is small) the two functions agree on everything a second-order expansion can capture; what remains is third order and beyond, which shrinks rapidly as xa\mathbf{x} \to \mathbf{a}. At the base point itself, T2T_2 is a perfect local clone of ff; the further we move, the more the clone drifts.

Each of the three checks below is a one-line substitution: plugging x=a\mathbf{x} = \mathbf{a} collapses the displacement xa\mathbf{x} - \mathbf{a} to zero, leaving only the term we care about. The algebra is mechanical — the point is that the formula was designed so each substitution recovers exactly what was prescribed.

Function value matches — the graphs touch.

f(a)=T2,f,a(a)f(\mathbf{a}) = T_{2,f,\mathbf{a}}(\mathbf{a})

Geometrically, the graphs of ff and T2T_2 pass through the same point (a,f(a))(\mathbf{a}, f(\mathbf{a})) — they are pinned together at the base point.

Verify by substitution

Substitute x=a\mathbf{x} = \mathbf{a} into T2T_2. Each correction term carries a factor of xa\mathbf{x} - \mathbf{a} which becomes 0\mathbf{0}, so both terms vanish:

T2,f,a(x)=f(a)+f(a)(xa)+12(xa)Hf(a)(xa)T2,f,a(a)=f(a)+f(a)0+120Hf(a)0=f(a)\begin{aligned} T_{2,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \\ T_{2,f,\mathbf{a}}(\mathbf{a}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top \mathbf{0} + \tfrac{1}{2}\mathbf{0}^\top H_f(\mathbf{a})\mathbf{0} \\ &= f(\mathbf{a}) \end{aligned}

This is the multivariate counterpart of the 1D identity f(a)=T2,f,a(a)f(a) = T_{2,f,a}(a).

Gradient matches — the graphs tilt the same way.

f(a)=T2,f,a(a)\nabla f(\mathbf{a}) = \nabla T_{2,f,\mathbf{a}}(\mathbf{a})

The graphs don’t merely touch at a\mathbf{a}, they touch tangentially: the tangent hyperplane to T2T_2 at a\mathbf{a} coincides with the tangent hyperplane to ff at a\mathbf{a}. Concretely, T2T_2 has the same directional derivative as ff in every direction at the base point.

Verify by differentiating once

Differentiate T2T_2 once with respect to x\mathbf{x}, then substitute:

T2,f,a(x)=f(a)+f(a)(xa)+12(xa)Hf(a)(xa)T2,f,a(x)=0+f(a)+Hf(a)(xa)T2,f,a(a)=0+f(a)+Hf(a)0=f(a)\begin{aligned} T_{2,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \\ \nabla T_{2,f,\mathbf{a}}(\mathbf{x}) &= \mathbf{0} + \nabla f(\mathbf{a}) + H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \\ \nabla T_{2,f,\mathbf{a}}(\mathbf{a}) &= \mathbf{0} + \nabla f(\mathbf{a}) + H_f(\mathbf{a})\mathbf{0} \\ &= \nabla f(\mathbf{a}) \end{aligned}

Each differentiation removes one "xa\mathbf{x} - \mathbf{a}" factor from any term that still has one — exactly like the 1D power rule sending (xa)k(x-a)^k to k(xa)k1k(x-a)^{k-1}, dropping the polynomial degree by one.

The point worth pausing on: we differentiate with respect to x\mathbf{x}, not with respect to xa\mathbf{x} - \mathbf{a}. But because a\mathbf{a} is a fixed constant — independent of x\mathbf{x} — the shift contributes nothing under differentiation. Component-wise, xj(xiai)=xixj0=δij\frac{\partial}{\partial x_j}(x_i - a_i) = \frac{\partial x_i}{\partial x_j} - 0 = \delta_{ij}, so the Jacobian of xa\mathbf{x} - \mathbf{a} with respect to x\mathbf{x} is the identity matrix II. The 1D version is the same one-liner: ddx(xa)=1\frac{d}{dx}(x-a) = 1, exactly as if we were differentiating xx on its own. So derivatives with respect to x\mathbf{x} behave identically to derivatives with respect to xa\mathbf{x} - \mathbf{a}, and the power-rule intuition transfers without modification.

Walking through the three terms of T2T_2:

  • Constant term f(a)f(\mathbf{a}): zero factors of xa\mathbf{x} - \mathbf{a} to begin with, so differentiating gives 0\mathbf{0} outright.
  • Linear term f(a)(xa)\nabla f(\mathbf{a})^\top(\mathbf{x} - \mathbf{a}): one factor, which differentiation consumes — leaving the constant f(a)\nabla f(\mathbf{a}) with no factor remaining. This mirrors ddx[c(xa)]=c\frac{d}{dx}[c(x-a)] = c in 1D.
  • Quadratic term 12(xa)Hf(a)(xa)\tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}): two factors, of which differentiation consumes one — leaving Hf(a)(xa)H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}), which still carries one factor.

The third bullet hides the most computation, since the matrix syntax of the quadratic term above keeps its xa\mathbf{x} - \mathbf{a} factors less visible than (xa)2(x-a)^2 does in 1D. Writing u=xa\mathbf{u} = \mathbf{x} - \mathbf{a} and H=Hf(a)H = H_f(\mathbf{a}) for compactness, the 1D pattern is the familiar ddu[12cu2]=cu\frac{d}{du}[\tfrac{1}{2} c u^2] = c u — the 12\tfrac{1}{2} exists exactly to cancel the 22 that the power rule pulls out of u2u^2. The matrix version is the same pattern applied to a quadratic form instead of a scalar quadratic. To see it concretely in R2\mathbb{R}^2, take a symmetric Hessian H=(h11h12h12h22)H = \begin{pmatrix} h_{11} & h_{12} \\ h_{12} & h_{22} \end{pmatrix} and carry out the matrix multiplication step by step — first the inner HuH\mathbf{u}, then u\mathbf{u}^\top contracted against it, then expand and combine:

uHu=(u1u2)(h11h12h12h22)(u1u2)=(u1u2)(h11u1+h12u2h12u1+h22u2)=u1(h11u1+h12u2)+u2(h12u1+h22u2)=h11u12+h12u1u2+h12u1u2+h22u22=h11u12+2h12u1u2+h22u22\begin{aligned} \mathbf{u}^\top H \mathbf{u} &= \begin{pmatrix} u_1 & u_2 \end{pmatrix} \begin{pmatrix} h_{11} & h_{12} \\ h_{12} & h_{22} \end{pmatrix} \begin{pmatrix} u_1 \\ u_2 \end{pmatrix} \\ &= \begin{pmatrix} u_1 & u_2 \end{pmatrix} \begin{pmatrix} h_{11} u_1 + h_{12} u_2 \\ h_{12} u_1 + h_{22} u_2 \end{pmatrix} \\ &= u_1 (h_{11} u_1 + h_{12} u_2) + u_2 (h_{12} u_1 + h_{22} u_2) \\ &= h_{11} u_1^2 + h_{12} u_1 u_2 + h_{12} u_1 u_2 + h_{22} u_2^2 \\ &= h_{11} u_1^2 + 2 h_{12} u_1 u_2 + h_{22} u_2^2 \end{aligned}

The two h12u1u2h_{12} u_1 u_2 terms in the second-to-last line are where the cross-term doubles: the off-diagonal entry h12h_{12} contributes one copy and h21h_{21} contributes another, and they combine because h21=h12h_{21} = h_{12} by symmetry. Now take the partials of 12uHu\tfrac{1}{2}\mathbf{u}^\top H \mathbf{u} — the leading 12\tfrac{1}{2} absorbs each 22 that comes out of differentiation:

u1 ⁣[12uHu]=h11u1+h12u2u2 ⁣[12uHu]=h12u1+h22u2\frac{\partial}{\partial u_1}\!\left[\tfrac{1}{2}\mathbf{u}^\top H \mathbf{u}\right] = h_{11} u_1 + h_{12} u_2 \qquad \frac{\partial}{\partial u_2}\!\left[\tfrac{1}{2}\mathbf{u}^\top H \mathbf{u}\right] = h_{12} u_1 + h_{22} u_2

Stacking these as a column reproduces HuH \mathbf{u} exactly. The same structure holds in any dimension: for symmetric HH,

 ⁣[12uHu]=Hu\nabla\!\left[\tfrac{1}{2}\mathbf{u}^\top H \mathbf{u}\right] = H \mathbf{u}

which is the multivariate analog of ddu[12cu2]=cu\frac{d}{du}[\tfrac{1}{2} c u^2] = c u. Translating back to the original variables:

 ⁣[12(xa)Hf(a)(xa)]=Hf(a)(xa)\nabla\!\left[\tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a})\right] = H_f(\mathbf{a})(\mathbf{x} - \mathbf{a})

This is exactly the term that appears on the second line of the gradient computation above.

So after one round of differentiation, the linear term has run out of xa\mathbf{x} - \mathbf{a} factors and become a constant, while the quadratic term still carries one. Substituting x=a\mathbf{x} = \mathbf{a} then sends that remaining factor to 0\mathbf{0}. This is the counterpart of the 1D identity f(a)=T2,f,a(a)f'(a) = T'_{2,f,a}(a).

Hessian matches — the graphs bend the same way.

Hf(a)=HT2,f,a(a)H_f(\mathbf{a}) = H_{T_{2,f,\mathbf{a}}}(\mathbf{a})

Beyond touching and tilting, T2T_2 also curves like ff at a\mathbf{a}: along every direction through the base point, the rate at which the slope itself changes is the same for both.

Verify by differentiating twice

Differentiate once more — the gradient line has its own Jacobian taken term by term:

T2,f,a(x)=f(a)+f(a)(xa)+12(xa)Hf(a)(xa)T2,f,a(x)=0+f(a)+Hf(a)(xa)HT2,f,a(x)=0+0+Hf(a)=Hf(a)\begin{aligned} T_{2,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \\ \nabla T_{2,f,\mathbf{a}}(\mathbf{x}) &= \mathbf{0} + \nabla f(\mathbf{a}) + H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \\ H_{T_{2,f,\mathbf{a}}}(\mathbf{x}) &= \mathbf{0} + \mathbf{0} + H_f(\mathbf{a}) \\ &= H_f(\mathbf{a}) \end{aligned}

This is where the contrast with the previous two checks shows up. There is no xa\mathbf{x} - \mathbf{a} left to substitute: two differentiations have already stripped both factors from the quadratic term (one per round), and the lower-order terms differentiated to zero along the way. The result is the constant matrix Hf(a)H_f(\mathbf{a}), with no dependence on x\mathbf{x} at all — so HT2,f,a(x)=Hf(a)H_{T_{2,f,\mathbf{a}}}(\mathbf{x}) = H_f(\mathbf{a}) holds at every point, not just at a\mathbf{a}. This is the counterpart of the 1D identity f(a)=T2,f,a(a)f''(a) = T''_{2,f,a}(a).

These three conditions are not just consequences of the formula — they are the formula, read backward. Demand a degree-2 polynomial whose value, gradient, and Hessian at a\mathbf{a} match those of ff, solve for the coefficients, and the expression in the definition is the only one that works. That uniqueness is what singles out T2,f,aT_{2,f,\mathbf{a}} among all degree-2 polynomials. Higher-order Taylor polynomials extend the same idea — pin the third-order partials too, then the fourth, and so on — with each new order locking down one more layer of local behavior at a\mathbf{a}.