Taylor Expansion

Recall from 1D calculus that the Taylor polynomial of order $k$ approximates a smooth function $f : \mathbb{R} \to \mathbb{R}$ near a base point $a$ using its first $k$ derivatives:

T_{k,f,a}(x) = \sum_{i=0}^{k} \frac{f^{(i)}(a)}{i!}(x - a)^i

The zeroth-order polynomial is just the constant value $f(a)$ — a flat line at that height. The first-order polynomial adds the linear correction $f'(a)(x - a)$ , producing the tangent line at $a$ . The second-order polynomial adds the quadratic correction $\tfrac{1}{2}f''(a)(x - a)^2$ , bending the line into a parabola whose curvature matches $f$ at $a$ . Each step uses one more piece of local information to track $f$ better near the base point.

The same construction extends to scalar fields with the gradient playing the role of $f'$ and the Hessian playing the role of $f''$ .

Multivariate Taylor Polynomials

Let $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ be a twice (i.e. second-order) partially differentiable scalar field on an open and convex domain $D$ , and let $\mathbf{a} \in D$ be a base point. The zeroth-, first-, and second-order Taylor polynomials of $f$ at $\mathbf{a}$ are the scalar fields $T_{k,f,\mathbf{a}} : \mathbb{R}^n \to \mathbb{R}$ defined by:

\begin{aligned} T_{0,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) \\ T_{1,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) \\ T_{2,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \end{aligned}

Each $T_{k,f,\mathbf{a}}$ takes a point $\mathbf{x} \in \mathbb{R}^n$ and returns a single real number, so every Taylor polynomial is itself a scalar field — an approximation of $f$ in the same shape as $f$ . The subscripts encode three pieces of information at once: the order $k$ of the approximation, the function $f$ being approximated, and the base point $\mathbf{a}$ at which the approximation is centered. Only $\mathbf{x}$ varies as we evaluate $T_{k,f,\mathbf{a}}$ across the domain; $\mathbf{a}$ is fixed.

The two domain conditions earn their place. Open ensures every base point $\mathbf{a}$ has breathing room — exactly the condition needed for $\nabla f(\mathbf{a})$ and $H_f(\mathbf{a})$ to be well-defined, since computing partial derivatives at $\mathbf{a}$ requires probing $f$ at points just to either side of $\mathbf{a}$ along every axis. Convex ensures that for any $\mathbf{x}, \mathbf{a} \in D$ the entire line segment between them stays inside $D$ , so the approximation is well-defined on the whole segment along which the polynomial is meant to track $f$ .

Before turning to the per-order analysis, it helps to see the construction at work in the simplest possible setting — 1D, where everything is directly drawable.

f(x) = \sin(x), \quad a = \tfrac{\pi}{4}

The example above is a 1D demo — not itself a multivariate Taylor polynomial. The function $f(x) = \sin(x)$ is a single-variable function $\mathbb{R} \to \mathbb{R}$ , used here purely for visual intuition, since the multivariate case is hard to draw on a flat page. The blue curve is $f$ and the red dot marks the base point $(a, f(a))$ at $a = \tfrac{\pi}{4}$ . The flat gray line is $T_0$ — it sits at height $f(a)$ and ignores $x$ entirely. The orange line is $T_1$ — it passes through $(a, f(a))$ with slope $f'(a)$ , the tangent line to $f$ at $a$ . The green parabola is $T_2$ — it adds the quadratic correction $\tfrac{1}{2}f''(a)(x-a)^2$ , bending the tangent line so its curvature matches $f$ at $a$ . Notice how each higher order hugs the blue curve over a wider window around $a$ before drifting off.

The multivariate Taylor polynomial $T_{k,f,\mathbf{a}}$ follows exactly the same principle — only the dimensions grow. The 1D slope $f'(a)$ is replaced by the gradient $\nabla f(\mathbf{a})$ , and the 1D curvature $f''(a)$ by the Hessian $H_f(\mathbf{a})$ . The three “shapes” carry over directly: the flat line becomes a flat hyperplane at height $f(\mathbf{a})$ , the tangent line becomes a tangent hyperplane (a tangent plane in 2D), and the fitted parabola becomes a quadric surface. The picture in 1D is the picture in $n$ D — just lifted into more dimensions.

Zeroth Order: A Flat Plateau

The zeroth-order Taylor polynomial

T_{0,f,\mathbf{a}}(\mathbf{x}) = f(\mathbf{a})

is a constant scalar field — it does not depend on $\mathbf{x}$ at all. Its graph is a horizontal hyperplane sitting at height $f(\mathbf{a})$ , like a flat plateau spread across the entire domain. It matches $f$ exactly at $\mathbf{a}$ and ignores how $f$ varies anywhere else. The crudest possible approximation, but the natural starting point.

First Order: The Tangent Hyperplane

The first-order Taylor polynomial

T_{1,f,\mathbf{a}}(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a})

adds a linear correction on top of $T_0$ . The inner product $\nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) = \langle \nabla f(\mathbf{a}),\, \mathbf{x} - \mathbf{a} \rangle$ predicts how much $f$ would change under a displacement $\mathbf{x} - \mathbf{a}$ if its rate of change stayed perfectly constant — exactly the directional derivative times step length picture, scaled to whatever displacement $\mathbf{x} - \mathbf{a}$ we plug in.

Geometrically, the graph of $T_1$ is the tangent hyperplane to the graph of $f$ at $(\mathbf{a}, f(\mathbf{a}))$ — for $n = 2$ the familiar tangent plane in $\mathbb{R}^3$ , for $n = 1$ the tangent line. It is the multivariate generalization of the affine approximation already used in the Jacobian discussion: $f(\mathbf{x}) \approx f(\mathbf{a}) + J_f(\mathbf{a})(\mathbf{x} - \mathbf{a})$ , with $J_f(\mathbf{a}) = \nabla f(\mathbf{a})^\top$ in the scalar-field case.

Second Order: A Quadratic Bowl

The second-order Taylor polynomial

T_{2,f,\mathbf{a}}(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a})

adds a quadratic correction on top of the tangent hyperplane. The quadratic form $(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a})$ uses the Hessian to encode local curvature — how the slope itself changes as we step away from $\mathbf{a}$ — and the factor $\tfrac{1}{2}$ matches the Taylor coefficient $\tfrac{1}{2!}$ from the 1D case.

$T_2$ is a polynomial of degree 2 in the coordinates of $\mathbf{x}$ , so its graph is a quadric surface — a paraboloid, saddle, or trough depending on the eigenvalues of $H_f(\mathbf{a})$ . It is the best polynomial of degree at most 2 that hugs $f$ near $\mathbf{a}$ , faithfully reproducing both the slope and the bending of $f$ at the base point.

Matching $f$ at the Base Point

The formula for $T_{2,f,\mathbf{a}}$ did not fall out of the sky. It was reverse-engineered to satisfy a very specific design goal: at the base point $\mathbf{a}$ , the polynomial should be indistinguishable from $f$ up to second order — same height, same slope in every direction, same curvature in every direction. This section verifies that the formula in the definition actually meets that specification.

This is also what makes $T_2$ trustworthy as a local approximation. If the value, gradient, and Hessian of $T_2$ all agree with those of $f$ at $\mathbf{a}$ , then near $\mathbf{a}$ (where the displacement $\mathbf{x} - \mathbf{a}$ is small) the two functions agree on everything a second-order expansion can capture; what remains is third order and beyond, which shrinks rapidly as $\mathbf{x} \to \mathbf{a}$ . At the base point itself, $T_2$ is a perfect local clone of $f$ ; the further we move, the more the clone drifts.

Each of the three checks below is a one-line substitution: plugging $\mathbf{x} = \mathbf{a}$ collapses the displacement $\mathbf{x} - \mathbf{a}$ to zero, leaving only the term we care about. The algebra is mechanical — the point is that the formula was designed so each substitution recovers exactly what was prescribed.

Function value matches — the graphs touch.

f(\mathbf{a}) = T_{2,f,\mathbf{a}}(\mathbf{a})

Geometrically, the graphs of $f$ and $T_2$ pass through the same point $(\mathbf{a}, f(\mathbf{a}))$ — they are pinned together at the base point.

Verify by substitution

Substitute $\mathbf{x} = \mathbf{a}$ into $T_2$ . Each correction term carries a factor of $\mathbf{x} - \mathbf{a}$ which becomes $\mathbf{0}$ , so both terms vanish:

\begin{aligned} T_{2,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \\ T_{2,f,\mathbf{a}}(\mathbf{a}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top \mathbf{0} + \tfrac{1}{2}\mathbf{0}^\top H_f(\mathbf{a})\mathbf{0} \\ &= f(\mathbf{a}) \end{aligned}

This is the multivariate counterpart of the 1D identity $f(a) = T_{2,f,a}(a)$ .

Gradient matches — the graphs tilt the same way.

\nabla f(\mathbf{a}) = \nabla T_{2,f,\mathbf{a}}(\mathbf{a})

The graphs don’t merely touch at $\mathbf{a}$ , they touch tangentially: the tangent hyperplane to $T_2$ at $\mathbf{a}$ coincides with the tangent hyperplane to $f$ at $\mathbf{a}$ . Concretely, $T_2$ has the same directional derivative as $f$ in every direction at the base point.

Verify by differentiating once

Differentiate $T_2$ once with respect to $\mathbf{x}$ , then substitute:

\begin{aligned} T_{2,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \\ \nabla T_{2,f,\mathbf{a}}(\mathbf{x}) &= \mathbf{0} + \nabla f(\mathbf{a}) + H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \\ \nabla T_{2,f,\mathbf{a}}(\mathbf{a}) &= \mathbf{0} + \nabla f(\mathbf{a}) + H_f(\mathbf{a})\mathbf{0} \\ &= \nabla f(\mathbf{a}) \end{aligned}

Each differentiation removes one " $\mathbf{x} - \mathbf{a}$ " factor from any term that still has one — exactly like the 1D power rule sending $(x-a)^k$ to $k(x-a)^{k-1}$ , dropping the polynomial degree by one.

The point worth pausing on: we differentiate with respect to $\mathbf{x}$ , not with respect to $\mathbf{x} - \mathbf{a}$ . But because $\mathbf{a}$ is a fixed constant — independent of $\mathbf{x}$ — the shift contributes nothing under differentiation. Component-wise, $\frac{\partial}{\partial x_j}(x_i - a_i) = \frac{\partial x_i}{\partial x_j} - 0 = \delta_{ij}$ , so the Jacobian of $\mathbf{x} - \mathbf{a}$ with respect to $\mathbf{x}$ is the identity matrix $I$ . The 1D version is the same one-liner: $\frac{d}{dx}(x-a) = 1$ , exactly as if we were differentiating $x$ on its own. So derivatives with respect to $\mathbf{x}$ behave identically to derivatives with respect to $\mathbf{x} - \mathbf{a}$ , and the power-rule intuition transfers without modification.

Walking through the three terms of $T_2$ :

Constant term $f(\mathbf{a})$ : zero factors of $\mathbf{x} - \mathbf{a}$ to begin with, so differentiating gives $\mathbf{0}$ outright.
Linear term $\nabla f(\mathbf{a})^\top(\mathbf{x} - \mathbf{a})$ : one factor, which differentiation consumes — leaving the constant $\nabla f(\mathbf{a})$ with no factor remaining. This mirrors $\frac{d}{dx}[c(x-a)] = c$ in 1D.
Quadratic term $\tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a})$ : two factors, of which differentiation consumes one — leaving $H_f(\mathbf{a})(\mathbf{x} - \mathbf{a})$ , which still carries one factor.

The third bullet hides the most computation, since the matrix syntax of the quadratic term above keeps its $\mathbf{x} - \mathbf{a}$ factors less visible than $(x-a)^2$ does in 1D. Writing $\mathbf{u} = \mathbf{x} - \mathbf{a}$ and $H = H_f(\mathbf{a})$ for compactness, the 1D pattern is the familiar $\frac{d}{du}[\tfrac{1}{2} c u^2] = c u$ — the $\tfrac{1}{2}$ exists exactly to cancel the $2$ that the power rule pulls out of $u^2$ . The matrix version is the same pattern applied to a quadratic form instead of a scalar quadratic. To see it concretely in $\mathbb{R}^2$ , take a symmetric Hessian $H = \begin{pmatrix} h_{11} & h_{12} \\ h_{12} & h_{22} \end{pmatrix}$ and carry out the matrix multiplication step by step — first the inner $H\mathbf{u}$ , then $\mathbf{u}^\top$ contracted against it, then expand and combine:

\begin{aligned} \mathbf{u}^\top H \mathbf{u} &= \begin{pmatrix} u_1 & u_2 \end{pmatrix} \begin{pmatrix} h_{11} & h_{12} \\ h_{12} & h_{22} \end{pmatrix} \begin{pmatrix} u_1 \\ u_2 \end{pmatrix} \\ &= \begin{pmatrix} u_1 & u_2 \end{pmatrix} \begin{pmatrix} h_{11} u_1 + h_{12} u_2 \\ h_{12} u_1 + h_{22} u_2 \end{pmatrix} \\ &= u_1 (h_{11} u_1 + h_{12} u_2) + u_2 (h_{12} u_1 + h_{22} u_2) \\ &= h_{11} u_1^2 + h_{12} u_1 u_2 + h_{12} u_1 u_2 + h_{22} u_2^2 \\ &= h_{11} u_1^2 + 2 h_{12} u_1 u_2 + h_{22} u_2^2 \end{aligned}

The two $h_{12} u_1 u_2$ terms in the second-to-last line are where the cross-term doubles: the off-diagonal entry $h_{12}$ contributes one copy and $h_{21}$ contributes another, and they combine because $h_{21} = h_{12}$ by symmetry. Now take the partials of $\tfrac{1}{2}\mathbf{u}^\top H \mathbf{u}$ — the leading $\tfrac{1}{2}$ absorbs each $2$ that comes out of differentiation:

\frac{\partial}{\partial u_1}\!\left[\tfrac{1}{2}\mathbf{u}^\top H \mathbf{u}\right] = h_{11} u_1 + h_{12} u_2 \qquad \frac{\partial}{\partial u_2}\!\left[\tfrac{1}{2}\mathbf{u}^\top H \mathbf{u}\right] = h_{12} u_1 + h_{22} u_2

Stacking these as a column reproduces $H \mathbf{u}$ exactly. The same structure holds in any dimension: for symmetric $H$ ,

\nabla\!\left[\tfrac{1}{2}\mathbf{u}^\top H \mathbf{u}\right] = H \mathbf{u}

which is the multivariate analog of $\frac{d}{du}[\tfrac{1}{2} c u^2] = c u$ . Translating back to the original variables:

\nabla\!\left[\tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a})\right] = H_f(\mathbf{a})(\mathbf{x} - \mathbf{a})

This is exactly the term that appears on the second line of the gradient computation above.

So after one round of differentiation, the linear term has run out of $\mathbf{x} - \mathbf{a}$ factors and become a constant, while the quadratic term still carries one. Substituting $\mathbf{x} = \mathbf{a}$ then sends that remaining factor to $\mathbf{0}$ . This is the counterpart of the 1D identity $f'(a) = T'_{2,f,a}(a)$ .

Hessian matches — the graphs bend the same way.

H_f(\mathbf{a}) = H_{T_{2,f,\mathbf{a}}}(\mathbf{a})

Beyond touching and tilting, $T_2$ also curves like $f$ at $\mathbf{a}$ : along every direction through the base point, the rate at which the slope itself changes is the same for both.

Verify by differentiating twice

Differentiate once more — the gradient line has its own Jacobian taken term by term:

\begin{aligned} T_{2,f,\mathbf{a}}(\mathbf{x}) &= f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}(\mathbf{x} - \mathbf{a})^\top H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \\ \nabla T_{2,f,\mathbf{a}}(\mathbf{x}) &= \mathbf{0} + \nabla f(\mathbf{a}) + H_f(\mathbf{a})(\mathbf{x} - \mathbf{a}) \\ H_{T_{2,f,\mathbf{a}}}(\mathbf{x}) &= \mathbf{0} + \mathbf{0} + H_f(\mathbf{a}) \\ &= H_f(\mathbf{a}) \end{aligned}

This is where the contrast with the previous two checks shows up. There is no $\mathbf{x} - \mathbf{a}$ left to substitute: two differentiations have already stripped both factors from the quadratic term (one per round), and the lower-order terms differentiated to zero along the way. The result is the constant matrix $H_f(\mathbf{a})$ , with no dependence on $\mathbf{x}$ at all — so $H_{T_{2,f,\mathbf{a}}}(\mathbf{x}) = H_f(\mathbf{a})$ holds at every point, not just at $\mathbf{a}$ . This is the counterpart of the 1D identity $f''(a) = T''_{2,f,a}(a)$ .

These three conditions are not just consequences of the formula — they are the formula, read backward. Demand a degree-2 polynomial whose value, gradient, and Hessian at $\mathbf{a}$ match those of $f$ , solve for the coefficients, and the expression in the definition is the only one that works. That uniqueness is what singles out $T_{2,f,\mathbf{a}}$ among all degree-2 polynomials. Higher-order Taylor polynomials extend the same idea — pin the third-order partials too, then the fourth, and so on — with each new order locking down one more layer of local behavior at $\mathbf{a}$ .