Partial Differentiation

Consider a scalar field of the form:

f:DRnR,x=(x1,,xn)f(x)=f(x1,,xn)f : D \subseteq \mathbb{R}^n \to \mathbb{R}, \quad \mathbf{x} = (x_1, \ldots, x_n)^\top \mapsto f(\mathbf{x}) = f(x_1, \ldots, x_n)

This is a function that takes a point in nn-dimensional space and returns a single real number — like a temperature field in a room, or an elevation map over terrain.

Directional Derivative

Recall from 1D calculus that the derivative of ff at a point aa is:

f(a)=limh0f(a+h)f(a)hf'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h}

This is the slope of the function at aa: how much does ff change when we take a tiny step of size hh away from aa? Geometrically, it is the same intuition behind tan(α)=sin(α)cos(α)\tan(\alpha) = \frac{\sin(\alpha)}{\cos(\alpha)} — the slope of a line is simply the ratio of the vertical rise to the horizontal run, i.e., ΔyΔx=f(a+h)f(a)h\frac{\Delta y}{\Delta x} = \frac{f(a+h) - f(a)}{h}.

In multiple dimensions, the same question becomes richer: we can step away from a point in infinitely many directions. This is where the directional derivative comes in — it tells us the rate of change of ff when we stand at a point a\mathbf{a} and look in a specific direction v\mathbf{v}.

Here, aD\mathbf{a} \in D is the base point — the location in the domain where we want to measure how fast ff is changing. Think of a\mathbf{a} as your exact current GPS coordinate while standing perfectly still. v\mathbf{v} is the direction you are facing. Crucially, the directional derivative is a local quantity — it captures the behavior of ff specifically at a\mathbf{a}, not globally.

For each unit vector v\mathbf{v} (with v=1\|\mathbf{v}\| = 1), we take a tiny step of size hh from a\mathbf{a} in the direction v\mathbf{v}, measure the change in ff, and divide by the step size — the same slope formula as before, now generalized to any direction in nn-dimensional space:

f(a+hv)f(a)h\frac{f(\mathbf{a} + h\mathbf{v}) - f(\mathbf{a})}{h}

As h0h \to 0, the step size becomes infinitesimally small, and we recover the instantaneous rate of change in the chosen direction.

The directional derivative of a scalar field f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} at a point aD\mathbf{a} \in D in the direction of a unit vector v\mathbf{v} (with v=1\|\mathbf{v}\| = 1) is:

fv(a)=vf(a)=fv(a)=limh0f(a+hv)f(a)h\frac{\partial f}{\partial \mathbf{v}}(\mathbf{a}) = \partial_\mathbf{v} f(\mathbf{a}) = f_\mathbf{v}(\mathbf{a}) = \lim_{h \to 0} \frac{f(\mathbf{a} + h\mathbf{v}) - f(\mathbf{a})}{h}

All three notations — fv(a)\frac{\partial f}{\partial \mathbf{v}}(\mathbf{a}), vf(a)\partial_\mathbf{v} f(\mathbf{a}), and fv(a)f_\mathbf{v}(\mathbf{a}) — refer to the same quantity: the rate of change of ff at the base point a\mathbf{a} along the direction v\mathbf{v}.

Requiring v\mathbf{v} to be a unit vector ensures that the step hvh\mathbf{v} has length exactly h|h|, making the derivative a pure measure of directional rate of change, independent of the magnitude of v\mathbf{v}.

Since v\mathbf{v} can point in any direction on the unit sphere in Rn\mathbb{R}^n, there are infinitely many directional derivatives at any given point a\mathbf{a}. In principle, this limit may not exist for every combination of ff, a\mathbf{a}, and v\mathbf{v}. However, throughout this course we will work with sufficiently smooth functions and always assume the limit exists — the “happy scenario.”

Partial Derivatives

Having infinitely many directions to check is impractical. The saving insight is that we are working in a linear vector space: any direction v\mathbf{v} can be expressed as a linear combination of the nn vectors of the standard basis (also called the canonical basis) of Rn\mathbb{R}^n. For smooth functions, this means the behavior of ff in every direction is fully determined by its behavior along the nn coordinate directions — one per argument of the function. We do not need to sweep through all directions; the nn coordinate directions are enough.

Substituting v=ei\mathbf{v} = \mathbf{e}_i into the directional derivative formula means we step from a\mathbf{a} by a tiny amount hh purely along the ii-th axis, holding all other coordinates fixed. The resulting limit is the rate of change of ff with respect to its ii-th argument xix_i.

The partial derivative of f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} at a point aD\mathbf{a} \in D with respect to the variable xix_i is:

fxi(a)=if(a)=fxi(a)=limh0f(a+hei)f(a)h,i{1,,n}\frac{\partial f}{\partial x_i}(\mathbf{a}) = \partial_i f(\mathbf{a}) = f_{x_i}(\mathbf{a}) = \lim_{h \to 0} \frac{f(\mathbf{a} + h\mathbf{e}_i) - f(\mathbf{a})}{h}, \quad i \in \{1, \ldots, n\}

As always, we assume this limit exists. Each partial derivative answers a single focused question: how fast does ff change when only the ii-th coordinate of a\mathbf{a} is nudged, while all others stay fixed? Together, the nn partial derivatives provide a complete description of how ff varies locally — they tell the full story.

When all nn of them exist at a given point, ff earns a name for the property.

A scalar field f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} is partially differentiable at a point aD\mathbf{a} \in D if all nn of its partial derivatives exist at a\mathbf{a}. It is partially differentiable on DD if this holds at every point of DD:

f partially differentiable at a    fx1(a),,fxn(a) all existf \text{ partially differentiable at } \mathbf{a} \iff f_{x_1}(\mathbf{a}), \ldots, f_{x_n}(\mathbf{a}) \text{ all exist}

The Gradient

When ff is partially differentiable at a\mathbf{a}, its nn partial derivatives can be collected into a single vector — the gradient.

The gradient of f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} at a point aD\mathbf{a} \in D is the column vector of all partial derivatives:

f(a)=(fx1(a)fxn(a))=gradf(a)\nabla f(\mathbf{a}) = \begin{pmatrix} f_{x_1}(\mathbf{a}) \\ \vdots \\ f_{x_n}(\mathbf{a}) \end{pmatrix} = \operatorname{grad} f(\mathbf{a})

Gradient as a reusable formula: In practice, you do not evaluate the gradient at a specific point a\mathbf{a} directly. Instead, you first derive the partial derivative formulas symbolically — keeping the variables x,y,x, y, \dots free — and assemble them into f(x)\nabla f(\mathbf{x}). This gives you a general expression valid for any point in the domain. Evaluating it at a specific a\mathbf{a} is then just substitution.

The gradient f(a)Rn\nabla f(\mathbf{a}) \in \mathbb{R}^n lives in the same space as the input — not the output. So for a scalar field f:R2Rf : \mathbb{R}^2 \to \mathbb{R}, the gradient at any point is a 2D vector; for f:R3Rf : \mathbb{R}^3 \to \mathbb{R}, it’s a 3D vector.

This means we can evaluate f\nabla f at every point in the domain, producing a gradient field af(a)\mathbf{a} \mapsto \nabla f(\mathbf{a}) — a vector field on DD. Think of it as an arrow attached to each input point on the flat “floor” map, pointing in the direction of steepest ascent at that location.

The symbol \nabla is called the nabla (or del) operator. Differential operators like \nabla will be explored in more depth later in the course.

The gradient is the payoff for working in a linear vector space. When fx1,,fxnf_{x_1}, \ldots, f_{x_n} are continuous, any directional derivative reduces to an inner product with the gradient:

fv(a)=f(a),v\frac{\partial f}{\partial \mathbf{v}}(\mathbf{a}) = \langle \nabla f(\mathbf{a}),\, \mathbf{v} \rangle

This is the key trick: instead of computing a separate tortuous limit for every possible diagonal direction v\mathbf{v}, we compute the gradient once — just nn partial derivatives — and recover any directional derivative for free via an inner product (dot product). The gradient packages all local directional information into a single vector.

Let f(x,y,z)=ezsinx+y2f(x, y, z) = e^{-z}\sin x + y^2. Using standard differentiation rules, the three partial derivatives are:

fx=ezcosx,fy=2y,fz=ezsinx\frac{\partial f}{\partial x} = e^{-z}\cos x, \qquad \frac{\partial f}{\partial y} = 2y, \qquad \frac{\partial f}{\partial z} = -e^{-z}\sin x

Assembling them into the gradient:

f(x,y,z)=(ezcosx2yezsinx)\nabla f(x, y, z) = \begin{pmatrix} e^{-z}\cos x \\ 2y \\ -e^{-z}\sin x \end{pmatrix}

Steepest Ascent and Descent

The gradient does more than package the partial derivatives — it points in a very specific geometric direction. Suppose f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} (with DD open) is a scalar field with continuous partial derivatives fx1,,fxnf_{x_1}, \ldots, f_{x_n}, and let aD\mathbf{a} \in D be a point where f(a)0\nabla f(\mathbf{a}) \neq \mathbf{0}. Then ff has its steepest ascent at a\mathbf{a} in the direction f(a)\nabla f(\mathbf{a}), and its steepest descent in the opposite direction f(a)-\nabla f(\mathbf{a}).

Intuitively: standing on a hillside, the gradient points straight uphill along the steepest route, and its negative points straight downhill (the way a ball would roll). Every other direction trades some uphill progress for sideways motion.

This follows directly from the inner-product formula fv(a)=f(a),v\frac{\partial f}{\partial \mathbf{v}}(\mathbf{a}) = \langle \nabla f(\mathbf{a}),\, \mathbf{v} \rangle: among all unit vectors v\mathbf{v}, this inner product is largest when v\mathbf{v} is perfectly aligned with f(a)\nabla f(\mathbf{a}) and smallest when it points the opposite way. The one strict condition is f(a)0\nabla f(\mathbf{a}) \neq \mathbf{0} — at a point where the gradient vanishes, there is no preferred direction (the ground is flat).

Isolines

Alongside the gradient, another geometric object captures the shape of a scalar field: the set of points where ff takes a constant value.

Let f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} be a scalar field. For a value cf(D)Rc \in f(D) \subseteq \mathbb{R}, the isoline (also called contour line or level set) of ff at level cc is:

Nc={xDf(x)=c}N_c = \{\, \mathbf{x} \in D \mid f(\mathbf{x}) = c \,\}

Think of a topographic map: each contour traces the points at the same elevation. On a weather map, each isotherm connects places with the same temperature. More generally, isolines slice the domain into level sets on which ff is constant. (Note: A notation like Nf(a)=cN_{f(\mathbf{a})=c} is just a formal way of saying “the specific isoline that passes through our base point a\mathbf{a}”).

The gradient at a point a\mathbf{a} is always perpendicular to the isoline Nf(a)=cN_{f(\mathbf{a}) = c} passing through a\mathbf{a}.

To see why, pick any unit vector v\mathbf{v} that points along the isoline (a tangent vector). Elevation doesn’t change in that direction, so the directional derivative vanishes:

f(a),v=fv(a)=0\langle \nabla f(\mathbf{a}),\, \mathbf{v} \rangle = \frac{\partial f}{\partial \mathbf{v}}(\mathbf{a}) = 0

Since a zero inner product means two vectors are orthogonal, the steepest ascent f(a)\nabla f(\mathbf{a}) is forced to be perpendicular to the isoline.

The “tilted ramp” picture

Imagine standing on a tilted ramp. The isoline is a horizontal line painted across the ramp (where height never changes). If you take a step at an angle (diagonally across the ramp), you are “wasting” part of your step’s length moving sideways along the isoline, meaning you gain less vertical height. To gain the absolute maximum height possible in a single step (the direction of steepest ascent, which is the gradient), you must dedicate 100% of your step to moving forward, wasting zero energy on sideways movement. The only way to move with zero sideways drift is to walk exactly at a 90-degree angle to the horizontal isoline.

f:[2,2]×[2,2]R,f(x,y)=x2+y2,f(x,y)=(2x2y)f: [-2,2] \times [-2,2] \to \mathbb{R}, \quad f(x,y) = x^2 + y^2, \quad \nabla f(x,y) = \begin{pmatrix} 2x \\ 2y \end{pmatrix}

In the example above, the 2D floor is the domain of ff: the heatmap encodes f(x,y)f(x,y) as color, the dark rings are isolines (level sets f(x,y)=cf(x,y) = c), and the red arrows are the gradient field f\nabla f. This is where ff, its isolines, and its gradient actually live. The 3D bowl is the graph z=f(x,y)z = f(x,y) — a visualization aid, not a separate object. The same isolines and gradient arrows are lifted onto it: the rings become horizontal cross-sections of the bowl, and the arrows become tangent vectors pointing in the steepest-ascent direction along the surface.

Arrow length. The arrows are short near the center and long at the boundary — not by coincidence. For f(x,y)=x2+y2f(x,y) = x^2 + y^2:

f=(2x)2+(2y)2=4x2+4y2=2x2+y2=2r\|\nabla f\| = \sqrt{(2x)^2 + (2y)^2} = \sqrt{4x^2 + 4y^2} = 2\sqrt{x^2 + y^2} = 2r

The gradient magnitude is just twice the radial distance. Geometrically: near the origin the bowl is nearly flat, so there’s barely any slope to point along; near the rim it’s steep, so the gradient is large. Exactly at the origin f=0\|\nabla f\| = 0 — you’re at the minimum, there’s no downhill direction, the gradient has nothing to say.

Arrow direction. Every arrow points straight outward, perpendicular to the isoline it sits on. That’s not specific to this example — it’s always true: the gradient is orthogonal to the level set. If you stood on the 3D bowl and asked “which way is straight up the slope?”, the answer is exactly where the lifted arrow points.

Second-Order Partial Derivatives

The partial derivatives fx1,,fxnf_{x_1}, \ldots, f_{x_n} of a scalar field f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} are themselves scalar fields on DD — each one assigns to every point the rate of change of ff along the corresponding coordinate axis. We assume DD is an open set here, which simply means every point of DD has a bit of breathing room around it that still lies inside DD — no point sits right on the edge. That matters because computing fxi(a)f_{x_i}(\mathbf{a}) means peeking at ff just to either side of a\mathbf{a} along the ii-th axis; if a\mathbf{a} were on the boundary, some of those nearby probe points would fall outside DD where ff isn’t defined, and the limit couldn’t be formed at all. Openness rules that case out, so the partial derivatives exist throughout DD.

When all nn of these derived scalar fields are continuous, we give the situation its own name.

A scalar field f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} (with DD open) is continuously partially differentiable on DD if all of its partial derivatives exist and are continuous on DD:

f continuously partially differentiable on D    fx1,,fxn continuous on Df \text{ continuously partially differentiable on } D \iff f_{x_1}, \ldots, f_{x_n} \text{ continuous on } D

If these first-order partial derivatives are themselves partially differentiable, we can differentiate a second time — taking a partial derivative of a partial derivative.

The second-order partial derivative of ff with respect to xix_i and xjx_j is the partial derivative of fxif_{x_i} taken once more with respect to xjx_j:

xjxif(x)=2fxjxi(x)=jif(x)=fxjxi(x)\partial_{x_j} \partial_{x_i} f(\mathbf{x}) = \frac{\partial^2 f}{\partial x_j \, \partial x_i}(\mathbf{x}) = \partial_j \partial_i f(\mathbf{x}) = f_{x_j x_i}(\mathbf{x})

All four expressions on the right denote exactly the same quantity — they are fully interchangeable, just different notations for the same second-order partial derivative (read them right-to-left like peeling an onion: first derive by xix_i, then derive the result by xjx_j).

The intuition mirrors the 1D case. In 1D, the second derivative f(a)f''(a) measures curvature — whether the slope is growing or shrinking as we move along the axis. On a hill, ff'' tells you whether the climb is getting steeper or starting to level out. The same picture carries over to higher dimensions: an unmixed derivative fxixif_{x_i x_i} describes how the slope along the ii-th axis is itself changing as we step further along xix_i, and a mixed derivative fxjxif_{x_j x_i} describes how a slope along one axis varies as we step along another.

Smoothness

The construction extends arbitrarily: differentiate a partial derivative once more to get a third-order partial derivative, again for fourth-order, and so on without limit. When all partial derivatives up to order kk exist and are continuous, we get a property worth naming.

A scalar field f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} (with DD open) is kk-times continuously partially differentiable on DD if all of its partial derivatives up to order kk exist and are continuous on DD.

These nested smoothness levels get their own family of names — the CkC^k classes — each one packaging ”ff has kk continuous orders of derivatives” into a single label.

For kN0{}k \in \mathbb{N}_0 \cup \{\infty\}, a scalar field f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} (with DD open) is of class Ck(D)C^k(D) as defined below:

C0(D)={ff is continuous}Ck(D)={ff is k-times continuously partially differentiable},kNC(D)={ff is continuously partially differentiable of any order}\begin{aligned} C^0(D) &= \{\, f \mid f \text{ is continuous} \,\} \\ C^k(D) &= \{\, f \mid f \text{ is } k\text{-times continuously partially differentiable} \,\}, \quad k \in \mathbb{N} \\ C^\infty(D) &= \{\, f \mid f \text{ is continuously partially differentiable of any order} \,\} \end{aligned}

So C0(D)C^0(D) is just the continuous functions (no jumps) on DD; Ck(D)C^k(D) for k1k \geq 1 asks for kk continuous orders of partial derivatives; and C(D)C^\infty(D) is the set of functions that remain differentiable (perfectly smooth) no matter how many times you differentiate them. Two milestones from this hierarchy matter most going forward: fC1(D)f \in C^1(D) means all first-order partial derivatives are continuous, so we can assemble them into the gradient f\nabla f at every point of DD; and fC2(D)f \in C^2(D) means all second-order partial derivatives are continuous, which is exactly what is needed to assemble them into the Hessian.

In practice, we rarely care about the exact value of kk — we just want enough continuous derivatives for whatever theorem or computation we’re doing. That looser notion gets its own name.

A function f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} is called a smooth function if it is of class Ck(D)C^k(D) with kk high enough for the problem at hand:

f smooth on D    fCk(D) for sufficiently large kf \text{ smooth on } D \iff f \in C^k(D) \text{ for sufficiently large } k

The required kk depends on context — if a result needs second-order partial derivatives to be continuous, “smooth enough” means C2(D)C^2(D); if third-order derivatives are needed, C3(D)C^3(D), and so on. In most practical settings one simply assumes fC(D)f \in C^\infty(D) to avoid having to track the exact order.

Once ff reaches C2C^2 smoothness, a useful property kicks in — the order in which we take partial derivatives stops mattering.

If f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} is a scalar field with fC2(D)f \in C^2(D), then for all i,j{1,,n}i, j \in \{1, \ldots, n\}:

fxixj=fxjxif_{x_i x_j} = f_{x_j x_i}

Differentiating with respect to xix_i first and then xjx_j gives the same answer as differentiating in the opposite order. For any function smooth enough to land in C2C^2, mixed second-order partial derivatives commute (i.e. AB = BA).

Hessian Matrix

With the first-order partial derivatives, we organized the nn values fx1,,fxnf_{x_1}, \ldots, f_{x_n} into a single vector — the gradient. Second-order partials are richer: for each pair (xi,xj)(x_i, x_j) there is one derivative fxjxif_{x_j x_i}, giving n2n^2 numbers in total. The natural way to organize them is as an n×nn \times n matrix. The pattern continues into higher orders — third-order partial derivatives need a three-index object with n3n^3 entries (a tensor), fourth-order ones live in n4n^4 entries, and so on — but at second order, this n×nn \times n matrix has its own name.

Let f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R} be a second-order partially differentiable scalar field. The Hessian matrix of ff at a point xD\mathbf{x} \in D, denoted Hf(x)H_f(\mathbf{x}), is the n×nn \times n matrix of all second-order partial derivatives of ff at x\mathbf{x}:

Hf(x)=(fx1x1(x)fx1xn(x)fxnx1(x)fxnxn(x))H_f(\mathbf{x}) = \begin{pmatrix} f_{x_1 x_1}(\mathbf{x}) & \cdots & f_{x_1 x_n}(\mathbf{x}) \\ \vdots & & \vdots \\ f_{x_n x_1}(\mathbf{x}) & \cdots & f_{x_n x_n}(\mathbf{x}) \end{pmatrix}

The entry in row ii, column jj is fxixj(x)f_{x_i x_j}(\mathbf{x}) — the rate at which the jj-th partial derivative of ff changes as we step along xix_i. The diagonal entries fxixif_{x_i x_i} are the pure second derivatives along each axis; the off-diagonal entries fxixjf_{x_i x_j} (for iji \neq j) are the mixed partials.

Since fC2(D)f \in C^2(D), the mixed partials satisfy fxixj=fxjxif_{x_i x_j} = f_{x_j x_i} for all (i,j)(i, j) — so the Hessian is always symmetric.

Stationary Points and Extrema

The gradient and Hessian are not brand-new objects — they are the nn-dimensional counterparts of the first and second derivatives we already know from 1D calculus, and they carry over the same roles.

In 1D, finding the extrema of a smooth function f:RRf : \mathbb{R} \to \mathbb{R} follows a two-step recipe: first solve f(a)=0f'(a) = 0 to locate candidate points (bottoms of valleys, tops of peaks), then evaluate f(a)f''(a) at those candidates — a positive value means the curve opens upward (local minimum), a negative value means it opens downward (local maximum), and zero leaves the test inconclusive.

The exact same logic plays out in nn dimensions, with the gradient playing the role of ff' and the Hessian playing the role of ff''. The “zero-slope” condition now becomes f(a)=0\nabla f(\mathbf{a}) = \mathbf{0} — every partial derivative vanishes at once. Points where this holds are called stationary points — defined formally in the optima chapter.

Geometrically, at a stationary point every directional derivative is zero — no matter which way you step, the slope of ff is flat at that point. Any actual change in ff as you move away only shows up through the curvature (how the surface bends), not through the slope itself. Stationary points are therefore exactly the candidates for local minima and maxima, the same way f(a)=0f'(a) = 0 produces the candidates in 1D. Not every stationary point is an extremum, however — some fall into neither the minimum nor the maximum category.

To tell the cases apart, we turn to second-order information: the Hessian. Informally, the Hessian describes the local curvature of ff in every direction at once, and at a stationary point it plays exactly the same classifying role that the sign of ff'' plays in 1D — it decides whether the point is a minimum, a maximum, or something else. A precise criterion for reading this off the Hessian comes later in the optima chapter; for now it is enough to remember that gradient and Hessian together give us the multivariate extension of the 1D “set f=0f' = 0, then check ff''” recipe.

The partial derivatives of a scalar field should not be viewed as the actual derivative of ff — they are not a direct extension of the real-function derivative ff'. In multiple dimensions the notion of “a derivative” is genuinely more subtle: each partial derivative only captures how ff changes along a single coordinate axis, and together they miss the fact that the input x\mathbf{x} can move in any direction, not just along the axes. The partials do, however, assemble into a single linear object called the total differential, which is the proper nn-dimensional analog of ff'. We will not develop the total differential in this chapter — for our purposes the gradient, as a bundle of partial derivatives, is enough — but it is worth knowing that strictly speaking, partial and total differentiation are different things.

Multivariate Functions

Every definition so far — partial derivatives, gradient, Hessian, continuous partial differentiability, class CkC^k — has been stated for scalar fields f:DRnRf : D \subseteq \mathbb{R}^n \to \mathbb{R}. Extending them to multivariate functions is mechanical: a multivariate function f:DRnRmf : D \subseteq \mathbb{R}^n \to \mathbb{R}^m is nothing more than an mm-tuple of scalar fields stacked vertically:

f:DRnRm,xf(x)=(f1(x1,,xn)fm(x1,,xn))f : D \subseteq \mathbb{R}^n \to \mathbb{R}^m, \quad \mathbf{x} \mapsto f(\mathbf{x}) = \begin{pmatrix} f_1(x_1, \ldots, x_n) \\ \vdots \\ f_m(x_1, \ldots, x_n) \end{pmatrix}

Each component function fi:DRnRf_i : D \subseteq \mathbb{R}^n \to \mathbb{R} is itself a scalar field — exactly the kind of object every earlier definition applied to. Every earlier property therefore lifts componentwise:

  • Partially differentiable: ff is partially differentiable at aD\mathbf{a} \in D (or on DD) if and only if every component fif_i is partially differentiable at a\mathbf{a} (or on DD).
  • kk-times continuously partially differentiable: ff is kk-times continuously partially differentiable at a\mathbf{a} (or on DD) if and only if every fif_i is.
  • Class Ck(D)C^k(D): ff belongs to Ck(D)C^k(D) if and only if every fif_i belongs to Ck(D)C^k(D).

Nothing new to prove — each property is simply applied to each component in turn.

One term gets its own label in the multivariate setting: when all partial derivatives of ff not only exist but are also continuous, we say ff is differentiable.

A multivariate function f:DRnRmf : D \subseteq \mathbb{R}^n \to \mathbb{R}^m is differentiable at a point xD\mathbf{x} \in D (or on DD) if all of its partial derivatives exist and are continuous at x\mathbf{x} (or throughout DD):

f differentiable on D    fC1(D)f \text{ differentiable on } D \iff f \in C^1(D)

Jacobian Matrix

Once a multivariate function is differentiable — every component fif_i contributing nn continuous partial derivatives — the natural way to organize all of its first-order information is into a single matrix. For a scalar field we collected nn partials into the gradient vector. With mm components, each carrying nn partials, we now have mnm \cdot n numbers in total, and they fit cleanly into an m×nm \times n matrix.

Let f:DRnRmf : D \subseteq \mathbb{R}^n \to \mathbb{R}^m be differentiable on DD. The Jacobian matrix of ff at a point xD\mathbf{x} \in D, denoted Jf(x)J_f(\mathbf{x}), is the m×nm \times n matrix whose entry in row ii and column jj is the partial derivative of the ii-th component with respect to the jj-th variable:

Jf(x)=(fixj(x))ij=(f1x1(x)f1xn(x)fmx1(x)fmxn(x))=(f1(x)fm(x))J_f(\mathbf{x}) = \left( \frac{\partial f_i}{\partial x_j}(\mathbf{x}) \right)_{ij} = \begin{pmatrix} \frac{\partial f_1}{\partial x_1}(\mathbf{x}) & \cdots & \frac{\partial f_1}{\partial x_n}(\mathbf{x}) \\ \vdots & & \vdots \\ \frac{\partial f_m}{\partial x_1}(\mathbf{x}) & \cdots & \frac{\partial f_m}{\partial x_n}(\mathbf{x}) \end{pmatrix} = \begin{pmatrix} \nabla f_1(\mathbf{x})^\top \\ \vdots \\ \nabla f_m(\mathbf{x})^\top \end{pmatrix}

Reading the matrix row by row makes the structure obvious: the ii-th row is exactly the gradient of the ii-th component function fif_i, written as a row vector. The Jacobian is therefore a vertical stack of component gradients — each row carrying the full first-order story of one output coordinate, and each column carrying the response of all outputs to a nudge in one input variable.

An alternative notation often seen in the literature is Df(x)Df(\mathbf{x}), used interchangeably with Jf(x)J_f(\mathbf{x}).

As a useful special case, when m=1m = 1 the function reduces to a scalar field and the Jacobian has only a single row — exactly the transposed gradient:

Jf(x)=f(x)J_f(\mathbf{x}) = \nabla f(\mathbf{x})^\top

So the gradient and the Jacobian are not separate ideas: the gradient is just the Jacobian’s only row when there is only one output to track.

The deeper purpose of the Jacobian is to provide the best linear approximation of ff near x\mathbf{x}. If ff is differentiable at x\mathbf{x}, then for a small step h\mathbf{h}:

f(x+h)f(x)+Jf(x)hf(\mathbf{x} + \mathbf{h}) \approx f(\mathbf{x}) + J_f(\mathbf{x})\,\mathbf{h}

with the approximation getting sharper as h0\|\mathbf{h}\| \to 0. This is the multivariate version of the slope picture used earlier to introduce the directional derivative: in 1D, f(a+h)f(a)+f(a)hf(a + h) \approx f(a) + f'(a)\,h — the very same ΔyΔxf(a)\frac{\Delta y}{\Delta x} \approx f'(a) tangent line, just rearranged to predict the function value a small step past aa. The scalar slope f(a)f'(a) is replaced by the matrix Jf(x)J_f(\mathbf{x}), which acts on the displacement vector h\mathbf{h} to produce the predicted change in every output coordinate at once. A more detailed treatment follows in the discussion of coordinate transformations.

A particularly tidy special case: when ff is itself a linear function, f(x)=Ax+bf(\mathbf{x}) = A\mathbf{x} + \mathbf{b} for some constant m×nm \times n matrix AA and constant vector b\mathbf{b}, the Jacobian is just AA at every point of the domain. Each component fif_i is the linear combination ai1x1++ainxn+bia_{i1}x_1 + \cdots + a_{in}x_n + b_i, whose partial derivative with respect to xjx_j is the constant aija_{ij}, so the Jacobian’s (i,j)(i, j) entry is aija_{ij} everywhere. The linear approximation f(x+h)f(x)+Jf(x)hf(\mathbf{x} + \mathbf{h}) \approx f(\mathbf{x}) + J_f(\mathbf{x})\,\mathbf{h} then stops being an approximation and becomes the exact equality f(x+h)=f(x)+Ahf(\mathbf{x} + \mathbf{h}) = f(\mathbf{x}) + A\,\mathbf{h} — a linear function coincides with its tangent linear map everywhere.

Calculation Rules

The Jacobian inherits a familiar set of computation rules — exactly the multivariate counterparts of the standard differentiation rules from 1D calculus. Sums differentiate term by term, scalars factor out, products obey a product rule, and compositions follow a chain rule. The only twist in higher dimensions is that the chain rule turns into a matrix multiplication of two Jacobians, so the order of multiplication now matters.

Let f,g:DRnRmf, g : D \subseteq \mathbb{R}^n \to \mathbb{R}^m be partially differentiable. Then for every xD\mathbf{x} \in D:

  • Additivity: Jf+g(x)=Jf(x)+Jg(x)J_{f+g}(\mathbf{x}) = J_f(\mathbf{x}) + J_g(\mathbf{x})
  • Homogeneity: Jλf(x)=λJf(x)J_{\lambda f}(\mathbf{x}) = \lambda\, J_f(\mathbf{x}) for all λR\lambda \in \mathbb{R}
  • Product rule: Jfg(x)=f(x)Jg(x)+g(x)Jf(x)J_{f^\top g}(\mathbf{x}) = f(\mathbf{x})^\top J_g(\mathbf{x}) + g(\mathbf{x})^\top J_f(\mathbf{x})

Additivity and homogeneity together say that the Jacobian behaves as a linear operator on partially differentiable functions — directly mirroring the linearity of the 1D derivative, (f+g)=f+g(f+g)' = f'+g' and (λf)=λf(\lambda f)' = \lambda f'. The product rule echoes the Leibniz pattern (fg)=fg+fg(fg)' = f'g + fg', but since ff and gg are vector-valued the natural scalar product is the inner product fg:DRf^\top g : D \to \mathbb{R}, and each factor carries a transpose to match the row-vector shapes the Jacobian expects.

The remaining rule needs two functions whose dimensions are compatible for composition. Let f:DRnRmf : D \subseteq \mathbb{R}^n \to \mathbb{R}^m and g:DRlRng : D' \subseteq \mathbb{R}^l \to \mathbb{R}^n with g(D)Dg(D') \subseteq D, so the composition h=fg:DRmh = f \circ g : D' \to \mathbb{R}^m is well-defined. Then for every xD\mathbf{x} \in D':

Jfg(x)=Jf(g(x))Jg(x)J_{f \circ g}(\mathbf{x}) = J_f(g(\mathbf{x}))\, J_g(\mathbf{x})

This is the composition rule, also widely known as the chain rule. The right-hand side is a matrix product: Jf(g(x))J_f(g(\mathbf{x})) is m×nm \times n, Jg(x)J_g(\mathbf{x}) is n×ln \times l, and the result is m×lm \times l — exactly the shape required for Jfg(x)J_{f \circ g}(\mathbf{x}). It is the direct generalization of the 1D chain rule (fg)(x)=f(g(x))g(x)(f \circ g)'(x) = f'(g(x))\, g'(x), with scalar multiplication replaced by matrix multiplication.

Reading off the entry in row ii and column jj of Jfg(x)J_{f \circ g}(\mathbf{x}) recovers the familiar partial-derivative form:

hixj(x)=k=1nfixk(g(x))gkxj(x),1im, 1jl\frac{\partial h_i}{\partial x_j}(\mathbf{x}) = \sum_{k=1}^{n} \frac{\partial f_i}{\partial x_k}(g(\mathbf{x}))\, \frac{\partial g_k}{\partial x_j}(\mathbf{x}), \quad 1 \leq i \leq m,\ 1 \leq j \leq l

Here ii is the row index — the output component hih_i of the composition — and jj is the column index — the input variable xjx_j. The sum over kk runs through the nn intermediate outputs of gg, which is exactly the inner product of the ii-th row of Jf(g(x))J_f(g(\mathbf{x})) with the jj-th column of Jg(x)J_g(\mathbf{x}) — the row-times-column rule of matrix multiplication, written out one entry at a time.

Differential Operators

A differential operator is an abstract mapping that takes a function as input and returns another function as output, with the (partial) derivatives of the input doing the structural work. The operators in this section all eat a function defined on a domain DRnD \subseteq \mathbb{R}^n — a scalar field or a vector field, depending on the operator — and produce a new function as output. We assume throughout that every partial derivative of the input function exists and is continuous on DD, so the constructions below are well-defined at every point.

A useful organizing idea before diving in: most of the operators that follow can be expressed as a single, more primitive operator — the nabla operator — combined with the standard vector products (scalar multiplication, inner product, cross product). Setting up that primitive carefully first makes everything else mechanical.

Nabla Operator

The nabla operator is the most fundamental differential operator on Rn\mathbb{R}^n — and one we have already been using implicitly: it is exactly the object that turned a scalar field into its gradient back in the gradient definition, where writing f\nabla f meant “stack the partial derivatives of ff into a column vector”. Pulling it out as a stand-alone operator simply makes that move explicit and lets the same primitive serve the divergence and the rotation below. It is best read as a formal column “vector” whose entries are the nn partial-derivative operators 1,,n\partial_1, \ldots, \partial_n — a column of operators, not numbers, waiting to be applied to a function.

The nabla operator (also called the del operator) on Rn\mathbb{R}^n is the formal column vector of partial-derivative operators:

=(1n)=(x1xn)\nabla = \begin{pmatrix} \partial_1 \\ \vdots \\ \partial_n \end{pmatrix} = \begin{pmatrix} \frac{\partial}{\partial x_1} \\ \vdots \\ \frac{\partial}{\partial x_n} \end{pmatrix}

Input: a function on DD — either a scalar field ff or a vector field v\boldsymbol{v}, depending on which “vector multiplication” is used. Output: a new function whose shape depends on the multiplication.

By itself \nabla is not a function and has no value at a point — it is purely a column of operators. It only produces a result once paired with an actual function via one of the standard vector products, and the product determines what the result looks like:

  • Applying to a scalar field: :(f:DRnR)(DRn), f(1f,,nf)\nabla : (f: D \subseteq \mathbb{R}^n \to \mathbb{R}) \to (D \to \mathbb{R}^n),\ f \mapsto (\partial_1 f, \ldots, \partial_n f)^\top — think of =(1,,n)\nabla = (\partial_1, \ldots, \partial_n)^\top as a formal column vector of operators; applying it to ff means each slot acts via partial differentiation, giving exactly the gradient gradient.

  • Inner product with a vector field: :(v:DRnRn)(DR), viivi\nabla \cdot : (\boldsymbol{v}: D \subseteq \mathbb{R}^n \to \mathbb{R}^n) \to (D \to \mathbb{R}),\ \boldsymbol{v} \mapsto \sum_i \partial_i v_i — pair each i\partial_i with the matching component viv_i, apply, then sum, exactly like an inner product but application replaces multiplication; the result is the divergence divergence.

  • Cross product with a vector field: ×:(v:DR3R3)(DR3), v(2v33v2, 3v11v3, 1v22v1)\nabla \times : (\boldsymbol{v}: D \subseteq \mathbb{R}^3 \to \mathbb{R}^3) \to (D \to \mathbb{R}^3),\ \boldsymbol{v} \mapsto (\partial_2 v_3 - \partial_3 v_2,\ \partial_3 v_1 - \partial_1 v_3,\ \partial_1 v_2 - \partial_2 v_1)^\top — exactly like a cross product but application replaces multiplication; the result is the rotation rotation.

Laplace Operator

The Laplace operator Δ\Delta takes a scalar field ff and returns a scalar field of the same shape.

Δ:(f:DRnR)(DR), fi=1ni2f\Delta : (f : D \subseteq \mathbb{R}^n \to \mathbb{R}) \to (D \to \mathbb{R}),\ f \mapsto \sum_{i=1}^{n} \partial_i^2 f

At each point x\mathbf{x}, it sums the pure second-order partial derivatives of ff — one for each dimension — leaving mixed partials ijf\partial_i \partial_j f for iji \neq j out.

Δf=i=1ni2f=12f++n2f\Delta f = \sum_{i=1}^{n} \partial_i^2 f = \partial_1^2 f + \cdots + \partial_n^2 f

Equivalently, Δf(x)\Delta f(\mathbf{x}) is the trace of the Hessian HfH_f at x\mathbf{x}: the diagonal entries of HfH_f are exactly 12f,,n2f\partial_1^2 f, \ldots, \partial_n^2 f, and the trace adds them.

Divergence

The divergence takes a vector field v\boldsymbol{v} and returns a scalar field — the nn component functions collapse into a single number at each point.

:(v:DRnRn)(DR), vi=1nivi\nabla \cdot : (\boldsymbol{v} : D \subseteq \mathbb{R}^n \to \mathbb{R}^n) \to (D \to \mathbb{R}),\ \boldsymbol{v} \mapsto \sum_{i=1}^{n} \partial_i v_i

Each i\partial_i acts on its matching component viv_i, and the results are summed — the formal inner product of \nabla and v\boldsymbol{v}, with application replacing multiplication.

divv=v=i=1nivi=1v1++nvn\operatorname{div} \boldsymbol{v} = \nabla \cdot \boldsymbol{v} = \sum_{i=1}^{n} \partial_i v_i = \partial_1 v_1 + \cdots + \partial_n v_n

Equivalently, v(x)\nabla \cdot \boldsymbol{v}(\mathbf{x}) is the trace of the Jacobian JvJ_{\boldsymbol{v}} at x\mathbf{x}: the diagonal entries of JvJ_{\boldsymbol{v}} are exactly 1v1,,nvn\partial_1 v_1, \ldots, \partial_n v_n, and the trace adds them.

Rotation

The rotation (curl) takes a 3D vector field v\boldsymbol{v} and returns a 3D vector field of the same shape.

×:(v:DR3R3)(DR3), v×v\nabla \times : (\boldsymbol{v} : D \subseteq \mathbb{R}^3 \to \mathbb{R}^3) \to (D \to \mathbb{R}^3),\ \boldsymbol{v} \mapsto \nabla \times \boldsymbol{v}

Each entry follows the cyclic pattern jvkkvj\partial_j v_k - \partial_k v_j of the cross product, with i\partial_i applying to vjv_j rather than multiplying.

rotv=×v=(2v33v23v11v31v22v1)\operatorname{rot} \boldsymbol{v} = \nabla \times \boldsymbol{v} = \begin{pmatrix} \partial_2 v_3 - \partial_3 v_2 \\ \partial_3 v_1 - \partial_1 v_3 \\ \partial_1 v_2 - \partial_2 v_1 \end{pmatrix}

Defined only in R3\mathbb{R}^3, since the cross product itself is exclusive to three dimensions.

A few mathematical identities between these differential operators — outside the scope of this lecture, but a useful addition to your mathematical toolkit. Under the assumptions that v\boldsymbol{v} and u\boldsymbol{u} are two C2C^2 vector fields of correct dimensionality, gg is a C2C^2 scalar field of correct dimensionality, and using the notation convention Δv=(Δv1,,Δvn)\Delta \boldsymbol{v} = (\Delta v_1, \ldots, \Delta v_n)^\top, the following identities hold:

  • divrotv=0\operatorname{div} \operatorname{rot} \boldsymbol{v} = 0
  • rotg=0\operatorname{rot} \nabla g = \mathbf{0}
  • divg=Δg\operatorname{div} \nabla g = \Delta g
  • rot(v×u)=(u)v(v)u+vdivuudivv\operatorname{rot}(\boldsymbol{v} \times \boldsymbol{u}) = (\boldsymbol{u}\nabla)\boldsymbol{v} - (\boldsymbol{v}\nabla)\boldsymbol{u} + \boldsymbol{v} \operatorname{div} \boldsymbol{u} - \boldsymbol{u} \operatorname{div} \boldsymbol{v}
  • rot(gv)=grotvv×g\operatorname{rot}(g\,\boldsymbol{v}) = g\operatorname{rot} \boldsymbol{v} - \boldsymbol{v} \times \nabla g
  • rot(gg)=0\operatorname{rot}(g \nabla g) = \mathbf{0}
  • (divv)=rotrotv+Δv\nabla(\operatorname{div} \boldsymbol{v}) = \operatorname{rot} \operatorname{rot} \boldsymbol{v} + \Delta \boldsymbol{v}