Partial Differentiation

Consider a scalar field of the form:

f : D \subseteq \mathbb{R}^n \to \mathbb{R}, \quad \mathbf{x} = (x_1, \ldots, x_n)^\top \mapsto f(\mathbf{x}) = f(x_1, \ldots, x_n)

This is a function that takes a point in $n$ -dimensional space and returns a single real number — like a temperature field in a room, or an elevation map over terrain.

Directional Derivative

Recall from 1D calculus that the derivative of $f$ at a point $a$ is:

f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h}

This is the slope of the function at $a$ : how much does $f$ change when we take a tiny step of size $h$ away from $a$ ? Geometrically, it is the same intuition behind $\tan(\alpha) = \frac{\sin(\alpha)}{\cos(\alpha)}$ — the slope of a line is simply the ratio of the vertical rise to the horizontal run, i.e., $\frac{\Delta y}{\Delta x} = \frac{f(a+h) - f(a)}{h}$ .

In multiple dimensions, the same question becomes richer: we can step away from a point in infinitely many directions. This is where the directional derivative comes in — it tells us the rate of change of $f$ when we stand at a point $\mathbf{a}$ and look in a specific direction $\mathbf{v}$ .

Here, $\mathbf{a} \in D$ is the base point — the location in the domain where we want to measure how fast $f$ is changing. Think of $\mathbf{a}$ as your exact current GPS coordinate while standing perfectly still. $\mathbf{v}$ is the direction you are facing. Crucially, the directional derivative is a local quantity — it captures the behavior of $f$ specifically at $\mathbf{a}$ , not globally.

For each unit vector $\mathbf{v}$ (with $\|\mathbf{v}\| = 1$ ), we take a tiny step of size $h$ from $\mathbf{a}$ in the direction $\mathbf{v}$ , measure the change in $f$ , and divide by the step size — the same slope formula as before, now generalized to any direction in $n$ -dimensional space:

\frac{f(\mathbf{a} + h\mathbf{v}) - f(\mathbf{a})}{h}

As $h \to 0$ , the step size becomes infinitesimally small, and we recover the instantaneous rate of change in the chosen direction.

The directional derivative of a scalar field $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ at a point $\mathbf{a} \in D$ in the direction of a unit vector $\mathbf{v}$ (with $\|\mathbf{v}\| = 1$ ) is:

\frac{\partial f}{\partial \mathbf{v}}(\mathbf{a}) = \partial_\mathbf{v} f(\mathbf{a}) = f_\mathbf{v}(\mathbf{a}) = \lim_{h \to 0} \frac{f(\mathbf{a} + h\mathbf{v}) - f(\mathbf{a})}{h}

All three notations — $\frac{\partial f}{\partial \mathbf{v}}(\mathbf{a})$ , $\partial_\mathbf{v} f(\mathbf{a})$ , and $f_\mathbf{v}(\mathbf{a})$ — refer to the same quantity: the rate of change of $f$ at the base point $\mathbf{a}$ along the direction $\mathbf{v}$ .

Requiring $\mathbf{v}$ to be a unit vector ensures that the step $h\mathbf{v}$ has length exactly $|h|$ , making the derivative a pure measure of directional rate of change, independent of the magnitude of $\mathbf{v}$ .

Since $\mathbf{v}$ can point in any direction on the unit sphere in $\mathbb{R}^n$ , there are infinitely many directional derivatives at any given point $\mathbf{a}$ . In principle, this limit may not exist for every combination of $f$ , $\mathbf{a}$ , and $\mathbf{v}$ . However, throughout this course we will work with sufficiently smooth functions and always assume the limit exists — the “happy scenario.”

Partial Derivatives

Having infinitely many directions to check is impractical. The saving insight is that we are working in a linear vector space: any direction $\mathbf{v}$ can be expressed as a linear combination of the $n$ vectors of the standard basis (also called the canonical basis) of $\mathbb{R}^n$ . For smooth functions, this means the behavior of $f$ in every direction is fully determined by its behavior along the $n$ coordinate directions — one per argument of the function. We do not need to sweep through all directions; the $n$ coordinate directions are enough.

Substituting $\mathbf{v} = \mathbf{e}_i$ into the directional derivative formula means we step from $\mathbf{a}$ by a tiny amount $h$ purely along the $i$ -th axis, holding all other coordinates fixed. The resulting limit is the rate of change of $f$ with respect to its $i$ -th argument $x_i$ .

The partial derivative of $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ at a point $\mathbf{a} \in D$ with respect to the variable $x_i$ is:

\frac{\partial f}{\partial x_i}(\mathbf{a}) = \partial_i f(\mathbf{a}) = f_{x_i}(\mathbf{a}) = \lim_{h \to 0} \frac{f(\mathbf{a} + h\mathbf{e}_i) - f(\mathbf{a})}{h}, \quad i \in \{1, \ldots, n\}

As always, we assume this limit exists. Each partial derivative answers a single focused question: how fast does $f$ change when only the $i$ -th coordinate of $\mathbf{a}$ is nudged, while all others stay fixed? Together, the $n$ partial derivatives provide a complete description of how $f$ varies locally — they tell the full story.

When all $n$ of them exist at a given point, $f$ earns a name for the property.

A scalar field $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ is partially differentiable at a point $\mathbf{a} \in D$ if all $n$ of its partial derivatives exist at $\mathbf{a}$ . It is partially differentiable on $D$ if this holds at every point of $D$ :

f \text{ partially differentiable at } \mathbf{a} \iff f_{x_1}(\mathbf{a}), \ldots, f_{x_n}(\mathbf{a}) \text{ all exist}

The Gradient

When $f$ is partially differentiable at $\mathbf{a}$ , its $n$ partial derivatives can be collected into a single vector — the gradient.

The gradient of $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ at a point $\mathbf{a} \in D$ is the column vector of all partial derivatives:

\nabla f(\mathbf{a}) = \begin{pmatrix} f_{x_1}(\mathbf{a}) \\ \vdots \\ f_{x_n}(\mathbf{a}) \end{pmatrix} = \operatorname{grad} f(\mathbf{a})

Gradient as a reusable formula: In practice, you do not evaluate the gradient at a specific point $\mathbf{a}$ directly. Instead, you first derive the partial derivative formulas symbolically — keeping the variables $x, y, \dots$ free — and assemble them into $\nabla f(\mathbf{x})$ . This gives you a general expression valid for any point in the domain. Evaluating it at a specific $\mathbf{a}$ is then just substitution.

The gradient $\nabla f(\mathbf{a}) \in \mathbb{R}^n$ lives in the same space as the input — not the output. So for a scalar field $f : \mathbb{R}^2 \to \mathbb{R}$ , the gradient at any point is a 2D vector; for $f : \mathbb{R}^3 \to \mathbb{R}$ , it’s a 3D vector.

This means we can evaluate $\nabla f$ at every point in the domain, producing a gradient field $\mathbf{a} \mapsto \nabla f(\mathbf{a})$ — a vector field on $D$ . Think of it as an arrow attached to each input point on the flat “floor” map, pointing in the direction of steepest ascent at that location.

The symbol $\nabla$ is called the nabla (or del) operator. Differential operators like $\nabla$ will be explored in more depth later in the course.

The gradient is the payoff for working in a linear vector space. When $f_{x_1}, \ldots, f_{x_n}$ are continuous, any directional derivative reduces to an inner product with the gradient:

\frac{\partial f}{\partial \mathbf{v}}(\mathbf{a}) = \langle \nabla f(\mathbf{a}),\, \mathbf{v} \rangle

This is the key trick: instead of computing a separate tortuous limit for every possible diagonal direction $\mathbf{v}$ , we compute the gradient once — just $n$ partial derivatives — and recover any directional derivative for free via an inner product (dot product). The gradient packages all local directional information into a single vector.

Let $f(x, y, z) = e^{-z}\sin x + y^2$ . Using standard differentiation rules, the three partial derivatives are:

\frac{\partial f}{\partial x} = e^{-z}\cos x, \qquad \frac{\partial f}{\partial y} = 2y, \qquad \frac{\partial f}{\partial z} = -e^{-z}\sin x

Assembling them into the gradient:

\nabla f(x, y, z) = \begin{pmatrix} e^{-z}\cos x \\ 2y \\ -e^{-z}\sin x \end{pmatrix}

Steepest Ascent and Descent

The gradient does more than package the partial derivatives — it points in a very specific geometric direction. Suppose $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ (with $D$ open) is a scalar field with continuous partial derivatives $f_{x_1}, \ldots, f_{x_n}$ , and let $\mathbf{a} \in D$ be a point where $\nabla f(\mathbf{a}) \neq \mathbf{0}$ . Then $f$ has its steepest ascent at $\mathbf{a}$ in the direction $\nabla f(\mathbf{a})$ , and its steepest descent in the opposite direction $-\nabla f(\mathbf{a})$ .

Intuitively: standing on a hillside, the gradient points straight uphill along the steepest route, and its negative points straight downhill (the way a ball would roll). Every other direction trades some uphill progress for sideways motion.

This follows directly from the inner-product formula $\frac{\partial f}{\partial \mathbf{v}}(\mathbf{a}) = \langle \nabla f(\mathbf{a}),\, \mathbf{v} \rangle$ : among all unit vectors $\mathbf{v}$ , this inner product is largest when $\mathbf{v}$ is perfectly aligned with $\nabla f(\mathbf{a})$ and smallest when it points the opposite way. The one strict condition is $\nabla f(\mathbf{a}) \neq \mathbf{0}$ — at a point where the gradient vanishes, there is no preferred direction (the ground is flat).

Isolines

Alongside the gradient, another geometric object captures the shape of a scalar field: the set of points where $f$ takes a constant value.

Let $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ be a scalar field. For a value $c \in f(D) \subseteq \mathbb{R}$ , the isoline (also called contour line or level set) of $f$ at level $c$ is:

N_c = \{\, \mathbf{x} \in D \mid f(\mathbf{x}) = c \,\}

Think of a topographic map: each contour traces the points at the same elevation. On a weather map, each isotherm connects places with the same temperature. More generally, isolines slice the domain into level sets on which $f$ is constant. (Note: A notation like $N_{f(\mathbf{a})=c}$ is just a formal way of saying “the specific isoline that passes through our base point $\mathbf{a}$ ”).

The gradient at a point $\mathbf{a}$ is always perpendicular to the isoline $N_{f(\mathbf{a}) = c}$ passing through $\mathbf{a}$ .

To see why, pick any unit vector $\mathbf{v}$ that points along the isoline (a tangent vector). Elevation doesn’t change in that direction, so the directional derivative vanishes:

\langle \nabla f(\mathbf{a}),\, \mathbf{v} \rangle = \frac{\partial f}{\partial \mathbf{v}}(\mathbf{a}) = 0

Since a zero inner product means two vectors are orthogonal, the steepest ascent $\nabla f(\mathbf{a})$ is forced to be perpendicular to the isoline.

The “tilted ramp” picture

Imagine standing on a tilted ramp. The isoline is a horizontal line painted across the ramp (where height never changes). If you take a step at an angle (diagonally across the ramp), you are “wasting” part of your step’s length moving sideways along the isoline, meaning you gain less vertical height. To gain the absolute maximum height possible in a single step (the direction of steepest ascent, which is the gradient), you must dedicate 100% of your step to moving forward, wasting zero energy on sideways movement. The only way to move with zero sideways drift is to walk exactly at a 90-degree angle to the horizontal isoline.

f: [-2,2] \times [-2,2] \to \mathbb{R}, \quad f(x,y) = x^2 + y^2, \quad \nabla f(x,y) = \begin{pmatrix} 2x \\ 2y \end{pmatrix}

In the example above, the 2D floor is the domain of $f$ : the heatmap encodes $f(x,y)$ as color, the dark rings are isolines (level sets $f(x,y) = c$ ), and the red arrows are the gradient field $\nabla f$ . This is where $f$ , its isolines, and its gradient actually live. The 3D bowl is the graph $z = f(x,y)$ — a visualization aid, not a separate object. The same isolines and gradient arrows are lifted onto it: the rings become horizontal cross-sections of the bowl, and the arrows become tangent vectors pointing in the steepest-ascent direction along the surface.

Arrow length. The arrows are short near the center and long at the boundary — not by coincidence. For $f(x,y) = x^2 + y^2$ :

\|\nabla f\| = \sqrt{(2x)^2 + (2y)^2} = \sqrt{4x^2 + 4y^2} = 2\sqrt{x^2 + y^2} = 2r

The gradient magnitude is just twice the radial distance. Geometrically: near the origin the bowl is nearly flat, so there’s barely any slope to point along; near the rim it’s steep, so the gradient is large. Exactly at the origin $\|\nabla f\| = 0$ — you’re at the minimum, there’s no downhill direction, the gradient has nothing to say.

Arrow direction. Every arrow points straight outward, perpendicular to the isoline it sits on. That’s not specific to this example — it’s always true: the gradient is orthogonal to the level set. If you stood on the 3D bowl and asked “which way is straight up the slope?”, the answer is exactly where the lifted arrow points.

Second-Order Partial Derivatives

The partial derivatives $f_{x_1}, \ldots, f_{x_n}$ of a scalar field $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ are themselves scalar fields on $D$ — each one assigns to every point the rate of change of $f$ along the corresponding coordinate axis. We assume $D$ is an open set here, which simply means every point of $D$ has a bit of breathing room around it that still lies inside $D$ — no point sits right on the edge. That matters because computing $f_{x_i}(\mathbf{a})$ means peeking at $f$ just to either side of $\mathbf{a}$ along the $i$ -th axis; if $\mathbf{a}$ were on the boundary, some of those nearby probe points would fall outside $D$ where $f$ isn’t defined, and the limit couldn’t be formed at all. Openness rules that case out, so the partial derivatives exist throughout $D$ .

When all $n$ of these derived scalar fields are continuous, we give the situation its own name.

A scalar field $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ (with $D$ open) is continuously partially differentiable on $D$ if all of its partial derivatives exist and are continuous on $D$ :

f \text{ continuously partially differentiable on } D \iff f_{x_1}, \ldots, f_{x_n} \text{ continuous on } D

If these first-order partial derivatives are themselves partially differentiable, we can differentiate a second time — taking a partial derivative of a partial derivative.

The second-order partial derivative of $f$ with respect to $x_i$ and $x_j$ is the partial derivative of $f_{x_i}$ taken once more with respect to $x_j$ :

\partial_{x_j} \partial_{x_i} f(\mathbf{x}) = \frac{\partial^2 f}{\partial x_j \, \partial x_i}(\mathbf{x}) = \partial_j \partial_i f(\mathbf{x}) = f_{x_j x_i}(\mathbf{x})

All four expressions on the right denote exactly the same quantity — they are fully interchangeable, just different notations for the same second-order partial derivative (read them right-to-left like peeling an onion: first derive by $x_i$ , then derive the result by $x_j$ ).

The intuition mirrors the 1D case. In 1D, the second derivative $f''(a)$ measures curvature — whether the slope is growing or shrinking as we move along the axis. On a hill, $f''$ tells you whether the climb is getting steeper or starting to level out. The same picture carries over to higher dimensions: an unmixed derivative $f_{x_i x_i}$ describes how the slope along the $i$ -th axis is itself changing as we step further along $x_i$ , and a mixed derivative $f_{x_j x_i}$ describes how a slope along one axis varies as we step along another.

Smoothness

The construction extends arbitrarily: differentiate a partial derivative once more to get a third-order partial derivative, again for fourth-order, and so on without limit. When all partial derivatives up to order $k$ exist and are continuous, we get a property worth naming.

A scalar field $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ (with $D$ open) is $k$ -times continuously partially differentiable on $D$ if all of its partial derivatives up to order $k$ exist and are continuous on $D$ .

These nested smoothness levels get their own family of names — the $C^k$ classes — each one packaging ” $f$ has $k$ continuous orders of derivatives” into a single label.

For $k \in \mathbb{N}_0 \cup \{\infty\}$ , a scalar field $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ (with $D$ open) is of class $C^k(D)$ as defined below:

\begin{aligned} C^0(D) &= \{\, f \mid f \text{ is continuous} \,\} \\ C^k(D) &= \{\, f \mid f \text{ is } k\text{-times continuously partially differentiable} \,\}, \quad k \in \mathbb{N} \\ C^\infty(D) &= \{\, f \mid f \text{ is continuously partially differentiable of any order} \,\} \end{aligned}

So $C^0(D)$ is just the continuous functions (no jumps) on $D$ ; $C^k(D)$ for $k \geq 1$ asks for $k$ continuous orders of partial derivatives; and $C^\infty(D)$ is the set of functions that remain differentiable (perfectly smooth) no matter how many times you differentiate them. Two milestones from this hierarchy matter most going forward: $f \in C^1(D)$ means all first-order partial derivatives are continuous, so we can assemble them into the gradient $\nabla f$ at every point of $D$ ; and $f \in C^2(D)$ means all second-order partial derivatives are continuous, which is exactly what is needed to assemble them into the Hessian.

In practice, we rarely care about the exact value of $k$ — we just want enough continuous derivatives for whatever theorem or computation we’re doing. That looser notion gets its own name.

A function $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ is called a smooth function if it is of class $C^k(D)$ with $k$ high enough for the problem at hand:

f \text{ smooth on } D \iff f \in C^k(D) \text{ for sufficiently large } k

The required $k$ depends on context — if a result needs second-order partial derivatives to be continuous, “smooth enough” means $C^2(D)$ ; if third-order derivatives are needed, $C^3(D)$ , and so on. In most practical settings one simply assumes $f \in C^\infty(D)$ to avoid having to track the exact order.

Once $f$ reaches $C^2$ smoothness, a useful property kicks in — the order in which we take partial derivatives stops mattering.

If $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ is a scalar field with $f \in C^2(D)$ , then for all $i, j \in \{1, \ldots, n\}$ :

f_{x_i x_j} = f_{x_j x_i}

Differentiating with respect to $x_i$ first and then $x_j$ gives the same answer as differentiating in the opposite order. For any function smooth enough to land in $C^2$ , mixed second-order partial derivatives commute (i.e. AB = BA).

Hessian Matrix

With the first-order partial derivatives, we organized the $n$ values $f_{x_1}, \ldots, f_{x_n}$ into a single vector — the gradient. Second-order partials are richer: for each pair $(x_i, x_j)$ there is one derivative $f_{x_j x_i}$ , giving $n^2$ numbers in total. The natural way to organize them is as an $n \times n$ matrix. The pattern continues into higher orders — third-order partial derivatives need a three-index object with $n^3$ entries (a tensor), fourth-order ones live in $n^4$ entries, and so on — but at second order, this $n \times n$ matrix has its own name.

Let $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ be a second-order partially differentiable scalar field. The Hessian matrix of $f$ at a point $\mathbf{x} \in D$ , denoted $H_f(\mathbf{x})$ , is the $n \times n$ matrix of all second-order partial derivatives of $f$ at $\mathbf{x}$ :

H_f(\mathbf{x}) = \begin{pmatrix} f_{x_1 x_1}(\mathbf{x}) & \cdots & f_{x_1 x_n}(\mathbf{x}) \\ \vdots & & \vdots \\ f_{x_n x_1}(\mathbf{x}) & \cdots & f_{x_n x_n}(\mathbf{x}) \end{pmatrix}

The entry in row $i$ , column $j$ is $f_{x_i x_j}(\mathbf{x})$ — the rate at which the $j$ -th partial derivative of $f$ changes as we step along $x_i$ . The diagonal entries $f_{x_i x_i}$ are the pure second derivatives along each axis; the off-diagonal entries $f_{x_i x_j}$ (for $i \neq j$ ) are the mixed partials.

Since $f \in C^2(D)$ , the mixed partials satisfy $f_{x_i x_j} = f_{x_j x_i}$ for all $(i, j)$ — so the Hessian is always symmetric.

Stationary Points and Extrema

The gradient and Hessian are not brand-new objects — they are the $n$ -dimensional counterparts of the first and second derivatives we already know from 1D calculus, and they carry over the same roles.

In 1D, finding the extrema of a smooth function $f : \mathbb{R} \to \mathbb{R}$ follows a two-step recipe: first solve $f'(a) = 0$ to locate candidate points (bottoms of valleys, tops of peaks), then evaluate $f''(a)$ at those candidates — a positive value means the curve opens upward (local minimum), a negative value means it opens downward (local maximum), and zero leaves the test inconclusive.

The exact same logic plays out in $n$ dimensions, with the gradient playing the role of $f'$ and the Hessian playing the role of $f''$ . The “zero-slope” condition now becomes $\nabla f(\mathbf{a}) = \mathbf{0}$ — every partial derivative vanishes at once. Points where this holds are called stationary points — defined formally in the optima chapter.

Geometrically, at a stationary point every directional derivative is zero — no matter which way you step, the slope of $f$ is flat at that point. Any actual change in $f$ as you move away only shows up through the curvature (how the surface bends), not through the slope itself. Stationary points are therefore exactly the candidates for local minima and maxima, the same way $f'(a) = 0$ produces the candidates in 1D. Not every stationary point is an extremum, however — some fall into neither the minimum nor the maximum category.

To tell the cases apart, we turn to second-order information: the Hessian. Informally, the Hessian describes the local curvature of $f$ in every direction at once, and at a stationary point it plays exactly the same classifying role that the sign of $f''$ plays in 1D — it decides whether the point is a minimum, a maximum, or something else. A precise criterion for reading this off the Hessian comes later in the optima chapter; for now it is enough to remember that gradient and Hessian together give us the multivariate extension of the 1D “set $f' = 0$ , then check $f''$ ” recipe.

The partial derivatives of a scalar field should not be viewed as the actual derivative of $f$ — they are not a direct extension of the real-function derivative $f'$ . In multiple dimensions the notion of “a derivative” is genuinely more subtle: each partial derivative only captures how $f$ changes along a single coordinate axis, and together they miss the fact that the input $\mathbf{x}$ can move in any direction, not just along the axes. The partials do, however, assemble into a single linear object called the total differential, which is the proper $n$ -dimensional analog of $f'$ . We will not develop the total differential in this chapter — for our purposes the gradient, as a bundle of partial derivatives, is enough — but it is worth knowing that strictly speaking, partial and total differentiation are different things.

Multivariate Functions

Every definition so far — partial derivatives, gradient, Hessian, continuous partial differentiability, class $C^k$ — has been stated for scalar fields $f : D \subseteq \mathbb{R}^n \to \mathbb{R}$ . Extending them to multivariate functions is mechanical: a multivariate function $f : D \subseteq \mathbb{R}^n \to \mathbb{R}^m$ is nothing more than an $m$ -tuple of scalar fields stacked vertically:

f : D \subseteq \mathbb{R}^n \to \mathbb{R}^m, \quad \mathbf{x} \mapsto f(\mathbf{x}) = \begin{pmatrix} f_1(x_1, \ldots, x_n) \\ \vdots \\ f_m(x_1, \ldots, x_n) \end{pmatrix}

Each component function $f_i : D \subseteq \mathbb{R}^n \to \mathbb{R}$ is itself a scalar field — exactly the kind of object every earlier definition applied to. Every earlier property therefore lifts componentwise:

Partially differentiable: $f$ is partially differentiable at $\mathbf{a} \in D$ (or on $D$ ) if and only if every component $f_i$ is partially differentiable at $\mathbf{a}$ (or on $D$ ).
$k$ -times continuously partially differentiable: $f$ is $k$ -times continuously partially differentiable at $\mathbf{a}$ (or on $D$ ) if and only if every $f_i$ is.
Class $C^k(D)$ : $f$ belongs to $C^k(D)$ if and only if every $f_i$ belongs to $C^k(D)$ .

Nothing new to prove — each property is simply applied to each component in turn.

One term gets its own label in the multivariate setting: when all partial derivatives of $f$ not only exist but are also continuous, we say $f$ is differentiable.

A multivariate function $f : D \subseteq \mathbb{R}^n \to \mathbb{R}^m$ is differentiable at a point $\mathbf{x} \in D$ (or on $D$ ) if all of its partial derivatives exist and are continuous at $\mathbf{x}$ (or throughout $D$ ):

f \text{ differentiable on } D \iff f \in C^1(D)

Jacobian Matrix

Once a multivariate function is differentiable — every component $f_i$ contributing $n$ continuous partial derivatives — the natural way to organize all of its first-order information is into a single matrix. For a scalar field we collected $n$ partials into the gradient vector. With $m$ components, each carrying $n$ partials, we now have $m \cdot n$ numbers in total, and they fit cleanly into an $m \times n$ matrix.

Let $f : D \subseteq \mathbb{R}^n \to \mathbb{R}^m$ be differentiable on $D$ . The Jacobian matrix of $f$ at a point $\mathbf{x} \in D$ , denoted $J_f(\mathbf{x})$ , is the $m \times n$ matrix whose entry in row $i$ and column $j$ is the partial derivative of the $i$ -th component with respect to the $j$ -th variable:

J_f(\mathbf{x}) = \left( \frac{\partial f_i}{\partial x_j}(\mathbf{x}) \right)_{ij} = \begin{pmatrix} \frac{\partial f_1}{\partial x_1}(\mathbf{x}) & \cdots & \frac{\partial f_1}{\partial x_n}(\mathbf{x}) \\ \vdots & & \vdots \\ \frac{\partial f_m}{\partial x_1}(\mathbf{x}) & \cdots & \frac{\partial f_m}{\partial x_n}(\mathbf{x}) \end{pmatrix} = \begin{pmatrix} \nabla f_1(\mathbf{x})^\top \\ \vdots \\ \nabla f_m(\mathbf{x})^\top \end{pmatrix}

Reading the matrix row by row makes the structure obvious: the $i$ -th row is exactly the gradient of the $i$ -th component function $f_i$ , written as a row vector. The Jacobian is therefore a vertical stack of component gradients — each row carrying the full first-order story of one output coordinate, and each column carrying the response of all outputs to a nudge in one input variable.

An alternative notation often seen in the literature is $Df(\mathbf{x})$ , used interchangeably with $J_f(\mathbf{x})$ .

As a useful special case, when $m = 1$ the function reduces to a scalar field and the Jacobian has only a single row — exactly the transposed gradient:

J_f(\mathbf{x}) = \nabla f(\mathbf{x})^\top

So the gradient and the Jacobian are not separate ideas: the gradient is just the Jacobian’s only row when there is only one output to track.

The deeper purpose of the Jacobian is to provide the best linear approximation of $f$ near $\mathbf{x}$ . If $f$ is differentiable at $\mathbf{x}$ , then for a small step $\mathbf{h}$ :

f(\mathbf{x} + \mathbf{h}) \approx f(\mathbf{x}) + J_f(\mathbf{x})\,\mathbf{h}

with the approximation getting sharper as $\|\mathbf{h}\| \to 0$ . This is the multivariate version of the slope picture used earlier to introduce the directional derivative: in 1D, $f(a + h) \approx f(a) + f'(a)\,h$ — the very same $\frac{\Delta y}{\Delta x} \approx f'(a)$ tangent line, just rearranged to predict the function value a small step past $a$ . The scalar slope $f'(a)$ is replaced by the matrix $J_f(\mathbf{x})$ , which acts on the displacement vector $\mathbf{h}$ to produce the predicted change in every output coordinate at once. A more detailed treatment follows in the discussion of coordinate transformations.

A particularly tidy special case: when $f$ is itself a linear function, $f(\mathbf{x}) = A\mathbf{x} + \mathbf{b}$ for some constant $m \times n$ matrix $A$ and constant vector $\mathbf{b}$ , the Jacobian is just $A$ at every point of the domain. Each component $f_i$ is the linear combination $a_{i1}x_1 + \cdots + a_{in}x_n + b_i$ , whose partial derivative with respect to $x_j$ is the constant $a_{ij}$ , so the Jacobian’s $(i, j)$ entry is $a_{ij}$ everywhere. The linear approximation $f(\mathbf{x} + \mathbf{h}) \approx f(\mathbf{x}) + J_f(\mathbf{x})\,\mathbf{h}$ then stops being an approximation and becomes the exact equality $f(\mathbf{x} + \mathbf{h}) = f(\mathbf{x}) + A\,\mathbf{h}$ — a linear function coincides with its tangent linear map everywhere.

Calculation Rules

The Jacobian inherits a familiar set of computation rules — exactly the multivariate counterparts of the standard differentiation rules from 1D calculus. Sums differentiate term by term, scalars factor out, products obey a product rule, and compositions follow a chain rule. The only twist in higher dimensions is that the chain rule turns into a matrix multiplication of two Jacobians, so the order of multiplication now matters.

Let $f, g : D \subseteq \mathbb{R}^n \to \mathbb{R}^m$ be partially differentiable. Then for every $\mathbf{x} \in D$ :

Additivity: $J_{f+g}(\mathbf{x}) = J_f(\mathbf{x}) + J_g(\mathbf{x})$
Homogeneity: $J_{\lambda f}(\mathbf{x}) = \lambda\, J_f(\mathbf{x})$ for all $\lambda \in \mathbb{R}$
Product rule: $J_{f^\top g}(\mathbf{x}) = f(\mathbf{x})^\top J_g(\mathbf{x}) + g(\mathbf{x})^\top J_f(\mathbf{x})$

Additivity and homogeneity together say that the Jacobian behaves as a linear operator on partially differentiable functions — directly mirroring the linearity of the 1D derivative, $(f+g)' = f'+g'$ and $(\lambda f)' = \lambda f'$ . The product rule echoes the Leibniz pattern $(fg)' = f'g + fg'$ , but since $f$ and $g$ are vector-valued the natural scalar product is the inner product $f^\top g : D \to \mathbb{R}$ , and each factor carries a transpose to match the row-vector shapes the Jacobian expects.

The remaining rule needs two functions whose dimensions are compatible for composition. Let $f : D \subseteq \mathbb{R}^n \to \mathbb{R}^m$ and $g : D' \subseteq \mathbb{R}^l \to \mathbb{R}^n$ with $g(D') \subseteq D$ , so the composition $h = f \circ g : D' \to \mathbb{R}^m$ is well-defined. Then for every $\mathbf{x} \in D'$ :

J_{f \circ g}(\mathbf{x}) = J_f(g(\mathbf{x}))\, J_g(\mathbf{x})

This is the composition rule, also widely known as the chain rule. The right-hand side is a matrix product: $J_f(g(\mathbf{x}))$ is $m \times n$ , $J_g(\mathbf{x})$ is $n \times l$ , and the result is $m \times l$ — exactly the shape required for $J_{f \circ g}(\mathbf{x})$ . It is the direct generalization of the 1D chain rule $(f \circ g)'(x) = f'(g(x))\, g'(x)$ , with scalar multiplication replaced by matrix multiplication.

Reading off the entry in row $i$ and column $j$ of $J_{f \circ g}(\mathbf{x})$ recovers the familiar partial-derivative form:

\frac{\partial h_i}{\partial x_j}(\mathbf{x}) = \sum_{k=1}^{n} \frac{\partial f_i}{\partial x_k}(g(\mathbf{x}))\, \frac{\partial g_k}{\partial x_j}(\mathbf{x}), \quad 1 \leq i \leq m,\ 1 \leq j \leq l

Here $i$ is the row index — the output component $h_i$ of the composition — and $j$ is the column index — the input variable $x_j$ . The sum over $k$ runs through the $n$ intermediate outputs of $g$ , which is exactly the inner product of the $i$ -th row of $J_f(g(\mathbf{x}))$ with the $j$ -th column of $J_g(\mathbf{x})$ — the row-times-column rule of matrix multiplication, written out one entry at a time.

Differential Operators

A differential operator is an abstract mapping that takes a function as input and returns another function as output, with the (partial) derivatives of the input doing the structural work. The operators in this section all eat a function defined on a domain $D \subseteq \mathbb{R}^n$ — a scalar field or a vector field, depending on the operator — and produce a new function as output. We assume throughout that every partial derivative of the input function exists and is continuous on $D$ , so the constructions below are well-defined at every point.

A useful organizing idea before diving in: most of the operators that follow can be expressed as a single, more primitive operator — the nabla operator — combined with the standard vector products (scalar multiplication, inner product, cross product). Setting up that primitive carefully first makes everything else mechanical.

Nabla Operator

The nabla operator is the most fundamental differential operator on $\mathbb{R}^n$ — and one we have already been using implicitly: it is exactly the object that turned a scalar field into its gradient back in the gradient definition, where writing $\nabla f$ meant “stack the partial derivatives of $f$ into a column vector”. Pulling it out as a stand-alone operator simply makes that move explicit and lets the same primitive serve the divergence and the rotation below. It is best read as a formal column “vector” whose entries are the $n$ partial-derivative operators $\partial_1, \ldots, \partial_n$ — a column of operators, not numbers, waiting to be applied to a function.

The nabla operator (also called the del operator) on $\mathbb{R}^n$ is the formal column vector of partial-derivative operators:

\nabla = \begin{pmatrix} \partial_1 \\ \vdots \\ \partial_n \end{pmatrix} = \begin{pmatrix} \frac{\partial}{\partial x_1} \\ \vdots \\ \frac{\partial}{\partial x_n} \end{pmatrix}

Input: a function on $D$ — either a scalar field $f$ or a vector field $\boldsymbol{v}$ , depending on which “vector multiplication” is used. Output: a new function whose shape depends on the multiplication.

By itself $\nabla$ is not a function and has no value at a point — it is purely a column of operators. It only produces a result once paired with an actual function via one of the standard vector products, and the product determines what the result looks like:

Applying to a scalar field: $\nabla : (f: D \subseteq \mathbb{R}^n \to \mathbb{R}) \to (D \to \mathbb{R}^n),\ f \mapsto (\partial_1 f, \ldots, \partial_n f)^\top$ — think of $\nabla = (\partial_1, \ldots, \partial_n)^\top$ as a formal column vector of operators; applying it to $f$ means each slot acts via partial differentiation, giving exactly the gradient gradient.
Inner product with a vector field: $\nabla \cdot : (\boldsymbol{v}: D \subseteq \mathbb{R}^n \to \mathbb{R}^n) \to (D \to \mathbb{R}),\ \boldsymbol{v} \mapsto \sum_i \partial_i v_i$ — pair each $\partial_i$ with the matching component $v_i$ , apply, then sum, exactly like an inner product but application replaces multiplication; the result is the divergence divergence.
Cross product with a vector field: $\nabla \times : (\boldsymbol{v}: D \subseteq \mathbb{R}^3 \to \mathbb{R}^3) \to (D \to \mathbb{R}^3),\ \boldsymbol{v} \mapsto (\partial_2 v_3 - \partial_3 v_2,\ \partial_3 v_1 - \partial_1 v_3,\ \partial_1 v_2 - \partial_2 v_1)^\top$ — exactly like a cross product but application replaces multiplication; the result is the rotation rotation.

Laplace Operator

The Laplace operator $\Delta$ takes a scalar field $f$ and returns a scalar field of the same shape.

$\Delta : (f : D \subseteq \mathbb{R}^n \to \mathbb{R}) \to (D \to \mathbb{R}),\ f \mapsto \sum_{i=1}^{n} \partial_i^2 f$

At each point $\mathbf{x}$ , it sums the pure second-order partial derivatives of $f$ — one for each dimension — leaving mixed partials $\partial_i \partial_j f$ for $i \neq j$ out.

\Delta f = \sum_{i=1}^{n} \partial_i^2 f = \partial_1^2 f + \cdots + \partial_n^2 f

Equivalently, $\Delta f(\mathbf{x})$ is the trace of the Hessian $H_f$ at $\mathbf{x}$ : the diagonal entries of $H_f$ are exactly $\partial_1^2 f, \ldots, \partial_n^2 f$ , and the trace adds them.

Divergence

The divergence takes a vector field $\boldsymbol{v}$ and returns a scalar field — the $n$ component functions collapse into a single number at each point.

$\nabla \cdot : (\boldsymbol{v} : D \subseteq \mathbb{R}^n \to \mathbb{R}^n) \to (D \to \mathbb{R}),\ \boldsymbol{v} \mapsto \sum_{i=1}^{n} \partial_i v_i$

Each $\partial_i$ acts on its matching component $v_i$ , and the results are summed — the formal inner product of $\nabla$ and $\boldsymbol{v}$ , with application replacing multiplication.

\operatorname{div} \boldsymbol{v} = \nabla \cdot \boldsymbol{v} = \sum_{i=1}^{n} \partial_i v_i = \partial_1 v_1 + \cdots + \partial_n v_n

Equivalently, $\nabla \cdot \boldsymbol{v}(\mathbf{x})$ is the trace of the Jacobian $J_{\boldsymbol{v}}$ at $\mathbf{x}$ : the diagonal entries of $J_{\boldsymbol{v}}$ are exactly $\partial_1 v_1, \ldots, \partial_n v_n$ , and the trace adds them.

Rotation

The rotation (curl) takes a 3D vector field $\boldsymbol{v}$ and returns a 3D vector field of the same shape.

$\nabla \times : (\boldsymbol{v} : D \subseteq \mathbb{R}^3 \to \mathbb{R}^3) \to (D \to \mathbb{R}^3),\ \boldsymbol{v} \mapsto \nabla \times \boldsymbol{v}$

Each entry follows the cyclic pattern $\partial_j v_k - \partial_k v_j$ of the cross product, with $\partial_i$ applying to $v_j$ rather than multiplying.

\operatorname{rot} \boldsymbol{v} = \nabla \times \boldsymbol{v} = \begin{pmatrix} \partial_2 v_3 - \partial_3 v_2 \\ \partial_3 v_1 - \partial_1 v_3 \\ \partial_1 v_2 - \partial_2 v_1 \end{pmatrix}

Defined only in $\mathbb{R}^3$ , since the cross product itself is exclusive to three dimensions.

A few mathematical identities between these differential operators — outside the scope of this lecture, but a useful addition to your mathematical toolkit. Under the assumptions that $\boldsymbol{v}$ and $\boldsymbol{u}$ are two $C^2$ vector fields of correct dimensionality, $g$ is a $C^2$ scalar field of correct dimensionality, and using the notation convention $\Delta \boldsymbol{v} = (\Delta v_1, \ldots, \Delta v_n)^\top$ , the following identities hold:

$\operatorname{div} \operatorname{rot} \boldsymbol{v} = 0$
$\operatorname{rot} \nabla g = \mathbf{0}$
$\operatorname{div} \nabla g = \Delta g$
$\operatorname{rot}(\boldsymbol{v} \times \boldsymbol{u}) = (\boldsymbol{u}\nabla)\boldsymbol{v} - (\boldsymbol{v}\nabla)\boldsymbol{u} + \boldsymbol{v} \operatorname{div} \boldsymbol{u} - \boldsymbol{u} \operatorname{div} \boldsymbol{v}$
$\operatorname{rot}(g\,\boldsymbol{v}) = g\operatorname{rot} \boldsymbol{v} - \boldsymbol{v} \times \nabla g$
$\operatorname{rot}(g \nabla g) = \mathbf{0}$
$\nabla(\operatorname{div} \boldsymbol{v}) = \operatorname{rot} \operatorname{rot} \boldsymbol{v} + \Delta \boldsymbol{v}$