Roots

This chapter and the next apply the multivariate calculus tools just developed — gradient, Hessian, Jacobian — to two of the most common numerical problems: finding roots (inputs that make a function output zero) and finding optima (inputs at which a function attains a local extremum). Both problems share the same recipe: write down the condition that characterizes a solution, then iterate toward it. We start with roots.

A root (also called a zero) of a function ff is an input value x\mathbf{x}^* at which ff vanishes:

f(x)=0f(\mathbf{x}^*) = 0

In 1D, x\mathbf{x}^* is a single number xRx^* \in \mathbb{R} where the graph of ff crosses the horizontal axis. In higher dimensions, x\mathbf{x}^* is a point in Rn\mathbb{R}^n — a specific combination of input coordinates — that makes ff output exactly zero.

So whenever you hear “finding a root,” the question is: what input do I have to plug into this machine to make it spit out a zero?

For toy functions like f(x)=x3f(x) = x - 3 the answer is immediate — x=3x^* = 3. But for anything more interesting (a polynomial of degree higher than four, a transcendental equation like x=cosxx = \cos x, or a system of nonlinear equations in several unknowns) there is no closed-form solution to read off. We have to find the root iteratively: start from a guess, refine it step by step, and stop once the guess is close enough to a real root.

The workhorse for this is Newton’s method. We start with the 1D version as a reminder, then lift the entire picture into dd dimensions in the next section.

The 1D Reminder

Let f:IRRf : I \subseteq \mathbb{R} \to \mathbb{R} be a twice continuously differentiable function (i.e. of class C2C^2) on an interval II, with a root xIx^* \in I we want to find. Starting from an initial guess x0x_0, Newton’s method generates the sequence

xk+1=xkf(xk)f(xk)for k=0,1,2,x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} \quad \text{for } k = 0, 1, 2, \ldots

Provided x0x_0 lies in a neighborhood of xx^*, this sequence converges quadratically to xx^* — meaning that the error roughly squares at each step (a guess accurate to two digits becomes accurate to four, then eight, then sixteen).

The Geometric Idea

Each step replaces the (possibly nasty) curve ff by its tangent line at the current guess xkx_k — a far simpler, linear, easy-to-zero function — and then jumps to where that tangent line crosses the horizontal axis.

Concretely: at xkx_k, the function has value f(xk)f(x_k) and slope f(xk)f'(x_k). The tangent line through (xk,f(xk))(x_k, f(x_k)) with that slope is

(x)=f(xk)+f(xk)(xxk)\ell(x) = f(x_k) + f'(x_k)(x - x_k)

Setting (x)=0\ell(x) = 0 and solving for xx (assuming f(xk)0f'(x_k) \neq 0, so we can actually divide) gives exactly the Newton update:

x=xkf(xk)f(xk)=xk+1x = x_k - \frac{f(x_k)}{f'(x_k)} = x_{k+1}

So xk+1x_{k+1} is the root not of ff itself, but of the tangent-line approximation of ff at xkx_k. The hope is that near a real root the tangent is close enough to ff that its zero is also close to ff‘s zero — and that closeness sharpens each iteration, pulling the guess in fast.

The non-zero-slope condition f(xk)0f'(x_k) \neq 0 is the algebraic catch: if at any iterate the tangent is horizontal, it never crosses the axis and the formula divides by zero. In practice, we assume f(x)0f'(x^*) \neq 0 at the root we are after — a so-called simple root — and a small enough neighborhood inherits non-zero slope by continuity.

Why a Neighborhood, and Why Quadratic

The two strict conditions in the convergence claim each pull their weight.

  • Why C2C^2 and quadratic. The tangent line is a first-order Taylor approximation of ff at xkx_k. Its error grows like 12f(ξ)(xxk)2\tfrac{1}{2} f''(\xi)(x - x_k)^2 for some ξ\xi between xkx_k and xx — i.e. the error is second order in the displacement. Squaring the displacement is exactly what makes the iteration converge quadratically: each step shrinks the error to roughly the square of the previous error. We need ff'' to exist and be bounded near xx^* for that bound to hold.

  • Why a neighborhood of xx^*. Far from a real root the tangent line can point anywhere — it might send the next guess off to infinity, into the basin of a different root, or onto a near-horizontal patch where f(xk)0f'(x_k) \approx 0 and the update divides by something tiny. The neighborhood condition is the price of using a local linearization to chase a global problem: Newton’s method is a sharp tool, but it only finds the root in whose basin you start. For this reason Newton’s method is called a locally convergent method — convergence is guaranteed only once the iterate is close enough to xx^*, not from any starting point.

Lifting the Iteration to Dimensions

Now f:DRnRnf : D \subseteq \mathbb{R}^n \to \mathbb{R}^n is a class C2C^2 multivariate function on an open and convex domain DD. A root is a point xD\mathbf{x}^* \in D at which ff vanishes: f(x)=0f(\mathbf{x}^*) = \mathbf{0}.

Mechanically, almost nothing changes from the 1D recipe. The 1D update

xk+1=xkf(xk)f(xk)x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)}

is just a special case of “step from xkx_k to where the tangent hits zero”. In dd dimensions the tangent line becomes the tangent hyperplane (the first-order Taylor approximation of ff at xk\mathbf{x}_k), the slope f(xk)f'(x_k) becomes the Jacobian matrix Jf(xk)Rn×nJ_f(\mathbf{x}_k) \in \mathbb{R}^{n \times n}, and “divide by the slope” becomes “multiply by the inverse Jacobian” — division is not a thing we can do with matrices, but multiplying by an inverse is. The literal lift of the formula reads

xk+1=xkJf(xk)1f(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_k)

This is the right idea. But we never actually form Jf(xk)1J_f(\mathbf{x}_k)^{-1} to apply it.

Solve, Don’t Invert

The standard linear-algebra reformulation: whenever a derivation produces x=A1b\mathbf{x} = A^{-1}\mathbf{b}, do not compute the inverse A1A^{-1}. Multiply both sides by AA from the left — the product AA1A \cdot A^{-1} collapses to the identity — and the equation rearranges into the equivalent form Ax=bA\mathbf{x} = \mathbf{b}, with no inverse left on the page. Solve that system directly — by a direct linear-system solver such as Gaussian elimination or LU decomposition — and read off x\mathbf{x} from the result. The reason is twofold: forming and applying A1A^{-1} does strictly more work than one direct solve, and matrix inversion is also numerically less stable, especially when AA is ill-conditioned (close to non-invertible — its determinant is tiny, so small perturbations in b\mathbf{b} produce wildly different solutions).

Apply that trick here. The literal-lift formula above subtracts a specific correction from xk\mathbf{x}_k to produce xk+1\mathbf{x}_{k+1}. Give that correction a name — the Newton step, defined explicitly as

Δxk:=Jf(xk)1f(xk)\Delta\mathbf{x}_k := J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_k)

so the update reads simply

xk+1=xkΔxk\mathbf{x}_{k+1} = \mathbf{x}_k - \Delta\mathbf{x}_k

To compute Δxk\Delta\mathbf{x}_k without ever forming Jf(xk)1J_f(\mathbf{x}_k)^{-1}, multiply the definition above by Jf(xk)J_f(\mathbf{x}_k) from the left — JfJf1J_f \cdot J_f^{-1} collapses to the identity — and what remains is the equivalent linear system

Jf(xk)Δxk=f(xk)J_f(\mathbf{x}_k) \, \Delta\mathbf{x}_k = f(\mathbf{x}_k)

This is the system we actually solve at each iteration. The two formulations agree algebraically; the point is operational — solving avoids the cost and instability of inverting. Geometrically, Δxk\Delta\mathbf{x}_k is the displacement from xk\mathbf{x}_k to where the tangent hyperplane of ff at xk\mathbf{x}_k hits zero — the multivariate analog of the 1D tangent-line intercept.

Spelling out the sizes makes the squareness visible:

Jf(xk)n×n  Δxkn×1  =  f(xk)n×1\underbrace{J_f(\mathbf{x}_k)}_{n \times n} \;\underbrace{\Delta\mathbf{x}_k}_{n \times 1} \;=\; \underbrace{f(\mathbf{x}_k)}_{n \times 1}

The Jacobian has one row per output of ff and one column per input variable, the unknown Δxk\Delta\mathbf{x}_k has one component per input variable, and the right-hand side has one component per output. So this is a square linear system: nn equations (one per output component of ff that has to be driven to zero) in nn unknowns (one per component of the step we are solving for).

Squareness is specific to the setup we have chosen — f:RnRnf : \mathbb{R}^n \to \mathbb{R}^n, with the same number of inputs as outputs. In a more general setup f:RnRmf : \mathbb{R}^n \to \mathbb{R}^m with mnm \neq n, the Jacobian would be rectangular (m×nm \times n) and the resulting system either over- or under-determined; Newton’s method as stated would not apply, and a different approach (e.g. least-squares) would be needed.

Squareness alone is not enough — Jf(xk)J_f(\mathbf{x}_k) must also be invertible (equivalently detJf(xk)0\det J_f(\mathbf{x}_k) \neq 0) for the system to have a unique solution. This is the multivariate analog of the 1D condition f(xk)0f'(x_k) \neq 0: without it, the step is undefined. As before we assume invertibility at the root itself (a simple root in the multivariate sense, detJf(x)0\det J_f(\mathbf{x}^*) \neq 0) and rely on continuity to inherit the property in a neighborhood.

Let f:DRnRnf : D \subseteq \mathbb{R}^n \to \mathbb{R}^n be a class C2C^2 function on an open and convex domain DD. To approximate a root xD\mathbf{x}^* \in D, fix a tolerance ε>0\varepsilon > 0 and choose a starting point x0D\mathbf{x}_0 \in D close to x\mathbf{x}^*. The Newton iteration then generates the sequence (xk)(\mathbf{x}_k) by, at each step k=0,1,2,k = 0, 1, 2, \ldots, solving the linear system

Jf(xk)Δxk=f(xk)J_f(\mathbf{x}_k) \, \Delta\mathbf{x}_k = f(\mathbf{x}_k)

for the Newton step Δxk\Delta\mathbf{x}_k and updating

xk+1=xkΔxk\mathbf{x}_{k+1} = \mathbf{x}_k - \Delta\mathbf{x}_k

The iteration continues as long as both xk+1xkε\|\mathbf{x}_{k+1} - \mathbf{x}_k\| \geq \varepsilon (the step is still larger than tolerance) and Jf(xk)1f(xk+1)Jf(xk)1f(xk)\|J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_{k+1})\| \leq \|J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_k)\| (the iteration is still making progress).

Stopping Conditions

The iteration is wrapped in two checks, both of which must hold for the loop to continue.

The tolerance check xk+1xkε\|\mathbf{x}_{k+1} - \mathbf{x}_k\| \geq \varepsilon asks whether the step we just took is larger than the chosen tolerance ε\varepsilon. Once consecutive iterates agree to within ε\varepsilon, we are no longer moving meaningfully and we accept xk+1\mathbf{x}_{k+1} as the approximate root.

The progress check Jf(xk)1f(xk+1)Jf(xk)1f(xk)\|J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_{k+1})\| \leq \|J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_k)\| is a divergence guard. The right-hand side is exactly Δxk\|\Delta\mathbf{x}_k\| — the size of the step we just took. The left-hand side is what the next step would be if we re-used the current Jacobian instead of recomputing one at xk+1\mathbf{x}_{k+1}: a cheap proxy for “how big a step is Newton about to take next?”. If that proxy is no larger than the current step, the iteration is still contracting and we keep going. If it grows, we are starting to overshoot — Newton has fallen out of its basin of convergence — and we stop rather than chase a divergent sequence.

Local Convergence Theorem

The informal “Newton converges fast if you start close enough” promise is now a precise theorem.

Let f:DRnRnf : D \subseteq \mathbb{R}^n \to \mathbb{R}^n be a class C2C^2 function on an open and convex domain DD, with a root xD\mathbf{x}^* \in D at which the Jacobian is invertible: detJf(x)0\det J_f(\mathbf{x}^*) \neq 0. Then there exists a neighborhood UU of x\mathbf{x}^* such that for every starting point x0U\mathbf{x}_0 \in U, the Newton iterates

xk+1=xkJf(xk)1f(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_k)

converge to x\mathbf{x}^*, and the convergence rate is quadratic — there exists a constant CRC \in \mathbb{R} with

xk+1xCxkx2\|\mathbf{x}_{k+1} - \mathbf{x}^*\| \leq C \, \|\mathbf{x}_k - \mathbf{x}^*\|^2

for every iterate.

The theorem packages every loose end from the previous subsections.

  • Invertible Jacobian at the root is the precondition for guaranteed convergence. If detJf(x)0\det J_f(\mathbf{x}^*) \neq 0 holds (the multivariate version of simple root) and we manage to start inside UU, convergence to x\mathbf{x}^* is not a hope but a certainty — every subsequent iterate stays in UU and the sequence settles on x\mathbf{x}^*. If instead detJf(x)=0\det J_f(\mathbf{x}^*) = 0, the theorem is silent: Newton may still converge, but slower (linear or sublinear) and the “sharp tool” property is lost.

  • Quadratic means correct digits roughly double per step. If xkx\|\mathbf{x}_k - \mathbf{x}^*\| is on the order of 10d10^{-d}, then xk+1x\|\mathbf{x}_{k+1} - \mathbf{x}^*\| is on the order of C102dC \cdot 10^{-2d} — the error squares, so the number of correct decimal places roughly doubles each step. In practice an iterate with a couple of correct digits often reaches machine precision in five or six iterations. The constant CC shifts the size of that “few” but not the doubling rate.

  • The neighborhood UU is unspecified. The theorem promises UU exists, but gives no formula for its size. This is the gap that makes Newton’s method locally convergent rather than globally convergent — and the gap that the run-time stopping conditions, next, are designed to cope with.