Roots

This chapter and the next apply the multivariate calculus tools just developed — gradient, Hessian, Jacobian — to two of the most common numerical problems: finding roots (inputs that make a function output zero) and finding optima (inputs at which a function attains a local extremum). Both problems share the same recipe: write down the condition that characterizes a solution, then iterate toward it. We start with roots.

A root (also called a zero) of a function $f$ is an input value $\mathbf{x}^*$ at which $f$ vanishes:

f(\mathbf{x}^*) = 0

In 1D, $\mathbf{x}^*$ is a single number $x^* \in \mathbb{R}$ where the graph of $f$ crosses the horizontal axis. In higher dimensions, $\mathbf{x}^*$ is a point in $\mathbb{R}^n$ — a specific combination of input coordinates — that makes $f$ output exactly zero.

So whenever you hear “finding a root,” the question is: what input do I have to plug into this machine to make it spit out a zero?

For toy functions like $f(x) = x - 3$ the answer is immediate — $x^* = 3$ . But for anything more interesting (a polynomial of degree higher than four, a transcendental equation like $x = \cos x$ , or a system of nonlinear equations in several unknowns) there is no closed-form solution to read off. We have to find the root iteratively: start from a guess, refine it step by step, and stop once the guess is close enough to a real root.

The workhorse for this is Newton’s method. We start with the 1D version as a reminder, then lift the entire picture into $d$ dimensions in the next section.

The 1D Reminder

Let $f : I \subseteq \mathbb{R} \to \mathbb{R}$ be a twice continuously differentiable function (i.e. of class $C^2$ ) on an interval $I$ , with a root $x^* \in I$ we want to find. Starting from an initial guess $x_0$ , Newton’s method generates the sequence

x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} \quad \text{for } k = 0, 1, 2, \ldots

Provided $x_0$ lies in a neighborhood of $x^*$ , this sequence converges quadratically to $x^*$ — meaning that the error roughly squares at each step (a guess accurate to two digits becomes accurate to four, then eight, then sixteen).

The Geometric Idea

Each step replaces the (possibly nasty) curve $f$ by its tangent line at the current guess $x_k$ — a far simpler, linear, easy-to-zero function — and then jumps to where that tangent line crosses the horizontal axis.

Concretely: at $x_k$ , the function has value $f(x_k)$ and slope $f'(x_k)$ . The tangent line through $(x_k, f(x_k))$ with that slope is

\ell(x) = f(x_k) + f'(x_k)(x - x_k)

Setting $\ell(x) = 0$ and solving for $x$ (assuming $f'(x_k) \neq 0$ , so we can actually divide) gives exactly the Newton update:

x = x_k - \frac{f(x_k)}{f'(x_k)} = x_{k+1}

So $x_{k+1}$ is the root not of $f$ itself, but of the tangent-line approximation of $f$ at $x_k$ . The hope is that near a real root the tangent is close enough to $f$ that its zero is also close to $f$ ‘s zero — and that closeness sharpens each iteration, pulling the guess in fast.

The non-zero-slope condition $f'(x_k) \neq 0$ is the algebraic catch: if at any iterate the tangent is horizontal, it never crosses the axis and the formula divides by zero. In practice, we assume $f'(x^*) \neq 0$ at the root we are after — a so-called simple root — and a small enough neighborhood inherits non-zero slope by continuity.

Why a Neighborhood, and Why Quadratic

The two strict conditions in the convergence claim each pull their weight.

Why $C^2$ and quadratic. The tangent line is a first-order Taylor approximation of $f$ at $x_k$ . Its error grows like $\tfrac{1}{2} f''(\xi)(x - x_k)^2$ for some $\xi$ between $x_k$ and $x$ — i.e. the error is second order in the displacement. Squaring the displacement is exactly what makes the iteration converge quadratically: each step shrinks the error to roughly the square of the previous error. We need $f''$ to exist and be bounded near $x^*$ for that bound to hold.
Why a neighborhood of $x^*$ . Far from a real root the tangent line can point anywhere — it might send the next guess off to infinity, into the basin of a different root, or onto a near-horizontal patch where $f'(x_k) \approx 0$ and the update divides by something tiny. The neighborhood condition is the price of using a local linearization to chase a global problem: Newton’s method is a sharp tool, but it only finds the root in whose basin you start. For this reason Newton’s method is called a locally convergent method — convergence is guaranteed only once the iterate is close enough to $x^*$ , not from any starting point.

Lifting the Iteration to $d$ Dimensions

Now $f : D \subseteq \mathbb{R}^n \to \mathbb{R}^n$ is a class $C^2$ multivariate function on an open and convex domain $D$ . A root is a point $\mathbf{x}^* \in D$ at which $f$ vanishes: $f(\mathbf{x}^*) = \mathbf{0}$ .

Mechanically, almost nothing changes from the 1D recipe. The 1D update

x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)}

is just a special case of “step from $x_k$ to where the tangent hits zero”. In $d$ dimensions the tangent line becomes the tangent hyperplane (the first-order Taylor approximation of $f$ at $\mathbf{x}_k$ ), the slope $f'(x_k)$ becomes the Jacobian matrix $J_f(\mathbf{x}_k) \in \mathbb{R}^{n \times n}$ , and “divide by the slope” becomes “multiply by the inverse Jacobian” — division is not a thing we can do with matrices, but multiplying by an inverse is. The literal lift of the formula reads

\mathbf{x}_{k+1} = \mathbf{x}_k - J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_k)

This is the right idea. But we never actually form $J_f(\mathbf{x}_k)^{-1}$ to apply it.

Solve, Don’t Invert

The standard linear-algebra reformulation: whenever a derivation produces $\mathbf{x} = A^{-1}\mathbf{b}$ , do not compute the inverse $A^{-1}$ . Multiply both sides by $A$ from the left — the product $A \cdot A^{-1}$ collapses to the identity — and the equation rearranges into the equivalent form $A\mathbf{x} = \mathbf{b}$ , with no inverse left on the page. Solve that system directly — by a direct linear-system solver such as Gaussian elimination or LU decomposition — and read off $\mathbf{x}$ from the result. The reason is twofold: forming and applying $A^{-1}$ does strictly more work than one direct solve, and matrix inversion is also numerically less stable, especially when $A$ is ill-conditioned (close to non-invertible — its determinant is tiny, so small perturbations in $\mathbf{b}$ produce wildly different solutions).

Apply that trick here. The literal-lift formula above subtracts a specific correction from $\mathbf{x}_k$ to produce $\mathbf{x}_{k+1}$ . Give that correction a name — the Newton step, defined explicitly as

\Delta\mathbf{x}_k := J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_k)

so the update reads simply

\mathbf{x}_{k+1} = \mathbf{x}_k - \Delta\mathbf{x}_k

To compute $\Delta\mathbf{x}_k$ without ever forming $J_f(\mathbf{x}_k)^{-1}$ , multiply the definition above by $J_f(\mathbf{x}_k)$ from the left — $J_f \cdot J_f^{-1}$ collapses to the identity — and what remains is the equivalent linear system

J_f(\mathbf{x}_k) \, \Delta\mathbf{x}_k = f(\mathbf{x}_k)

This is the system we actually solve at each iteration. The two formulations agree algebraically; the point is operational — solving avoids the cost and instability of inverting. Geometrically, $\Delta\mathbf{x}_k$ is the displacement from $\mathbf{x}_k$ to where the tangent hyperplane of $f$ at $\mathbf{x}_k$ hits zero — the multivariate analog of the 1D tangent-line intercept.

Spelling out the sizes makes the squareness visible:

\underbrace{J_f(\mathbf{x}_k)}_{n \times n} \;\underbrace{\Delta\mathbf{x}_k}_{n \times 1} \;=\; \underbrace{f(\mathbf{x}_k)}_{n \times 1}

The Jacobian has one row per output of $f$ and one column per input variable, the unknown $\Delta\mathbf{x}_k$ has one component per input variable, and the right-hand side has one component per output. So this is a square linear system: $n$ equations (one per output component of $f$ that has to be driven to zero) in $n$ unknowns (one per component of the step we are solving for).

Squareness is specific to the setup we have chosen — $f : \mathbb{R}^n \to \mathbb{R}^n$ , with the same number of inputs as outputs. In a more general setup $f : \mathbb{R}^n \to \mathbb{R}^m$ with $m \neq n$ , the Jacobian would be rectangular ( $m \times n$ ) and the resulting system either over- or under-determined; Newton’s method as stated would not apply, and a different approach (e.g. least-squares) would be needed.

Squareness alone is not enough — $J_f(\mathbf{x}_k)$ must also be invertible (equivalently $\det J_f(\mathbf{x}_k) \neq 0$ ) for the system to have a unique solution. This is the multivariate analog of the 1D condition $f'(x_k) \neq 0$ : without it, the step is undefined. As before we assume invertibility at the root itself (a simple root in the multivariate sense, $\det J_f(\mathbf{x}^*) \neq 0$ ) and rely on continuity to inherit the property in a neighborhood.

Let $f : D \subseteq \mathbb{R}^n \to \mathbb{R}^n$ be a class $C^2$ function on an open and convex domain $D$ . To approximate a root $\mathbf{x}^* \in D$ , fix a tolerance $\varepsilon > 0$ and choose a starting point $\mathbf{x}_0 \in D$ close to $\mathbf{x}^*$ . The Newton iteration then generates the sequence $(\mathbf{x}_k)$ by, at each step $k = 0, 1, 2, \ldots$ , solving the linear system

J_f(\mathbf{x}_k) \, \Delta\mathbf{x}_k = f(\mathbf{x}_k)

for the Newton step $\Delta\mathbf{x}_k$ and updating

\mathbf{x}_{k+1} = \mathbf{x}_k - \Delta\mathbf{x}_k

The iteration continues as long as both $\|\mathbf{x}_{k+1} - \mathbf{x}_k\| \geq \varepsilon$ (the step is still larger than tolerance) and $\|J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_{k+1})\| \leq \|J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_k)\|$ (the iteration is still making progress).

Stopping Conditions

The iteration is wrapped in two checks, both of which must hold for the loop to continue.

The tolerance check $\|\mathbf{x}_{k+1} - \mathbf{x}_k\| \geq \varepsilon$ asks whether the step we just took is larger than the chosen tolerance $\varepsilon$ . Once consecutive iterates agree to within $\varepsilon$ , we are no longer moving meaningfully and we accept $\mathbf{x}_{k+1}$ as the approximate root.

The progress check $\|J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_{k+1})\| \leq \|J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_k)\|$ is a divergence guard. The right-hand side is exactly $\|\Delta\mathbf{x}_k\|$ — the size of the step we just took. The left-hand side is what the next step would be if we re-used the current Jacobian instead of recomputing one at $\mathbf{x}_{k+1}$ : a cheap proxy for “how big a step is Newton about to take next?”. If that proxy is no larger than the current step, the iteration is still contracting and we keep going. If it grows, we are starting to overshoot — Newton has fallen out of its basin of convergence — and we stop rather than chase a divergent sequence.

Local Convergence Theorem

The informal “Newton converges fast if you start close enough” promise is now a precise theorem.

Let $f : D \subseteq \mathbb{R}^n \to \mathbb{R}^n$ be a class $C^2$ function on an open and convex domain $D$ , with a root $\mathbf{x}^* \in D$ at which the Jacobian is invertible: $\det J_f(\mathbf{x}^*) \neq 0$ . Then there exists a neighborhood $U$ of $\mathbf{x}^*$ such that for every starting point $\mathbf{x}_0 \in U$ , the Newton iterates

\mathbf{x}_{k+1} = \mathbf{x}_k - J_f(\mathbf{x}_k)^{-1} f(\mathbf{x}_k)

converge to $\mathbf{x}^*$ , and the convergence rate is quadratic — there exists a constant $C \in \mathbb{R}$ with

\|\mathbf{x}_{k+1} - \mathbf{x}^*\| \leq C \, \|\mathbf{x}_k - \mathbf{x}^*\|^2

for every iterate.

The theorem packages every loose end from the previous subsections.

Invertible Jacobian at the root is the precondition for guaranteed convergence. If $\det J_f(\mathbf{x}^*) \neq 0$ holds (the multivariate version of simple root) and we manage to start inside $U$ , convergence to $\mathbf{x}^*$ is not a hope but a certainty — every subsequent iterate stays in $U$ and the sequence settles on $\mathbf{x}^*$ . If instead $\det J_f(\mathbf{x}^*) = 0$ , the theorem is silent: Newton may still converge, but slower (linear or sublinear) and the “sharp tool” property is lost.
Quadratic means correct digits roughly double per step. If $\|\mathbf{x}_k - \mathbf{x}^*\|$ is on the order of $10^{-d}$ , then $\|\mathbf{x}_{k+1} - \mathbf{x}^*\|$ is on the order of $C \cdot 10^{-2d}$ — the error squares, so the number of correct decimal places roughly doubles each step. In practice an iterate with a couple of correct digits often reaches machine precision in five or six iterations. The constant $C$ shifts the size of that “few” but not the doubling rate.
The neighborhood $U$ is unspecified. The theorem promises $U$ exists, but gives no formula for its size. This is the gap that makes Newton’s method locally convergent rather than globally convergent — and the gap that the run-time stopping conditions, next, are designed to cope with.

Roots

The 1D Reminder

The Geometric Idea

Why a Neighborhood, and Why Quadratic

Lifting the Iteration to $d$ Dimensions

Solve, Don’t Invert

Stopping Conditions

Local Convergence Theorem

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

V

W

#

Roots

The 1D Reminder

The Geometric Idea

Why a Neighborhood, and Why Quadratic

Lifting the Iteration to d Dimensions

Solve, Don’t Invert

Stopping Conditions

Local Convergence Theorem

Lifting the Iteration to $d$ Dimensions