Roots
This chapter and the next apply the multivariate calculus tools just developed — gradient, Hessian, Jacobian — to two of the most common numerical problems: finding roots (inputs that make a function output zero) and finding optima (inputs at which a function attains a local extremum). Both problems share the same recipe: write down the condition that characterizes a solution, then iterate toward it. We start with roots.
A root (also called a zero) of a function is an input value at which vanishes:
In 1D, is a single number where the graph of crosses the horizontal axis. In higher dimensions, is a point in — a specific combination of input coordinates — that makes output exactly zero.
So whenever you hear “finding a root,” the question is: what input do I have to plug into this machine to make it spit out a zero?
For toy functions like the answer is immediate — . But for anything more interesting (a polynomial of degree higher than four, a transcendental equation like , or a system of nonlinear equations in several unknowns) there is no closed-form solution to read off. We have to find the root iteratively: start from a guess, refine it step by step, and stop once the guess is close enough to a real root.
The workhorse for this is Newton’s method. We start with the 1D version as a reminder, then lift the entire picture into dimensions in the next section.
The 1D Reminder
Let be a twice continuously differentiable function (i.e. of class ) on an interval , with a root we want to find. Starting from an initial guess , Newton’s method generates the sequence
Provided lies in a neighborhood of , this sequence converges quadratically to — meaning that the error roughly squares at each step (a guess accurate to two digits becomes accurate to four, then eight, then sixteen).
The Geometric Idea
Each step replaces the (possibly nasty) curve by its tangent line at the current guess — a far simpler, linear, easy-to-zero function — and then jumps to where that tangent line crosses the horizontal axis.
Concretely: at , the function has value and slope . The tangent line through with that slope is
Setting and solving for (assuming , so we can actually divide) gives exactly the Newton update:
So is the root not of itself, but of the tangent-line approximation of at . The hope is that near a real root the tangent is close enough to that its zero is also close to ‘s zero — and that closeness sharpens each iteration, pulling the guess in fast.
The non-zero-slope condition is the algebraic catch: if at any iterate the tangent is horizontal, it never crosses the axis and the formula divides by zero. In practice, we assume at the root we are after — a so-called simple root — and a small enough neighborhood inherits non-zero slope by continuity.
Why a Neighborhood, and Why Quadratic
The two strict conditions in the convergence claim each pull their weight.
-
Why and quadratic. The tangent line is a first-order Taylor approximation of at . Its error grows like for some between and — i.e. the error is second order in the displacement. Squaring the displacement is exactly what makes the iteration converge quadratically: each step shrinks the error to roughly the square of the previous error. We need to exist and be bounded near for that bound to hold.
-
Why a neighborhood of . Far from a real root the tangent line can point anywhere — it might send the next guess off to infinity, into the basin of a different root, or onto a near-horizontal patch where and the update divides by something tiny. The neighborhood condition is the price of using a local linearization to chase a global problem: Newton’s method is a sharp tool, but it only finds the root in whose basin you start. For this reason Newton’s method is called a locally convergent method — convergence is guaranteed only once the iterate is close enough to , not from any starting point.
Lifting the Iteration to Dimensions
Now is a class multivariate function on an open and convex domain . A root is a point at which vanishes: .
Mechanically, almost nothing changes from the 1D recipe. The 1D update
is just a special case of “step from to where the tangent hits zero”. In dimensions the tangent line becomes the tangent hyperplane (the first-order Taylor approximation of at ), the slope becomes the Jacobian matrix , and “divide by the slope” becomes “multiply by the inverse Jacobian” — division is not a thing we can do with matrices, but multiplying by an inverse is. The literal lift of the formula reads
This is the right idea. But we never actually form to apply it.
Solve, Don’t Invert
The standard linear-algebra reformulation: whenever a derivation produces , do not compute the inverse . Multiply both sides by from the left — the product collapses to the identity — and the equation rearranges into the equivalent form , with no inverse left on the page. Solve that system directly — by a direct linear-system solver such as Gaussian elimination or LU decomposition — and read off from the result. The reason is twofold: forming and applying does strictly more work than one direct solve, and matrix inversion is also numerically less stable, especially when is ill-conditioned (close to non-invertible — its determinant is tiny, so small perturbations in produce wildly different solutions).
Apply that trick here. The literal-lift formula above subtracts a specific correction from to produce . Give that correction a name — the Newton step, defined explicitly as
so the update reads simply
To compute without ever forming , multiply the definition above by from the left — collapses to the identity — and what remains is the equivalent linear system
This is the system we actually solve at each iteration. The two formulations agree algebraically; the point is operational — solving avoids the cost and instability of inverting. Geometrically, is the displacement from to where the tangent hyperplane of at hits zero — the multivariate analog of the 1D tangent-line intercept.
Spelling out the sizes makes the squareness visible:
The Jacobian has one row per output of and one column per input variable, the unknown has one component per input variable, and the right-hand side has one component per output. So this is a square linear system: equations (one per output component of that has to be driven to zero) in unknowns (one per component of the step we are solving for).
Squareness is specific to the setup we have chosen — , with the same number of inputs as outputs. In a more general setup with , the Jacobian would be rectangular () and the resulting system either over- or under-determined; Newton’s method as stated would not apply, and a different approach (e.g. least-squares) would be needed.
Squareness alone is not enough — must also be invertible (equivalently ) for the system to have a unique solution. This is the multivariate analog of the 1D condition : without it, the step is undefined. As before we assume invertibility at the root itself (a simple root in the multivariate sense, ) and rely on continuity to inherit the property in a neighborhood.
Let be a class function on an open and convex domain . To approximate a root , fix a tolerance and choose a starting point close to . The Newton iteration then generates the sequence by, at each step , solving the linear system
for the Newton step and updating
The iteration continues as long as both (the step is still larger than tolerance) and (the iteration is still making progress).
Stopping Conditions
The iteration is wrapped in two checks, both of which must hold for the loop to continue.
The tolerance check asks whether the step we just took is larger than the chosen tolerance . Once consecutive iterates agree to within , we are no longer moving meaningfully and we accept as the approximate root.
The progress check is a divergence guard. The right-hand side is exactly — the size of the step we just took. The left-hand side is what the next step would be if we re-used the current Jacobian instead of recomputing one at : a cheap proxy for “how big a step is Newton about to take next?”. If that proxy is no larger than the current step, the iteration is still contracting and we keep going. If it grows, we are starting to overshoot — Newton has fallen out of its basin of convergence — and we stop rather than chase a divergent sequence.
Local Convergence Theorem
The informal “Newton converges fast if you start close enough” promise is now a precise theorem.
Let be a class function on an open and convex domain , with a root at which the Jacobian is invertible: . Then there exists a neighborhood of such that for every starting point , the Newton iterates
converge to , and the convergence rate is quadratic — there exists a constant with
for every iterate.
The theorem packages every loose end from the previous subsections.
-
Invertible Jacobian at the root is the precondition for guaranteed convergence. If holds (the multivariate version of simple root) and we manage to start inside , convergence to is not a hope but a certainty — every subsequent iterate stays in and the sequence settles on . If instead , the theorem is silent: Newton may still converge, but slower (linear or sublinear) and the “sharp tool” property is lost.
-
Quadratic means correct digits roughly double per step. If is on the order of , then is on the order of — the error squares, so the number of correct decimal places roughly doubles each step. In practice an iterate with a couple of correct digits often reaches machine precision in five or six iterations. The constant shifts the size of that “few” but not the doubling rate.
-
The neighborhood is unspecified. The theorem promises exists, but gives no formula for its size. This is the gap that makes Newton’s method locally convergent rather than globally convergent — and the gap that the run-time stopping conditions, next, are designed to cope with.