Numerical Treatment of ODE

The population models built earlier each came with a closed-form solution: Malthusian growth gave a clean exponential, the logistic model an explicit S-curve, the saturation model an exponential approach to a limit. Being able to write the solution down as a formula is the comfortable case, and also the rare one. The overwhelming majority of ordinary differential equations have no closed-form solution at all — there is simply no finite combination of elementary functions that satisfies them. The moment a model grows past a textbook example, this is what happens, and the only way forward is to compute an approximate solution numerically, step by step. This page is about how that is done, and about the ways it can go wrong.

Initial value problems and boundary value problems

A differential equation on its own does not pin down a single solution — it describes a whole family of curves, one for each way of fixing the constants of integration. To single out the one solution we actually want, the equation has to be paired with side conditions. The population-dynamics models are exactly this: one or several ODEs together with enough extra conditions to make the solution unique. There are two standard ways to supply those conditions, and they split ODE problems into two families.

The first fixes the solution at the start of the time interval — we are told the state at the beginning and asked how the system develops from there.

An initial value problem (IVP) is an ODE together with the value of its solution at the start of the time interval. In the scalar case,

y˙(t)=f(t,y(t)),y(a)=ya,ta,\dot y(t) = f(t, y(t)), \qquad y(a) = y_a, \qquad t \ge a,

and for a system of nn coupled ODEs,

y˙i(t)=fi(t,y1(t),,yn(t)),yi(a)=yi,a,ta,i=1,,n.\dot y_i(t) = f_i(t, y_1(t), \dots, y_n(t)), \qquad y_i(a) = y_{i,a}, \qquad t \ge a, \quad i = 1, \dots, n.

The condition y(a)=yay(a) = y_a is the initial value: the known state at the starting time t=at = a, from which the solution is propagated forward.

This is the population-dynamics setting from before. The initial value yay_a plays exactly the role the starting population p0p_0 did — the headcount we measure at the outset and then watch evolve. So y(a)=yay(a) = y_a is not just a piece of notation; it is the concrete starting datum, the analog of p0p_0, that turns the whole family of solution curves into the single one we care about.

The second family fixes the solution at both ends of the interval — we are told where the trajectory starts and where it must end, and asked to find the path between.

A boundary value problem (BVP) is an ODE together with values of its solution at both endpoints of the interval, rather than at the start alone — for instance y(a)=yay(a) = y_a at the left end and y(b)=yby(b) = y_b at the right.

The optimal trajectory of a space shuttle is the standard example: the craft leaves a known launch point and must arrive at a prescribed target, and the problem is to find the flight path joining the two. Most of what follows concerns initial value problems, which are the natural shape for population dynamics; boundary value problems are taken up later, once the machinery for IVPs is in place.

Reading the prototype

A few features of the prototype y˙(t)=f(t,y(t))\dot y(t) = f(t, y(t)) are worth drawing out, because the notation hides some choices.

The right-hand side ff is not the ff of the population models. There, f(p,q)f(p, q) named the per-capita growth rate — the factor multiplying the population in p˙=f(p,q)p\dot p = f(p, q)\, p. Here ff is the entire right-hand side of the ODE: the whole expression for y˙\dot y, the rate of change itself. Same letter, a different job.

The unknown yy may be a single scalar function of time or a vector of several functions stacked together. The system form above is the vector case written out component by component: nn populations evolving at once, where — as is typical in these models — each species’ rate of change depends on all the others. The coupled predator–prey and competition systems are of exactly this shape, nn first-order equations sharing one set of unknowns.

And every equation here is first-order: only the first derivative y˙\dot y appears, never y¨\ddot y or anything higher. That is a genuine simplification, and it is not as restrictive as it looks. A higher-order equation — the damped oscillator’s p¨\ddot p, for instance — can be folded back into this first-order form by introducing the lower derivatives as extra unknowns, a reduction taken up later in this chapter. So the first-order IVP is not a special case; it is the standard form everything else is converted into.

One last case is worth flagging. If ff happens to depend only on tt and not on yy — that is, y˙=f(t)\dot y = f(t) — then the ODE is no longer really a differential equation to be solved but a plain integration, y(t)=ya+atf(s)dsy(t) = y_a + \int_a^t f(s)\,ds, with nothing to do beyond evaluating an integral. The interesting case, and the one that forces the numerical machinery of this page, is the usual one: ff depends on the unknown yy itself, tying the rate of change at each instant to the current state, so the solution cannot be had by integrating a known function of tt alone.

A short excursus: discretization

Before any method can run, there is a gap to bridge. A model lives in the continuum — real-valued quantities, smooth functions, an unbroken interval of time — but a computer is a finite machine that can hold only finitely many numbers and perform only finitely many operations. Every numerical method therefore begins with the same move.

Discretization is the transition from the continuum to the discrete and finite: replacing objects that take uncountably many values — the real numbers, a continuous function, an interval of time — with finite, machine-representable stand-ins.

This happens at several levels, and it is worth seeing each one, because the approximations they introduce are the ultimate source of the errors the rest of this page has to control.

Representing numbers

The first thing to discretize is the real number line itself. A machine cannot store an arbitrary real number — between any two reals lie uncountably many others, and only finitely many fit in memory. There are three standard schemes, differing in how they trade off range (how large and small the representable numbers go) against resolution (how finely spaced they are):

  • Integer arithmetic keeps only whole numbers, so the spacing is always exactly one. Nothing between two integers can be stored: a value like 2.52.5 has no slot, and a division such as 5/25 / 2 silently drops its remainder. For continuous quantities that is hopelessly coarse.
  • Fixed-point arithmetic narrows the spacing to let fractions in. It pins the radix point at a fixed position and keeps a fixed number of digits after it, so the machine really stores an integer such as 123123 but reads it as 1.231.23 (the point is just understood to sit two places in). Now fractions fit — but the spacing it buys is uniform across the whole range: choose a step of 0.010.01 and every representable number sits 0.010.01 from its neighbor, near zero and near a million alike. That is the wrong place to spend precision. Out among large values the fixed step is far finer than anyone needs, while near zero that same step is too coarse to resolve anything smaller than itself. What you actually want is fine steps for small numbers and coarse steps for large ones, and a single absolute spacing can never be both.
  • Floating-point arithmetic fixes that imbalance by letting the radix point “float” to wherever the digits are needed, so both the range and the resolution vary with the number’s size. The spacing now scales with magnitude — tight near zero, wide out among large values — but the relative precision stays the same everywhere: the same handful of significant digits whether the quantity is tiny or huge. This is what makes it possible to hold both very large and very small numbers sensibly, and it is what essentially all scientific computation uses.

The floating-point idea has a precise definition.

The floating-point numbers in base BB with tt mantissa digits form the set

FB,t={MBE  :  M=0  Bt1M<Bt,  M,EZ}.\mathbb{F}_{B,t} = \{\, M \cdot B^E \;:\; M = 0 \ \lor\ B^{t-1} \le |M| < B^t,\ \ M, E \in \mathbb{Z} \,\}.

The integer MM is the mantissa, carrying the significant digits; the integer EE is the exponent, which scales them; and BB is the base. The bound Bt1M<BtB^{t-1} \le |M| < B^t forces a nonzero mantissa to have exactly tt digits — no leading zeros — so every number has one normalized representation.

A real machine cannot store arbitrarily large or small exponents, so it keeps only the machine numbers, the floating-point numbers whose exponent lies in a fixed range aEba \le E \le b:

F=FB,t,a,b={fFB,t  :  aEb}.\mathbb{F} = \mathbb{F}_{B,t,a,b} = \{\, f \in \mathbb{F}_{B,t} \;:\; a \le E \le b \,\}.

Because the exponent is bounded, this set is finite and has a smallest and a largest element. The smallest positive machine number is the smallest mantissa at the smallest exponent, Bt1BaB^{t-1} \cdot B^{a}, and the largest machine number is the largest mantissa at the largest exponent, (Bt1)Bb(B^{t} - 1) \cdot B^{b}. A value beyond either bound cannot be stored: it underflows to zero if it is too small, or overflows if it is too large.

Take B=10B = 10 and t=3t = 3, so the mantissa always carries three decimal digits (100M<1000100 \le |M| < 1000 when nonzero) and the exponent slides the decimal point. Then 1.23=1231021.23 = 123 \cdot 10^{-2}, 123=123100123 = 123 \cdot 10^{0}, and 456,000=456103456{,}000 = 456 \cdot 10^{3} are all representable, each with the same three significant digits. With the exponent bounded by, say, 9E9-9 \le E \le 9, the smallest positive machine number is then Bt1Ba=100109B^{t-1} \cdot B^{a} = 100 \cdot 10^{-9} and the largest is (Bt1)Bb=999109(B^{t} - 1) \cdot B^{b} = 999 \cdot 10^{9}. Outside that window a value cannot be stored at all — it underflows to zero or overflows.

Resolution — accuracy is relative

The single most important property of a floating-point system is how finely it resolves numbers, and the key fact is that it does so relatively, not absolutely.

The resolution of a floating-point system is the maximal relative distance between neighboring representable numbers,

ρ=B1t.\rho = B^{1-t}.

Here is where it comes from. Take two neighboring machine numbers with the same exponent EE. Their mantissas differ by one, so the numbers themselves are MBEM \cdot B^E and (M+1)BE(M+1) \cdot B^E, and subtracting leaves an absolute gap of just BEB^E between them. To turn that into a relative gap, divide it by the value itself, MBEM \cdot B^E. This fraction is largest where the value is smallest, and the smallest a normalized mantissa is allowed to be is M=Bt1|M| = B^{t-1} (the lower bound built into the definition). Putting that worst case in, the BEB^E cancels top and bottom:

BEBt1BE=B1t=ρ.\frac{B^E}{B^{t-1} \cdot B^E} = B^{1-t} = \rho.

So the relative gap never exceeds ρ\rho. The absolute spacing BEB^E itself grows as the numbers grow, so neighbors far out on the line are spaced far apart while neighbors near zero are packed tightly together, but the relative spacing stays capped at ρ\rho everywhere. This is exactly the behavior one wants: precision is proportional to magnitude, the same handful of significant digits whether the quantity is tiny or huge.

With B=10B = 10 and t=3t = 3, the resolution is ρ=1013=102\rho = 10^{1-3} = 10^{-2}, one part in a hundred. Near 11, consecutive machine numbers 1.00, 1.01, 1.02,1.00,\ 1.01,\ 1.02,\dots sit an absolute 0.010.01 apart. Near 10001000 they are 1000, 1010, 1020,1000,\ 1010,\ 1020,\dots, an absolute gap of 1010, a thousand times wider. Yet the relative gap is 1%1\% in both places: three significant digits, wherever you look on the line.

Beyond numbers

Discretization does not stop at the numbers. The same move — replace an infinite or continuous object with a finite one — is applied to functions and to the operations of calculus:

  • An infinite series is cut off after finitely many terms, turning it into a polynomial. The sine, for example, is computed in practice not from its defining infinite series but from a truncated polynomial that agrees with it closely on the range of interest.
  • A continuous interval of time is replaced by a finite grid of discrete points t0,t1,t2,t_0, t_1, t_2, \dots, and the solution is sought only at those points rather than at every instant.
  • A derivative — defined as a limit, an inherently continuous notion — is replaced by a difference quotient taken over a small but finite step.

The difference quotient approximates the derivative of yy at tt by the slope of a chord over a small finite step hh, instead of the limiting slope of the tangent:

y˙(t)y(t+h)y(t)h.\dot y(t) \approx \frac{y(t+h) - y(t)}{h}.

The derivative is the limit of this ratio as h0h \to 0; keeping hh small but positive is what makes it computable. This last substitution is the seed of the whole numerical treatment of ODEs — it turns y˙=f(t,y)\dot y = f(t, y) from a statement about instantaneous rates into a rule that steps the solution forward one finite hh at a time. But every one of these discretizations — rounded numbers, truncated series, finite steps in place of limits — buys that computability at the price of some error. Keeping those errors identified and under control is what the next two sections are about, and what the rest of the page ultimately depends on.

Rounding and round-off error

A numerical algorithm picks up error from several sources. The most basic one is built into floating-point arithmetic itself: a real number that is not a machine number cannot be stored exactly and must be replaced by one that can.

Every real xx that is not itself a machine number falls strictly between two that are — its nearest representable neighbors below and above,

fl(x)=max{fF:fx},fr(x)=min{fF:fx}.f_l(x) = \max\{\, f \in \mathbb{F} : f \le x \,\}, \qquad f_r(x) = \min\{\, f \in \mathbb{F} : f \ge x \,\}.

If xx happens to be a machine number, both neighbors coincide with xx itself; otherwise xx sits between fl(x)f_l(x) and fr(x)f_r(x), and storing it means choosing one of the two. That choice is made by a rounding map.

Rounding is a map rd:RF\mathrm{rd} : \mathbb{R} \to \mathbb{F} that replaces each real number with a machine number. It has three defining properties:

  • surjective (every target value is hit) — each machine number is the rounding of some real, if only of itself, so no representable value is left unreachable;
  • idempotent (applying it twice changes nothing the second time) — a number that is already a machine number is left alone, rd(f)=f\mathrm{rd}(f) = f for every fFf \in \mathbb{F}, so rd(rd(x))=rd(x)\mathrm{rd}(\mathrm{rd}(x)) = \mathrm{rd}(x);
  • monotonic (order-preserving) — if xyx \le y then rd(x)rd(y)\mathrm{rd}(x) \le \mathrm{rd}(y), so rounding never flips the order of two numbers.

Which of the two neighbors rd(x)\mathrm{rd}(x) actually returns is fixed by the rounding mode:

  • rounding up always takes the upper neighbor, rd(x)=fr(x)\mathrm{rd}(x) = f_r(x), i.e. it rounds toward ++\infty;
  • rounding down always takes the lower neighbor, rd(x)=fl(x)\mathrm{rd}(x) = f_l(x), i.e. it rounds toward -\infty;
  • correct rounding (round to nearest) takes whichever neighbor is closer, with a fixed tie-breaking rule for the exact midpoint between them;
  • truncating (round toward zero) simply cuts the number off, keeping the leading digits and throwing away the rest. Since dropping digits only ever shrinks the magnitude, this lands on the neighbor nearer to zero. That matches rounding down for positive values, but for negative ones it rounds the opposite way: 2.7-2.7 truncates to 2-2, whereas rounding down would give 3-3.

Whatever the mode, the price of storing xx is the gap between it and the machine number chosen for it.

The round-off error is the discrepancy rd(x)x\mathrm{rd}(x) - x introduced when a real number is replaced by its machine representation. Measured relative to xx, it is bounded by the resolution: at most ρ\rho for the directed modes (up, down, truncation), and at most ρ/2\rho/2 for correct rounding to the nearest — which, by always taking the closer neighbor, can never be off by more than half the spacing to a neighbor.

Arithmetic on machine numbers

Rounding is not a one-time cost paid only on input. Even when both operands are machine numbers, the result of an arithmetic operation usually is not — multiply two three-digit numbers and the exact product can need six digits — so it has to be rounded back into F\mathbb{F} as well. The idealized model of this is clean:

Under ideal arithmetic, each elementary operation is carried out as if in exact arithmetic and only its result is rounded to a machine number:

a~b=rd(ab),{+,,,/},a,bF.a \mathbin{\tilde{*}} b = \mathrm{rd}(a * b), \qquad * \in \{+, -, \cdot, /\}, \quad a, b \in \mathbb{F}.

The tilde marks the machine operation ~\tilde{*} as distinct from the exact one * it approximates.

This is the behavior the IEEE floating-point standard prescribes — though, as a technological matter, not every computer implements it faithfully. For error analysis it is handier to weaken the exact equality into a bound: every machine operation returns the true result perturbed by a small relative amount,

a~b=(ab)(1+ε(a,b)),ε(a,b)<ε~=O(ρ).a \mathbin{\tilde{*}} b = (a * b)\,\bigl(1 + \varepsilon(a, b)\bigr), \qquad |\varepsilon(a, b)| < \tilde{\varepsilon} = O(\rho).

The perturbation ε(a,b)\varepsilon(a, b) depends on the operands, but its size is always controlled. The O(ρ)O(\rho) here is big-O notation, read “on the order of ρ\rho”: it means the bound ε~\tilde{\varepsilon} is at most some fixed constant times the resolution ρ\rho, so it stays proportional to ρ\rho and shrinks right along with it as the precision improves. A single rounding, then, is utterly harmless. The real question is what happens when millions of them compound over a long computation: do the individual errors stay independent and roughly cancel, or do they reinforce and grow? Estimating that accumulated influence is the job of round-off error analysis, and the standard worked example is Horner’s method for evaluating a polynomial — the classic case for tracing how round-off builds up across a long chain of multiply–add steps.

Further sources of errors

Round-off is the most basic source of error, but it is not the only one, and on modern hardware it is rarely the worst. Double-precision arithmetic carries so many significant digits that the resolution ρ\rho is minuscule, so the errors that actually dominate a real computation usually come not from the number format but from the deliberate approximations a method makes. Three of these are worth naming.

The discretization error is the error from solving a problem on a discrete set of points instead of on the underlying continuum — sampling a continuous function at grid points, or stepping an ODE forward in finite jumps rather than continuously.

This is the direct cost of the discretization that every method on this page relies on, and it is the one these methods spend most of their effort controlling. Once a concrete method is in hand, the discretization error splits into a local part (made in a single step) and a global part (accumulated across all steps), and the central questions become how fast each shrinks as the step size goes to zero.

The truncation error is the error from stopping an iterative process after finitely many steps, instead of running it to its (often infinite) completion.

This takes a few common shapes. A series computed term by term has to be cut off after NN terms, leaving out the infinite tail — the same series-to-polynomial truncation seen earlier, now viewed as a source of error. Root-finding by Newton’s method likewise produces only a sequence of ever-better approximations to a true root; the iteration has to be stopped somewhere, either after a fixed number NN of steps or once the change from one step to the next becomes insignificant, and whatever distance remains to the exact root is truncation error.

The data error is the error already present in the input: the data fed to a computation are often measurements, and measurements are inexact.

Data error is different in kind from the other two — it has nothing to do with the algorithm or the machine, but is baked into the problem before any computation starts. It cannot be reduced by a better method or finer precision, only carried along, and how much it grows or shrinks as it passes through a computation turns out to be a property of the problem itself, taken up later on this page.

Round-off, discretization, truncation, data — each is a separate way the computed answer can drift from the true one, and a sound numerical method has to keep an eye on all of them at once.