Numerical algorithms II: elementary functions

Powers

Integer powers x^n are computed by a fast algorithm of "repeated squaring", as described in Knuth's book, The art of computer programming. The algorithm is based on the following trick: if n is even, n=2*k, then x^n=x^k^2; and if n is odd, n=2*k+1, then x^n=x*x^k^2. Thus we can reduce the calculation of x^n to the calculation of x^k with k<=n/2, using at most two multiplications. This is one "step"; at each step n is divided by 2. This algorithm stops when n becomes 1, which is after m steps where m is the number of bits in n.

This procedure should be recursive, or else it requires to obtain the bits of the number n in reverse order. It is easier to implement this algorithm in a slightly different form: suppose we have the bits b[i] of the number n, so that n=b[0]+2*b[1]+...+b[m]*2^m. Then

x^n=x^b[0]*x^2^b[1]*...*x^2^m^b[m].

In other words, we evaluate x^2, x^4, ... by repeated squaring, and multiply those x^2^k for which the k-th bit b[k] of the number n is nonzero.

In Yacas script form, the algorithm looks like this:

power(x_IsPositiveInteger, n_IsPositiveInteger)<-- [ Local(result); result:=1; While(n != 0) [ if (IsOdd(n)) result := result*x; x := x*x; n := n>>1; ]; result; ];

power(m,n) calculates the result of m^n for n>0, m>0, integer n and integer m.

Since the "binary squaring" algorithm scans bits in the number n, the execution time is linear in the number of digits of n, i.e. logarithmic in n.

Roots

The square root is computed by using the bisection method, which works well for integers. The general approach is to scan each bit of the input number and to see if a certain bit should be set in the resulting integer. The time is linear in the number of decimals, or logarithmic in the input number. The method is very similar in approach to the repeated squaring method described above for raising numbers to a power.

For integer N, the following steps are performed:

Find the highest bit set, l2, in the number N.
1<<(l2/2) is at least a bit that is set in the end result. Start by setting that bit in the result, u=1<<l2 . It is also the highest bit set.
Now, traverse all the lower bits, one by one. For each lower bit, starting at lnext=l2-1, set v=1<<lnext. Now, (u+v)^2=u^2+2*u*v+v^2. If (u+v)^2<=N, then the bit set in v should also be set in the result, u, otherwise that bit should be cleared.
Set lnext=lnext-1, and repeat until all bits are tested, and lnext=0 and return the result found.

The intermediate results, u^2, v^2 and 2*u*v can be maintained easily too, due to the nature of the numbers involved ( v having only one bit set, and it being known which bit that is).

For floating point numbers, first the required number of decimals p after the decimal point is determined. Then the input number N is multiplied by a power of 10 until it has 2*p decimal. Then the integer square root calculation is performed, and the resulting number has p digits of precision.

Below is some Yacas script code to perform the calculation for integers.

//sqrt(1) = 1, sqrt(0) = 0 10 # BisectSqrt(0) <-- 0; 10 # BisectSqrt(1) <-- 1; 20 # BisectSqrt(N_IsPositiveInteger) <-- [ Local(l2,u,v,u2,v2,uv2,n); // Find highest set bit, l2 u := N; l2 := 0; While (u!=0) [ u:=u>>1; l2++; ]; l2--; // 1<<(l2/2) now would be a good under estimate // for the square root. 1<<(l2/2) is definitely // set in the result. Also it is the highest // set bit. l2 := l2>>1; // initialize u and u2 (u2==u^2). u := 1 << l2; u2 := u << l2; // Now for each lower bit: l2--; While( l2 != 0 ) [ // Get that bit in v, and v2 == v^2. v := 1<<l2; v2 := v<<l2; // uv2 == 2*u*v, where 2==1<<1, and // v==1<<l2, thus 2*u*v == // (1<<1)*u*(1<<l2) == u<<(l2+1) uv2 := u<<(l2 + 1); // n = (u+v)^2 = u^2 + 2*u*v + v^2 // = u2+uv2+v2 n := u2 + uv2 + v2; // if n (possible new best estimate for // sqrt(N)^2 is smaller than N, then the // bit l2 is set in the result, and // add v to u. if( n <= N ) [ u := u+v; // u <- u+v u2 := n; // u^2 <- u^2 + 2*u*v + v^2 ]; l2--; ]; u; // return result, accumulated in u. ];

BisectSqrt(N) computes the integer part of Sqrt(N) for integer N.

A separate function IntNthRoot is provided to compute the integer part of n^(1/s) for integer n and s. For a given s, it evaluates the integer part of n^(1/s) using only integer arithmetic with integers of size n^(1+1/s). This can be done by Halley's iteration method, solving the equation x^s=n. For this function, the Halley iteration sequence is monotonic. The initial guess is x[0]=2^(b(n)/s) where b(n) is the number of bits in n obtained by bit counting or using the integer logarithm function. It is clear that the initial guess is accurate to within a factor of 2. Since the relative error is squared at every iteration, we need as many iteration steps as bits in n^(1/s).

Since we only need the integer part of the root, it is enough to use integer division in the Halley iteration. The sequence x[k] will monotonically approximate the number n^(1/s) from below if we start from an initial guess that is less than the exact value. (We start from below so that we have to deal with smaller integers rather than with larger integers.) If n=p^s, then after enough iterations the floating-point value of x[k] would be slightly less than p; our value is the integer part of x[k]. Therefore, at each step we check whether 1+x[k] is a solution of x^s=n, in which case we are done; and we also check whether (1+x[k])^s>n, in which case the integer part of the root is x[k]. To speed up the Halley iteration in the worst case when s^s>n, it is combined with bisection. The root bracket interval x1<x<x2 is maintained and the next iteration x[k+1] is assigned to the midpoint of the interval if the Newton formula does not give sufficiently rapid convergence. The initial root bracket interval can be taken as x[0], 2*x[0].

Real powers (as opposed to integer powers and roots) are computed by using the exponential and logarithm functions, a^b=Exp(b*Ln(a)). This is also faster on integers than the integer root algorithm if b is very large.

Logarithm

There are two functions for the logarithm: one for the integer argument and one for the real argument.

The "integer logarithm", defined as the integer part of Ln(x)/Ln(b), where x and b are integers, is computed using a special routine IntLog(x, b) with purely integer math. This is much faster than evaluating the full logarithm when both arguments are integers and only the integer part of the logarithm is needed. The algorithm consists of (integer) dividing x by b repeatedly until x becomes 0 and counting the number of divisions. A speed-up for large x is achieved by first comparing x with b, then with b^2, b^4, etc., until the factor b^2^n is larger than x. At this point, x is divided by that power of b and the remaining value is iteratively compared with and divided by successively smaller powers of b.

The logarithm function Ln(x) for general (complex) x can be computed using its Taylor series,

Ln(1+x)=x-x/2+x^2/3-...

This series converges only for Abs(x)<1, so for all other values of x one first needs to bring the argument into this range by taking several square roots and then using the identity Ln(x)=2^k*Ln(x^2^(-k)). This is implemented in the Yacas core (for real x).

Currently the routine LnNum uses the Halley method for the equation Exp(x)=a to find x=Ln(a),

x'=x-2*(Exp(x)-a)/(Exp(x)+a).

This is currently quicker than other methods (with internal math).

A much faster algorithm based on the AGM sequence was given by Salamin (see R. P. Brent: Multiple-precision zero-finding methods and the complexity of elementary function evaluation, in Analytic Computational Complexity, ed. by J. F. Traub, Academic Press, 1975, p. 151; also available online from Oxford Computing Laboratory, as the paper rpb028). The formula is based on an asymptotic relation,

Ln(x)=Pi*x*(1+4*x^(-2)*(1-1/Ln(x))+O(x^(-4)))/(2*AGM(x,4)).

If x is large enough, the numerator is very close to 1 and can be disregarded. "Large enough" for a desired precision of P decimal digits means that 4*x^(-2)<10^(-P). The AGM algorithm gives P digits only for such large values of x, unlike the Taylor series which is only good for x close to 1.

The required number of AGM iterations is approximately 2*Ln(P)/Ln(2). For smaller values of x (but x>1), one can either raise x to a large integer power r (this is quick only if x is an integer or a rational) and compute 1/r*Ln(x^r), or multiply x by a large integer power of 2 (this is better for floating-point x) and compute Ln(2^s*x)-s*Ln(2). Here the required powers are

r=Ln(10^P*4)/(2*Ln(x)),

s=P*Ln(10)/(2*Ln(2))+1-Ln(x)/Ln(2).

These parameters can be found quickly by using the integer logarithm procedure IntLog, while constant values such as Ln(10)/Ln(2) can be simply approximated by rational numbers because r and s do not need to be very precise (but they do need to be large enough). For the second calculation, Ln(2^s*x)-s*Ln(2), we must precompute Ln(2) to the same precision. Also, the subtraction of a large number s*Ln(2) leads to a certain loss of precision, namely, about Ln(s)/Ln(10) decimal digits are lost, therefore the operating precision must be increased by this number of digits. (The quantity Ln(s)/Ln(10) is computed, of course, by the integer logarithm procedure.)

If x<1, then (-Ln(1/x)) is computed. Finally, there is a special case when x is very close to 1, where the Taylor series converges quickly but the AGM algorithm requires to multiply x by a large power of 2 and then subtract two almost equal numbers, leading to a great loss of precision. Suppose 1<x<1+10^(-M), where M is large (say of order P). The Taylor series for Ln(1+epsilon) needs about N= -P*Ln(10)/Ln(epsilon)=P/M terms. If we evaluate the Taylor series using the rectangular scheme, we need 2*Sqrt(N) multiplications and Sqrt(N) units of storage. On the other hand, the main slow operation for the AGM sequence is the geometric mean Sqrt(a*b). If Sqrt(a*b) takes an equivalent of c multiplications (Brent's estimate is c=13/2 but it may be more in practice), then the AGM sequence requires 2*c*Ln(P)/Ln(2) multiplications. Therefore the Taylor series method is more efficient for

M>1/c^2*P*(Ln(2)/Ln(P))^2.

In this case it requires at most c*Ln(P)/Ln(2) units of storage and 2*c*Ln(P)/Ln(2) multiplications.

For larger x>1+10^(-M), the AGM method is more efficient. It is necessary to increase the working precision to P+M*Ln(2)/Ln(10) but this does not decrease the asymptotic speed of the algorithm. To compute Ln(x) with P digits of precision for any x, only O(Ln(P)) long multiplications are required.

Exponential

The exponential function is computed using its Taylor series,

Exp(x)=1+x+x^2/2! +...

This series converges for all (complex) x, but if Abs(x) is large, it converges slowly. A speed-up trick used for large x is to divide the argument by some power of 2 and then square the result several times, i.e.

Exp(x)=Exp(2^(-k)*x)^2^k,

where k is chosen sufficiently large so that the Taylor series converges quickly at 2^(-k)*x. The threshold value for x is in the variable MathExpThreshold in stdfuncs. If x is large and negative, then it is easier to compute 1/ Exp(-x). If x is sufficiently small, e.g. Abs(x)<10^(-M) and M>Ln(P)/Ln(10), then it is enough to take about P/M terms in the Taylor series. If x is of order 1, one needs about P*Ln(10)/Ln(P) terms.

An alternative way to compute x=Exp(a) at large precision would be to solve the equation Ln(x)=a using a fast logarithm routine. A cubically convergent formula is obtained if we replace Ln(x)=a by an equivalent equation

(Ln(x)-a)/(Ln(x)-a-2)=0.

For this equation, Newton's method gives the iteration

x'=x*(1+(a+1-Ln(x))^2)/2.

This iteration converges for initial values 0<x<Exp(a+2) with a cubic convergence rate and requires only one more multiplication, compared with Newton's method for Ln(x)=a. A good initial guess can be found by raising 2 to the integer part of a/Ln(2) (the value Ln(2) can be approximated from above by a suitable rational number, e.g. 7050/10171).

Calculation of Pi

In Yacas, the constant pi is computed by the library routine Pi() which uses the internal routine MathPi to compute the value to current precision Precision(). The result is stored in a global variable as a list of the form {precision, value} where precision is the number of digits of pi that have already been found and value is the multiple-precision value. This is done to avoid recalculating pi if a precise enough value for it has already been found.

Efficient iterative algorithms for computing pi with arbitrary precision have been recently developed by Brent, Salamin, Borwein and others. However, limitations of the current multiple-precision implementation in Yacas (compiled with the "internal" math option) make these advanced algorithms run slower because they require many more arbitrary-precision multiplications at each iteration.

The file examples/pi.ys implements five different algorithms that duplicate the functionality of Pi(). See http://numbers.computation.free.fr/Constants/ for details of computations of pi and generalizations of Newton-Raphson iteration.

PiMethod0(), PiMethod1(), PiMethod2() are all based on a generalized Newton-Raphson method of solving equations.

Since pi is a solution of Sin(x)=0, one may start sufficiently close, e.g. at x0=3.14159265 and iterate x'=x-Tan(x). In fact it is faster to iterate x'=x+Sin(x) which solves a different equation for pi. PiMethod0() is the straightforward implementation of the latter iteration. A significant speed improvement is achieved by doing calculations at each iteration only with the precision of the root that we expect to get from that iteration. Any imprecision introduced by round-off will be automatically corrected at the next iteration.

If at some iteration x=pi+epsilon for small epsilon, then from the Taylor expansion of Sin(x) it follows that the value x' at the next iteration will differ from pi by O(epsilon^3). Therefore, the number of correct digits triples at each iteration. If we know the number of correct digits of pi in the initial approximation, we can decide in advance how many iterations to compute and what precision to use at each iteration.

The final speed-up in PiMethod0() is to avoid computing at unnecessarily high precision. This may happen if, for example, we need to evaluate 200 digits of pi starting with 20 correct digits. After 2 iterations we would be calculating with 180 digits; the next iteration would have given us 540 digits but we only need 200, so the third iteration would be wasteful. This can be avoided by first computing pi to just over 1/3 of the required precision, i.e. to 67 digits, and then executing the last iteration at full 200 digits. There is still a wasteful step when we would go from 60 digits to 67, but much less time would be wasted than in the calculation with 200 digits of precision.

Newton's method is based on approximating the function f(x) by a straight line. One can achieve better approximation and therefore faster convergence to the root if one approximates the function with a polynomial curve of higher order. The routine PiMethod1() uses the iteration

x'=x+Sin(x)+1/6*Sin(x)^3+3/40*Sin(x)^5+5/112*Sin(x)^7

which has a faster convergence, giving 9 times as many digits at every iteration. (The series is the Taylor series for ArcSin(y) cut at O(y^9).) The same speed-up tricks are used as in PiMethod0(). In addition, the last iteration, which must be done at full precision, is performed with the simpler iteration x'=x+Sin(x) to reduce the number of high-precision multiplications.

Both PiMethod0() and PiMethod1() require a computation of Sin(x) at every iteration. An industrial-strength arbitrary precision library such as gmp can multiply numbers much faster than it can evaluate a trigonometric function. Therefore, it would be good to have a method which does not require trigonometrics. PiMethod2() is a simple attempt to remedy the problem. It computes the Taylor series for ArcTan(x),

ArcTan(x)=x-x^3/3+x^5/5-x^7/7+...,

for the value of x obtained as the tangent of the initial guess for pi; in other words, if x=pi+epsilon where epsilon is small, then Tan(x)=Tan(epsilon), therefore epsilon=ArcTan(Tan(x)) and pi is found as pi=x-epsilon. If the initial guess is good (i.e. epsilon is very small), then the Taylor series for ArcTan(x) converges very quickly (although linearly, i.e. it gives a fixed number of digits of pi per term). Only a single full-precision evaluation of Tan(x) is necessary at the beginning of the algorithm. The complexity of this algorithm is proportional to the number of digits and to the time of a long multiplication.

The routines PiBrentSalamin() and PiBorwein() are based on much more advanced mathematics. (See papers of P. Borwein for review and explanations of the methods.) They do not require evaluations of trigonometric functions, but they do require taking a few square roots at each iteration, and all calculations must be done using full precision. Using modern algorithms, one can compute a square root roughly in the same time as a division; but Yacas's internal math is not yet up to it. Therefore, these two routines perform poorly compared to the more simple-minded PiMethod0().

Trigonometric functions

Trigonometric functions Sin(x), Cos(x) are computed by subtracting 2*Pi from x until it is in the range 0<x<2*Pi and then using Taylor series. Tangent is computed by dividing Sin(x)/Cos(x).

Inverse trigonometric functions are computed by Newton's method (for ArcSin) or by continued fraction expansion (for ArcTan),

ArcTan(x)=x/(1+x^2/(3+(2*x)^2/(5+(3*x)^2/(7+...)))).

The convergence of this expansion for large Abs(x) is improved by using the identities

ArcTan(x)=Pi/2*Sign(x)-ArcTan(1/x),

ArcTan(x)=2*ArcTan(x/(1+Sqrt(1+x^2))).

Thus, any value of x is reduced to Abs(x)<0.42. This is implemented in the standard library scripts.

By the identity ArcCos(x):=Pi/2-ArcSin(x), the inverse cosine is reduced to the inverse sine. Newton's method for ArcSin(x) consists of solving the equation Sin(y)=x for y. Implementation is similar to the calculation of pi in PiMethod0().

For x close to 1, Newton's method for ArcSin(x) converges very slowly. An identity

ArcSin(x)=Sign(x)*(Pi/2-ArcSin(Sqrt(1-x^2)))

can be used in this case. Another potentially useful identity is

ArcSin(x)=2*ArcSin(x/(Sqrt(2)*Sqrt(1+Sqrt(1-x^2)))).

Inverse tangent can also be related to inverse sine by

ArcTan(x)=ArcSin(x/Sqrt(1+x^2)),

ArcTan(1/x)=ArcSin(1/Sqrt(1+x^2)).

Hyperbolic and inverse hyperbolic functions are reduced to exponentials and logarithms: Cosh(x)=1/2*(Exp(x)+Exp(-x)), Sinh(x)=1/2*(Exp(x)-Exp(-x)), Tanh(x)=Sinh(x)/Cosh(x),

ArcCosh(x)=Ln(x+Sqrt(x^2+1)),

ArcSinh(x)=Ln(x+Sqrt(x^2-1)),

ArcTanh(x)=1/2*Ln((1+x)/(1-x)).

The idea to use continued fraction expansions for ArcTan comes from the book by Jack W. Crenshaw, MATH Toolkit for REAL-TIME Programming (CMP Media Inc., 2000). In that book the author explains how he got the idea to use continued fraction expansions to approximate ArcTan(x), given that the Taylor series converges slowly, and having a hunch that in that case the continued fraction expansion then converges rapidly. He then proceeds to show that in the case of ArcTan(x), this advantage is very significant. However, it might not be true for all slowly converging series.

The convergence of the continued fraction expansion of ArcTan(x) is indeed better than convergence of the Taylor series. Namely, the Taylor series converges only for Abs(x)<1 while the continued fraction converges for all x. However, the speed of its convergence is not uniform in x; the larger the value of x, the slower the convergence. The necessary number of terms of the continued fraction is in any case proportional to the required number of digits of precision, but the constant of proportionality depends on x.

This can be understood by the following elementary argument. The difference between two partial continued fractions that differ only by one extra last term can be estimated by

Abs(delta):=Abs(b[0]/(a[1]+b[1]/(...+b[n-1]/a[n]))-b[0]/(a[1]+b[1]/(...+b[n]/a[n+1])))<(b[0]*b[1]*...*b[n])/((a[1]*...*a[n])^2*a[n+1]).

(This is a conservative estimate that could be improved with more careful analysis. See also the section on numerical continued fractions.) For the above continued fraction for ArcTan(x), this directly gives the following estimate,

Abs(delta)<(x^(2*n+1)*(n!)^2)/((2*n+1)*((2*n-1)!!)^2)<=>Pi*(x/2)^(2*n+1).

This formula only gives a meaningful bound if x<2, but it is clear that the precision generally becomes worse when x grows. If we need P digits of precision, then, for a given x, the number of terms n has to be large enough so that the relative precision is sufficient, i.e.

delta/ArcTan(x)<10^(-P).

This gives n>(P*Ln(10))/(Ln(4)-2*Ln(x)) and for x=1, n>3/2*P. This estimate is very close for small x and only slightly suboptimal for larger x: numerical experimentation shows that for x<=1, the required number of terms for P decimal digits is only about 4/3*P, and for x<=0.42, n must be 3/4*P. If x<1 is very small then one needs a much smaller number of terms n>P*Ln(10)/(Ln(4)-2*Ln(x)). Roundoff errors may actually make the result less precise if a larger number of terms is used.

If we compare the rate of convergence of the continued fraction for ArcTan(x) with the Taylor series

ArcTan(x)=Sum(n,0,Infinity,((-1)^n*x^(2*n+1))/(2*n+1)),

we find that the number of terms of the Taylor series needed for P digits is about n>(P*Ln(10))/(2*Ln(x)). Since the range of x can be reduced to about [0, 0.42] by trigonometric identities, the difference between this and P*Ln(10)/(Ln(4)-2*Ln(x)) is never very large. At most twice as many terms n are needed in the Taylor series as in the continued fraction. However, a Taylor series can be evaluated efficiently using O(Sqrt(n)) long multiplications, while a continued fraction with n terms always requires n divisions. Therefore, at high enough precision the continued fraction method will be much less efficient than the Taylor series.

Factorials and binomial coefficients

The factorial is defined by n! :=n*(n-1)*...*1 for integer n and the binomial coefficient is defined as

Bin(n,m):=n! /(m! *(n-m)!).

A "double factorial" n!! :=n*(n-2)*(n-4)*... is also useful for some calculations.

There are two tasks related to the factorial: the exact integer calculation and an approximate calculation to some floating-point precision. Factorial of n has approximately n*Ln(n)/Ln(10) decimal digits, so an exact calculation is practical only for relatively small n. In the current implementation, exact factorials for n>65535 are not computed but print an error message advising the user to avoid exact computations. For example, LnGammaNum(n+1) is able to compute Ln(n!) for very large n to the desired floating-point precision.

Exact factorials

To compute factorials exactly, we use two direct methods. The first method is to multiply the numbers 1, 2, ..., n in a loop. This method requires n multiplications of short numbers with P-digit numbers, where P=O(n*Ln(n)) is the number of digits in n!. Therefore its complexity is O(n^2*Ln(n)). This factorial routine is implemented in the Yacas core with a small speedup: consecutive pairs of integers are first multiplied together using platform math and then multiplied by the accumulator product.

A second method uses a binary tree arrangement of the numbers 1, 2, ..., n similar to the recursive sorting routine ("merge-sort"). If we denote by a *** b the "partial factorial" product a*(a+1)*...(b-1)*b, then the tree-factorial algorithm consists of replacing n! by 1***n and recursively evaluating (1***m)*((m+1)***n) for some integer m near n/2. The partial factorials of nearby numbers such as m***(m+2) are evaluated explicitly. The binary tree algorithm requires one multiplication of P/2 digit integers at the last step, two P/4 digit multiplications at the last-but-one step and so on. There are O(Ln(n)) total steps of the recursion. If the cost of multiplication is M(P)=P^(1+a)*Ln(P)^b, then one can show that the total cost of the binary tree algorithm is O(M(P)) if a>0 and O(M(P)*Ln(n)) if a=0 (which is the best asymptotic multiplication algorithm).

Therefore, the tree method wins over the simple method if the cost of multiplication is lower than quadratic.

The tree method can also be used to compute "double factorials" ( n!!). This is faster than to use the identities (2*n)!! =2^n*n! and

(2*n-1)!! =(2*n)! /(2^n*n!).

Double factorials are used in the exact calculation of the Gamma function of half-integer argument.

Binomial coefficients Bin(n,m) are found by first selecting the smaller of m, n-m and using the identity Bin(n,m)=Bin(n,n-m). Then a partial factorial is used to compute

Bin(n,m)=((n-m+1)***n)/m!.

This is always much faster than computing the three factorials in the definition of Bin(n,m).

Approximate factorials

A floating-point computation of the factorial may proceed either via Euler's Gamma function, n! =Gamma(n+1), or by a direct method (multiplying the integers and maintaining a certain floating-point precision). If the required precision is much less than the number of digits in the exact factorial, then almost all multiplications will be truncated to the precision P and the tree method O(n*M(P)) is always slower than the simple method O(n*P).

Classical orthogonal polynomials: general case

A family of orthogonal polynomials is a sequence of polynomials q _n(x), n=0, 1, ... that satisfy the orthogonality condition on some interval [ a, b] with respect to some weight function rho(x):

Integrate(x,a,b)q _m(x)*q _n(x)*rho(x)=0.

The interval can be finite or infinite and the weight function must be real and non-negative on this interval.

In principle, one could choose any (non-negative) weight function rho(x) and any interval [a, b] and construct the corresponding family of orthogonal polynomials q _n(x). For example, take q _0=1, then take q _1=x+c with unknown c and find c such that q _0 and q _1 satisfy the orthogonality condition; this requires solving a linear equation. Then we can similarly find two unknown coefficients of q _2 and so on. (This is called the Gramm-Schmidt orthogonalization procedure.)

But of course not all weight functions rho(x) and not all intervals [a, b] are equally interesting. There are several "classical" families of orthogonal polynomials that have been of use to theoretical and applied science. The "classical" polynomials are always solutions of a simple second-order differential equation and are always a specific case of some hypergeometric function.

The construction of "classical" polynomials always follows the same scheme. The function rho(x) must satisfy the so-called Pearson's equation,

1/rho(x)*D(x)rho(x)=alpha(x)/beta(x),

where the functions alpha, beta are simple:

alpha(x)=alpha _0+alpha _1*x,

beta(x)=beta _0+beta _1*x+beta _2*x^2.

Also, the following boundary conditions must be satisfied on both ends a, b of the interval,

rho(a)*beta(a)=rho(b)*beta(b)=0.

If the function rho(x) and the interval [a, b] is chosen in this way, then the corresponding orthogonal polynomials q_n(x) are solutions of the differential equation

(Deriv(x)(beta(x)*rho(x)*(Deriv(x)q_n(x))))-n*(alpha_1+(n+1)*beta_2)*q[n]=0 .

The polynomials q_n(x) are also given by the Rodrigues formula,

q_n(x) = A[n]/rho(x)*D(x,n)rho(x)*beta(x)^n ,

where A[n] is a normalization constant. It is usual practice to normalize the polynomials so that

(Integrate(x,a,b)(q_n(x))^2*rho(x))=1 .

So the normalization constant can be chosen accordingly.

Finally, there is a formula for the generating function of the polynomials,

G(x,w)=1/rho(x)*rho(t(x,w))/Abs(1-w*(beta _1+2*beta _2*t(x,w))),

where t(x,w) is the root of t-x-w*beta(t)=0 which is nearest to t=x at small w. This function G(x,w) gives the non-normalized polynomials,

G(x,w) = Sum(n,0,Infinity, q_n(x)/n! *w^n) ,

where

q_n(x) = 1/rho(x)*D(x,n)rho(x)*beta(x)^n .

The classical families of (normalized) orthogonal polynomials are obtained in this framework with the following definitions:

The Legendre polynomials OrthoP(n,x): a= -1, b=1, rho(x)=1, alpha(x)=0, beta(x)=1-x^2, A[n]=(-1)^n/(2^n*n!).

The Laguerre polynomials (L^m)_n(x): a=0, b= +Infinity, rho(x)=x^m*Exp(-x), (here m> -1 or else the weight function is not integrable on the interval), alpha(x)=m-x, beta(x)=x, A[n]=1.

The Hermite polynomials OrthoH(n,x): a= -Infinity, b= +Infinity, rho(x)=Exp(-x^2), alpha(x)= -2*x, beta(x)=1, A[n]=(-1)^n.

The Chebyshev polynomials of the first kind OrthoT(n,x): a= -1, b=1, rho(x)=1/Sqrt(1-x^2), alpha(x)=x, beta(x)=1-x^2, A[n]=(-1)^n/(2*n)!!.

The Rodrigues formula or the generating function are not efficient ways to calculate the polynomials. A better way is to use linear recurrence relations connecting q[n+1] with q[n] and q[n-1]. (These recurrence relations can also be written out in full generality through alpha(x) and beta(x) but we shall save the space.)

There are three computational tasks related to orthogonal polynomials:

Compute the coefficients of a given polynomial q _n(x) exactly. The coefficients are usually rational numbers, but the numerators and denominators usually grow exponentially quickly with n, so that n-th polynomial has at least some coefficients as n-digit numbers. The array of coefficients can be obtained using recurrence relations n times. The required number of operations is proportional to n^2 (because of n coefficients, n recurrence relations for each of them) and to the multiplication time of n-digit integers, i.e. O(n^2*M(n)). Sometimes an exact formula for the coefficients is available (this is the case for the four "classical" families of polynomials above; see the next section). Then the computation time dramatically drops down to O(n^2) because no recurrences are needed.
Compute the numerical value of a given polynomial q _n(x) at given x, either exactly (at rational x) or approximately in floating point. This requires only O(n*M(n)) operations for exact computation and O(n*P) operations in P-digit floating point.
Compute a series of orthogonal polynomials with given coefficients f[n], i.e. f(x):=Sum(n,0,N,f[n]*q _n(x)), at a given x. This task does not actually require computing the polynomials first, if we use the so-called Clenshaw-Smith procedure which gives the value of f(x) directly in n iterations. (See the next section for more explanations.) The number of operations is again O(n*P) because only "short" multiplications and divisions are needed.

In the next section we shall give some formulae that allow to calculate particular polynomials more efficiently.

Classical orthogonal polynomials: special cases

The fastest algorithm available is for Tschebyscheff polynomials OrthoT(n,x), OrthoU(n,x). The following recurrence relations can be used:

OrthoT(2*n,x)=2*OrthoT(n,x)^2-1,

OrthoT(2*n+1,x)=2*OrthoT(n+1,x)*OrthoT(n,x)-x,

OrthoU(2*n,x)=2*OrthoT(n,x)*OrthoU(n,x)-1,

OrthoU(2*n+1,x)=2*OrthoT(n+1,x)*OrthoU(n,x).

This allows to compute OrthoT(n,x) and OrthoU(n,x) in time logarithmic in n.

There is a way to implement this method without recursion. The idea is to build the sequence of numbers n[1], n[2], ... that are needed to compute OrthoT(n,x).

For example, to compute OrthoT(19,x) using the second recurrence relation, we need OrthoT(10,x) and OrthoT(9,x). We can write this chain symbolically as 19<>c(9,10). For OrthoT(10,x) we need only OrthoT(5,x). This we can write as 10<>c(5). Similarly we find: 9<>c(4,5). Therefore, we can find both OrthoT(9,x) and OrthoT(10,x) if we know OrthoT(4,x) and OrthoT(5,x). Eventually we find the following chain of pairs:

19<>c(9,10)<>c(4,5)<>c(2,3)<>c(1,2)<>c(1).

Therefore, we find that OrthoT(19,x) requires to compute OrthoT(k,x) sequentially for all k that appear in this sequence (1,2,3,4,5,9,10).

There are about 2*Ln(n)/Ln(2) elements in the chain that leads to the number n. We can generate this chain in a straightforward way by examining the bits in the binary representation of n. Therefore, we find that this method requires no storage and time logarithmic in n. A recursive routine would also take logarithmic time and also require logarithmic storage space.

Note that using these recurrence relations we do not obtain any individual coefficients of the Tschebyscheff polynomials. This method does not seem very useful for symbolic calculations (with symbolic x), because the resulting expressions are rather complicated combinations of nested products. It is difficult to expand such an expression into powers of x or manipulate it in any other way, except compute a numerical value. However, these fast recurrences are numerically unstable, so numerical values need to be evaluated with extended working precision. Currently this method is not used in Yacas, despite its speed.

Coefficients for Legendre, Hermite, Laguerre, Chebyshev polynomials can be obtained by explicit formulae. This is faster than using recurrences if we need the entire polynomial symbolically, but still slower than the recurrences for numerical calculations.

Tschebyscheff polynomials of the first and of the second kind: If k=Floor(n/2), then
OrthoT(n,x)=Sum(i,0,k,(-1)^i*(2*x)^(n-2*i)/(n-i)*Bin(n-i,i)),

OrthoU(n,x)=Sum(i,0,k,(-1)^i*(2*x)^(n-2*i)*Bin(n-i,i)).
Here it is assumed that n>0 (the case n=0 must be done separately). The summation is over integer values of i such that 0<=2*i<=n, regardless of whether n is even or odd.

Hermite polynomials: For even n=2*k where k>=0,
OrthoH(n,x)=(-2)^k*(n-1)!! *Sum(i,0,k,(x^(2*i)*(-4)^i*k!)/((2*i)! *(k-i)!)),
and for odd n=2*k+1 where k>=0,
OrthoH(n,x)=2*(-2)^k*n!! *Sum(i,0,k,(x^(2*i+1)*(-4)^i*k!)/((2*i+1)! *(k-i)!)).

Legendre polynomials: If k=Floor(n/2), then
OrthoP(n,x)=2^(-n)*Sum(i,0,k,(-1)^i*x^(n-2*i)*Bin(n,i)*Bin(2*n-2*i,n)).
The summation is over integer values of i such that 0<=2*i<=n, regardless of whether n is even or odd.

Laguerre polynomials:
OrthoL(n,a,x)=Sum(i,0,n,(-x)^i/i! *Bin(n+a,n-i)).
Here the binomial coefficient must be defined for non-integer a through the Gamma function instead of factorials, which gives
Bin(n+a,n-i)=((n+a)*...*(i+1+a))/(n-i)!.
The result is a rational function of a.

In all formulae for the coefficients, there is no need to compute factorials every time: the next coefficient can be obtained from the previous one by a few short multiplications and divisions. Therefore this computation costs O(n^2) short operations.

Series of orthogonal polynomials

If we need to compute a series of orthogonal polynomials with given coefficients f[n], i.e.

f(x):=Sum(n,0,N,f[n]*q _n(x))

at a given x, we do not need to compute the orthogonal polynomials separately. The Clenshaw-Smith recurrence procedure allows to compute the value of the sum directly.

Suppose a family of functions q _n(x), n=0, 1, ... satisfies known recurrence relations of the form

q[n]=A _n(x)*q[n-1]+B _n(x)*q[n-2],

where A _n(x) and B _n(x) are some known functions and q _0(x) and q _1(x) are known.

The procedure goes as follows.

This is from the book by Yudell L. Luke, Mathematical functions and their approximations, Academic Press, N. Y., 1975.

First, for convenience, we define q[-1]:=0 and the coefficient A _1(x) so that q[1]=A[1]*q[0]. This allows us to use the above recurrence relation formally also at n=1. Then, we take the array of coefficients f[n] and define a backward recurrence relation

X[N+1]=X[N+2]=0,

X[n]=A[n+1]*X[n+1]+B[n+2]*X[n+2]+f[n],

where n=N, N-1, ..., 0. (Note that here we have used the artificially defined coefficient A[1].) Magically, the value we are looking for is given by

f(x)=X[0]*q[0].

This happens because we can express

f[n]=X[n]-A[n+1]*X[n+1]-B[n+2]*X[n+2],

for n=0, 1, ..., N, and regroup terms in the sum

f(x):=Sum(n,0,N,f[n]*q _n(x))

to collect X[n] and obtain

f(x)=Sum(n,0,N,X[n]*q[n])-Sum(n,1,N,X[n]*A[n]*q[n-1])-Sum(n,2,N,X[n]*B[n]*q[n-2]),

and finally

f(x)=X[0]*q[0]+X[1]*(q[1]-A[1]*q[1])

+Sum(n,2,N,X[n]*(q[n]-A[n]*q[n-1]-B[n]*q[n-2]))=X[0]*q[0].

Luke's book warns that the recurrence relation for X[n] is not always numerically stable.

Note that in the book there seems to be some confusion as to how the coefficient A[1] is defined. (It is not defined explicitly there.) Our final formula differs from the book for this reason.

The Clenshaw-Smith procedure is analogous to the Horner scheme of calculating polynomials. This procedure can also be generalized for linear recurrence relations having more than two terms. The functions q _0(x), q _1(x), A _n(x), and B _n(x) do not actually have to be polynomials for this to work.