0% found this document useful (0 votes)
978 views464 pages

Curse NG

This document provides an introduction to numerical analysis in MATLAB. It discusses MATLAB's capabilities for numerical computation and visualization. The document begins with an overview of MATLAB, including basic operations, matrices, programming, and toolboxes. It then covers two-dimensional and three-dimensional graphics in MATLAB. The final sections discuss errors in floating point arithmetic and their propagation, as well as the IEEE standard for floating point representation. The goal of the document is to provide students with foundational knowledge of MATLAB for numerical analysis applications.

Uploaded by

Radu Trimbitas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
978 views464 pages

Curse NG

This document provides an introduction to numerical analysis in MATLAB. It discusses MATLAB's capabilities for numerical computation and visualization. The document begins with an overview of MATLAB, including basic operations, matrices, programming, and toolboxes. It then covers two-dimensional and three-dimensional graphics in MATLAB. The final sections discuss errors in floating point arithmetic and their propagation, as well as the IEEE standard for floating point representation. The goal of the document is to provide students with foundational knowledge of MATLAB for numerical analysis applications.

Uploaded by

Radu Trimbitas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 464

Numerical Analysis in MATLAB

Radu T. Trı̂mbiţaş
ii
Preface

This work is an introductory course for undergraduate students. It is intended to achieve a


balance between theory, algorithms and practical implementation aspects.
Lloyd N. Trefethen [95] proposed the following definition of Numerical Analysis:

Numerical Analysis is the study of algorithms for the problems of continuous


mathematics.

The keyword here is that of algorithms, and the main purpose of Numerical Analysis is
devising and analyzing numerical algorithms to solve a certain class of problems.
These are the problems of continuous mathematics. “Continuous” means here that input
data are real and complex variables; its opposite is discrete. Shortly, one could say that
Numerical analysis is continuous algorithmics, as opposed to classical algorithmics, that is
discrete algorithmics.
Since real and complex numbers cannot be represented exactly on computers, they must
be approximated using a finite representation (floating-point representation and arithmetic).
Moreover, most problems of continuous mathematics cannot be solved by the so-called finite
algorithms, even we assume an infinite precision arithmetic. A first example is the solution
of nonlinear algebraic equations. This becomes more clear in eigenvalue and eigenvector
problems. The same conclusion extends to virtually any problem with a nonlinear term or a
derivative in it – zero finding, numerical quadrature, differential equations, integral equations,
optimization, and so on. Thus we introduce various type of errors; their study is an important
task of numerical analysis. Chapter 3, Errors and Floating Point Arithmetic is devoted to this
topic.
As we mention previously, Numerical Analysis mean to find and analyze algorithms for
constructive mathematical problems. Rapid convergence of approximations is the aim, and
the pride of our field is that, for many problems, we have invented algorithms that converge
exceedingly fast. The development of computer algebra software like Maple or Mathematica
diminished the importance of rounding errors without diminishing the importance of algo-

iii
iv Preface

rithms speed of convergence. Chapters 4 to 10 study various class of numerical algorithms,


as follows:

• Chapter 4 – Numerical Solution of Linear Algebraic Systems;


• Chapter 5 – Function Approximation;
• Chapter 6 – Linear Functional Approximation;
• Chapter 7 – Numerical Solution of Nonlinear Equations;
• Chapter 8 – Eigenvalues and Eigenvectors;
• Chapter 9 – Numerical Solution of Ordinary Differential Equations;
• Chapter 10 – Multivariate Approximation.

There are other important aspects: that numerical algorithms are implemented on com-
puters, whose architecture may be an important part of the problem; that reliability and effi-
ciency are paramount goals; and most important, all this work is applied, applied daily and
successfully to thousands of applications on millions of computer around the world. Without
numerical methods, science and engineering as practiced today cease to exist. Numerical
analysts proudly consider themselves the heirs to the great traditions of Newton, Euler, La-
grange, Gauss and other great mathematicians.
Also, in [94], Trefethen wrote:

A computational study is unlikely to lead to real scientific progress unless the


software environment is convenient enough to encourage one to vary parameters,
modify the problem, play around.

In this context, MATLAB is an excellent choice as software support. It allows us to focuss


on essence of numerical algorithms, to rapidly prototype them, to obtain compact code and
high accuracy, and to vary the problems and their parameters. The first two chapters are an
introduction to MATLAB.
We gave MATLAB implementations for algorithms in the book and tested them carefully.
Each chapter, excepting the first, contains fully implemented applications of the notions pre-
sented within them. The sources in this book and solutions to problems can be downloaded by
following the links from the author’s web page: http://www.math.ubbcluj.ro/˜tradu.

Radu Tiberiu Trı̂mbiţaş


Cluj-Napoca, July 2008
Contents

1 Introduction to MATLAB 1
1.1 Starting MATLAB and MATLAB Help . . . . . . . . . . . . . . . . . . . . 2
1.2 Calculator Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Matrix generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Indexing and colon operator . . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Matrix and array operations . . . . . . . . . . . . . . . . . . . . . . 13
1.3.4 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.5 Relational and Logical Operators . . . . . . . . . . . . . . . . . . . 18
1.3.6 Sparse matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 MATLAB Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.1 Control flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5 M files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5.1 Scripts and functions . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5.2 Subfunctions, nested and anonymous functions . . . . . . . . . . . . 28
1.5.3 Passing a Function as an Argument . . . . . . . . . . . . . . . . . . 30
1.5.4 Advanced data structures . . . . . . . . . . . . . . . . . . . . . . . . 32
1.5.5 Variable Number of Arguments . . . . . . . . . . . . . . . . . . . . 37
1.5.6 Global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.5.7 Recursive functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.5.8 Error control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.6 Symbolic Math Toolboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2 MATLAB Graphics 51
2.1 Two-Dimensional Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.1.1 Basic Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

v
vi Contents

2.1.2 Axes and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 57


2.1.3 Multiple plots in a figure . . . . . . . . . . . . . . . . . . . . . . . . 60
2.2 Three-Dimensional Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3 Handles and properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.4 Saving and Exporting Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.5 Application - Snail and Shell Surfaces . . . . . . . . . . . . . . . . . . . . . 70
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3 Errors and Floating Point Arithmetic 73


3.1 Numerical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.2 Error Measuring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3 Propagated error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4 Floating-Point Representation . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.2 Cancelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5 IEEE Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.5.1 Special Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6 MATLAB Floating-Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . 81
3.7 The Condition of a Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8 The Condition of an algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.9 Overall error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.10 Ill-Conditioned Problems and Ill-Posed Problems . . . . . . . . . . . . . . . 89
3.11 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.11.1 Asymptotical notations . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.11.2 Accuracy and stability . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.11.3 Backward Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . 93
3.12 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.12.1 Inverting the hyperbolic cosine . . . . . . . . . . . . . . . . . . . . . 93
3.12.2 Conditioning of a root of a polynomial equation . . . . . . . . . . . . 95
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4 Numerical Solution of Linear Algebraic Systems 101


4.1 Notions of Matrix Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Condition of a linear system . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4 Factorization based methods . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.4.1 LU decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.4.2 LUP decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.4.3 Cholesky factorization . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.4.4 QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.5 Strassen’s algorithm for matrix multiplication . . . . . . . . . . . . . . . . . 124
4.6 Solution of Algebraic Linear Systems in MATLAB . . . . . . . . . . . . . . 126
4.6.1 Square systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.6.2 Overdetermined systems . . . . . . . . . . . . . . . . . . . . . . . . 127
4.6.3 Underdetermined systems . . . . . . . . . . . . . . . . . . . . . . . 128
Contents vii

4.6.4 LU and Cholesky factorizations . . . . . . . . . . . . . . . . . . . . 129


4.6.5 QR factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.6.6 The linsolve function . . . . . . . . . . . . . . . . . . . . . . . . 133
4.7 Iterative refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.8 Iterative solution of Linear Algebraic Systems . . . . . . . . . . . . . . . . . 135
4.9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.9.1 The finite difference method for linear two-points boundary value problem141
4.9.2 Computing a plane truss . . . . . . . . . . . . . . . . . . . . . . . . 145
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5 Function Approximation 151


5.1 Least Squares approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.1.1 Inner products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.1.2 The normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.1.3 Least squares error; convergence . . . . . . . . . . . . . . . . . . . . 158
5.2 Examples of orthogonal systems . . . . . . . . . . . . . . . . . . . . . . . . 160
5.3 Examples of orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . 163
5.3.1 Legendre polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.3.2 First kind Chebyshev polynomials . . . . . . . . . . . . . . . . . . . 167
5.3.3 Second kind Chebyshev polynomials . . . . . . . . . . . . . . . . . 173
5.3.4 Laguerre polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.3.5 Hermite polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.3.6 Jacobi polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.3.7 A MATLAB example . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.4 Polynomials and data fitting in MATLAB . . . . . . . . . . . . . . . . . . . 177
5.4.1 An application — Census Data . . . . . . . . . . . . . . . . . . . . . 183
5.5 The Space H n [a, b] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.6 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.6.1 Lagrange interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.6.2 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.6.3 Interpolation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.7 Efficient Computation of Interpolation Polynomials . . . . . . . . . . . . . . 199
5.7.1 Aitken-type methods . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5.7.2 Divided difference method . . . . . . . . . . . . . . . . . . . . . . . 201
5.7.3 Barycentric Lagrange Interpolation . . . . . . . . . . . . . . . . . . 205
5.7.4 Multiple nodes divided differences . . . . . . . . . . . . . . . . . . . 210
5.8 Convergence of polynomial interpolation . . . . . . . . . . . . . . . . . . . 212
5.9 Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.9.1 Interpolation by cubic splines . . . . . . . . . . . . . . . . . . . . . 217
5.9.2 Minimality properties of cubic spline interpolants . . . . . . . . . . . 222
5.10 Interpolation in MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
5.11 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.11.1 Spline and sewing machines . . . . . . . . . . . . . . . . . . . . . . 227
5.11.2 A membrane deflection problem . . . . . . . . . . . . . . . . . . . . 229
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
viii Contents

6 Linear Functional Approximation 235


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.1.1 Method of interpolation . . . . . . . . . . . . . . . . . . . . . . . . 237
6.1.2 Method of undetermined coefficients . . . . . . . . . . . . . . . . . 238
6.2 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.3 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.3.1 The composite trapezoidal and Simpson’s rule . . . . . . . . . . . . 241
6.3.2 Weighted Newton-Cotes and Gauss formulae . . . . . . . . . . . . . 245
6.3.3 Properties of Gaussian quadrature rules . . . . . . . . . . . . . . . . 248
6.4 Adaptive Quadratures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.5 Iterated Quadratures. Romberg Method . . . . . . . . . . . . . . . . . . . . 255
6.6 Adaptive Quadratures II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
6.7 Numerical Integration in MATLAB . . . . . . . . . . . . . . . . . . . . . . 260
6.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6.8.1 Computation of an ellipsoid surface . . . . . . . . . . . . . . . . . . 264
6.8.2 Computation of the wind action on a sailboat mast . . . . . . . . . . 266
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

7 Numerical Solution of Nonlinear Equations 271


7.1 Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.2 Iterations, Convergence, and Efficiency . . . . . . . . . . . . . . . . . . . . 271
7.3 Sturm Sequences Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.4 Method of False Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.5 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
7.6 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.7 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
7.8 Newton’s Method for Multiple zeros . . . . . . . . . . . . . . . . . . . . . . 287
7.9 Algebraic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
7.10 Newton’s method for systems of nonlinear equations . . . . . . . . . . . . . 288
7.11 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.11.1 Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 292
7.11.2 Modification Method . . . . . . . . . . . . . . . . . . . . . . . . . . 292
7.12 Nonlinear Equations in MATLAB . . . . . . . . . . . . . . . . . . . . . . . 295
7.13 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
7.13.1 Analysis of the state equation of a real gas . . . . . . . . . . . . . . . 298
7.13.2 Nonlinear heat transfer in a wire . . . . . . . . . . . . . . . . . . . . 300
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

8 Eigenvalues and Eigenvectors 305


8.1 Eigenvalues and Polynomial Roots . . . . . . . . . . . . . . . . . . . . . . . 305
8.2 Basic Terminology and Schur Decomposition . . . . . . . . . . . . . . . . . 306
8.3 Vector Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.4 QR Method – the Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
8.5 QR Method – the Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
8.5.1 Classical QR method . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Contents ix

8.5.2 Spectral shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320


8.5.3 Double shift QR method . . . . . . . . . . . . . . . . . . . . . . . . 322
8.6 Eigenvalues and Eigenvectors in MATLAB . . . . . . . . . . . . . . . . . . 326
8.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
8.7.1 Solving mass-spring systems . . . . . . . . . . . . . . . . . . . . . . 330
8.7.2 Computing natural frequencies of a rectangular membrane . . . . . . 333
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

9 Numerical Solution of Ordinary Differential Equations 339


9.1 Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
9.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
9.3 Local Description of One-Step Methods . . . . . . . . . . . . . . . . . . . . 341
9.4 Examples of One-Step Methods . . . . . . . . . . . . . . . . . . . . . . . . 342
9.4.1 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
9.4.2 Method of Taylor expansion . . . . . . . . . . . . . . . . . . . . . . 344
9.4.3 Improved Euler methods . . . . . . . . . . . . . . . . . . . . . . . . 345
9.5 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
9.6 Global Description of One-Step Methods . . . . . . . . . . . . . . . . . . . 350
9.6.1 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
9.6.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
9.6.3 Asymptotics of global error . . . . . . . . . . . . . . . . . . . . . . 356
9.7 Error Monitoring and Step Control . . . . . . . . . . . . . . . . . . . . . . . 358
9.7.1 Estimation of global error . . . . . . . . . . . . . . . . . . . . . . . 358
9.7.2 Truncation error estimates . . . . . . . . . . . . . . . . . . . . . . . 360
9.7.3 Step control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
9.8 ODEs in MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
9.8.1 Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
9.8.2 Nonstiff examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
9.8.3 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
9.8.4 Stiff equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
9.8.5 Event handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
9.8.6 deval and odextend . . . . . . . . . . . . . . . . . . . . . . . . 391
9.8.7 Implicit equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
9.9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
9.9.1 The restricted three-body problem . . . . . . . . . . . . . . . . . . . 392
9.9.2 Motion of a projectile . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

10 Multivariate Approximation 407


10.1 Interpolation in Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . 407
10.1.1 Interpolation problem . . . . . . . . . . . . . . . . . . . . . . . . . . 407
10.1.2 Cartesian product and grid . . . . . . . . . . . . . . . . . . . . . . . 407
10.1.3 Boolean sum and tensor product . . . . . . . . . . . . . . . . . . . . 409
10.1.4 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
10.1.5 A Newtonian scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 416
x Contents

10.1.6 Shepard interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 417


10.1.7 Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
10.1.8 Moving least squares . . . . . . . . . . . . . . . . . . . . . . . . . . 424
10.1.9 Interpolation by radial basis functions . . . . . . . . . . . . . . . . . 426
10.2 Multivariate Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . 428
10.3 Multivariate Approximations in MATLAB . . . . . . . . . . . . . . . . . . . 433
10.3.1 Multivariate interpolation in MATLAB . . . . . . . . . . . . . . . . 433
10.3.2 Computing double integrals in MATLAB . . . . . . . . . . . . . . . 436
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439

Bibliography 441

Index 446
List of MATLAB sources

1.1 The first MATLAB script - playingcards . . . . . . . . . . . . . . . . . 26


1.2 The stat function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3 Function mysqrt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.4 Function fd deriv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5 Function companb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6 Function moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.7 Recursive gcd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.1 Epicycloid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.2 Snail/Shell surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1 Computation of eps - 1st variant . . . . . . . . . . . . . . . . . . . . . . . . 82
3.2 Computation of eps - 2nd variant . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3 Conditioning of the roots of an algebraic equation . . . . . . . . . . . . . . . 96
4.1 Solve the system Ax = b by Gaussian elimination with scaled column pivoting.115
4.2 LUP Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.3 Forward substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.4 Back substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.5 Solution of a linear system by LUP decomposition . . . . . . . . . . . . . . 119
4.6 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.7 QR decomposition using Householder reflections . . . . . . . . . . . . . . . 123
4.8 Solution of Ax = b using QR method . . . . . . . . . . . . . . . . . . . . . 124
4.9 Strassen’s algorithm for matrix multiplication . . . . . . . . . . . . . . . . . 126
4.10 Jacobi method for linear sytems . . . . . . . . . . . . . . . . . . . . . . . . 138
4.11 Successive Overrelaxation method (SOR) . . . . . . . . . . . . . . . . . . . 140
4.12 Finding optimal value of relaxation parameter . . . . . . . . . . . . . . . . . 141
4.13 Two point boundary value problem – finite difference method . . . . . . . . . 143
4.14 Two point boundary value problem – finite difference method and symmetric matrix144
5.1 Compute Legendre polynomials using recurrence relation . . . . . . . . . . . 165
5.2 Compute Legendre coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 166

xi
xii LIST OF MATLAB SOURCES

5.3 Least square approximation via Legendre polynomials . . . . . . . . . . . . 166


5.4 Leat squares continuous approximation with Chebyshev # 1 polynomials . . . 171
5.5 Compute Chebyshev # 1 polynomials by mean of recurrence relation . . . . . 172
5.6 Least squares approximation with Chebyshev # 1 polynomials – continuation: computing Fourie
5.7 Discrete Chebyshev least squares approximation . . . . . . . . . . . . . . . . 172
5.8 The coefficients of discrete Chebyshev least squares approximation . . . . . . 173
5.9 Test for least squares approximations . . . . . . . . . . . . . . . . . . . . . . 176
5.10 An example of least squares approximation . . . . . . . . . . . . . . . . . . 185
5.11 Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.12 Find basic Lagrange interpolation polynomials using MATLAB facilities. . . 190
5.13 Generate divided difference table . . . . . . . . . . . . . . . . . . . . . . . . 203
5.14 Compute Newton’s form of Lagrange interpolation polynomial . . . . . . . . 204
5.15 Barycentric Lagrange interpolation . . . . . . . . . . . . . . . . . . . . . . . 208
5.16 Bayicentric weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
5.17 Barycentric Lagrange interpolation with Chebyshev nodes of first kind . . . . 209
5.18 Barycentric Lagrange interpolation with Chebyshev nodes of second kind . . 209
5.19 Generates a divided difference table with double nodes . . . . . . . . . . . . 213
5.20 Runge’s Counterexample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.1 Composite trapezoidal rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.2 Composite Simpson formula . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.3 Compute nodes and coefficients of a Gaussian quadrature rule . . . . . . . . 250
6.4 Approximation of an integral using a Gaussian formula . . . . . . . . . . . . 251
6.5 Generate a Gauss-Legendre formula . . . . . . . . . . . . . . . . . . . . . . 251
6.6 Generate a first kind Gauss-Chebyshev formula . . . . . . . . . . . . . . . . 251
6.7 Generate a second kind Gauss-Chebyshev formula . . . . . . . . . . . . . . 251
6.8 Generate a Gauss-Hermite formula . . . . . . . . . . . . . . . . . . . . . . . 252
6.9 Generate a Gauss-Laguerre formula . . . . . . . . . . . . . . . . . . . . . . 252
6.10 Generate a Gauss-Jacobi formula . . . . . . . . . . . . . . . . . . . . . . . . 252
6.11 Adaptive quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
6.12 Romberg method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.13 Adaptive quadrature, variant . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.14 Computation of the wind action on a sailboat mast . . . . . . . . . . . . . . . 267
7.1 False position method for nonlinear equation in R . . . . . . . . . . . . . . . 276
7.2 Secant method for nonlinear equations in R . . . . . . . . . . . . . . . . . . 281
7.3 Newton method for nonlinear equations in R . . . . . . . . . . . . . . . . . . 284
7.4 Newton method in R and Rn . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.5 Broyden method for nonlinear systems . . . . . . . . . . . . . . . . . . . . . 294
8.1 The RQ transform of a Hessenberg matrix . . . . . . . . . . . . . . . . . . . 316
8.2 Reduction to upper Hessenberg form . . . . . . . . . . . . . . . . . . . . . . 318
8.3 Pure (simple) QR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
8.4 QRSplit1a – QR method with partition and treatment of 2 × 2 matrices . . 320
8.5 Compute eigenvalues of a 2 × 2 matrix . . . . . . . . . . . . . . . . . . . . . 321
8.6 QR iterations on a Hessenberg matrix . . . . . . . . . . . . . . . . . . . . . 321
8.7 Spectral shift QR method, partition and treatment of complex eigenvalues . . 323
8.8 QR iteration and partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
LIST OF MATLAB SOURCES xiii

8.9 Double shift QR method with partition and treating 2 × 2 matrices . . . . . . 325
8.10 Double shift QR iterations and Hessenberg transformation . . . . . . . . . . 326
9.1 Classical 4th order Runge-Kutta method . . . . . . . . . . . . . . . . . . . . 349
9.2 Implementation of a Runge-Kutta method with constant step given Butcher table351
9.3 Initialize Butcher table for RK4 . . . . . . . . . . . . . . . . . . . . . . . . 351
9.4 Rössler system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
9.5 Stiff problem with information about Jacobian . . . . . . . . . . . . . . . . . 386
9.6 Two body problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
10.1 Bivariate tensor product Lagrange interpolant . . . . . . . . . . . . . . . . . 412
10.2 Bivariate Boolean sum Lagrange interpolant . . . . . . . . . . . . . . . . . . 412
10.3 Shepard interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
10.4 Local Shepard interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
10.5 Interpolation with radial basis function . . . . . . . . . . . . . . . . . . . . . 428
10.6 Double integral approximation on a rectangle . . . . . . . . . . . . . . . . . 434
xiv LIST OF MATLAB SOURCES
CHAPTER 1

Introduction to MATLAB

MATLAB1 is a high-performance interactive software system for technical computing and


data visualization. It integrates programming, computation, and visualization in an easy-
to-use environment where problems and solutions are expressed in familiar mathematical
notation.
The name MATLAB stands for MATrix LABoratory. MATLAB was originally written
by Cleve Moler to provide easy access to matrix software developed by the LINPACK and
EISPACK projects. Today, MATLAB engines incorporate the LAPACK and BLAS libraries,
embedding the state of the art in software for matrix computation.
In many university environments, MATLAB became the standard instructional tool for
introductory and advanced courses in mathematics, engineering, and science. In industry
and research establishments, MATLAB is the tool of choice for high-productivity research,
development, and analysis.
MATLAB has several advantages over classical programming languages:

• It allows quick and easy coding in a very high-level language.

• Its interactive interface allows rapid experimentation and easy debugging.

• It is a versatile tool for modeling, simulation, and prototyping and also for analysis,
exploration, and visualization of data.

• High-quality graphics and visualization facilities are available. They allow you to plot
the results of computations with a few statements.
1 MATLABr is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not

warrant the accuracy of the text or exercises in this book. This book use or discussion of MATLABr software
or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical
approach or particular use of the MATLABr software.

1
2 Introduction to MATLAB

• Application development, including implementation of GUIs (Graphical User’s Inter-


faces) can be easily done.

• MATLAB M-files are completely portable across a wide range of platforms.

• A wide range of user-contributed M-files is freely available on the Internet.

The basic data structure of MATLAB language is an array that does not require dimen-
sioning. This allows flexibility and easiness during the solution of problems, especially those
with matrix and vector formulations. MATLAB capabilities save the time of developing,
testing and prototyping algorithms. But, the language has sophisticated data structures and
object-oriented facilities, allowing development of large, complex applications. Having and
interpreted languages, MATLAB inevitably suffers some loss of efficiency compared with
compiled languages. This can be partially fixed by using the MATLAB Compiler or by link-
ing to compiled Fortran or C code using MEX files.
Besides the language and the development environment, MATLAB provides a set of add-
on application-specific solutions called toolboxes. Toolboxes allow the users to learn and
apply specialized technology. They are collections of MATLAB functions (M-files) that ex-
tend the MATLAB environment to solve particular classes of problems. Areas in which
toolboxes are available include signal processing, control systems, neural networks, fuzzy
logic, wavelets, simulation, statistics and many others.
For a gentle and detailed introduction see [58, 57, 44]. An excellent source for both
MATLAB and numerical computing is Moler’s book “Numerical Computing in MATLAB”
[66]. A very good short to medium introduction to MATLAB is [27].

1.1 Starting MATLAB and MATLAB Help


To start the MATLAB program on a Microsoft Windows platform, select and click the MAT-
LAB entry from start menu, or double-click the shortcut icon for MATLAB on your Windows
desktop. To start the MATLAB program on UNIX platforms, type matlab at the operating
system prompt. When you start MATLAB you get a multipaneled desktop (Figure 1.1). The
layout and the behavior of the desktop and its components are highly customizable. The main
component is the Command Window. Here you can give MATLAB commands typed at the
prompt, >>.
At the top left you can see the Current Directory. In general MATLAB is aware only of
files in the current directory (folder) and on its path, which can be customized. Commands
for work with directory and path include cd, what, addpath, and editpath (or you can
choose “File/Set Path...” from the menus).
At top right is the Workspace window. The workspace shows you what variable names
are currently defined and some information about their contents. (At start-up it is, naturally,
empty.) This represents another break from compiled environments: variables created in the
workspace persist for you to examine and modify, even after code execution stops. Another
component is the Command History window. As you enter commands, they are recorded
here. This record persists across different MATLAB sessions, and commands or blocks of
commands can be copied from here or saved to files.
1.2. Calculator Mode 3

Workspace Window

Current directory

Command Window
Command History Window

Figure 1.1: MATLAB desktop

MATLAB is a huge software. Nobody can know everything about it. To have success, it
is essential that you become familiar with the online help. There are two levels of help:

• If you need quick help on the syntax of a command, use help command. For example,
help plot shows right in the Command Window all the ways in which you can
use the plot command. Typing help by itself gives you a list of categories that
themselves yield lists of commands.

• Typing doc followed by a command name brings up more extensive help in a separate
window. MATLAB help browser is a fully functional web browser. For example, doc
plot is better formatted and more informative than help plot. In the left panel
one sees a hierarchical, browsable display of all the online documentation. Typing
doc alone or selecting Help from the menu brings up the window at a “root” page.

A lot of useful documentation is available in printable form, by following the links from the
MathWorks Inc. web page (www.mathworks.com).

1.2 Calculator Mode


If you type in a valid expression and press Enter, MATLAB will immediately execute it and
return the result, just like a calculator.

>> 3+2
ans =
5
4 Introduction to MATLAB

>> 3ˆ2
ans =
9
>> sin(pi/2)
ans =
1
The basic arithmetic operations are + - * / and ˆ (power). You can alter the default
precedence with parentheses.
MATLAB recognizes several number types:
• Integers, like 1362 or -217897;
• Reals, for example, 1.234, −10.76;

• Complex numbers, like 3.21 − 4.3i, where i = −1;
• Inf, denotes infinity;
• NaN, Not a Number, result of an operation that is not mathematically defined or is
illegal, like 0/0, ∞/∞, ∞ − ∞, etc. Once generated, a NaN propagates through all
subsequent computations.
The scientific notation is also used:

−1.3412e + 03 = −1.3412 × 103 = −1341.2


−1.3412e − 01 = −1.3412 × 10−1 = −0.13412

MATLAB carries out all its arithmetic computations in double precision floating point arith-
metic, conforming to the IEEE standard. The format command controls the way in which
number are displayed. Type help format for a complete list. The next table gives some
examples.

Command Output examples


format short 31.4162
format short e 31.416e+01
format long e 3.141592653589793e+000
format short g 31.4162
format bank 31.42

The format compact command suppresses blank lines and allow to display more infor-
mation. All MATLAB session examples in this work were generated using this command.
MATLAB variable names are case sensitive and consists of a letter followed by any com-
bination of letters, digits and underscores. Examples: x, y, z525, GeneralTotal. Vari-
ables must have values before they can be used. There exist some special names:
- eps = 2.2204e-16 = 2−54 is machine-epsilon (see Chapter 3); it is the largest
floating point number with the property that 1+eps==1;
1.2. Calculator Mode 5

- pi = π.
If we perform computations with complex numbers, the usage of variable names i and j is
not recommended, since they denote the imaginary unit. It is safer to use 1i or 1j instead.
We give some examples:
>> 5+2i-1+i
ans =
4.0000 + 3.0000i
>>eps
ans =
2.2204e-016
The special variable ans store the value of last evaluated expression. It can be used in
expressions, like any other variable.
>>3-2ˆ4
ans =
-13
>>ans*5
ans =
-65
To suppress the display of last evaluated expression, end the expression by a semicolon
(;). We can type several expression in a single line. We can separate them by comma, when
its value is displayed, or by a “;”, when its value is not displayed.
>> x=-13; y = 5*x, z = xˆ2+y, z2 = xˆ2-y;
y =
-65
z =
104

MATLAB has a rich set of elementary and mathematical functions. Type help elfun
and help matfun to obtain the full list. A selection is listed in Table 1.1.
Once variables have been defined, they exist in the workspace. You can see the list of all
variables used in current session by typing whos:
>>whos
Name Size Bytes Class
ans 1x1 8 double array
i 1x1 8 double array
v 1x3 24 double array
x 1x1 8 double array
y 1x1 8 double array
z 1x1 8 double array
z2 1x1 8 double array
Grand total is 7 elements using 72 bytes
An existing variable var can be removed from the workspace by typing clear var,
while clear clears all existing variables.
To print the value of a variable or expression without the name of the variable or ans
being displayed, you can use disp:
6 Introduction to MATLAB

cos, sin, tan, csc, sec, cot Trigonometric


acos, asin, atan, atan2, asec, acsc, Inverse trigonometric
acot
cosh, sinh, tanh, sech, csch, coth Hyperbolic
acosh, asinh, atanh, asech, acsch, Inverse hyperbolic
acoth
log, log2, log10, exp, pow2, nextpow2 Exponential
ceil, fix, floor, round Rounding
abs, angle, conj, imag, real Complex
mod, rem, sign Remainder, sign
airy, bessel*, beta*, erf*, expint, Mathematical
gamma*, legendre
factor, gcd, isprime, lcm, primes, Number theoretic
nchoosek, perms, rat, rats
cart2sph, cart2pol, pol2cart, sph2cart Coordinate transforms

Table 1.1: Elementary and special mathematical functions (“fun*” indicates that more than
one function name begins “fun”).

>> X=ones(2); disp(X)


1 1
1 1

If we want to save our work, we can do it with


>>save file-name variable-list
where variables din variable-list are separated by spaces. We can use wildcards, like
a* in variable-list. It save variables in variable-list in file-name.mat. If
variable-list is missing, all variables in the workspace are saved to file-name.mat.
A .mat file is a binary file, peculiar to MATLAB. The command load
>>load file-name
loads variables from a .mat file. We can do savings and loadings in ascii, in ascii and double
precision, or by adding to an existing file. For details help save and help load (or
doc save and doc load).
Often you need to capture MATLAB output for incorporation into a report. The command
>>diary file-name
copies all commands and their results (excepting graphs) to file file-name; diary off
end the diary process. You can later type diary on to cause subsequent output to be ap-
pended to the same diary file.

1.3 Matrices
MATLAB is an interactive system whose basic data element is an array that does not require
dimensioning. Matrix is a fundamental data type in MATLAB.
1.3. Matrices 7

An array is a collection of elements or entries, referenced by one or more indices running


over different index sets. In MATLAB, the index sets are always sequential integers starting
with 1. The dimension of the array is the number of indices needed to specify an element.
The size of an array is a list of the sizes of the index sets. A matrix is a two-dimensional
array with special rules for specific operations. With few exceptions, its entries are double-
precision floating point numbers. A vector is a matrix for which one dimension has only the
index 1. A row vector has only one row and a column vector has only one column.
The terms matrix and array are often used interchangeable.

1.3.1 Matrix generation


The simplest way to construct a matrix is by enclosing its elements in square brackets. Ele-
ments are separated by spaces or comas, and rows by semicolons and new lines.
>> A = [5 7 9
1 -3 -7]
A =
5 7 9
1 -3 -7
>> B = [-1 2 5; 9 0 5]
B =
-1 2 5
9 0 5
>> C = [0, 1; 3, -2; 4, 2]
C =
0 1
3 -2
4 2
>> x=[7, 1, 3, 2]
x =
7 1 3 2
>>

Information about size and dimension is stored with the array. For this reason, array sizes are
not usually passed explicitly to functions as they are in C, Pascal or FORTRAN. You can use
size to obtain information about the size of a matrix. Other useful commands are length
that returns the size of longest dimension and ndims that gives the number of dimensions.
>> v = size(A)
v =
2 3
>> [r, c] = size(A)
r =
2
c =
3
>> ndims(A)
ans =
2
8 Introduction to MATLAB

zeros Zeros array


ones Ones array
eye Identity matrix
repmat Replicate and tile array
rand Uniformly distributed random numbers
randn Normally distributed random numbers
linspace Vector of equally spaced elements
logspace Logarithmically spaced vector

Table 1.2: Matrix generation functions

>> length(C), length(x)


ans =
3
ans =
4
>> ndims(x)
ans =
2

One special array is the empty matrix entered as [].


Bracket constructions are not suitable for larger matrices. Many elementary matrices can
be generated with MATLAB functions; see Table 1.2. Examples:
>> zeros(2) %like zero(2,2)
ans =
0 0
0 0
>> ones(2,3)
ans =
1 1 1
1 1 1
>> eye(3,2)
ans =
1 0
0 1
0 0

If we wish to generate a zeros, ones or identity matrix, having the same size as a given
matrix A, you can use the size function as argument, like in eye(size(A)).
The rand and randn functions generates matrices of (pseudo) random numbers from
the uniform distribution on [0,1] and standard normal distribution, respectively. Examples:
>> rand
ans =
0.4057
>> rand(3)
ans =
0.9355 0.8936 0.8132
1.3. Matrices 9

0.9169 0.0579 0.0099


0.4103 0.3529 0.1389

In simulations or in experiments with random number it is important to reproduce the


random numbers on a subsequent occasion. The generated random numbers depends on the
status of the generator. The method and the status can be set with rand(method,s). It
causes rand to use the generator determined by method, and initializes the state of that
generator using the value of s. We give some examples.
Set rand to its default initial state:
rand(’twister’, 5489);

Initialize rand to a different state each time:


rand(’twister’, sum(100*clock));

Save the current state, generate 100 values, reset the state, and repeat the sequence:
s = rand(’twister’);
u1 = rand(100);
rand(’twister’,s);
u2 = rand(100); % contains exactly the same values as u1

Matrices can be built out of other arrays (in block form) as long as their sizes are compat-
ible. With B, defined by B=[1 2; 3 4], we may create
>> C=[B, zeros(2); ones(2), eye(2)]
C =
1 2 0 0
3 4 0 0
1 1 1 0
1 1 0 1

An example with incompatible sizes:


>> [A;B]
??? Error using ==> vertcat
CAT arguments dimensions are not consistent.

Several commands are available for manipulating matrices; see Table 1.3.
diag, tril, and triu functions consider that the diagonal of the matrix they act on is
numbered as follows: main diagonal has number 0, the kth diagonal above the main diagonal
has number k, and the kth diagonal below the main diagonal has number −k, k = 0, . . . , n.
If A is a matrix, diag(A,k) extracts the diagonal numbered by k in a vector. If A is a vector,
diag(A,k) constructs a diagonal matrix, whose kth diagonal consists of elements of A.
>> diag([1,2],1)
ans =
0 1 0
0 0 2
0 0 0
>> diag([3 4],-2)
10 Introduction to MATLAB

reshape Change size


diag Diagonal matrices and diagonals of matrix
blkdiag Block diagonal matrix
tril Extract lower triangular part
triu Extract upper triangular part
fliplr Flip matrix in left/right direction
flipud Flip matrix in up/down direction
rot90 Rotate matrix 90 degrees

Table 1.3: Matrix manipulation functions

ans =
0 0 0 0
0 0 0 0
3 0 0 0
0 4 0 0

If
A =
2 3 5
7 11 13
17 19 23

then
>> diag(A)
ans =
2
11
23
>> diag(A,-1)
ans =
7
19

The commands tril(A,k) and triu(A,k) extracts the lower and the upper triangu-
lar part, respectively, of their argument, starting from diagonal k. For A given above,
>>tril(A)
ans =
2 0 0
7 11 0
17 19 23
>>triu(A,1)
ans =
0 3 5
0 0 13
0 0 0
1.3. Matrices 11

compan Companion matrix


gallery Large collection of test matrices
hadamard Hadamard matrix
hankel Hankel matrix
hilb Hilbert matrix
invhilb Inverse Hilbert matrix
magic Magic square
pascal Pascal matrix (binomial coefficients)
rosser Classic symmetric eigenvalue test problem
toeplitz Toeplitz Matrix
vander Vandermonde matrix
wilkinson Wilkinson’s eigenvalue test matrix

Table 1.4: Special matrices

>>triu(A,-1)
ans =
2 3 5
7 11 13
0 19 23

MATLAB provides a number of special matrices; see Table 1.4. These matrices have
interesting properties that make them useful for constructing examples and for testing al-
gorithms. We shall give examples with hilb and vander functions in Section 4.2. The
gallery function provides access to a large collection of test matrices created by Nicholas
J. Higham [46]. For details, see help gallery.

1.3.2 Indexing and colon operator


To enable access and assignment to submatrices MATLAB has a powerful notation based
on the colon character, “:”. This is an important array constructor. The format is first:
step:last, and the result is a row vector (which may be empty). Examples:
>> 1:5
ans =
1 2 3 4 5
>> 4:-1:-2
ans =
4 3 2 1 0 -1 -2
>> 0:.75:3
ans =
0 0.7500 1.5000 2.2500 3.0000

Single elements of a matrix are accessed as A(i,j), where i ≥ 1 and j ≥ 1 (zero or nega-
tive subscripts are not supported in MATLAB). The submatrix comprising the intersection of
rows p to q and columns r to s is denoted by A(p:q,r:s). As a special case, a lone colon
as the row or column specifier covers all entries in that row or column; thus A(:,j) is the
12 Introduction to MATLAB

jth column of A and A(i,:) the ith row. The keyword end used in this context denotes the
last index in the specified dimension; thus A(end,:) picks out the last row of A. Finally, an
arbitrary submatrix can be selected by specifying the individual row and column indices. For
example, A([i j k],[p q]) produces the submatrix given by the intersection of rows
i, j and k and columns p and q.
Here are some example working on the matrix
A =
1 2 3
7 5 9
17 15 21

>> A(2,1)
ans =
7
>> A(2:3,2:3)
ans =
5 9
15 21
>> A(:,1)
ans =
1
7
17
>> A(2,:)
ans =
7 5 9
>> A([1 3],[2 3])
>> A([1 3],[2 3])
ans =
2 3
15 21

The special case A(:) denotes a vector comprising all the elements of A taken down the
columns from first to last:
>> B=A(:)
B =
1
7
17
2
5
15
3
9
21

Any array can be accessed by a single script. Multidimensional arrays are actually stored
linearly in memory, varying over the first dimension, then the second, and so on. (Think of
1.3. Matrices 13

the columns of a table being stacked on top of each other.) In this sense the array is equivalent
to a vector, and a single subscript will be interpreted in this context.
>> A(5)
ans =
5
Subscript referencing can be used on either side of assignments.
>> B=ones(2,3)
B =
1 1 1
1 1 1
>> B(1,:)=A(1,:)
B =
1 2 3
1 1 1
>> C=rand(2,5)
C =
0.1576 0.9572 0.8003 0.4218 0.7922
0.9706 0.4854 0.1419 0.9157 0.9595
>> C(2,:)=-1 %expand scalar into a submatrix
C =
0.1576 0.9572 0.8003 0.4218 0.7922
-1.0000 -1.0000 -1.0000 -1.0000 -1.0000
>> C(3,1) = 3 %create a new row to make space
C =
0.1576 0.9572 0.8003 0.4218 0.7922
-1.0000 -1.0000 -1.0000 -1.0000 -1.0000
3.0000 0 0 0 0
The empty matrix [] is useful to delete lines or columns
>> C(:,4)=[]
C =
0.1576 0.9572 0.8003 0.7922
-1.0000 -1.0000 -1.0000 -1.0000
3.0000 0 0 0
The empty matrix is also useful in a function call to indicate a missing argument, as we shall
see in §1.3.4.
An array is resized automatically if you delete elements or make assignments outside the
current size. (Any new undefined elements are made zero.) But attention, this might be a
source of hard-to-find errors.

1.3.3 Matrix and array operations


The arithmetic operators +, -, *, /, are interpreted in a matrix sense. When appropriate,
scalars are “expanded” to match a matrix. In addition to the usual division operator /, with
the meaning A/B = AB −1 , MATLAB has a left division operator, (backslash \), with
the meaning A\B = A−1 B. The backslash and the forward slash define solutions of linear
systems: A\B is a solution X of A*X = B, while A/B is a solution X of X*B = A. Examples:
14 Introduction to MATLAB

Operation Matrix sense Array sense


Addition + +
Subtraction - -
Multiplication * .*
Left division \ .\
Right division / ./
Exponentiation ˆ .ˆ

Table 1.5: Elementary matrix and array operations

>> A=[1 2; 3 4], B=ones(2)


A =
1 2
3 4
B =
1 1
1 1
>> A+B
ans =
2 3
4 5
>> A*B
ans =
3 3
7 7
>> A\B
ans =
-1 -1
1 1
>> A+2
ans =
3 4
5 6
>> Aˆ3
ans =
37 54
81 118

Array operations simply act identically on each element of an array. Multiplication, di-
vision and exponentiation in array or elementwise sense are specified by preceding the op-
erator with a period. If A and B are matrices of the same dimensions then C = A.*B sets
C(i,j)=A(i,j)*B(i,j) and C = B.\A sets C(i,j)=B(i,j)/A(i,j). With the
same A and B as in the previous example:

>>A.*B
ans =
1 2
3 4
1.3. Matrices 15

>>B./A
ans =
1.0000 0.5000
0.3333 0.2500
>> A.ˆ2
ans =
1 4
9 16

The dot form of exponentiation allows the power to be an array when the dimension of
the base and the power agree, or when the base is a scalar:

>>x=[1 2 3]; y=[2 3 4]; Z=[1 2; 3 4];

>>x.ˆy
ans =
1 8 81

>>2.ˆx
ans =
2 4 8

>>2.ˆZ
ans =
2 4
8 16

To invert a nonsingular matrix one uses inv, and for the determinant of a square matrix
det.
Matrix exponentiation is defined for all powers, not just for positive integers. If n<0 is
an integer then Aˆn is defined as inv(A)ˆ(-n). For noninteger p, Aˆp is evaluated using
the eigensystem of A; results can be incorrect or inaccurate when A is not diagonalizable or
when A has an ill-conditioned eigensystem.
The conjugate transpose of the matrix A is obtained with A’. If A is real this is simply
the transpose. The transpose without conjugation is obtained with A.’. The functional
alternatives ctranspose(A) and transpose(A) are sometimes more convenient. Table
1.5 gives a list of matrix and array operations.
For the special case of column vectors x and y, x’*y is the inner or dot product, which
can also be obtained using the dot function as dot(x,y). The vector or cross product of
two 3-by-1 or 1-by-3 vectors (as used in mechanics) is produced by cross. Example:

>>x=[-1 0 1]’; y=[3 4 5]’;

>>x’*y
ans =
2

>>dot(x,y)
16 Introduction to MATLAB

max Largest component


min Smallest component
mean Average or mean value
median Median value
std Standard deviation
var Variance
sort Sort
sum Sum of elements
prod Product of elements
cumsum Cumulative sum of elements
cumprod Cumulative product of elements
diff Difference of elements

Table 1.6: Basic data analysis functions

ans =
2

>>cross(x,y)
ans =
-4
8
-4

1.3.4 Data analysis


Table 1.6 lists functions for basic data analysis computation. The simplest usage is to apply
these functions to vectors. For example
>>x=[4 -8 -2 1 0]
x =
4 -8 -2 1 0

>>[min(x) max(x)]
ans =
-8 4

>>sort(x)
ans =
-8 -2 0 1 4

>>sum(x)
ans =
-5

The sort function sorts into ascending order. For complex vectors, sort sorts by absolute
value and so descending order must be obtained by explicitly reordering the output:
1.3. Matrices 17

>>x=[1+i -3-4i 2i 1];

>>y=sort(x)

>>y=y(end:-1:1)
y =
-3.0000 - 4.0000i 0 + 2.0000i 1.0000 + 1.0000i 1.0000

For matrices the functions are defined columnwise. Thus max and min return a vector
containing the maximum and the minimum element, respectively, in each column, sum re-
turns a vector containing the column sums, and sort sorts the element in each column of
the matrix in ascending order. The function min and max can return a second argument that
specifies in which components the minimum and the maximum elements are located. For
example, if
A =
0 -1 2
1 2 -4
5 -3 -4

then
>>max(A)
ans =
5 2 2

>>[m,i]=min(A)
m =
0 -3 -4
i =
1 3 2

As this example shows if there are two or more minimal elements in a column, then the index
of the first is returned. The smallest element in the matrix can be found by applying min
twice in succession:
>>min(min(A))
ans =
-4

An alternative way is typing


>>min(A(:))
ans =
-4

The diff function computes differences. Applied to a vector x of length n it produces


the vector [x(2)-x(1) [x(3)-x(2) ... x(n)-x(n-1)] of length n-1. Applied
to a matrix, it acts columnwise. Example
>>x=(1:8).ˆ2
x =
18 Introduction to MATLAB

1 4 9 16 25 36 49 64

>>y=diff(x)
y =
3 5 7 9 11 13 15

For details like sorting in descending order, acting rowwise, how the function not de-
scribed here work, see the corresponding helps or docs.

1.3.5 Relational and Logical Operators


MATLAB relational operators are: == (equal), ˜= (not equal), < (less than), > (greater than),
<= (less than or equal) and >= (greater than or equal). Note that a single = denotes assign-
ment, not an equality test.
Comparisons between scalars produce 1 if the relation is true and 0 if it is false. Com-
parisons are defined between matrices of the same dimensions and between a matrix and a
scalar, the result being a matrix of 0s and 1s in both cases. For matrix-matrix comparisons
corresponding pairs of elements are compared, while for matrix-scalar comparisons the scalar
is compared with each matrix element. For example:
>> A=[1 2; 3 4]; B = 2*ones(2);

>> A == B
ans =
0 1
0 0

>>A > 2
ans =
0 0
1 1

To test whether matrices A and B are identical, the expression isequal(A,B) can be
used:
>>isequal(A,B)
ans =
0

There are many useful logical function whose names begin with is; for a full list type doc
is.
MATLAB’s logical operators are &(and), |(or), ˜(not), xor(exclusive or), all (true if all
elements of vector is nonzero), any (true if any element of vector is nonzero). They produce
matrices of 0s and 1s when one of the arguments is a matrix. Examples
>> x = [-1 1 1]; y = [1 2 -3];
>> x>0 & y>0
ans =
0 1 0
>> x>0 | y>0
1.3. Matrices 19

Precedence level Operator


1 (highest) Transpose (.´), power(.ˆ), complex conjugate transpose
(’), matrix power(ˆ)
2 Unary plus (+), unary minus (-), logical negation (~)
3 Multiplication (.*), right division (./), left division (.\),
matrix multiplication (*), matrix right division (/), matrix
left division (\)
4 Addition (+), subtraction (-)
5 Colon operator (:)
6 Less than (<), less than or equal to (<=), greater than (>),
greater than or equal to (>=), equal to (==), not equal to
(~=)
7 Logical and (&)
8 (lowest) Logical or (|)

Table 1.7: Operator precedence

ans =
1 1 1
>> xor(x>0,y>0)
ans =
1 0 1
>> any(x>0)
ans =
1
>>all(x>0)
ans =
0

The precedence of arithmetic, relational and logical operators is summarized in table 1.7
(which is based on the information provided by help precedence). For operators of
equal precedence MATLAB evaluates from left to right. Precedence can be overridden by
using parentheses.
The find command returns the indices corresponding to the nonzero elements of a vec-
tor. For example,
>> x = [-3 1 0 -inf 0 NaN];
>> f = find(x)
f =
1 2 4 6

The result of find can then be used to extract just those elements of the vector:
>> x(f)
ans =
-3 1 -Inf NaN

With x as above, we can use find to obtain the finite and nonNaN elements of x,
20 Introduction to MATLAB

>> x(find(isfinite(x) & ˜isnan(x)))


ans =
-3 1 0 0

and to replace the negative components of x by zero:


>> x(find(x<0))=0
x =
0 1 0 0 0

When find is applied to a matrix A, the index vector corresponds to A(:), and this
vector can be used to index into A. In the following example we use find to set to zero those
elements of A that are less than the corresponding elements of B:
>> A = [4 2 16; 12 4 3], B = [12 3 1; 10 -1 7]
A =
4 2 16
12 4 3
B =
12 3 1
10 -1 7
>> f = find(A<B)
f =
1
3
6
>> A(f) = 0
A =
0 0 16
12 4 0

An alternative usage of find for matrices is [i,j] = find(A), which returns vectors i
and j containing the row and column indices of nonzero elements.
>> [i,j]=find(B<3); [i,j]
ans =
2 2
1 3

The results of MATLAB’s logical operators and logical functions are array of 0s and
1s that are examples of logical arrays. Logical arrays can also be created by applying the
function logical to a numeric array. Logical arrays can be used for subscripting. The
expression A(M), where M is a logical array of the same dimension as A, extracts the elements
of A corresponding to the elements of M with nonzero real part.
>> B = [-1 2 5; 9 0 5]
B =
-1 2 5
9 0 5
>> la=B>3
la =
1.3. Matrices 21

0 0 1
1 0 1
>> B(la)
ans =
9
5
5
>> b=B(2,:)
b =
9 0 5
>> x=[1, 1, 1]; xl=logical(x);
>> b(x)
ans =
9 9 9
>> b(xl)
ans =
9 0 5
>> whos x*
Name Size Bytes Class Attributes

x 1x3 24 double
xl 1x3 3 logical

1.3.6 Sparse matrices


You can create two-dimensional double and logical matrices using one of two storage for-
mats: full or sparse. For matrices with mostly zero-valued elements, a sparse matrix requires
a fraction of the storage space required for an equivalent full matrix. Sparse matrices invoke
methods especially tailored to solve sparse problems. Many real world matrices are both ex-
tremely large and very sparse, meaning that most entries are zero. In such cases it is wasteful
or downright impossible to store every entry. Instead one can keep a list of nonzero entries
and their locations. MATLAB has a sparse data type for this purpose. The sparse and
full commands convert back and forth and lay bare the storage difference.

>> sparse(A)
ans =
(1,1) 1
(2,1) 4
(3,1) 9
(1,2) 1
(2,2) 2
(3,2) 3
(1,3) 1
(2,3) 1
(3,3) 1
>> clear
>> A = vander(1:3);
>> S = sparse(A);
>> full(S)
22 Introduction to MATLAB

ans =
1 1 1
4 2 1
9 3 1

Sparsifying a standard full matrix is usually not the right way to create a sparse matrix
you should avoid creating very large full matrices, even temporarily. One alternative is to
give sparse the raw data required by the format.

sparse(1:4,8:-2:2,[2 3 5 7])
ans =
(4,2) 7
(3,4) 5
(2,6) 3
(1,8) 2

Alternatively, you can create an empty sparse matrix with space to hold a specified num-
ber of nonzeros, and then fill in using standard subscript assignments. Another useful sparse
building command is spdiags, which builds along the diagonals of the matrix.

>> M = ones(6,1)*[-20 Inf 10]


M =
-20 Inf 10
-20 Inf 10
-20 Inf 10
-20 Inf 10
-20 Inf 10
-20 Inf 10
>> S = spdiags( M,[-2 0 1],6,6 );
>> full(S)
ans =
Inf 10 0 0 0 0
0 Inf 10 0 0 0
-20 0 Inf 10 0 0
0 -20 0 Inf 10 0
0 0 -20 0 Inf 10
0 0 0 -20 0 Inf

The nnz function tells how many nonzeros are in a given sparse matrix. Since it is
impractical to view directly all the entries (even just the nonzeros) of a realistically sized
sparse matrix, the spy command helps by producing a plot in which the locations of nonzeros
are shown. For instance, spy(bucky) shows the pattern of bonds among the 60 carbon
atoms in a buckyball.
MATLAB has a lot of ability to work intelligently with sparse matrices. The arithmetic
operators +, -, *, andˆuse sparse-aware algorithms and produce sparse results when applied to
sparse inputs. The backslash \uses sparse-appropriate linear system algorithms automatically
as well. There are also functions for the iterative solution of linear equations, eigenvalues,
and singular values.
1.4. MATLAB Programming 23

1.4 MATLAB Programming


1.4.1 Control flow
MATLAB has four control flow structures: the if statement, the for loop, the while loop
and the switch statement.
Here is an example illustrating most of the features of if.

if isinf(x) | ˜isreal(x)
disp(’Bad input!’)
y = NaN;
elseif (x == round(x)) && (x > 0)
y = prod(1:x-1);
else
y = gamma(x);
end

Compound logic in if statements can be short-circuited. As a condition is evaluated from


left to right, it may become obvious before the end that truth or falsity is assured. At that
point, evaluation of the condition is halted. This makes it convenient to write things like
if (length(x) > 2) & (x(3)==1) ...
that otherwise could create errors or be awkward to write. If you want short-circuit behavior
for logical operations outside if and while statements, you must use the special operators k
and &&. The if/elseif construct is fine when only a few options are present. When a
large number of options are possible, it is customary to use switch instead. For instance:

switch units
case ’length’
disp(’meters’)
case ’volume’
disp(’liters’)
case ’time’
disp(’seconds’)
otherwise
disp(’I give up’)
end

The switch expression can be a string or a number. The first matching case has its com-
mands executed. Execution does not “fall through” as in C. If otherwise is present, it
gives a default option if no case matches.
The for loop is one of the most useful MATLAB constructs although, code is more
compact without it. The syntax is
for variable = expression
statements
end
Usually, expression is a vector of the form i:s:j. The statements are executed with variable
equal to each element of the expression in turn. For example, the first 10 terms of Fibonacci
sequence are computed by
24 Introduction to MATLAB

f = [1 1];
for n = 3:10
f(n) = f(n-1) + f(n-2);
end

Another way to define expression is using the square bracket notation:

for x = [pi/6 pi/4 pi/3], disp([x, sin(x)]), end


0.5236 0.5000
0.7854 0.7071
1.0472 0.8660

Multiple for loops can of course be nested, in which case indentation helps to improve
the readability. MATLAB editor debugger can perform automatic indentation. The following
code forms the 5-by-5 symmetric matrix A with (i, j) element i/j for j ≥ i:

n = 5; A = eye(n);
for j=2:n
for i = 1:j-1
A(i,j)=i/j;
A(j,i)=i/j;
end
end

The expression in the for loop can be a matrix, in which case variable is assigned the
columns of expression from first to last. For example, to set x to each of the unit vector in
turn, we can write for x=eye(n), ..., end.
The while loop has the form
while expression
statements
end
The statements are executed as long as expression is true. The following example approxi-
mates the smallest nonzero floating point number:

x = 1; while x>0, xmin = x; x = x/2; end, xmin


xmin =
4.9407e-324

A while loop can be terminated with a break statement, which passes control to the first
statement after the corresponding end. An infinite loop can be constructed using while 1,
..., end, which is useful when it is not convenient to put the exit test at the top of the loop.
(Note that, unlike some other languages, MATLAB does not have a “repeat-until” loop.) We
can rewrite the previous example less concisely as

x = 1; while 1
xmin = x;
x = x/2;
if x == 0, break, end
end
xmin
1.5. M files 25

The break statement can also be used to exit a for loop. In a nested loop a break exit
to the loop at the next higher level.
The continue statement causes the execution of a for or a while loop to pass im-
mediately to the next iteration of the loop, skipping the remaining statements in the loop. As
a trivial example
for i=1:10
if i < 5, continue, end
disp(i)
end

displays the integers 5 to 10. In more complicated loops the continue statement can be
useful to avoid long-bodied if statements.

1.5 M files
1.5.1 Scripts and functions
MATLAB M-files are equivalents of programs, functions, subroutines and procedures in other
programming languages. Collecting together a sequence of commands into an M-file opens
up many possibilities including

• experimenting with an algorithm by editing a file, rather than retyping a long list of
commands,

• making a permanent record of a numerical experiment,

• building up utilities that can be reused at a later date,

• exchanging M-files with colleagues.

An M-file is a text file that has a .m filename extension and contains MATLAB com-
mands. There are two types:
Script M-files (or command files) have no input or output arguments and operate on
variables in the workspace.
Function M-files contain a function definition line and can accept input arguments
and return output arguments, and their internal variables are local to the function (unless
declared global).
A script enables you to store a sequence of commands that are used repeatedly or will be
needed at some future time.
The script given in MATLAB Source 1.1 uses random numbers to simulate a game. Con-
sider 13 spade playing cards which are well shuffled. The probability to choose any card
from the deck is 1/13. The action of extracting a card is implemented by generating a random
number. The game proceeds by putting the card back to the deck and reshuffling until the
user press a key different to r or the number of repetitions is 20.
The line
rand(’twister’,sum(100*clock));
26 Introduction to MATLAB

MATLAB Source 1.1 The first MATLAB script - playingcards


%PLAYINGCARDS
%Simulating a card game

rand(’Twister’,sum(100*clock));
for k=1:20
n=ceil(13*rand);
fprintf(’Selected card: %d\n’,n)
disp(’ ’)
disp(’Press r and Return to continue’)
r=input(’or any other key to finish: ’,’s’);
if r˜=’r’, break, end
end

resets the generator to a different state each time.


The first two lines of this script begin with the % symbol and hence are the comment
lines. Whenever MATLAB encounters a % it ignores the remainder of the line. This allows
you to insert text that makes the code easier for humans to understand. Starting with version
7 MATLAB allows block comments, that is, comments with more than one line length. To
comment a contiguous group of lines, type %{ before the first line and %} after the last line
you want to comment. Example:

%{
Block comment
two lines
%}

If our script is stored into the file playingcards.m, typing playingcards we ob-
tain:

>> playingcards
Selected card: 6

Press r and Return to continue


or any other key to finish: r
Selected card: 9

Press r and Return to continue


or any other key to finish: x
>>

The first two lines serves as documentation for the file and will be typed out in the com-
mand window if help is used on the file.

>> help playingcards


PLAYINGCARDS
Simulating a card game
1.5. M files 27

Function M-files enable you to extend the MATLAB language by writing your own func-
tions that accepts and returns arguments. They can be used in exactly the same way as existing
MATLAB functions as sin, eye, size, etc.

MATLAB Source 1.2 The stat function


function [med,sd] = stat(x)
%STAT Mean and standard deviation of a sample
% [MEAN,SD] = STAT(X) computes mean and standard
% deviation of sample X

n = length(x);
mean = sum(x)/n;
sd = sqrt(sum((x-med).ˆ2)/n);

MATLAB Source 1.2 shows a simple function that computes the mean and standard devi-
ation of a sample (vector). This example illustrates a number of features. The first line begins
with the keyword function followed by the list of output arguments, [mean, sd], and
the = symbol. On the right of the = comes the function name, stat, followed by the input
arguments, x, within parentheses. (In general, there can be any number of input and output
arguments.) The function name must be the same as the name of the .m file in which the
function is stored – in this case the file must be named stat.m.
The second line of a function file is called the H1 (help 1) line. It should be a comment
line of a special form: a line beginning with a % character, followed without any space by
the function name in capital letter, followed by one or more space and then a brief descrip-
tion. The description should begin with a capital letter, end with a period, and omit the words
“the” and “a”. All the comment lines from the first comment line up to the first noncom-
ment line (usually a blank line, for readability of the source code ) are displayed when help
function name is typed. Therefore these lines should describe the function and its argu-
ments. It is conventional to capitalize function and argument names in these comment lines.
When we type help function name, all lines, from the first comment line to the first
noncomment line, are displayed on screen. For stat.m example we have
>> help stat
STAT Mean and standard deviation of a sample
[MEAN,SD] = STAT(X) computes mean and standard
deviation of sample X

We strongly recommend documenting all your function files in this way, however short they
may be. It is often useful to record in comment lines the date when the function was first
written and to note all subsequent changes that have been made. The help command works
in a similar manner on script files, displaying the initial sequence of comment lines.
The function stat is called just like any other MATLAB function:
>> [m,s]=stat(1:10)
m =
5.5000
a =
2.8723
28 Introduction to MATLAB

>> x=rand(1,10);
[m,s]=stat(x)
m =
0.5025
a =
0.1466

A more complicated
√ example is mysqrt, shown in MATLAB Source 1.3. Given a > 0,
it computes a by Newton’s method
 
1 a
xk+1 = xk + , x1 = a.
2 xk

Here are some examples of usage:


>> [x,it] = mysqrt(2)
x =
1.414213562373095
it =
6
>> [x,it]=mysqrt(2,1e-4)
x =
1.414213562374690
it =
4

This M-file illustrates the use of optional input arguments. The function nargin returns
the number of input arguments supplied when the function was called and enables default
values to be assigned to arguments that have not been specified. In this case, if the call
to sqrtn does not specify a value for tol, then eps is assigned to tol. Analogously,
nargout returns the number of the output arguments requested.

1.5.2 Subfunctions, nested and anonymous functions


A single M-file may hold more than one function definition. The first function is called
primary function. Two other types of functions can be in the same file: subfunctions and
nested functions.
The subfunctions begin after the end of primary function. Every subfunction in the file
is available to be called by the primary function and the other subfunctions, but they have
private workspaces and otherwise behave like functions in separate files. The difference is
that only the primary function, not the subfunctions, can be called from sources outside the
file. The next example computes the roots of a quadratic equation. It can solve several
equations simultaneously, if the coefficients are given as vectors.
function x = quadeq(a,b,c)
% QUADEQ Find roots of a quadratic equation.
% X = QUADRATIC(A,B,C) returns the two roots of
% y = A*xˆ2 + B*x + C.
% The roots are contained in X = [X1 X2].
1.5. M files 29

MATLAB Source 1.3 Function mysqrt


function [x,iter] = mysqrt(a,tol)
%MYSQRT Square root by Newton’s method
% X = SQRTN(A,TOL) computes the square root of
% A by Newton’s (Heron’s) method
% assume A >= 0.
% TOL is the tolerance (default EPS).
% [X,ITER] = SQRTN(A,TOL) returns also the number
% of iterations ITER required

if nargin < 2, tol = eps; end

x = a;

for k=1:50
xold = x;
x = (x + a/x)/2;
if abs(x-xold) <= tol*abs(x)
iter=k; return;
end
end
error(’Not converged after 50 iterations’)

denom = 2*a(:);
delta = discr(a,b,c); % Root of the discriminant
x1 = (-b + delta)./denom;
x2 = (-b - delta)./denom;
x = [x1(:), x2(:)];
end %quadeq
function d = discr(a,b,c)
d = sqrt(b().ˆ2-4*a(:).*c(:));
end %discr

The end line is optional for single-function files, but it is a good idea when subfunctions
are involved and mandatory when using nested functions.
For further examples on subfunctions, see MATLAB Sources 5.2 and 5.6 in Chapter 5.
Nested functions are defined within the scope of another function, and they share access
to the containing functions workspace. For example, we can recast our quadratic formula yet
again
function x = quadeqn(a,b,c)
function discr(a,b,c)
d = sqrt(b().ˆ2-4*a(:).*c(:));
end %discr()
denom = 2*a(:);
discr(a,b,c); % Root of the discriminant
30 Introduction to MATLAB

x1 = (-b + d)./denom;
x2 = (-b - d)./denom;
x = [x1(:), x2(:)];
end %quadecn()

Sometimes you may need a quick, short function definition that does not seem to merit a
named file on disk, or even a named subfunction. The “old-fashioned” way to do this is by
using an inline function. Example:
f = inline(’x*sin(x)’,’x’
Starting with MATLAB 7, a better alternative became available: the anonymous function.
A simple example of an anonymous function is
sincos = @(x) sin(x) + cos(x);
We can define multivariate functions:
w = @(x,y,z) cos(x-y*z);
More interestingly, anonymous functions can define functions depending on parameters avail-
able at the time of anonymous function’s creation. As an illustration, consider the function
given by f (x) = x2 + a, where a is a given parameter. The definition could be:
a = 2;
f = @(x) x.ˆ2+a
Nested functions and anonymous functions have similarities but are not quite identi-
cal. Nested functions always share scope with their enclosing parent; their behavior can be
changed by and create changes within that scope. An anonymous function can be influenced
by variables that were in scope at its creation time, but thereafter it acts autonomously.

1.5.3 Passing a Function as an Argument


Many problems in scientific computation, like finding roots, approximating integrals (quadra-
ture), and solving differential equations need to pass a function as an argument to another
function. MATLAB calls the functions operating on other functions function functions. (See
the help topic funfun for a complete list.) There are several way to pass functions as pa-
rameters but the most important is the function handle. A function handle is a MATLAB
datatype that contains all the information necessary to evaluate a function. A function handle
can be created by putting the @ character before the function name. Example:
ezplot(@fun)
The parameter fun can be the name of an M-file, or a built-in function.
ezplot(@sin)
The definitions of anonymous functions and inline objects provide function handles as result,
so they do not require a character before their name.
Another way is to transmit the parameter as a string:
ezplot(’exp’)
but a function handle is more efficient and versatile.
Consider the function fd deriv in MATLAB Source 1.4. This function evaluates the
finite difference approximation
f (x + h) − f (x)
f ′ (x) ≈
h
1.5. M files 31

MATLAB Source 1.4 Function fd deriv


function y = fd_deriv(f,x,h)
%FD_DERIV Aproximates derivative by divided difference
% FD_DERIV(F,X,H) is the divided diference of F
% with nodes X and X+H.
% default H: SQRT(EPS).

if nargin < 3, h = sqrt(eps); end


f=fcnchk(f);
y = (f(x+h) - f(x))/h;

to the function passed as its first argument. Our first example uses a built-in function

>> fd_deriv(@sqrt,0.1)
ans =
1.5811

We can use mysqrt function (source MATLAB 1.3) instead of predefined sqrt:

fd_deriv(@mysqrt,0.1)
ans =
1.5811

In this example fd deriv uses the default tolerance, since it accepts a univariate function
as argument. If we want to force the use of another tolerance, use an anonymous function as
argument

>> format long


>> fd_deriv(@(x) mysqrt(x, 1e-3), 0.1)
ans =
1.581138722598553

We can pass an inline object or an anonymous function to fd deriv:

>> fd_deriv(f,pi)
ans =
-0.0063
>> g = @(x) xˆ2*sin(x);
>> fd_deriv(g,pi)
ans =
-9.8696

The role of line


f=fcnchk(f);
in fd deriv is to make this function to accept a string expression as input parameter. (This
is how ezplot and other MATLAB function work with string expression, see [66] for exam-
ples). The command fcnchk(f,’vectorized’) vectorizes the string f. The function
vectorize vectorizes an inline object.
32 Introduction to MATLAB

Array
(full or sparse)

Java function
logical char NUMERIC cell structure
classes handle

user
classes

int8, uint8
int16, uint16
single double
int32, uint32
int64, uint64

Figure 1.2: MATLAB class hierarchy

1.5.4 Advanced data structures


Other data types

There are 15 fundamental data types in MATLAB. Each of these data types is in the form of
a matrix or array. This matrix or array is a minimum of 0-by-0 in size and can grow to an
n-dimensional array of any size. All of the fundamental data types are shown in lowercase,
plain nonitalic text in Figure 1.2. The two data types shown in italic text are user-defined,
object-oriented user classes and Java classes.
The default data type is double. There exists also a single data type corresponding to
IEEE single precision arithmetic. The reasons for working with single precision are: to save
storage (a scalar single requires only 32 bits to be represented) and to explore the accuracy of
numerical algorithms.
The single type will be described in Section 3.6.
The types int* and uint*, where * is replaced by 8,16, 32 or 64 are array of signed
(int) and unsigned (uint) integers. Some integer types require less storage space than sin-
gle or double. All integer types except for int64 and uint64 can be used in mathematical
operations. We can construct an object of previously mentioned types by typing its type name
as a function with its value as argument. The command c = class(obj) returns the class
of the object obj. Examples:
1.5. M files 33

>> s=single(pi)
s =
3.1416
>> x=pi
x =
3.1416
>> format long, x=pi, format short
x =
3.141592653589793
>> a=uint8(3)
a =
3
>> b=int32(2ˆ20+1)
b =
1048577
>> whos
Name Size Bytes Class Attributes

a 1x1 1 uint8
b 1x1 4 int32
s 1x1 4 single
x 1x1 8 double
>> class(s)
ans =
single

For details on nondouble types see [60, 61].

String and formatted output


A string in MATLAB is enclosed in single forward quotes. In fact a string is really just a row
vector of character ASCII codes. Because of this, strings can be concatenated using matrix
concatenation.
>> str = ’Hello world’;
>> str(1:5)
ans =
Hello
>> s1 = double(str)
s1 =
72 101 108 108 111 32 119 111 114 108 100
>> s2 = char(s1)
s2 =
Hello world
>> [’Hello’,’ ’,’world’]
ans =
Hello world
>> whos
Name Size Bytes Class Attributes
34 Introduction to MATLAB

ans 1x11 22 char


s1 1x11 88 double
s2 1x11 22 char
str 1x11 22 char

You can convert a string such as 3.14 into its numerical meaning (not its character codes)
by using eval or str2num on it. Conversely, you can convert a number to string repre-
sentation using num2str or the much more powerful sprintf (see below). If you want
a quote character within a string, use two quotes, as in It’’s Donald’’s birthday.
Multiple strings can be stored as rows in an array using str2mat, which pads strings with
blanks so that they all have the same length. However, a better way to collect strings is to use
cell arrays. There are lots of string handling functions. See the help on strfun. Here are a
few:

>> strcat(’Hello’,’ world’)


ans =
Hello world
>> upper(str)
ans =
HELLO WORLD
>> strcmp(str,’Hello world’)
ans =
1
>> findstr(’world’,str)
ans =
7

For the conversion of internal data to string, in memory, or for output, use sprintf or
fprintf. These are closely based on the C function printf, with the important vector-
ization enhancement that format specifiers are reused through all the elements of a vector or
matrix (in the usual rowfirst order).

x=0.25; n=1:6;
c=1./cumprod([1 n]);
for k=1:7, T(k)=polyval(c(k:-1:1),x); end
X=[(0:6)’,T’, abs(T-exp(x))’];
fprintf(’\n n | T_n(x) | |T_n(x)-exp(x)|\n’);
fprintf(’--------------------------------------\n’);
fprintf(’ %d | %15.12f | %8.3e\n’, X’ )
fprintf(’--------------------------------------\n’);

n | T_n(x) | |T_n(x)-exp(x)|
--------------------------------------
0 | 1.000000000000 | 2.840e-001
1 | 1.250000000000 | 3.403e-002
2 | 1.281250000000 | 2.775e-003
3 | 1.283854166667 | 1.713e-004
4 | 1.284016927083 | 8.490e-006
5 | 1.284025065104 | 3.516e-007
1.5. M files 35

6 | 1.284025404188 | 1.250e-008
--------------------------------------
Use sprintf if you want to save the result as a string rather than have it output immediately.

Cell arrays
Cell arrays are used to gather dissimilar objects into one variable. They are like regular
numeric arrays, but their elements can be absolutely anything. A cell array is created or using
curly braces rather than parentheses or cell. Cell array contents are indexed using curly
braces, and the colon notation can be used in the same way as for other arrays.
>> vvv={’Humpty’ ’Dumpty’ ’sat’ ’on’ ’a’ ’wall’}
vvv =
’Humpty’ ’Dumpty’ ’sat’ ’on’ ’a’ ’wall’
>> vvv{3}
ans =
sat
The next example tabulates the coefficients of Chebyshev polynomials. In MATLAB
one expresses a polynomial as a vector (highest degree first) of its coefficients. The number
of coefficients needed grows with the degree of the polynomial. Although you can put all
the Chebyshev coefficients into a triangular array, this is an inconvenient complication. The
recurrence relation is
T0 (x) = 1; T1 (x) = x;
Tn+1 (x) = 2xTn (x) − Tn−1 (x), n = 1, 2, . . .
Here is the code
function T=Chebyshevlc(deg)
%CHEBYSHEVLC - Chebyshev polynomials
% tabulate coefficients of Chebyshev polynomials
% T = CHEBYSHEVLC(DEG)
T = cell(1,deg);
T(1:2) = { [1], [1 0] };
for n = 2:deg
T{n+1} = [2*T{n} 0] - [0 0 T{n-1}];
end
and a call
>> T=Chebyshevlc(6)
T =
[1] [1x2 double] [1x3 double] [1x4 double]
[1x5 double] [1x6 double] [1x7 double]
>> for k=1:5, disp(T{k}), end
1
1 0
2 0 -1
4 0 -3 0
8 0 -8 0 1
36 Introduction to MATLAB

Cell arrays are used by the varargin and varargout functions to work with a vari-
able number of arguments.

Structures
Structures are much like cell arrays, but they are indexed by names rather than by numbers.
Suppose we want to build a structure with probability distributions, containing the name, the
type and the probability density (or mass) function.
>> pdist(1).name=’binomial’;
>> pdist(1).type=’discrete’;
>> pdist(1).pdf=@binopdf;
>> pdist(2).name=’normal’;
>> pdist(2).type=’continuous’;
>> pdist(2).pdf=@normpdf;

Displaying the structure gives the field names but not the contents:
>> pdist
pdist =
1x2 struct array with fields:
name
type
pdf

We can access individual fields using a period:


>> pdist(2).name
ans =
normal

Another way to set up the pdist structure is using the struct command:
>> pdist = struct(’name’,{’binomial’,’normal’},...
’type’,{’discrete’,’continuous’}, ...
’pdf’,{@binopdf,@normpdf})

The arguments to the struct function are the field names, with each field name followed by
the field contents listed within curly braces (that is, the field contents are cell arrays, which
are described next). If the entire structure cannot be assigned with one struct statement
then it can be created with fields initialized to a particular value using repmat. For example,
we can set up a pdist structure for six distributions with empty fields
>> pdist = repmat(struct(’name’,{’’}, ’type’,{’’}, ...
’pdf’,{’’}),6,1)
pdist =
6x1 struct array with fields:
name
type
pdf

Struct arrays make it easy to extract data into cells:


1.5. M files 37

MATLAB Source 1.5 Function companb


function C = companb(varargin)
%COMPANB Block companion matrix.
% C = COMPANB(A_1,A_2,...,A_m) is the block companion
% matrix corresponding to the n-by-n matrices
% A_1,A_2,...,A_m.

m = nargin;
n = length(varargin{1});

C = diag(ones(n*(m-1),1),-n);
for j = 1:m
Aj = varargin{j};
C(1:n,(j-1)*n+1:j*n) = -Aj;
end

>> [rlp{1:2}] = deal(pdist.name)


rlp =
’binomial’ ’normal’

In fact, deal is a very handy function for converting between cells and structs. See online
help for many examples.
Structures are used by spline, by solve, and to set options for the nonlinear equa-
tion and optimization solvers and the differential equation solvers. Structures also play an
important role in object-oriented programming in MATLAB (which is not discussed here).

1.5.5 Variable Number of Arguments


In certain situations a function must accept or returns a variable, possibly unlimited, number
of input or output arguments. This can be achieved using the varargin and varargout
functions. Suppose we wish to write a function companb to form the mn × mn block
companion matrix corresponding to the n × n matrices A1 , A2 , . . . , Am :
 
−A1 −A2 . . . . . . −Am
 I 0 0 
 
 . . .. 
C =  I . . .
 . . .. 
 . . 
I 0

The solution is to use varargin as shown in MATLAB Source 1.5. (This example is
given in [44].) When varargin is specified as the input argument list, the input arguments
supplied are copied into a cell array called varargin. Consider the call
>> X = ones(2); C = companb(X, 2*X, 3*X)
C =
38 Introduction to MATLAB

-1 -1 -2 -2 -3 -3
-1 -1 -2 -2 -3 -3
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 0
If we insert the line
varargin
at the beginning of companb then the above call produces
varargin =
[2x2 double] [2x2 double] [2x2 double]
Thus varargin is a 1-by-3 cell array whose elements are 2-by-2 matrices passed as argu-
ments to companb, and varargin{j} is the j-th input matrix, Aj .
It is not necessary for varargin to be the only input argument but it must be the last
one, appearing after any named input arguments.
An example using the analogous statement varargout for output arguments is shown
in MATLAB Source 1.6. Here we use nargout to determine how many output arguments
have been requested and then create a varargout cell array containing the required output.
To illustrate:
>> m1 = moments(1:4)
m1 =
2.5000

>> [m1,m2,m3] = moments(1:4)


m1 =
2.5000
m2 =
7.5000
m3 =
25

MATLAB Source 1.6 Function moments


function varargout = moments(x)
%MOMENTS Moments of a vector.
% [m1,m2,...,m_k] = MOMENTS(X) returns the first, second, ...,
% k’th moments of the vector X, where the j’th moment
% is SUM(X.ˆj)/LENGTH(X).

for j=1:nargout, varargout(j) = sum(x.ˆj)/length(x); end

1.5.6 Global variables


Variables within a function are local to that function workspace. Occasionally it is convenient
to create variables that exist in more than workspace including, possibly, the main workspace.
1.5. M files 39

This can be done using the global. statement.


As example, we give the code for tic and toc MATLAB functions (with shorter com-
ments). This functions measure the elapsed time. The global variable TICTOC is visible for
both function, but is invisible at base workspace (command line or script level) or in any other
function that does not declare it with global.

function tic
% TIC Start a stopwatch timer.
% TIC; any stuff; TOC
% prints the time required.
% See also: TOC, CLOCK.
global TICTOC
TICTOC = clock;
function t = toc
% TOC Read the stopwatch timer.
% TOC prints the elapsed time since TIC was used.
% t = TOC; saves elapsed time in t, does not print.
% See also: TIC, ETIME.
global TICTOC
if nargout < 1
elapsed_time = etime(clock,TICTOC)
else
t = etime(clock,TICTOC);
end

Within a function, the global declaration should appear before the first occurrence of the
relevant variable, ideally at the top of the file. By convention, the names of global variables
are comprised of capital letters, and ideally the names are long in order to reduce the chance
of clashes with other variables.

1.5.7 Recursive functions


Functions can be recursive, that is, they can call themselves. Recursion is a powerful tool,
but not all computations that are described recursively are best programmed this way.
Function mygcd in MATLAB Source reflis10.5 use recursion to compute the greatest
common divisor, that is the property

gcd(a, b) = gcd(b, a mod b).

It accepts two numbers as input arguments and returns their greatest common divisor. Exam-
ple:

>> d=mygcd(5376, 98784)


d =
672

For other examples on recursion see quad and quadl MATLAB functions and sources
in Sections 6.4 and 6.6, used for numerical integration.
40 Introduction to MATLAB

MATLAB Source 1.7 Recursive gcd


function d=mygcd(a,b)
%MYGCD - recursive greatest common divisor
% D = MYGCD(A,B) computes the greatest
% common divisor of A and B

if a==0 & b==0, then d = NaN;


elseif b==0
d = a;
else
d = mygcd(b, mod(a,b));
end

1.5.8 Error control

MATLAB functions may encounter statements that are impossible to execute (for example,
multiplication of incompatible matrices). In that case an error is thrown: execution of the
function halts, a message is displayed, and the output arguments of the function are ignored.
You can throw errors in your own functions with the error statement, called with a string
that is displayed as the message. Similar to an error is a warning, which displays a message
but allows execution to continue. You can create these using warning.
Sometimes you would like the ability to recover from an error in a subroutine and continue
with a contingency plan. This can be done using the try-catch construct. For example, the
following will continue asking for a statement until you give it one that executes successfully.

done = false;
while ˜done
state = input(’Enter a valid statement: ’,’s’);
try
eval(state);
done = true;
catch
err=lasterror;
disp(’That was not a valid statement! Look:’)
disp(err.identifier)
disp(err.message)
end
end

Within the catch block you can find the most recent error message using lasterr, or de-
tailed information using lasterror. Also, error and warning allow a more sophisti-
cated form of call, with formatted string and message identifiers.
For details, see [61] and the corresponding helps.
1.6. Symbolic Math Toolboxes 41

1.6 Symbolic Math Toolboxes


Symbolic Math Toolboxes incorporate symbolic computer facilities into MATLAB numeric
environment. The toolbox is based upon the Mapler kernel, which performs all the symbolic
and variable precision computations. Starting from Maple 2008b there exist a MuPAD based
Symbolic Toolbox. You can select among Maple and MuPAD as support engine. There are
two toolboxes:
• Symbolic Math Toolbox, that provides access to Maple kernel and to Maple linear
algebra package using a style and a syntax which are natural extensions of MATLAB
language.
• Extended Symbolic Math Toolbox extends the above mentioned features to provide
access to nongraphical Maple packages, and to Maple programming capabilities and
user’s defined procedures.
The Symbolic Math Toolbox defines a new datatype: a symbolic object, denoted by sym.
Internally, a symbolic object is a data structure that stores a string representation of the sym-
bol. Symbolic Math Toolbox uses symbolic object to represent symbolic variables, expres-
sions and matrices. Its default arithmetic for symbolic object is the rational arithmetic.
Symbolic objects can be created with the sym function. For example, the statement
x = sym(’x’);
produces a new symbolic variable, x. We can combine several such declarations using syms:
syms a b c x y f g

Examples in this section assume that we have already executed this command.
Differentiation. We can differentiate a symbolic expression with diff. Let us create a
symbolic expression:
>> f=exp(a*x)*sin(x);
Its derivative with respect to x can be computed using
>> diff_f=diff(f,x)
diff_f =
a*exp(a*x)*sin(x)+exp(a*x)*cos(x)
If n is a nonzero natural number, diff(f,x,n) computes the nth order derivative of f.
The next example computes the second derivative
>> diff(f,x,2)
ans = aˆ2*exp(a*x)*sin(x)+2*a*exp(a*x)*cos(x)-exp(a*x)*sin(x)
For details see help sym/diff or doc sym/diff.
Integration. To calculate the indefinite integral of a symbolic expression f we can use
int(f,x). For example, to find the indefinite integral of g(x) = e−ax sin cx, we do:
>> g = exp(-a*x)*sin(c*x);
>> int_g=int(g,x)
int_g =
-c/(aˆ2+cˆ2)*exp(-a*x)*cos(c*x)-a/(aˆ2+cˆ2)*exp(-a*x)*sin(c*x)
42 Introduction to MATLAB

If we execute the command diff(int g,x), we do not obtain g, but rather an equiv-
alent expression. After the execution of the command simple(diff(int g,x)), we
obtain a sequence of messages that informs about the used rules, and finally
ans = exp(-a*x)*sin(c*x)

To compute the definite integral −π g dx, we may try int(g,x,-pi,pi). The result

is not too elegant. Another example is the computation of 0 x sin x dx:
>> int(’x*sin(x)’,x,0,pi)
ans = pi
When MATLAB is unable to find an analytical (symbolic) integral, such as with
>> int(exp(sin(x)),x,0,1)
Warning: Explicit integral could not be found.
ans =
int(exp(sin(x)),x = 0 .. 1)
we can try to find a numerical approximation, like:
>> quad(’exp(sin(x))’,0,1)
ans =
1.6319

Functions int and diff can both be applied to matrices, in which case they operate ele-
mentwise.
For details see help sym/int or doc sym/int.
Substitutions and simplifications. The substitution of one value or parameter for another
in an expression is done with subs. (See help sym/subs or doc sym/subs).
For example, let us compute the previous definite integral of g for a=2 and c=4:
>> int_sub=subs(int_def_g,{a,c},{2,4})
int_sub =
107.0980
simplify is a powerful function that applies various type of identities and simplifica-
tion rules to bring an expression to a “simpler” form. Example:
>> syms h
>> h=(1-xˆ2)/(1-x);
>> simplify(h)
ans = x+1
simple is a “non-orthodox” function whose aim is to obtain an equivalent expression with
the smallest number of characters. We have already given an example, but we consider an-
other one:
>> [jj,how]=simple (cos(x)ˆ2+sin(x)ˆ2)
jj =
1
how =
simplify
1.6. Symbolic Math Toolboxes 43

The rôle of the second output parameter is to avoid long messages during the simplifica-
tion process.
See help sym/simplify and help sym/simple or doc sym/simplify and
doc sym/simple.
Taylor series. The taylor command is useful to generate symbolic Taylor expansions
about a given symbolic value of the argument. For example, to compute the 5th order expan-
sion of ex about x = 0, we may use:

>> clear, syms x, Tay_expx=taylor(exp(x),5,x,0)


Tay_expx =
1+x+1/2*xˆ2+1/6*xˆ3+1/24*xˆ4

The pretty command, display a sym object in a 2D form, close to the mathematical format:

>> pretty(Tay_expx)

2 3 4
1 + x + 1/2 x + 1/6 x + 1/24 x

Now, we shall compare the approximation for x = 2 to the exact value:

>> approx=subs(Tay_expx,x,2), exact=exp(2)


approx =
7
exact =
7.3891
>> frac_err=abs(1-approx/exact)
frac_err =
0.0527

We can compare the two approximations graphically with ezplot (see Chapter 2 for
MATLAB graphics capabilities):

>> ezplot(Tay_expx,[-3,3]), hold on


>> ezplot(exp(x),[-3,3]), hold off
>> legend(’Taylor’,’exp’,’Location’,’Best’)

The graph is given in Figure 1.3. ezplot function is a convenient way to plot symbolic
expressions.
For additional information, see help sym/taylor or doc sym/taylor.
Limits. For the syntax and usage of limit command see help sym/limit or doc
sym/limit. We give only two simple examples. The first computes limx→0 sinx x :

>> L=limit(sin(x)/x,x,0)
L =
1

The second example computes one sided limit of the tan function when x approaches to π2 .
44 Introduction to MATLAB

18
Taylor
exp
16

14

12

10

−3 −2 −1 0 1 2 3
x

Figure 1.3: The exponential and its Taylor expansion

>> LS=limit(tan(x),x,pi/2,’left’)
LS =
Inf
>> LD=limit(tan(x),x,pi/2,’right’)
LD =
-Inf

Solving equations. Symbolic Math Toolbox is able to solve equations and linear and
nonlinear systems of equations. See help sym/solve or doc sym/solve. We shall
give a few examples. Before the solution we shall clear the memory and define the symbolic
variables and equations. (It is a good practice to clear the memory before the solution of a
new problem to avoid the side-effects of previous values of certain variables.)
We start with a generic quadratic equation, ax2 + bx + c = 0.
>> clear, syms x a b c
>> eq=’a*xˆ2+b*x+c=0’;
>> x=solve(eq,x)
x =
1/2/a*(-b+(bˆ2-4*a*c)ˆ(1/2))
1/2/a*(-b-(bˆ2-4*a*c)ˆ(1/2))

If we have several solution we can select one of them using indexing, for example x(1).
Let us solve now the linear system 2x−3y +4z = 5, y +4z +x = 10, −2z +3x+4y = 0.
>> clear, syms x y z
>> eq1=’2*x-3*y+4*z=5’;
>> eq2=’y+4*z+x=10’;
>> eq3=’-2*z+3*x+4*y=0’;
>> [x,y,z]=solve(eq1,eq2,eq3,x,y,z)
1.6. Symbolic Math Toolboxes 45

x =
-5/37
y =
45/37
z =
165/74

Notice that the order of input variables is not important, while the order of output variables is
important.
The next example solves the nonlinear system

y = 2ex
y = 3 − x2 .
>> clear, syms x y
>> eq1=’y=2*exp(x)’; eq2=’y=3-xˆ2’;
>> [x,y]=solve(eq1,eq2,x,y)
x =
.36104234240225080888501262630700
y =
2.8696484269926958876157155521484

Now, consider the solution of the trigonometric equation sin x = 21 . This has an infinite
number of solutions. The sequence of commands
>> clear, syms x, eq=’sin(x)=1/2’;
>> x=solve(eq,x)

finds only the solution


x =
1/6*pi

To find the solutions within a given interval, say [2,3], we can use fsolve command:
>> clear, x=maple(’fsolve(sin(x)=1/2,x,2..3)’)
x =
2.6179938779914943653855361527329

The result is a character string, that can be converted into double with str2double:
>> z=str2double(x), whos
z =
2.6180
Name Size Bytes Class

ans 1x33 264 double array


x 1x33 66 char array
y 1x1 8 double array
z 1x1 8 double array

Grand total is 68 elements using 346 bytes


46 Introduction to MATLAB

Symbolic Math toolbox can solve differential equation with dsolve command. We give
two examples. The first solve a scalar second order ODE: y ′′ = 2y + 1.
>> dsolve(’D2y = 2*y+1’)
ans =
1/2 1/2
exp(2 t) _C2 + exp(-2 t) _C1 - 1/2
The second solve the previous equation with initial condition y(0) = 1, y ′ (0) = 0.
>> dsolve(’D2y = 2*y+1’,’y(0)=1’,’Dy(0)=0’)
ans =
1/2 1/2
3/4 exp(2 t) + 3/4 exp(-2 t) - 1/2
The maple command pass command sequences to Maple kernel. See the appropriate
help and the documentation.
Variable precision arithmetic (vpa). There are three different kinds of arithmetic oper-
ations in this toolbox:
• Numeric – MATLAB floating-point arithmetic.
• Rational – Maple exact symbolic arithmetic;
• VPA – Maple’s variable-precision arithmetic.
Variable-precision arithmetic is done by calling the vpa function. The number of digits
is controlled by Digits Maple variable. digits function display the Digits value,
and digits(n), where n is an integer sets Digits to n decimal digits. The function
vpa(a,d) returns the expression a evaluated with d digits of floating-point precision. If d
is not given, then d is equal to the default precision, digits. The result is of type sym.
For example, the MATLAB statements
>> clear
>> format long
1/2+1/3

use numeric computation to produce


ans =
0.83333333333333

With the Symbolic Math Toolbox, the statement


>> sym(1/2)+1/3

uses symbolic computation to yield


ans =
5/6

And, also with the toolbox, the statements


1.6. Symbolic Math Toolboxes 47

>> digits(25)
>> vpa(’1/2+1/3’)

use variable-precision arithmetic to return


ans =
.8333333333333333333333333

To convert a variable precision number to double one can use the double function.
The next example compute the golden section
>> digits(32)
>> clear, phi1=vpa((1+sqrt(5))/2)
phi1 =
1.6180339887498949025257388711907
>> phi2=vpa(’(1+sqrt(5))/2’), diff=phi1-phi2
phi2 =
1.6180339887498948482045868343656
diff =
.543211520368251e-16

The discrepancy between phi1 and phi2 is due to the fact that the first assignment perform
his computation in double precision and convert the result to vpa, while the second use a
string and perform all computation in vpa.
For additional information about Symbolic Math Toolbox, we recommend [62].

Using MuPAD from MATLAB. Version 5 of Symbolic Math Toolbox is powered by the
MuPAD symbolic engine.
• Nearly all Symbolic Math Toolbox functions work the same way as in previous ver-
sions.
• MuPAD notebooks provide a new interface for performing symbolic and variable-
precision calculations, plotting, and animations.
• Symbolic Math Toolbox functions allow you to copy variables and expressions between
the MATLAB workspace and MuPAD notebooks.
• You can call MuPAD functions and procedures from the MATLAB environment.
The next example computes the mean and the variance of a random variable that obeys
an exponential distribution with parameter b > 0. The probability density function (pdf) is
 −x/b
e
f (x; b) = b , if x > 0;
0, otherwise.

To open a new MuPAD notebook, at MATLAB prompt type


mupad

The input area is demarcated by a left bracket ([). To set parameters and to input the pdf, type
48 Introduction to MATLAB

Figure 1.4: A MuPAD notebook

reset();
assume(b>0);
f:=1/b*exp(-x/b)

Now, the sequence for the computation of mean and variance

meanvalue:=int(x*f, x=0..infinity)
variance:=simplify(int(xˆ2*f, x=0..infinity)-meanvalueˆ2)

Finally, the commands for plotting the pdf for b=2

b:=2: plot(f,x=0..10)

Figure 1.4 shows the MuPAD notebook for this example.

Problems
Problem 1.1. Compute the sum
n
X 1
Sn = ,
k2
k=1

efficiently, for n = 20, 200. How well approximates Sn the sum of series

X 1 π2
S= = ?
k2 6
k=1
1.6. Symbolic Math Toolboxes 49

Problem 1.2. Write a function M file to evaluate MacLaurin expansion of ln(x + 1):

x2 x3 xn
ln(x + 1) = x − + − · · · + (−1)n+1 + ...
2 3 n
The convergence holds for x ∈ [−1, 1]. Test your function for x ∈ [−0.5, 0.5] and check
what is happened when x approaches -1 or 1.

Problem 1.3. Write a MATLAB script that reads an integer and convert it into the roman
numeral.

Problem 1.4. Implement an iterative variant of Euclid’s algorithm for gcd in MATLAB.

Problem 1.5. Implement the binary search in an ordered array in MATLAB.

Problem 1.6. Write a MATLAB code that generates, for a given n, the tridiagonal matrix
 
1 n
−2 2 n − 1 
 
 −3 3 n−2 
 
Bn =  . . . .
 .. .. .. 
 
 −n + 1 n − 1 2 
−n n

Problem 1.7. What is the largest value of n such that


n
X
Sn = k 2 < L,
k=1

where L is given? Solve by summation and using a classical formula for Sn .

Problem 1.8. Generate the matrix Hn = (hij ), where


1
hij = , i, j = 1, n,
i+j−1
using Symbolic Math Toolbox.

Problem 1.9. Build the triangular matrix of binomial coefficients, for powers from 1 to a
given n ∈ N.

Problem 1.10. Develop a MATLAB function that accepts the coordinates of a triangle ver-
tices and a level as input parameters and subdivides the triangle recursively into four triangles
after the midpoints of edges. The subdivision proceed until the given level is reached.

Problem 1.11. Write a MATLAB function that extracts from a given matrix A a block diag-
onal part, where the size of each block is given, and the upper left corner of each block lays
on the main diagonal.
50 Introduction to MATLAB

Problem 1.12. One way to compute the exponential function exp is to use its Taylor series
expansion around x = 0. Unfortunately, many terms are required if |x| is large. But a special
property of the exponential is that e2x = (ex )2 . This leads to a scaling and squaring method:
Divide x by 2 repeatedly until |x| < 1/2, use a Taylor series (16 terms should be more than
enough), and square the result repeatedly. Write a function expss(x) that does this. (The
function polyval can help with evaluating the Taylor expansion.) Test your function on x
values -30, -3, 3, 30.

Problem 1.13. Let x and y be column vectors describing the vertices of a polygon (given
in order). Write functions polyperim(x,y) and polyarea(x,y) that compute the
perimeter and area of the polygon. For the area, use a formula based on Green’s theorem:
n
1 X

A = (xk yk+1 − xk+1 yk ) .
2 i=1

Here n is the number of vertices and it is understood that xn+1 = x1 and yn+1 = y1 . Test
your function on a square and an equilateral triangle.
CHAPTER 2

MATLAB Graphics

MATLAB has powerful and versatile graphics capabilities. You can generate easily a large
variety of highly customizable graphs and figures. We do not intend to be exhaustive, rather
to introduce the reader to those capabilities of MATLAB which will be necessary in the
sequel. The figures and graphs are not only easy to generate, but they can be easily modify
and annotate interactively and with specialized functions (see help plotedit). For a deeper
insight of MATLAB graphics we recommend [44, 56, 59, 68, 55].

2.1 Two-Dimensional Graphics


2.1.1 Basic Plots
MATLAB’s plot function can be used for simple “join-the-dots” x-y plots. Typing
>>x=[1.5,2.2,3.1,4.6,5.7,6.3,9.4];
>>y=[2.3,3.9,4.3,7.2,4.5,3.8,1.1];
>>plot(x,y)

produces the left-hand picture in Figure 2.1(a), where the points x(i), y(i) are joined in
sequence. MATLAB opens a figure window (unless one has already opened as a result of a
previous command) in which to draw the picture. In this example, default values are used for
a number of features, including the ranges of the x- and y-axes, the spacing of the axis tick
marks, and the color and type of the line used for the plot.
More generally we could replace plot(x,y) with plot(x,y,string ), where string
combines up to three elements that control the color, marker and line style. For example,
plot(x,y,’r*--’) specifies that a red asterisk is to be placed at each point x(i), y(i)
and the points are to be joined by a red dashed line, whereas plot(x,y,’+y’) specifies
a yellow cross marker with no line joining the points. Table 2.1 lists the options available.

51
52 MATLAB Graphics

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

(a) Default (b) Nondefault

Figure 2.1: Simple x-y plots

The right-hand picture in Figure 2.1 was produced with plot(x,y,’kd:’), which gives
a black dotted line with diamond marker. The three elements in string may appear in any
order, so, for example, plot(x,y,’ms--’) and plot(x,y,’s--m’) are equivalent.
You can exert further control by supplying more arguments to plot. The properties called
Linewidth (default 0.5 points) and MarkerSize (default 6 points) can be specified in
points, where a point is 1/72 inch. For example, the commands
>>plot(x,y,’m--ˆ’,’LineWidth’,3,’MarkerSize’,5)

produce a plot with a 2-point line width and 10-point marker size, respectively. For markers
that have a well-defined interior, the MarkerEdgeColor and MarkerFaceColor can
be set to one of the colors in Table 2.1. So, for example
plot(x,y,’o’,’MarkerEdgeColor,’,’m’)

gives magenta edges to the circles. The left-hand plot in Figure 2.2 was produced with
plot(x,y,’m--ˆ’,’LineWidth’,3,’MarkerSize’,5)

and the right-hand plot width


plot(x,y,’--rs’,’MarkerSize’,20,’MarkerFaceColor’,’g’)

Note that more than one set of data can be passed to plot. For example,
plot(x,y,’g-’,b,c,’r--’)
superimposes plots of x(i), y(i) and b(i), c(i) with solid green and dashed red line
styles, respectively.
The plot command accepts matrices as input arguments. If x is an m vector and Y is an
m × n matrix, then plot(x,Y) plots the graph obtained from x and each column of Y. The
next example plots sin and cos on the same graph. The output is shown in Figure 2.3.
x=(-pi:pi/50:2*pi)’;
Y=[sin(x),cos(x)];
plot(x,Y)
2.1. Two-Dimensional Graphics 53

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

(a) (b)

Figure 2.2: Two nondefault x-y graphs

Marker
o Circle
* Asterisk
Color
. Point
r Red
+ Plus
g Green Line style
x Cross
b Blue - Solid line (default)
s Square
c Cyan -- Dashed line
d Diamond
m Magenta : Dotted line
ˆ Upward triangle
y Yellow -. Dash-dot line
v Downward triangle
k Black
> Right triangle
w White
< Left triangle
p Five-point star
h Six-point star

Table 2.1: Options for the plot command


54 MATLAB Graphics

xlabel(’x’);
ylabel(’y’);
title(’Sine and cosine’)
box off

Sine and cosine


1

0.8

0.6

0.4

0.2

0
y

−0.2

−0.4

−0.6

−0.8

−1
−4 −2 0 2 4 6 8
x

Figure 2.3: Two plots on same graph

In this example we used title, xlabel and ylabel. These functions reproduce their
input string above the plot and besides the x- and y-axes, respectively. We also used the
command box off, which removes the box from the current plot, leaving just the x- and
y-axes.
Similarly, if X and Y are matrices of the same size, plot(X,Y) plots graphs obtained
from corresponding columns of X and Y. The example below generates a sawtooth graph (see
Figure 2.4 for output):
>> x = [ 0:3; 1:4 ]; y = [ zeros(1,4); ones(1,4) ];
>> plot(x,y,’b’), axis equal

If the arguments of plot are not real, their imaginary parts are ignored. There is, nev-
ertheless an exception, when plot has a single argument. If plot Y is complex, plot(Y) is
equivalent with plot(real(Y),imag(Y)). If Y is real, plot(Y) takes the subscripts
of Y as abscissas and Y itself as ordinates.
If one plotting command is later followed by another then the new picture will either
replace or be superimposed on the old picture, depending on the current hold state. Typing
hold on causes subsequent plots to be superimposed on the current one, whereas hold
off specifies that each new plot should start afresh. The default status correspond to hold
off.
The fplot is a more elaborate version of plot, useful for plotting mathematical func-
tions. It adaptively samples a function at enough points to produce a representative graph.
The general syntax of this function is
fplot(’fun’,lims,tol,N,’LineSpec’,p1,p2,...)

The argument list works as follows.


2.1. Two-Dimensional Graphics 55

1.5

0.5

−0.5

−1
0.5 1 1.5 2 2.5 3 3.5 4

Figure 2.4: Plot of matrices

• fun is the function to be plotted.

• The x and/or y limits are given by lims.

• tol is a relative error tolerance, the default value of 2 × 10−3 corresponding to the
0.2% accuracy.

• At least N+1 points will be used to produce the plot.

• LineSpec determines the line type.

• p1, p2, ... are parameters that are passed to fun, which must have input argu-
ments x, p1, p2, . . . .

Here is an example:
fplot(exp(sqrt(x)*sin(12*x)),[0 2*pi])

We can represent polar curves using polar(t,r) command, where t is the polar angle,
and r is the polar radius. An additional string parameter s, id used, has the same meaning as
in plot. The graph of a polar curve, called cardioid, whose equation is

r = a(1 + cos t), t ∈ [0, 2π],

where a is a given constant, is given in Figure 2.5 and is generated by

t=0:pi/50:2*pi;
a=2; r=a*(1+cos(t));
polar(t,r)
title(’cardioid’)
56 MATLAB Graphics

cardioid
90
4
120 60
3

150 2 30

180 0

210 330

240 300
270

Figure 2.5: Polar graph — cardioid

The fill function works analogously to plot. By typing fill(x,y,c) shades a


polygon whose vertices are specified by the points x(i), y(i). The points are taken in
order, and the last vertex is joined to the first. The color of the shading can be given explicitly,
or as an [r g b] triple. The elements r, g and b, which must be scalars in the range [0,1],
determine the level of red, green and blue, respectively in the shading. So, fill(x,y,[0 1
0]) uses pure green and fill(x,y,[1 0 1]) uses magenta. Specifying equal amounts
of red, green and blue gives a grey shading that can be varied between black ([0 0 0]) and
white ([1 1 1]). The next example plots a gray regular heptagon:

n=7;
t=2*(0:n-1)*pi/n;
fill(cos(t),sin(t),[0.7,0.7,0.7])
axis square

Figure 2.6 gives the resulting picture.


The command clf clears the current figure window, while close closes it. It is possible
to have several figure windows on the screen. The simplest way to create a new figure window
is to type figure. The nth figure window (where n is displayed in the title bar) can be made
current by typing figure(n). The command close all causes all the figure windows
to be closed.
Note that many aspects of a figure can be changed interactively, after the figure has been
displayed, by using the items on the toolbar of the figure window or on the Tools pull-down
menu. In particular, it is possible to zoom on a particular region of the plot using mouse
clicks (see help zoom).
2.1. Two-Dimensional Graphics 57

0.8

0.6

0.4

0.2

−0.2

−0.4

−0.6

−0.8

−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure 2.6: Example with fill

2.1.2 Axes and Annotation


Various aspects of the axes of a plot can be controlled with the axis command. The axes
are removed from a plot with axis off. The aspect ratio can be set to unity, so that, for
example, a circle appears circular rather than elliptical, by typing axis equal. The axis
box can be made square with axis square.
Setting axis([xmin xmax ymin ymax]) causes the x-axis to run from xmin to
xmax and the y-axis from ymin to ymax. To return to the default axis scaling, which
MATLAB chooses automatically based on the data being plotted, type axis auto. If you
want one of the limits to be chosen automatically by MATLAB, set it to -inf or inf; for
example, axis([-1 1 -inf 0]). The x-axis and y-axis limits can be set individually
with xlim([xmin xmax]) and ylim([ymin ymax]).
The next example plots the function 1/(x − 1)2 + 3/(x − 2)2 over the interval [0,3]:
x = linspace(0,3,500);
plot(x,1./(x-1).ˆ2+3./(x-2).ˆ2)
grid on

The grid on produces a light grid of horizontal and vertical grid od dashed lines that ex-
tends from axis ticks. You can see the result in Figure 2.7(a). Because of the singularities at
x = 1, 2 the plot is not too informative. However, by executing the additional command
ylim([0,50])
the Figure 2.7(b) is produced, which focuses on the interesting part of the first plot.
In the following example we plot the epicycloid

x(t) = (a + b) cos(t) − b cos((a/b + 1)t)
0 ≤ t ≤ 10π,
y(t) = (a + b) sin(t) − b sin((a/b + 1)t)
for a = 12 and b = 5.
58 MATLAB Graphics

5
x 10
8 50

7 45

40
6

35

5
30

4 25

20
3

15
2
10

1
5

0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3

(a) (b)

Figure 2.7: Use of ylim (right) to change automatic (left) y-axis limits.

MATLAB Source 2.1 Epicycloid


a = 12; b=5;
t=0:0.05:10*pi;
x = (a+b)*cos(t)-b*cos((a/b+1)*t);
y =(a+b)*sin(t)-b*sin((a/b+1)*t);
plot(x,y)
axis equal
axis([-25 25 -25 25])
grid on
title(’Epicicloid: $a=12$, $b=5$’,...
’Interpreter’,’LaTeX’,’FontSize’,16)
xlabel(’$x(t)$’,’Interpreter’,’LaTeX’,’FontSize’,14),
ylabel(’$y(t)$’,’Interpreter’,’LaTeX’,’FontSize’,14)

The resulting picture appears in Figure 2.8. The axis limits were chosen to put some space
around the epicycloid.
We can introduce texts in our graphs by using text(x, y, s), where x and y are
the coordinates of the text, and s is a string or a string-type variable. (A related function
gtext allows the text location to be determined interactively via the mouse.) MATLAB
allows to introduce in s some constructions borrowed from TEX, such as for a subscript,ˆ
for a superscript, or Greek letters ( \alpha, \beta, \gamma, etc.). Also, we can select certain
text attributes, like font, font size and so on.
The commands
plot(0:pi/20:2*pi,sin(0:pi/20:2*pi),pi,0,’o’)
text(pi,0,’ \leftarrow sin(\pi)’,’FontSize’,18)

annotate the point of coordinates (π, 0) by the string sin(π). The result is given in Figure
2.9. We may use these capabilities in title, legends or axes labels, since these are text objects.
Starting from MATLAB 7, text primitives support a strong LATEX subset. The corresponding
2.1. Two-Dimensional Graphics 59

Epicicloid: a = 12, b = 5
25

20

15

10

5
y(t)

−5

−10

−15

−20

−25
−20 −10 0 10 20
x(t)

Figure 2.8: Epicycloid

0.8

0.6

0.4

0.2

0 ← sin(π)
−0.2

−0.4

−0.6

−0.8

−1
0 1 2 3 4 5 6 7

Figure 2.9: Using text example


60 MATLAB Graphics

property is called Interpreter and its possible values are TeX, LaTeX or none. For
examples of LATEX constructions usage see the example with epicycloid (MATLAB Source
2.1) or the script graphLegendre.m, page 164. See also doc text and click here on
Text Properties for additional information.
Generally, typing legend(’string1’,’string2’,...,’stringn’) will cre-
ate a legend box that puts ’stringi’ next to the color/marker/line style information for the
corresponding plot. By default, the box appears in the top right-hand corner of the axis area.
The location of the box can be specified by adding an extra argument (see help legend
or doc legend). The next example add a legend to a graph of hyperbolic sine and co-
sine(output in Figure 2.10):
x = -2:2/25:2;
plot(x,cosh(x),’-ro’,x,sinh(x),’-.b’)
h = legend(’cosh’,’sinh’,4);

−1

−2

−3
cosh
sinh
−4
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 2.10: Graph with a legend

2.1.3 Multiple plots in a figure


M ATLAB’s subplot allows you to place a number of plots in a grid pattern together on
the same figure. Typing subplot(mnp) or equivalently, subplot(m,n,p), splits the
figure window into an m-by-n array of regions, each having its own axis. The current plotting
commands will then apply to the pth of these regions, where the count moves along the first
row, and then along the second row, and so on. So, for example, subplot(425) splits the
figure window into a 4-by-2 matrix of regions and specifies that plotting commands apply to
the fifth region, that is, the first region in the third row. If subplot(427) appears later,
then the (4,1) position becomes active.
The following example generates the graphs in Figure 2.11.
t = 0:.1:2*pi;
subplot(2,2,1)
plot(cos(t),t.*sin(t))
subplot(2,2,2)
2.2. Three-Dimensional Graphics 61

2 1

0 0.5

−2 0

−4 −0.5

−6 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

Figure 2.11: Example with subplot

plot(cos(t),sin(2*t))
subplot(2,2,3)
plot(cos(t),sin(3*t))
subplot(2,2,4)
plot(cos(t),sin(4*t))

To complete this section, we list in Table 2.2 the most popular 2D plotting functions in
MATLAB.

2.2 Three-Dimensional Graphics


The function plot3 is the three-dimensional analogue of plot. The following example
illustrates the simple usage: plot3(x,y,z) draws a “join the dots” curve by taking the
points x(i), y(i), z(i) in order. The result is shown in figure 2.12.

t = 0:pi/50:6*pi;
expt = exp(-0.1*t);
xt = expt.*cos(t); yt = expt.*sin(t);
plot3(xt, yt, t), grid on
xlabel(’x(t)’), ylabel(’y(t)’), zlabel(’z(t)’)
title(’plot3 {\itexample}’,’FontSize’,14)

This example also uses the functions xlabel, ylabel and title, which were dis-
cussed in the previous section and the analogous zlabel. Note that we used the TEX no-
tation in the title command to produce the italic text. The color, marker and line styles for
62 MATLAB Graphics

plot Simple x-y plot


loglog Plot with logarithmically scaled axis
semilogx Plot with logarithmically scaled x-axis
semilogy Plot with logarithmically scaled y-axis
plotyy x-y plot with y-axes on left and right
polar Plot in polar coordinates
fplot Automatic function plot
ezplot Easy-to-use version of fplot
ezpolar Easy-to-use version of polar
fill Polygon fill
area Filled area graph
bar Bar graph
barh Horizontal bar graph
hist Histogram
pie Pie chart
comet Animated, comet-like x-y plot
errorbar Error bar plot
quiver Quiver (velocity vector) plot
scatter Scatter plot

Table 2.2: 2D plotting functions

plot3 example

20

15
z(t)

10

0
1
0.5 1
0 0.5
0
−0.5 −0.5
y(t) −1 −1
x(t)

Figure 2.12: 3D plot created with plot3


2.2. Three-Dimensional Graphics 63

plot3 can be controlled in the same way as for plot. The axis limit in 3D are automatically
computed, but they can be changed by
axis([xmin, xmax, ymin, ymax, zmin, zmax])
In addition to xlim, ylim, there exists zlim, that changes z-axis limits.
To plot a bivariate function z = f (x, y), we need to define it on a rectangular grid {xi :
i = 1, . . . , m} × {yj : j = 1, . . . , m}, zi,j = f (xi , yj ). All you need is some way to
get your data into a matrix format. If you have a vector of x values, and a vector of y
values, MATLAB provides a useful function called meshgrid that can be used to simplify
the generation of X and Y matrix arrays used in 3-D plots. It is invoked using the form
[X,Y] = meshgrid(x,y), where x and y are vectors that help specify the region in
which coordinates, defined by element pairs of the matrices X and Y, will lie. The matrix X
will contain replicated rows of the vector x, while Y will contain replicated columns of vector
y. The next example will show how meshgrid works:
x = [-1 0 1];
y = [9 10 11 12];
[X,Y] = meshgrid(x,y)

MATLAB returns
X =
-1 0 1
-1 0 1
-1 0 1
-1 0 1
Y =
9 9 9
10 10 10
11 11 11
12 12 12

The command meshgrid(x) is equivalent to meshgrid(x,x).


The mesh function generate a wire-frame plotting of a surface. It creates many criss-
crossed lines that look like a net draped over the surface defined by your data. To understand
what the command is plotting, consider three M-by-N matrices, X, Y, and Z, that together
specify coordinates of some surface in a three-dimensional space. A mesh plot of these ma-
trices can be generated with the command mesh(X,Y,Z). Each (x(i,j), y(i,j),
z(i,j)) triplet, corresponding to the element in the ith row and jth column of each of
the X, Y, and Z matrices, is connected to the triplets defined by the elements in neighboring
columns and rows. Vertices defined by triplets created from elements that are not in either an
outer (i.e., first or last) row or column of the matrix will, therefore, be joined to four adjacent
vertices. Vertices on the edge of the surface will be joined to three adjacent ones. Finally, ver-
tices defining the corners of the surface will be joined only to the two adjacent ones. Consider
the following example which will produce the plot shown in Figure 2.13(a).
[X,Y] = meshgrid(linspace(0,2*pi,50),linspace(0,pi,50));
Z = sin(X).*cos(Y);
mesh(X,Y,Z)
xlabel(’x’); ylabel(’y’); zlabel(’z’);
axis([0 2*pi 0 pi -1 1])
64 MATLAB Graphics

1 1

0.5 0.5

0 0
z

z
−0.5 −0.5

−1 −1
3 3
6 6
2 2
4 4
1 1
2 2
y 0 0 y 0 0
x x

(a) mesh (b) surf

Figure 2.13: Surface plotted with mesh and surf

The surf function produce similar plots as mesh, except it paints the inside of each
mesh cell with color, so an image of surfaces is created. If, in the previous example, we
replace mesh by surf, we obtain Figure 2.13(b).
The view parameters of a plot may be changed by view([Az,El]) or view(x,y,z).
Here Az is the azimuthal angle and El is the elevation angle (in degrees). The angles are
defined with respect to the axis origin, where the azimuth angle, Az, is in the xy-plane and
the elevation angle, El, is relative to the xy plane. Figure 2.14 depicts how to interpret the
azimuth and elevation angles relative to the plot coordinate system. The default values are

Figure 2.14: The point-of-view in a 3-D plot

Az = −37.5◦ and El = 30◦ . The forms of the function view(3) and view(2) will restore
the current plot to the default 3-D or 2-D views respectively. The form [Az,El]=view
2.2. Three-Dimensional Graphics 65

returns the current view parameters. Figure 2.15 presents multiple views of the function
peaks (see help peaks), created with the following code.
azrange=-60:20:0;
elrange=0:30:90;
spr=length(azrange);
spc=length(elrange);
pane=0;
for az=azrange
for el=elrange
pane=1+pane;
subplot(spr,spc,pane);
[x,y,z]=peaks(20);
mesh(x,y,z);
view(az,el);
tstring=[’Az=’,num2str(az),...
’ El=’,num2str(el)];
title(tstring)
axis off
end
end

Az=−60 El=0 Az=−60 El=30 Az=−60 El=60 Az=−60 El=90

Az=−40 El=0 Az=−40 El=30 Az=−40 El=60 Az=−40 El=90

Az=−20 El=0 Az=−20 El=30 Az=−20 El=60 Az=−20 El=90

Az=0 El=0 Az=0 El=30 Az=0 El=60 Az=0 El=90

Figure 2.15: Various views

Contour of a function in a two-dimensional array can be plotted by contour. Its basis


synopsis is
contour(X,Y,Z,level)
Here X, Y, and Z have the same meaning as in mesh or surf, and level is a vector
containing the levels of contour. level may be replaced by an integer, which will be inter-
preted as the number of contour levels. The next examples produces contours for the function
66 MATLAB Graphics

0.5

−0.5

−1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

1 1.2
0.5 −1.1 0.5 1.2 1.2 .4
−0.4 −0
0.5 −1. −0 0 . 5
1 .4 1.2 .1
−1
0.5
0
1.2

−0.4

.4

−01.1

−0
−0.5 −1. −0.4 0.5 −1.1
0.5.4 1
1.2 0.5 1.2
−1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 2.16: Two plots generated with contour.

sin(3y − x2 + 1) + cos(2y 2 − 2x) over the range −2 ≤ x ≤ 2 and −1 ≤ y ≤ 1; the result


can be seen in Figure 2.16.
x=-2:.01:2; y=-1:0.01:1;
[X,Y] = meshgrid(x,y);
Z =sin(3*Y-X.ˆ2+1)+cos(2*Y.ˆ2-2*X);
subplot(2,1,1)
contour(X,Y,Z,20);
subplot(2,1,2)
[C,h]=contour(X,Y,Z,[1.2,0.5,-0.4,-1.1]);
clabel(C,h)

The upper half was generated by choosing automatically 20 contour level. The lower half
annotates the contour with clabel.
For applications in mechanics which use contour function see [51].
We can improve the appearance of our graphs by using shading function, that controls
the color shading of surface and patch graphics objects. It has three options
• shading flat – each mesh line segment and face has a constant color determined
by the color value at the endpoint of the segment or the corner of the face that has the
smallest index or indices.
• shading faceted – flat shading with superimposed black mesh lines. This is the
default shading mode.
• shading interp varies the color in each line segment and face by interpolating the
colors across the line or face.
2.2. Three-Dimensional Graphics 67

Figure 2.17 example shows the effects of shading on the hyperbolic paraboloid surface given
by the equation z = x2 − y 2 over the domain [−2, 2] × [−2, 2]. Here is the code:
[X,Y]=meshgrid(-2:0.25:2);
Z=X.ˆ2-Y.ˆ2;
subplot(1,3,1)
surf(X,Y,Z);
axis square
shading flat
title(’shading flat’,’FontSize’,14)

subplot(1,3,2)
surf(X,Y,Z);
axis square
shading faceted
title(’shading faceted’,’FontSize’,14)

subplot(1,3,3)
surf(X,Y,Z);
axis square
shading interp
title(’shading interp’,’FontSize’,14)

shading flat shading faceted shading interp

5 5 5

0 0 0

−5 −5 −5
2 2 2
2 2 2
0 0 0 0 0 0
−2 −2 −2 −2 −2 −2

Figure 2.17: The effect of shading

The camera toolbar is a set of tools which allow you to change interactively figure, view-
68 MATLAB Graphics

plot3∗ Simple x-y-z plot


contour∗ Contour plot
contourf∗ Filled contour plot
contour3 3D contour plot
mesh∗ Wireframe surface
meshc∗ Wireframe surface plus contours
meshz Wireframe surface with curtain
surf∗ Solid surface
surfc∗ Solid surface plus contours
waterfall Unidirectional wire frame
bar3 3D bar graph
bar3h 3D horizontal bar graph
pie3 3D pie chart
fill3 Polygon fill
comet3 3D animated, comet like plot
scatter3 3D scatter plot
stem3 Stem plot
These functions fun have ezfun counterparts, too

Table 2.3: 3D plotting functions

ing, light and other parameters of your plot. It is available from Figure window.
The table 2.3 summarizes the most popular 3D plotting functions. As the table indicates,
several of the functions have “easy to use” alternative versions with names beginning with
ez.

2.3 Handles and properties


Every rendered object has an identifier or handle. The functions gcf, gca, and gco
return the handles to the active figure, axes, and object (usually the most recently drawn).
Properties can be accessed and changed by the functions get and set, or interactively in
graphical mode (see the Plot Edit Toolbar, Plot Browser, and Property Editor in the figures
View menu). Here is just a taste of what you can do:
>> h = plot(t,sin(t));
>> set(h,’Color’,’m’,’LineWidth’,2,’Marker’,’s’)
>> set(gca,’pos’,[0 0 1 1],’visible’,’off’)
Here is a way to make a “dynamic” graph or simple animation:
clf, axis([-2 2 -2 2]), axis equal
h = line(NaN,NaN,’marker’,’o’,’linestyle’,’-’,’erasemode’,’none’);
t = 6*pi*(0:0.02:1);
for n = 1:length(t)
set(h,’XData’,2*cos(t(1:n)),’YData’,sin(t(1:n)))
pause(0.05)
end
2.4. Saving and Exporting Graphs 69

From this example you see that even the data displayed by an object are settable properties
(XData and YData). Among other things, this means you can extract the precise values used
to plot any object from the object itself.
Because of the way handles are used, plots in MATLAB are usually created first in a basic
form and then modified to look exactly as you want. However, it can be useful to change the
default property values that are first used to render an object. You can do this by resetting the
defaults at any level above the target objects type. For instance, to make sure that all future
Text objects in the current figure have font size 10, enter
>> set(gcf, defaulttextfontsize, 10)
All figures are also considered to be children of a root object that has handle 0, so this
can be used to create global defaults. Finally, you can make it a global default for all future
MATLAB sessions by placing the set command in a file called startup.m in a certain
directory (see the documentation).

2.4 Saving and Exporting Graphs


The print command allows you to send your graph to a printer or to save it on disk in a
graphical format or as an M-file. The syntax is:
print -ddevice -options filename
It has several options, type help print to see them. Among the devices, we mention here
deps for encapsulated postscript, dpng for Portable Network Graphic format and djpeg -
<nn> for jpeg image at quality level nn.
If you have set your printer properly, the command print will send the content of your
figure to it. The command
print -deps2 myfig.eps
creates an encapsulated level 2 black and white PostScript file myfig.eps that can sub-
sequently be printed on a PostScript printer or included in a document. This file can be
incorporated into a LATEX document, as in the following outline:

\documentclass{article}
\usepackage[dvips]{graphics}

...
\begin{document}
...
\begin{figure}
\begin{center}
\includegraphics[width=8cm]{myfig.eps}
\end{center}
\caption{...}
\end{figure}
...
\end{document}

The saveas command saves a figure to a file in a form that can be reloaded into MAT-
LAB. For example,
70 MATLAB Graphics

saveas(gcf,’myfig’,’fig’)

saves the current figure as a binary FIG-file, which can be reloaded into MATLAB with
open(’myfig.fig’).
It is also possible to save and print figures from the pulldown File menu in the figure
window.

2.5 Application - Snail and Shell Surfaces


Consider the family of parametric surfaces:

x(u, v) = ewu (h + a cos v) cos cu,


y(u, v) = ewu (h + a cos v) sin cu, (2.5.1)
z(u, v) = ewu (k + b sin v) ,

where u ∈ [umin , umax ], v ∈ [0, 2π], and a, b, c, h, k, w, and R are given parameters. R is a
direction parameter and may have only ±1 values, and the endpoints for u interval, umin and
umax are given. These surfaces are used to model the shape of some snails house and shells’
shell.
The MATLAB function SnailsandShells (see MATLAB source 2.2) computes the
points of a surfaces given by equations (2.5.1).

MATLAB Source 2.2 Snail/Shell surface


function [X,Y,Z]=SnailsandShells(a,b,c,h,k,w,umin,umax,R,nu,nv)
%SNAILSANDSHELLS - plot snails & shell surface
%call SNAILSANDSHELLS(A,B,C,H,K,W,R,UMIN,UMAX]
%R =+/-1 direction
%a, b, c, h, k, w - shape parameters
%umin, umax - interval for u
%nu,nv - number of points

if nargin<11, nv=100; end


if nargin<10, nu=100; end
if nargin<9, R=1; end
v=linspace(0,2*pi,nv);
u=linspace(umin,umax,nu);
[U,V]=meshgrid(u,v);
ewu=exp(w*U);
X=(h+a*cos(V)).*ewu.*cos(c*U);
Y=R*(h+a*cos(V)).*ewu.*sin(c*U);
Z=(k+b*sin(V)).*ewu;

We illustrate the usage of this source to model the house of Pseudoheliceras subcatena-
tum:
%Pseudoheliceras subcatenatum
2.5. Application - Snail and Shell Surfaces 71

a=1.6; b=1.6;c=1; h=1.5; k=-7.0;


w=0.075; umin=-50; umax=-1;

[X,Y,Z]=SnailsandShells(a,b,c,h,k,w,umin,umax,1,512,512);
surf(X,Y,Z)
view(87,21)
shading interp
camlight right
light(’Position’,[3,-0.5,-4])
axis off

To obtain a pleasant-looking graph we invoked light and camera functions. The light
function add a light object to the current axes. Here, only Position property of a light
object is used. The camlight function creates and sets position of a light. For details on
lighting see the corresponding helps or [59].
We recommend the user to try the following funny examples (we give the species and
parameters):

• Nautilus, with a = 1, b = 0.6, c = 1, h = 1, k = 0, w = 0.18, umin = −20, and


umax = 1.

• Natica stelata, with a = 2.6, b = 2.4, c = 1.0, h = 1.25, k = −2.8, w = 0.18,


umin = −20, and umax = 1.0.

• Mya arenaria, with a = 0.85, b = 1.6, c = 3.0, h = 0.9, k = 0, w = 2.5, umin = −1,
and umax = 0.52.

• Euhoplites, with a = 0.6, b = 0.4, c = 1.0, h = 0.9, k = 0.0, w = 0.1626,


umin = −40, and umax = −1.

• Bellerophina, with a = 0.85, b = 1.2, c = 1.0, h = 0.75, k = 0.0, w = 0.06,


umin = −10, and umax = −1.

• Astroceras, with a = 1.25, b = 1.25, c = 1.0, h = 3.5, k = 0, w = 0.12, umin = −40,


and umax = −1.

See the front and the back cover for a plot of some of them.
For details on geometry of this kind of surfaces see [36, 63].

Problems
Problem 2.1 (Lissajous curve (Bodwitch)). Plot the parametric curve

α(t) = (a sin(nt + c), b sin t), t ∈ [0, 2π],

for (a) a = 2, b = 3, c = 1, n = 2, (b) a = 5, b = 7, c = 9, n = 4, (c) a = b = c = 1,


n = 10.
72 MATLAB Graphics

Problem 2.2 (Butterfly curve). Plot the polar curve


r(t) = ecos(t) − a cos(bt) + sin5 (ct),
for (a) a = 2, b = 4, c = 1/12, (b) a = 1, b = 2, c = 1/4, (c) a = 3, b = 1, c = 1/2.
Experiment for t ranging various intervals of the form [0, 2kπ], k ∈ N.
Problem 2.3 (Spherical rodonea). Plot the tridimensional curve
α(t) = a(sin(nt) cos t, sin(nt) sin t, cos(nu)),
for (a) a = n = 2, (b) a = 1/2, n = 1, (c) a = 3, n = 1/4, (d) a = 1/3, n = 5.
Problem 2.4. [27] Play the “chaos game”. Let P1 , P2 , and P3 be the vertices of an equilateral
triangle. Start with a point anywhere inside the triangle. At random, pick one of the three
vertices and move halfway toward it. Repeat indefinitely. If you plot all the points obtained,
a very clear pattern will emerge. (Hint: This is particularly easy to do if you use complex
numbers. If z is complex, then plot(z) is equivalent to plot(real(z),imag(z)).)
Problem 2.5. Make surface plots of the following functions over the given ranges.
2
−y 2
(a) (x2 + 3y 2 )e−x , −3 ≤ x ≤ 3, −3 ≤ y ≤ 3.
2 2
(b) −3y/(x + y + 1), |x| ≤ 2, |y| ≤ 4.
(c) |x| + |y|, |x| ≤ 1, |y| ≤ 1.
Problem 2.6. Make contour plots of the functions in the previous exercise.
Problem 2.7. [27] Make a contour plot of
2
+2y 2 2
+2y 2 )
f (x, y) = e−4x cos 8x + e−3((2x+1/2)
for −1.5 < x < 1.5, −2.5 < y < 2.5, showing only the contour at the level f (x, y) = 0.001.
You should see a friendly message.
Problem 2.8 (Twisted sphere (Corkscrew), [70]). Plot the parametric surface
χ(u, v) = (a cos u cos v, a sin u cos v, a sin v + bu)
where (u, v) ∈ [0, 2π) × [−π, π), for (a) a = b = 1, (b) a = 3, b = 1, (c) a = 1, b = 0, (d)
a = 1, b = −3/2.
Problem 2.9 (Helicoid, [70]). Plot the parametric surface given by
χ(u, v) = (av cos u, bv sin u, cu + ev)
where (u, v) ∈ [0, 2π) × [−d, d), for (a) a = 2, b = c = 1, e = 0, (b) a = 3, b = 1, c = 2,
e = 1.
Problem 2.10. Plot the parametrical surface
χ(u, v) = (u(3 + cos(v)) cos(2u), u(3 + cos(v)) sin(2u), u sin(v) − 3u),
for 0 ≤ u ≤ 2π, 0 ≤ v ≤ 2π.
CHAPTER 3

Errors and Floating Point Arithmetic

Computation errors evaluation is one of the main goal of Numerical Analysis. Several type
of error which can affect the accuracy may occur:

1. Input data error ;

2. Rounding error ;

3. Approximation error.

Input data errors are out of computation control. They are due, for example, to the inherent
imperfections of physical measures.
Rounding errors are caused since we perform our computation using a finite representa-
tion, as usual.
For the third error type, many methods do not provide the exact solution of a given prob-
lem P, even if the computation is carried out exactly (without rounding), but rather the so-
lution of a simpler problem, P, e which approximates P. As an example we consider the
summation of an infinite series:
1 1 1
e=1+ + + + ···
1! 2! 3!

which could be replaced with a simpler problem P e which consist of a summation of a finite
number of series terms. Such an error is called truncation error (nevertheless, this name
is also use for rounding errors obtained by removing the last digits of the representation
– chopping). Many approximation problems result by “discretising” the original problem
P: definite integrals are approximated by finite sums, derivatives by differences, and so on.
Some authors extend the term “truncation error” to cover also the discretization error.

73
74 Errors and Floating Point Arithmetic

The aim of this chapter is to study the overall effect of input error and rounding error
on a computational result. The approximation errors will be discussed when we expose the
numerical methods individually.

3.1 Numerical Problems


A numerical problem is a combination of a constructive mathematical problem (MP) and a
precision specification (PS).

Example 3.1.1. Let f : R −→ R and x ∈ R. We wish to compute y = f (x). Generally,


x is not representable inside the computer; for this reason we shall use an approximation
x∗ ≈ x. Also, it is possible that f could not be computed exactly; we shall replace f with an
approximation fA . The computed value would be fA (x∗ ). So our numerical problem is:

MP. Given x and f , compute f (x).

PS. |f (x) − fA (x∗ )| < ε, for a given ε. ♦

3.2 Error Measuring


Definition 3.2.1. Let X be a normed linear space, A ⊆ X and x ∈ X. An element x∗ ∈ A
is an approximation of x from A (notation x∗ ≈ x).

Definition 3.2.2. If x∗ is an approximation of x the difference ∆x = x − x∗ is called error,


and

k∆xk = kx∗ − xk (3.2.1)


is an absolute error.

Definition 3.2.3. The quantity

k∆xk
δx = , x 6= 0 (3.2.2)
kxk

is called a relative error.

Remark 3.2.4.
k∆xk
1. Since x is unknown in practice, one uses the approximation δx = kx∗ k . If k∆xk is
small relatively to x∗ , then the approximation is accurate.
∆x
2. If X = R, then it is to use δx = x and ∆x = x∗ − x. ♦
3.3. Propagated error 75

3.3 Propagated error


Let f : Rn −→ R, x = (x1 , . . . , xn ) and x∗ = (x∗1 , . . . , x∗n ). We want to evaluate
the absolute error ∆f , and the relative error δf , respectively, when f is approximated by
f (x∗ ). These are propagated errors, because they describe how the initial error (absolute
or relative) is propagated during the computation of f . Let suppose x = x∗ + ∆x, where
∆x = (∆x1 , . . . , ∆xn ). For the absolute error, using Taylor’s formula we obtain

∆f = f (x∗1 + ∆x1 , . . . , x∗n + ∆xn ) − f (x∗1 , . . . x∗n ) =


Xn n n
∂f 1 XX ∂2f
= ∆xi ∗ (x∗1 , . . . x∗n ) + ∆xi ∆xj ∗ ∗ (θ),
i=1
∂xi 2 i=1 j=1 ∂xi ∂xj

where θ ∈ [(x∗1 , . . . , x∗n ), (x∗1 + ∆x1 , . . . , x∗n + ∆xn )] .


If the quantities ∆xi are sufficiently small, then ∆xi ∆xj are negligible with respect to
∆xi , and we have
X n
∂f
∆f ≈ ∆xi ∗ (x∗1 , . . . x∗n ). (3.3.1)
i=1
∂xi

Analogously, for the relative error


n ∂f ∗ n
∆f X ∂x∗ (x ) X ∂
δf = ≈ ∆xi i ∗ = ∆xi ∗ ln f (x∗ ) =
f i=1
f (x ) i=1
∂xi
n
X ∂
= x∗i δxi ln f (x∗ ).
i=1
∂x∗i

Thus
n
X ∂
δf = x∗i ln f (x∗ )δxi . (3.3.2)
i=1
∂x∗i

The inverse problem has also a great importance: what accuracy is needed for the input
data such that the result be of a desired accuracy? That is, given ε > 0, how much ∆xi or
δxi , i = 1, . . . , n would be such that ∆f or δf < ε? A solution method is based on equal
effects principle: one supposes all terms which appear in (3.3.1) have (3.3.2) the same effect,
i. e.
∂f ∗ ∂f ∗
(x )∆x1 = . . . = (x )∆xn .
∂x∗1 ∂x∗n
(3.3.1) implies
∆f
∆xi ≈
∂f ∗ . (3.3.3)
n ∂x∗ (x )
i

Analogously,
δf
δxi = . (3.3.4)
∗ ∂
n xi ∂x∗ ln f (x∗ )
i
76 Errors and Floating Point Arithmetic

3.4 Floating-Point Representation


3.4.1 Parameters
Several different representations of real numbers have been proposed, but by far the most
widely used is the floating-point representation. The floating-point representation parameters
are a base β (which is always assumed to be even), a precision p, and a largest and a smallest
allowable exponent, emax and emin , all being natural numbers. In general, a floating-point
number will be represented as
x = ±d0 .d1 d2 . . . dp−1 × β e , 0 ≤ di < β (3.4.1)
where d0 .d1 d2 . . . dp−1 is called the significand or mantissa, and e is the exponent. The value
of x is
± (d0 + d1 β −1 + d2 β −2 + · · · + dp−1 β −(p−1) )β e . (3.4.2)
In order to achieve the uniqueness of representation, the floating-point number are normal-
ized, that is, we change the representation, not the value, such that d0 6= 0. Zero is represented
as 1.0 × β emin −1 . Thus, the numerical ordering of nonnegative real numbers corresponds to
the lexicographical ordering of their floating-point representation with exponent stored to the
left of the significand.
The term floating-point number will be used to mean a real number that can be exactly
represented in this format. Each interval [β e , β e+1 ) in R contains exactly β p floating-point

e e +1 e +2 e +3
0 β min β min β min β min

Figure 3.1: The distribution of normalized floating point numbers on the real axis without
denormalization

numbers (the number of all possible significands). The interval (0, β emin ) is empty; for these
reason the denormalized numbers are introduced, i.e. numbers whose significand has the
form 0.d1 d2 . . . dp−1 and whose exponent is β emin −1 . The availability of denormalization
is an additional parameter of the representation. The set of floating-numbers for a set fixed
parameters of representation will be denoted
F(β, p, emin , emax , denorm), denorm ∈ {true, f alse}.
This set is not equal to R because:
1. is a finite subset of Q;
2. for x ∈ R it is possible to have |x| > β × β emax (overflow) or |x| < 1.0 × β emin
(underflow).
3.4. Floating-Point Representation 77

e e +1 e +2 e +3
0 β min β min β min β min

Figure 3.2: The distribution of normalized floating point numbers on the real axis with de-
normalization

The usual arithmetic operation on F(β, p, emin , emax , denorm) are denoted by ⊕, ⊖, ⊗,
⊘, and the name of usual functions are capitalized: SIN, COS, EXP, LN, SQRT, and so on.
(F, ⊕, ⊗) is not a field since

(x ⊕ y) ⊕ z 6= x ⊕ (y ⊕ z) (x ⊗ y) ⊗ z 6= x ⊗ (y ⊗ z)
(x ⊕ y) ⊗ z 6= x ⊗ z ⊕ y ⊗ z.

In order to measure the error one uses the relative error and ulps – units in the last place.
If the number z is represented as d0 .d1 d2 . . . dp−1 × β e , then the error is

|d0 .d1 d2 . . . dp−1 − z/β e | β p−1 ulps.

The relative error that corresponds to 12 ulps is

1 −p 1 β
β ≤ ulps ≤ β −p ,
2 2 2
β β −p
since 0.0
| {z. . . 0} β ′ ×β e , with β ′ = 2. The value eps = 2β is referred to as machine epsilon.
p
The default rounding obeys the even digit rule: if x = d0 .d1 . . . dp−1 dp . . . and dp > β2
then the rounding is upward, if dp < β2 the rounding is downward, and if dp = β2 and among
the removed digits there exists a nonzero one the rounding is upward; otherwise, the last
preserved digit is even. If fl denotes the rounding operation, we can define the floating-point
arithmetic operation by
x ⊚ y = fl(x ◦ y). (3.4.3)
Another kind of rounding can be chosen: to −∞, to +∞, to 0 (chopping). During the
reasoning concerning the floating-point operations we shall use the following model

∀x, y ∈ F, ∃δ with |δ| < eps such that x ⊚ y = (x ◦ y)(1 + δ). (3.4.4)

Intuitively, each floating point arithmetic operation is exact within a boundary of at most eps
for the relative errors.
The formula (3.4.4) is called the fundamental axiom of floating-point arithmetic.
78 Errors and Floating Point Arithmetic

3.4.2 Cancelation
From formulae (3.3.2) for the relative error, if x ≈ x(1 + δx ) and y ≈ y(1 + δy ), we obtain
the relative error for the floating-point arithmetic:

δxy = δx + δy (3.4.5)
δx/y = δx − δy (3.4.6)
x y
δx+y = δx + δy (3.4.7)
x+y x+y

Only subtraction of two nearby quantities x ≈ y is critical; in this case, δx−y → ∞. This phe-
nomenon is called cancelation and is depicted in Figure 3.3. There b, b′ , b′′ stand for binary
digits which are reliable, and the gs represent binary digits contaminated by errors (garbage
digits). Note in that garbage - garbage = garbage, but more important, the normalization of
the result moves the first garbage digit from the 12th position to the 3rd.

x = 1 0 1 1 0 0 1 0 1 b b g g g g e
y = 1 0 1 1 0 0 1 0 1 b′ b′ g g g g e
x-y = 0 0 0 0 0 0 0 0 0 b′′ b′′ g g g g e
= b′′ b′′ g g g g ? ? ? ? ? ? ? ? ? e-9

Figure 3.3: The cancelation phenomenon

We have two kind of cancelation: benign, when subtracting exactly known quantities and
catastrophic, when the subtraction operands are subject to rounding errors. The programmer
must be aware of the possibility of its occurrence and he/she must try to avoid it. The ex-
pressions which lead to cancelation must be rewritten, and a catastrophic cancelation must be
converted into a benign one. We shall give some examples in the sequel.

Example 3.4.1. If a ≈ b, then the expression a2 − b2 is rewritten into (a − b)(a + b). The
initial form is preferred when a ≫ b or b ≫ a. ♦

Example 3.4.2. If cancelation appears within an expression containing square roots, then we
rewrite:
√ √ δ
x+δ− x= √ √ , δ ≈ 0.
x+δ+ x ♦

Example 3.4.3. The difference of two values of the same function for nearby arguments is
rewritten using Taylor expansion:

δ 2 ′′
f (x + δ) − f (x) = δf ′ (x) + f (x) + · · · f ∈ C n [a, b]. ♦
2
3.4. Floating-Point Representation 79

Example 3.4.4. The solution of a quadratic equation ax2 + bx + c = 0 can involve catas-
trophic cancelation when b2 ≫ 4ac. The usual formulae


−b +b2 − 4ac
x1 = (3.4.8)
√2a
−b − b2 − 4ac
x2 = (3.4.9)
2a

can lead to cancelation as follows: for b > 0 the cancelation affects the computation of x1 ;
for b < 0 x2 . We can correct the situation using the conjugate

2c
x1 = √ (3.4.10)
−b − b2 − 4ac
2c
x2 = √ . (3.4.11)
−b + b2 − 4ac

For the first case we use formulae (3.4.10) and (3.4.9); for the second case (3.4.8) and
(3.4.11). Consider the quadratic equation with a = 1, b = 100000000, and c = 1. Ap-
plying formulas 3.4.8 and 3.4.9 we obtain:

>>a=1; c=1; b=-100000000;


>>x1=(-b+sqrt(bˆ2-4*a*c))/(2*a)

x1 =
100000000

>>x2=(-b-sqrt(bˆ2-4*a*c))/(2*a)

x2 =
7.45058059692383e-009

If we amplify by the conjugate to compute x2 we have:

>>x1=(-b+sqrt(bˆ2-4*a*c))/(2*a)

x1 =
100000000

>> x2a=2*c/(-b+sqrt(bˆ2-4*a*c))

x2a =
1e-008

The function root yields the same results. ♦


80 Errors and Floating Point Arithmetic

3.5 IEEE Standard


There are two different standards for floating point computation: IEEE 754 that require β = 2
and IEEE 854 that allows either β = 2 or β = 10, but it is more permissive concerning
representation.
We deal only with the first standard. Table 3.1 gives its parameters.

Format
Parameter Single Single Extended Double Double extended
p 24 ≥ 32 53 ≥ 64
emax +127 ≥ +1023 +1023 ≥ +16383
emin -126 ≤ −1022 -1022 ≤ −16382
Exponent width 8 ≥ 11 11 ≥ 15
Number width 32 ≥ 43 64 ≥ 79

Table 3.1: IEEE 754 Format Parameters

Why extended formats?

1. A better precision.

2. The conversion from binary to decimal and then back to binary needs 9 digits in single
precision and 17 digits in double precision.

The relation |emin | < emax is motivated by the fact that 1/2emin must not lead to over-
flow.
The operations ⊕, ⊖, ⊗, ⊘ must be exactly rounded. The accuracy is achieved using two
guard digit and a third sticky bit.
The exponent is biased, i.e. instead of e the standard represents e + D, where D is fixed
when the format is chosen.
For IEEE 754 single precision, D = 127, and for double precision, D = 1023.

3.5.1 Special Quantities


The IEEE standard specifies the following special quantities:
Exponent Significand Represents
e = emin − 1 f =0 ±0
e = emin − 1 f 6= 0 0.f × 2emin
emin ≤ e ≤ emax 1.f × 2e
e = emax + 1 f =0 ±∞
e = emax + 1 f 6= 0 NaN
NaN. In fact we have a family of NaNs. The illegal and indeterminate
√ operations lead to
NaN: ∞ + (−∞), 0 × ∞, 0/0, ∞/∞, x REM 0, ∞ REM y, x for x < 0. If one operand
is a NaN the result is a NaN too.
3.6. MATLAB Floating-Point Arithmetic 81

Infinity. 1/0 = ∞, −1/0 = −∞. The infinite values allow the continuation of computa-
tion when an overflow occurs. This is safer than aborting or returning the largest representable
number.
x
1+x2 for x = ∞ gives 0.
Signed Zero. We have two zeros: +0, −0; the relations +0 = −0 and −0 < +∞ hold.
Advantages: simpler treatment of underflow and discontinuity. We can make a distinction
between log 0 = −∞ and log x = NaN for x < 0. Without signed zero we can not make
any distinction between the logarithm of a negative number which leads to overflow and the
logarithm of 0.

3.6 MATLAB Floating-Point Arithmetic


MATLAB uses mainly the IEEE double-precision format. Starting from MATLAB 7, it exist
support for single-precision IEEE arithmetic. There is no distinction between integer and real
numbers. The command format hex is useful to display the floating-point representation.

For example, one obtains the representation of 1, -1, 0.1 and golden ratio, φ = (1 + 5)/2,
respectively, by using
>> format hex
>> 1,-1
ans =
3ff0000000000000
ans =
bff0000000000000
>> 0.1
ans =
3fb999999999999a
>> phi=(1+sqrt(5))/2
phi =
3ff9e3779b97f4a8

One considers that the fraction f satisfies 0 ≤ f < 1, and the exponent −1022 ≤ e ≤ 1023.
We can characterize the MATLAB floating-point system by three constants: realmin,
realmax, and eps. realmin is the smallest normalized floating-point number. Any quan-
tity less than either it represents a denormalized number or produces an underflow. realmax
is the largest representable floating-point number. Anything larger than it produces an over-
flow. The values of these constants are

binary decimal hexadecimal


eps 2−52 2.220446049250313e-016 3cb0000000000000
realmin 2−1022 2.225073858507201e-308 0010000000000000
realmax (2−eps)×21023 1.797693134862316e+308 7fefffffffffffff

Functions in sources 3.1 and 3.2 compute eps. It is instructive to type each line of the
second variant individually.
It is interesting to try:
82 Errors and Floating Point Arithmetic

MATLAB Source 3.1 Computation of eps - 1st variant


function eps=myeps1
eps = 1;
while (1+eps) > 1
eps = eps/2;
end
eps = eps*2;

MATLAB Source 3.2 Computation of eps - 2nd variant


function z=myeps2
x=4/3-1;
y=3*x;
z=1-y;

>> format long


>> 2*realmax
ans =
Inf
>> realmin*eps
ans =
4.940656458412465e-324
>> realmin*eps/2
ans =
0

or in hexadecimal:
>> format hex
>> 2*realmax
ans =
7ff0000000000000
>> realmin*eps
ans =
0000000000000001
>> realmin*eps/2
ans =
0000000000000000

The denormalized numbers are within interval [eps*realmin , realmin]. They are rep-
resented taking e = −1023. The displacement is D = 1023. The infinity, Inf, is represented
taking e = 1024 and f = 0, and NaN with e = 1024 and f 6= 0. For example,
>> format short
>> Inf-Inf
ans =
NaN
>> Inf/Inf
3.6. MATLAB Floating-Point Arithmetic 83

ans =
NaN

Two MATLAB functions that take apart and put together floating-point numbers are
log2 and pow2. The expression [F,E]=log2(X) for a real array X, returns an array
F of real numbers, usually in the range 0.5 <= abs(F) < 1, and an array E of inte-
gers, so that X = F. ∗ 2.^E. Any zeros in X produce F = 0 and E = 0. The expression
X=pow2(F,E) for a real array F and an integer array E computes X = F. ∗ 2.^E. The re-
sult is computed quickly by simply adding E to the floating-point exponent of F. In IEEE
arithmetics, the command [F,E] = log2(X) yields to:

F E X
1/2 1 1
pi/4 2 pi
-3/4 2 -3
1/2 -51 eps
1-eps/2 1024 realmax
1/2 -1021 realmin

and X = pow2(F,E) recovers X.


We have seen in §1.5.4 that in MATLAB 7 there exist support for single-precision floating-
point numbers (the function single). For example,

>> a = single(5);

assign to a the single-precision floating-point representation of 5. We can compare this with


the corresponding double-precision representation:

>> b=5;
>> whos
Name Size Bytes Class

a 1x1 4 single array


b 1x1 8 double array

Grand total is 2 elements using 12 bytes

>> format hex


>> a,b
a =
40a00000
b =
4014000000000000

The conversion from double to single may affect the value, due to rounding:

>> format long


>> single(3.14)
ans =
3.1400001
84 Errors and Floating Point Arithmetic

The result of a binary operation with single-type operation has the type single. The
result of a binary operation between a single and a double has the type single, as the
examples bellow shown:
>> x = single(2)*single(3)
x =
6
>> class(x)
ans =
single
>> x = single(8)+3
x =
11
>> class(x)
ans =
single
We may call eps, realmin, and realmax for single-precision too:
>> eps(’single’)
ans =
1.1921e-007
>> realmin(’single’),realmax(’single’)
ans =
1.1755e-038
ans =
3.4028e+038
Starting from MATLAB 7, the function eps gives the distance between floating-point
numbers. The command d=eps(x), where x is a single or double floating-point num-
ber, gives the distance from abs(x) to the closest floating-point number greater than x and
having the same precision as x. Example:
>> format long
>> eps
ans =
2.220446049250313e-016
>> eps(5)
ans =
8.881784197001252e-016
eps(’double’) is equivalent to eps or to eps(1.0), and eps(’single’) is equiv-
alent to eps(single(1.0)).
Clearly, the distance between single-precision floating-point numbers is greater than the
distance between double-precision floating-point numbers. For example,
>> x = single(5)
>> eps(x)
returns
ans =
4.7683716e-007
that is larger than eps(5).
3.7. The Condition of a Problem 85

3.7 The Condition of a Problem


We may think a problem as a map

f : Rm → Rn , y = f (x). (3.7.1)

We are interested in the sensitivity of the map f at some given point x to a small per-
turbation of x, that is, how much bigger (or smaller) the perturbation in y is compared to
the perturbation in x. In particular, we wish to measure the degree of sensitivity by a single
number – the condition number of the map f at the point x. The function f is assumed to
be evaluated exactly, with infinite precision, as we perturb x. The condition of f , therefore,
is an inherent property of the map f and does not depend on any algorithmic considerations
concerning its implementation.
It does not mean that the knowledge of the condition of a problem is irrelevant to any
algorithmic solution of the problem. On the contrary! The reason is that quite often the com-
puted solution y ∗ of (3.7.1) (computed in floating point machine arithmetic, using a specific
algorithm) can be demonstrated to be the exact solution of a “nearby” problem; that is

y ∗ = f (x∗ ) (3.7.2)

where
x∗ = x + δ (3.7.3)
and moreover, the distance kδk = kx∗ − xk can be estimated in terms of the machine preci-
sion. Therefore, if we know how strongly or weakly the map f reacts to small perturbation,
such as δ in (3.7.3), we can say something about the error y ∗ − y in the solution caused by
the perturbation.
We can consider more general spaces for f , but for practical implementation the finite
dimensional spaces are sufficient.
Let

x = [x1 , . . . , xm ]T ∈ Rm , y = [y1 , . . . , yn ]T ∈ Rn ,
yν = fν (x1 , . . . , xm ), ν = 1, . . . , n.

We think yν as a function of one single variable xµ


∂fν

∂xµ
γνµ = (condνµ f )(x) = . (3.7.4)
fν (x)

These give us a matrix of condition numbers


 ∂f ∂f 
x1 ∂x1 xm ∂xm1
1
...
 f1 (x) f1 (x) 
 .. .. .. 
Γ(x) =  . . .  = [γνµ (x)] (3.7.5)
 
x1 ∂fn
∂x1
∂fn
xm ∂x
fn (x) ... m
fn (x)
86 Errors and Floating Point Arithmetic

and we shall consider as condition number

(cond f )(x) = kΓ(x)k. (3.7.6)

Another approach. We consider the norm k · k∞

Xm
∂fν
∆yν ≈ ∆xµ (= fν (x + ∆x) − fν (x))
µ=1
∂xµ

n
X m
X
∂fν ∂fν
|∆yν | ≤
∂xµ ∆xµ ≤ max
µ
|∆xµ | ∂xµ ≤
µ=1 µ=1

m
X

∂fν
≤ max |∆xµ | max
µ ν ∂xµ
µ=1

Therefore
∂f
k∆yk∞ ≤ k∆xk∞
∂x (3.7.7)

where  
∂f1 ∂f1 ∂f1
∂x1 ∂x2 ... ∂xm
 ∂f2 ∂f2 ∂f2 
∂f  ∂x1 ∂x2 ... ∂xm 
J(x) = = .. .. ..  ∈ Rn × Rm (3.7.8)
∂x  . .
..
. .


∂fn ∂fn ∂fn
∂x1 ∂x2 ... ∂xm

is the Jacobian matrix of f




k∆yk∞ kxk∞ ∂f
∂x k∆xk

≤ · . (3.7.9)
kyk∞ kf (x)k∞ kxk∞

If m = n = 1, then both approaches lead to



xf (x)

(cond f )(x) = ,
f (x)

for x 6= 0, y 6= 0.
If x = 0 ∧ y 6= 0, then we take the absolute error for x and the relative error for y

f (x)
(cond f )(x) = .
f (x)

For y = 0 ∧ x 6= 0 we take the absolute error for y and the relative error for x. For
x=y=0
(cond f )(x) = |f ′ (x)| .
3.8. The Condition of an algorithm 87

Example 3.7.1 (Systems of linear algebraic equations). Given a nonsingular square ma-
trix A ∈ Rn×n and a vector b ∈ Rn solve the system

Ax = b. (3.7.10)

Here the input data are the elements of A and b, and the result is the vector x. To simplify
matters let’s assume that A is a fixed matrix not subject to change, and only b is undergoing
perturbations. We have a map f : Rn → Rn given by

x = f (b) := A−1 b,
∂f
which is linear. Therefore ∂b = A−1 and using (3.7.9),

kbkkA−1 k kAxkkA−1 k
(cond f )(b) = = ,
kA−1 bk kA−1 bk
(3.7.11)
kAxk −1
max (cond f )(b) = max kA k = kAkkA−1 k. ♦
b∈Rn
b6=0
x∈Rn
b6=0
kxk

The number kAkkA−1 k is called the condition number of the matrix A and we denote it by
cond A.
cond A = kAkkA−1 k.

3.8 The Condition of an algorithm


Let us consider the problem

f : Rm → Rn , y = f (x). (3.8.1)

Along with the problem f , we are also given an algorithm A that solves the problem.
That is, given a vector x ∈ F(β, p, emin , emax , denorm), the algorithm A produces a vector
yA (in floating-point arithmetic), that is supposed to approximate y = f (x). Thus we have
another map fA describing how the problem f is solved by the algorithm A

fA : Fm (. . . ) → Fn (. . . ), yA = fA (x).

In order to be able to analyze fA in this general terms, we must make a basic assumption,
namely, that
(BA) ∀ x ∈ Fm ∃ xA ∈ Fm : fA (x) = f (xA ). (3.8.2)
That is, the computed solution corresponding to some input x is the exact solution for some
different input xA (not necessarily a machine vector and not necessarily uniquely determined)
that we hope is close to x. The closer we can find an xA to x, the more confidence we should
place in the algorithm A.
We define the condition of A at x by comparing the relative error with eps:
kxA − xk .
(cond A)(x) = inf eps .
xA kxk
88 Errors and Floating Point Arithmetic

Motivation:
fA (x) − f (x) (xA − x)f ′ (ξ) xA − x 1 xf ′ (x)
δy = = ≈ · eps .
f (x) f (x) x eps f (x)
The infimum is over all xA satisfying yA = f (xA ). In practice one can take any such xA
and then obtain an upper bound for the condition number
kxA −xk
kxk
(cond A)(x) ≤ . (3.8.3)
eps

3.9 Overall error


The problem to be solved is again

f : Rm → Rn , y = f (x). (3.9.1)

This is the mathematical (idealized) problem, where the data are exact real numbers,
and the solution is the mathematically exact solution. When solving such a problem on a
computer, in floating-point arithmetic with precision eps, and using some algorithm A, one
first of all rounds the data, and then applies to these rounded data not f , but fA .
kx∗ − xk
x∗ = x rounded, = ε, ∗
yA = fA (x∗ ).
kxk
Here ε represents the rounding error in the data. (It could also be due to sources other than
rounding, e.g., measurement.) The total error that we wish to estimate is

kyA − yk
.
kyk
By the basic assumption (3.8.2, BA) made on the algorithm A, and choosing xA opti-
mally, we have
fA (x∗ ) = f (x∗A ),
kx∗A − x∗ k
= (cond A)(x∗ ) eps . (3.9.2)
kx∗ k
Let y ∗ = f (x∗ ). Using the triangle inequality, we have

kyA − yk ky ∗ − y ∗ k ky ∗ − yk ky ∗ − y ∗ k ky ∗ − yk
≤ A + ≈ A ∗ + .
kyk kyk kyk ky k ky ∗ k
We supposed kyk ≈ ky ∗ k. By virtue of (3.9.2) we now have for the first term on the right,

kyA − y∗k kfA (x∗ ) − f (x∗ )k kf (x∗A ) − f (x∗ )k

= ∗
= ≤
ky k kf (x )k kf (x∗ )k
kx∗A − xk
≤ (cond f )(x∗ ) = (cond f )(x∗ )(cond A)(x∗ ) eps,
kx∗ k
3.10. Ill-Conditioned Problems and Ill-Posed Problems 89

and for the second

ky ∗ − yk kf (x∗ ) − f (x)k kx∗ − xk


= ≤ (cond f )(x) = (cond f )(x)ε.
kyk kf (x)k kxk

Assuming finally that (cond f )(x∗ ) ≈ (cond f )(x), we get



kyA − yk
≤ (cond f )(x)[ε + (cond A)(x∗ ) eps]. (3.9.3)
kyk

Interpretation: The data error and eps contribute together towards the total error. Both are
amplified by the condition of the problem, but the latter is further amplified by the condition
of the algorithm.

3.10 Ill-Conditioned Problems and Ill-Posed Problems


If the condition number of a problem is large ((cond f )(x) ≫ 1), then even for small
(relative) errors, huge errors in output data could be expected. Such problems are called
ill-conditioned problems. It is not possible to draw a clear separation line between well-
conditioned and ill-conditioned problems. The classification depends on precision specifica-
tions. If we wish
ky ∗ − yA

k

kyk

and in (3.9.3) (cond f )(x)ε ≥ τ , then the problem is surely ill-conditioned.


It is important to choose a reasonable boundary off error, since otherwise, even if we
increase the iteration number, we can not achieve the desired accuracy.
If the result of a mathematical problem depends discontinuously on continuous input
data, then it is impossible to obtain an accurate numerical solution in a neighborhood of the
discontinuity. In such cases the result is significantly perturbed, even if the input data are
accurate and the computation is performed using multiple precision. These problems are
called ill-posed problems. An ill-posed problem can appear if, for example, an integer result
is computed from real input data (which vary continuously). As examples we can cite the
number of real zeros of a polynomial and the rank of a matrix.

Example 3.10.1 (The number of real zeros of a polynomial). The equation

P3 (x, c0 ) = c0 + x − 2x2 + x3

can have one, two or three real zeros, depending on how c0 is: strictly positive, zero or strictly
negative. Therefore, if c0 is close to zero, the number of real zeros of P3 is an ill-posed
problem. ♦
90 Errors and Floating Point Arithmetic

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0

−0.5 −0.5 −0.5

−1 −1 −1

−1.5 −1.5 −1.5

−2 −2 −2
−1 0 1 2 −1 0 1 2 −1 0 1 2

Figure 3.4: An ill-posed problem

3.11 Stability
3.11.1 Asymptotical notations
We shall introduce here basic notations and some common abuses.
For a given function g(n), Θ(g(n)) will denote the set of functions
Θ(g(n)) = {f (n) : ∃c1 , c2 , n0 > 0 0 ≤ c1 g(n) ≤ f (n) ≤ c2 g(n) ∀n ≤ n0 } .
Although Θ(g(n)) is a set we write f (n) = Θ(g(n)) instead of f (n) ∈ Θ(g(n)). These
abuse has some advantages. g(n) will be called an asymptotically tight bound for f (n).
The Θ(g(n)) definition requires that every member of it to be asymptotically nonnegative,
that is, f (n) ≥ 0 for sufficiently large n.
For a given function g(n), O(g(n)) will denote the set
O(g(n)) = {f (n) : ∃c, n0 0 ≤ f (n) ≤ cg(n), ∀n ≥ n0 } .
Also for f (n) ∈ O(g(n)) we shall use f (n) = O(g(n)). Note that f (n) = Θ(g(n)) implies
f (n) = O(g(n), since the Θ notation is stronger than the O notation. In set theory terms,
Θ(g(n)) ⊆ O(g(n)). One of the funny properties of the O notation is n = O(n2 ). g(n) will
be called an asymptotically upper bound for f .
For a given function g(n), Ω(g(n)) is defined as the set of functions
Ω(g(n)) = {f (n) : ∃c, n0 0 ≤ cg(n) ≤ f (n), ∀n ≥ n0 } .
This notation provide an asymptotically lower bound. The definitions of asymptotic notations
imply immediately:
f (n) = Θ(g(n)) ⇐⇒ f (n) = O(g(n)) ∧ f (n) = Ω(g(n)).
3.11. Stability 91

The functions f and g : N −→ R are asymptotically equivalent (notation ∼) if


f (n)
lim = 1.
n→∞ g(n)
The extension of asymptotic notations to real numbers is obvious. For example, f (t) =
O(g(t)) means that there exists a positive constant C such that for all t sufficiently close to
an understood limit (e.g., t → 0 or t → ∞),
|f (t)| ≤ Cg(t). (3.11.1)

3.11.2 Accuracy and stability


In this section, we think a problem as a map f : X −→ Y , where X and Y are normed
linear spaces (for our purpose finite dimensional spaces are sufficient). We are interested in
the problem behavior at a particular point x ∈ X (the behavior may vary from one point
to another). A combination of a problem f with prescribed input data x might be called a
problem instance, but it is usually, though occasionally, confusing to use the term problem
for both notions.
Since the complex numbers are represented as a pair of floating-point numbers, the axiom
(3.4.4) also holds for complex numbers, except that for ⊗ and ⊘ eps must be enlarged by a
factor of the order 23/2 and 25/2 , respectively.
An algorithm can be viewed as another map fA : X −→ Y , where X and Y are as above.
Let us consider a problem f , a computer whose floating-point number system satisfies (3.4.4),
but not necessarily (3.4.3), an algorithm fA for f , an implementation of this algorithm as a
computer program, A, all fixed. Given input data x ∈ X, we round it to a floating point
number and then supply it to the program. The result is a collection of floating-point numbers
forming a vector from Y (since the algorithm was designed to solve f ). Let this computer
result be called fA (x).
Except in trivial cases, fA cannot be continuous. One might say that an algorithm fA for
the problem f is accurate, if for each x ∈ X, his relative error satisfies
kfA (x) − f (x)k
= O(eps). (3.11.2)
kf (x)k
If the problem f is ill-conditioned, the goal of accuracy, as defined by (3.11.2) is unrea-
sonable ambitious. Rounding errors on input data are unavoidable on a digital computer, and
even if all the subsequent computation could be carried out perfectly, this perturbation alone
might lead to a significant change in the result. Instead of aiming at accuracy in all cases,
it is the most appropriate to search for general stability. We say that an algorithm fA for a
problem f is stable if for each x ∈ X
kfA (x) − f (e
x)k
= O(eps), (3.11.3)
kf (e
x)k
for some x̃ with
ke
x − xk
= O(eps). (3.11.4)
kxk
In words,
92 Errors and Floating Point Arithmetic

A stable algorithm gives nearly the right answer to nearly the right question.

Many algorithms of Numerical Linear Algebra satisfy a condition that is both stronger
and simpler than stability. We say that an algorithm fA for the problem f is backward stable
if
ke
x − xk
∀x ∈ X ∃e x with = O(eps) such that fA (x) = f (ex). (3.11.5)
kxk

This is a tightening of the definition of stability in that the O(eps) in (3.11.3) was replaced
by zero. In words

A backward stable algorithm gives exactly the right answer to nearly the right
question.

Remark 3.11.1. The notation

||computed quantity|| = O(eps) (3.11.6)

has the following meaning:

• ||computed quantity|| represents the norm of some number or collection of numbers


determined by an algorithm fA for a problem f , depending on both the input data
x ∈ X for f and eps. An example is the relative error.

• The implicit limit process is eps → 0 (i.e. eps corresponds to t in (3.11.1)).

• The O applies uniformly to all data x ∈ X. This uniformity is default in the statement
of stability results.

• In any particular machine arithmetic, eps is a fixed quantity. Speaking of the limit
eps → 0, we are considering an idealization of a computer or a family of computers.
Equation (3.11.6) means that if we were to run the algorithm in question on computers
satisfying (3.4.3) and (3.4.4) for a sequence of values of eps decreasing to zero, then
||computed quantity|| would be guaranteed to decrease in proportion to eps or faster.
These ideal computers are required to satisfy (3.4.3) and (3.4.4), but nothing else.

• The default constant in O can depend also on the size of the argument (e.g. the solution
of a nonsingular system Ax = b on A and b size). Generally, in practice, the error
growing due to the size of argument is slow enough, but there are situations with factors
like 2m ; they make such bounds useless for practice. ♦

Due to the equivalence of norms on finite dimensional linear spaces, for problems f and
algorithms fA defined on such spaces, the properties of accuracy, stability and backward
stability all hold or fail to hold independently of the choice of norms in X and Y .
3.12. Applications 93

3.11.3 Backward Error Analysis


Backward stability and well-conditioning imply accuracy in relative sense.
Theorem 3.11.2. Suppose a backward stable algorithm fA is applied to solve a problem
f : X −→ Y with condition number (cond f )(x)on a computer satisfying the axioms (3.4.3)
and (3.4.4). Then the relative error satisfies

kfA (x) − f (x)k


= O ((cond f )(x) eps) . (3.11.7)
kf (x)k

Proof. By the definition (3.11.5) of backward stability we have fA (x) = f (e


x) for some
e ∈ X satisfying
x
ke
x − xk
= O(eps).
kxk
By the definition (3.7.5) and (3.7.6) of (cond f )(x), this implies

kfA (x) − f (x)k ke


x − xk
≤ ((cond f )(x) + o(1)) , (3.11.8)
kf (x)k kxk

where o(1) denotes a quantity which converges to zero as eps → 0. Combining these bounds
gives (3.11.7). 
The process just carried out in proving Theorem 3.11.2 is known as backward error anal-
ysis. We obtained an accuracy estimate by two steps. One step is to investigate the condition
of the problem. The other is to investigate the stability of the algorithm. By Theorem 3.11.2,
if the algorithm is backward stable, then the final accuracy reflects that condition number.
There exists also a forward error analysis. Here, the rounding error introduced at each
step of the calculation are estimated, and somehow, a total is maintained of how they com-
pound from step to step (section 3.3).
Experience has shown that for the most of the algorithms of numerical linear algebra,
forward error analysis is harder to carry out than the backward error analysis. The best
algorithms of linear algebra do no more, in general, than to compute exact solutions for
slightly perturbed data. Backward error analysis is a method of reasoning fitted neatly to this
backward reality.

3.12 Applications
3.12.1 Inverting the hyperbolic cosine
The aim of this application is to compute the inverse of cosh in floating point arithmetic when
x ≫ 1. From
ey + e−y
x = cosh(y) = ,
2
it follows, for x > 1  p 
y = arccosh x = − ln x − x2 − 1 .
94 Errors and Floating Point Arithmetic

This way of computing arccosh suffers of catastrophic cancelation. Because of


r  
p 1 1
2
x − 1 = x 1 − 2 ≈ x 1 − 2 + ··· ,
x 2x

we obtain
 p  1
y = − ln x − x2 − 1 ≈ − ln = ln 2x.
2x
The first difficulty occurs when we try to evaluate x2 , for x large ; this could cause overflow.
The overflow is unnecessary because the argument we are trying to compute is on scale. If
x is large, but not so large that x2 o cause overflows, fl(x2 − 1) = fl(x2 ). We have a small
error in subtraction and then in square root. Cancelation occurs when we try to compute the
argument of natural logarithm.
How we can avoid the difficulty? A little calculation shows that
 p  1  p 
− ln x − x2 − 1 = ln √ = ln x + x2 − 1 ,
x − x2 − 1

a form that avoids cancelation. A better way of handling the rest of the argument is
s  2
p 1
x2 − 1 = x 1 − .
x

Notice that we wrote (1/x)2 instead of 1/x2 , because we convert an overflow when forming
x2 into a less harmful underflow. Finally, we see that the expression
 s 
 2
1 
y = ln x + x 1−
x

avoids all the difficulties of the original expression for arccosh(x) = cosh−1 (x). Indeed, it
is clear that for large x, evaluation of this expression in floating point arithmetic will lead to
an approximation of ln(2x), as it should.
The next MATLAB example inverts cosh y for y = 15, 16, . . . , 20.

>> y=15:20;
>> x=cosh(y);
>> -log(x-sqrt(x.ˆ2-1))
ans =
Columns 1 through 6
14.9998 15.9986 17.0102 18.0218 Inf Inf
>> log(x+x.*sqrt(1-(1./x).ˆ2))
ans =
15 16 17 18 19 20
3.12. Applications 95

3.12.2 Conditioning of a root of a polynomial equation


Consider the algebraic equation:

p(x) = xn + an−1 xn−1 + · · · + a1 x + a0 = 0, a0 6= 0 (3.12.1)

and one of its simple root:


p(ξ) = 0, p′ (ξ) 6= 0.
Our problem is to find ξ, given p. The input data consists of the vector of p’s coefficients

a = [a0 , a1 , . . . , an−1 ]T ∈ Rn

and the result is ξ, a real or complex number. So, our problem is

ξ : Rn → C, ξ = ξ(a0 , a1 , . . . , an−1 )

What is the condition of ξ?


We define
aν ∂ξ

γν = (condν ξ)(a) = ∂aν , ν = 0, 1, . . . , n − 1 (3.12.2)
ξ

Then we take a convenient norm of the vector γ = [γ0 , . . . , γn−1 ]T , for example
n−1
X
kγk1 := |γν |
ν=0

to define
n−1
X
(cond ξ)(a) = (condν ξ)(a) (3.12.3)
ν=0

To find the partial derivatives of ξ with respect to aν , we start from the identity

[ξ(a0 , a1 , . . . , an−1 )]n + an−1 [ξ(a0 , a1 , . . . , an−1 )]n−1 + · · · +

+aν [ξ(a0 , a1 , . . . , an−1 )]ν + · · · + a0 = 0.


Differentiating it with respect to aν we obtain

∂ξ ∂ξ
n[ξ(a0 , a1 , . . . , an−1 )]n−1 + an−1 (n − 1)[ξ(a0 , a1 , . . . , an−1 )]n−2 + ···+
∂aν ∂aν

∂ξ ∂ξ
+aν ν[ξ(a0 , a1 , . . . , an−1 )]ν−1 + · · · + a1 + [ξ(a0 , a1 , . . . , an−1 )]ν ≡ 0,
∂aν ∂aν
where the last term comes from the differentiating the first factor of aν ξ ν . The last identity
can be written as
∂ξ
p′ (ξ) + ξν = 0
∂aν
96 Errors and Floating Point Arithmetic

∂ξ
Since p′ (ξ) 6= 0, we solve for ∂aν and substitute the result in (3.12.2) and (3.12.3) to obtain

n−1
X
1
(cond ξ)(a) = ′
|aν ||ξ|ν (3.12.4)
|ξp (ξ)| ν=0

We illustrate (3.12.4) by considering a famous example, due to Wilkinson [104]


n
Y
p(x) = (x − ν) = xn + an−1 xn−1 + · · · + a0 . (3.12.5)
ν=1

If we take ξµ = µ, µ = 1, 2, . . . , n, it can be shown that [32]

minµ cond ξµ cond ξ1 ∼ n2 when


= nn → ∞

1 2+1
maxµ cond ξµ ∼ √ √ when n → ∞.
(2− 2)πn 2−1


The worst conditioned root is ξµ0 with µ0 the integer closest to n/ 2 when n is large. This
condition number grows exponentially fast in n. For example when n = 20, then µ0 = 14
and cond ξµ0 = 0.540 × 1014 .
The moral of this example is that the roots of an algebraic equation written in form
(3.12.1) can be extremely sensitive to small changes in the coefficients (and thus, severe
ill-conditioned). Hence, avoid to express polynomials as sum of powers as in (3.12.1) and
(3.12.5). This is also true for characteristic polynomials of matrices. It is better to compute
their roots as solutions of an eigenvalue problem, which is better conditioned.
Now, we apply formula (3.12.4) to implement a MATLAB function that computes the
condition numbers of the roots of a polynomial equation (see MATLAB Source 3.3).

MATLAB Source 3.3 Conditioning of the roots of an algebraic equation


function nc=condpol(p,xi)
%CONDPOL - condition of the roots of an algebraic equation
%call NC=CONDPOL(P,XI)

if nargin<2
xi=roots(p);
end
n=length(p)-1;
dp=polyder(p); %derivative;
nc=1./(abs(xi.*polyval(dp,xi))).*(polyval(abs(p(2:end)),abs(xi)));

Let us use this function to the Wilkinson example. The results are given in ascending
order of condition number, and they are in accordance to the theory.
>> format short g
>> xi=1:20;
>> nc=condpol(poly(xi),xi);
>> [ncs,i]=sort(nc’);
3.12. Applications 97

>> [ncs,i]
ans =
420 1
43890 2
2.0189e+006 3
5.1483e+007 4
8.2373e+008 5
8.9237e+009 6
6.8839e+010 7
1.378e+011 20
3.9156e+011 8
1.3781e+012 19
1.6816e+012 9
5.5566e+012 10
6.3797e+012 18
1.4194e+013 11
1.8121e+013 17
2.8523e+013 12
3.5438e+013 16
4.4307e+013 13
5.0243e+013 15
5.401e+013 14

We close the section with a graphical study of influence of small perturbations on Wilkin-
son’s polynomial. The next MATLAB function perturbs the coefficients with small normal
random numbers with mean 0 and variance 10−10 and then plots the theoretical roots and the
perturbed roots.

function Wilkinson(n)
%perturbations for Wilkinson example

p=poly(1:n);
h=plot([1:n],zeros(1,n),’.’);
set(h,’Markersize’,15);
hold on
for k=1:1000
r=randn(1,n+1);
pr=p.*(1+1e-10*r);
z=roots(pr);
h2=plot(z,’k.’);
set(h2,’Markersize’,4)
end
axis equal

Figure 3.5 shows the results of Wilkinson(20).


This treatment of the topic of conditioning of algebraic equations follows the Gautschi’s
book [33]. For multiple roots, see [91]. The graphical experiment is proposed in [96].
98 Errors and Floating Point Arithmetic

−2

−4

−6

−8
5 10 15 20

Figure 3.5: Results generated by execution of Wilkinson(20)

Problems
Problem 3.1. Code MATLAB functions to compute sin x and cos x using Taylor formula:

x3 x5 x2n+1
sin x = x − + + ... + (−1)n + ...
3! 5! (2n + 1)!
x2 x4 x2n
cos x = 1 − + + ... + (−1)n + ....
2! 4! (2n)!

The following facts are well known (see any Calculus course):

- absolute value of the error is less than the absolute value of the first neglected term;

- the convergence radius is R = ∞.

What is happened for x = 10π (and in general for x = 2kπ, if k is large)? Explain the
phenomenon and find a remedy?

Problem 3.2. Let


Z1
En = xn ex−1 dx.
0

We see that E1 = 1/e and En = 1 − nEn−1 , n = 2, 3, . . . . It is possible to show that

1
0 < En <
n+1
and if E1 = c, then
3.12. Applications 99


0, for c = 1/e
lim En =
n→∞ ∞ otherwise.

Explain the phenomenon, find a remedy, and compute e with a precision of eps.

Problem 3.3. Consider the MATLAB functions in Section 3.12.2. Study experimentally the
condition of the problem of finding the roots of the polynomial equation

xn + a1 xn−1 + a2 xn−2 + · · · + an = 0, (3.12.6)

ak = 2−k . Take n = 20 for practical tests. Modify the graphical experiment if the perturba-
tion obeys the uniform law.

Problem 3.4. What is the index of the largest Fibonacci number that could be represented
exactly in MATLAB in double precision? What is the index of the largest Fibonacci number
that could be represented in MATLAB in double precision without overflow?

Problem 3.5. Let F be the set of all IEEE floating-point numbers, excepting NaN and Inf,
with biased exponent 7ff (in hexadecimal) and denormalized numbers, with biased exponent
000 (in hexadecimal).

(a) What is the cardinal of F ?

(b) What proportion of F elements lies within interval [1, 2)?

(c) What proportion of F elements lies within interval [1/64, 1/32)?

(d) Find using a random selection the proportion of F elements satisfying the MATLAB
logical relation
x ∗ (1/x) == 1

Problem 3.6. What familiar real numbers are approximated by floating-point numbers, for
which, by using format hex, the following values are displayed:

4059000000000000
3f847ae147ae147b
3fe921fb54442d18

Problem 3.7. Explain the results displayed by

t = 0.1
n = 1:10
e = n/10 - n*t

Problem 3.8. What each of the following program does? How many lines does each of them
produce? What are the last two values of x displayed?
100 Errors and Floating Point Arithmetic

x=1; while 1+x>1, x=x/2, pause (.02), end

x=1; while x+x>x, x=2*x, pause (.02), end

x=1; while x+x>x, x=x/2, pause (.02), end

Problem 3.9. Write a MATLAB function that computes the area of a triangle, given edges,
using Hero’s formula: p
S = p(p − a)(p − b)(p − c),
where p is the half-perimeter. What happens when the triangle is almost degenerated? Pro-
pose a remedy, and give an error estimation(see [37]).

Problem 3.10. The sample variance is defined by


N
1 X
s2 = (xi − x̄)2 ,
N − 1 i=1

where
N
1 X
x̄ = xi .
N i=1
One can compute them alternatively by formula
 !2 
XN N
X
1  1 .
s2 = x2 − xi
N − 1 i=1 i N i=1

What formula is more accurate from numerical point of view? Give an example that argues
the answer.
CHAPTER 4

Numerical Solution of Linear Algebraic Systems

There are two classes of methods for the solution of algebraic linear systems (ALS):
• direct or exact methods – they provide a solution in a finite number of steps, under the
assumption all computations are carried out exactly (Cramer, Gaussian elimination,
Cholesky)
• iterative methods – they approximates the solution by generating a sequence converg-
ing to that solution (Jacobi, Gauss-Seidel, SOR).

4.1 Notions of Matrix Analysis


Definition 4.1.1. The p-norm of a vector x ∈ Kn is defined by

n
!1/p
X
p
kxkp = |xi | 1 ≤ p < ∞.
i=1

For p = ∞ the norm is defined by

kxk∞ = max |xi |.


i=1,n

The norm k · k2 is called Euclidian norm, k · k1 is called Minkowski norm, and k · k∞ is


called Chebyshev norm.

The MATLAB function norm computes the p-norm of a vector. It is invoked as norm(x,p),
with default value p=2. As a special case, for p=-Inf, the quantity mini |xi | is computed.
Example:

101
102 Numerical Solution of Linear Algebraic Systems

>> x = 1:4;
>> [norm(x,1),norm(x,2),norm(x,inf),norm(x,-inf)]
ans =
10.0000 5.4772 4.0000 1.0000

Figure 4.1 shows the pictures of unit sphere in R2 for various p-norms. It was obtained
via contour function.

p=1 p=2 p=3 p=∞


1 1 1 1
0.5 0.5 0.5 0.5
0 0 0 0
−0.5 −0.5 −0.5 −0.5
−1 −1 −1 −1
−1 0 1 −1 0 1 −1 0 1 −1 0 1

Figure 4.1: The unit sphere in R2 for four p-norms

Let A ∈ Kn×n .

• The polynomial p(λ) = det(A − λI) – characteristic polynomial of A;

• zeros of p – eigenvalues of A;

• If λ is an eigenvalue ofA, the vector x 6= 0 such that (A − λI)x = 0 is an eigenvector


of A corresponding to the eigenvalue λ;

• ρ(A) = max{|λ| λ eigenvalue of A} – spectral radius of the matrix A.

AT – the transpose of A; A∗ the conjugate transpose (adjoint) of A.

Definition 4.1.2. A matrix A is called:

1. normal, if AA∗ = A∗ A;

2. unitary, if AA∗ = A∗ A = I;

3. hermitian, if A = A∗ ;

4. orthogonal, if AAT = AT A = I, A real;

5. symmetric, if A = AT , A real;

6. upper Hessenberg, if aij = 0, for i > j + 1

Definition 4.1.3. A matrix norm is a map k · k : Km×n → R that for each A, B ∈ Km×n
and α ∈ K satisfy

(NM1) kAk ≥ 0, kAk = 0 ⇔ A = Om×n ;


4.1. Notions of Matrix Analysis 103

(NM2) kαAk = |α|kAk;

(NM3) kA + Bk ≤ kAk + kBk;

(NM4) kABk ≤ kAkkBk.

A simple way to obtain matrix norm is: given a vector norm k · k on Cn , the map k · k :
n×n
C →R
kAvk
kAk = sup = sup kAvk = sup kAvk
v∈Cn kvk v∈Cn v∈Cn
v6=0 kvk≤1 kvk=1

is a matrix norm called subordinate matrix norm(to the given vector norm) or natural norm
(induced by the given vector norm).

Remark 4.1.4. A subordinate matrix norm verifies kIk = 1. ♦

The norms subordinate to vector norms k · k1 , k · k2 , k · k∞ are given by the following


result.

Theorem 4.1.5. Let A ∈ Kn×n (C). Then

kAvk1 X
kAk1 := sup = max |aij |,
v∈Cn \{0} kvk 1 j
i
kAvk∞ X
kAk∞ := sup = max |aij |,
v∈Cn \{0} kvk∞ i
j
kAvk2 p p
kAk2 := sup = ρ(A∗ A) = ρ(AA∗ ) = kA∗ k2 .
v∈Cn \{0} kvk2

The norm k · k2 is invariant to the unit transforms,

U U ∗ = I ⇒ kAk2 = kAU k2 = kU Ak2 = kU ∗ AU k2 .

If A is normal, then
AA∗ = A∗ A ⇒ kAk2 = ρ(A).

Proof. For any vector v we have




X X X X
kAvk1 = a v
ij j ≤ |vj | |aij | ≤

i j j i
!
X
≤ max |aij | kvk1 .
j
i
104 Numerical Solution of Linear Algebraic Systems

P
I order to show that max i |aij | is actually the smallest α having the property kAvk1 ≤
j
n
αkvk1 , ∀ v ∈ C , it is sufficient to construct a vector u (that depends on A) such that:
( )
X
kAuk1 = max |aij | kuk1 .
j
i

If j0 is a subscript such that X X


max |aij | = |aij0 |,
j
i i

then the vector u entries are ui = 0 for i 6= j0 , uj0 = 1.


Similarly  
X X

kAvk∞ = max aij vj ≤ max |aij | kvk∞ .
i i
j j

Let i0 be a subscript such that


X X
max |aij | = |ai0 j |.
i
j j

ai0 j
The vector u such that uj = for ai0 j 6= 0, uj = 1 for ai0 j = 0 verifies
|ai0 j |
 
 X 
kAuk∞ = max |aij | kuk∞ .
 i 
j

Since AA∗ is a Hermitian matrix, there exists an eigendecomposition AA∗ = QΛQ∗ of


A, where Q is a unitary matrix whose columns are eigenvectors, and Λ is a diagonal matrix
whose entries are eigenvalues of A (all of them must be real). If there exists a negative
eigenvalue and q is the corresponding eigenvector, then 0 ≤ kAqk22 = q ∗ A∗ Aq = q ∗ λq =
λkqk22 . So,

kAxk2 (x∗ A∗ Ax)1/2 (x∗ QΛQ∗ x)1/2


kAk2 = max = max = max
x6=0 kxk2 x6=0 kxk2 x6=0 kxk2
sP
∗ ∗ ∗ 1/2 ∗ 1/2
((Q x) ΛQ x) (y Λy) λi yi2
= max = max = max P
x6=0 kQ∗ xk2 y6=0 kyk2 y6=0 yi2
sP
p y2
≤ max λmax P i2 ;
y6=0 yi

the equality holds if y is a conveniently chosen column of a unit matrix.


Let us prove now that ρ(A∗ A) = ρ(AA∗ ). If ρ(A∗ A) > 0 there exists a p such that p 6= 0,
A Ap = ρ(A∗ A)p and Ap 6= 0 (ρ(A∗ A) > 0). Since Ap 6= 0 and AA∗ (Ap) = ρ(A∗ A)Ap

it follows that 0 < ρ(A∗ A) ≤ ρ(AA∗ ), and therefore ρ(AA∗ ) = ρ(A∗ A) (because (A∗ )∗ =
4.1. Notions of Matrix Analysis 105

A). If ρ(A∗ A) = 0, then ρ(AA∗ ) = 0. Hence, in all cases kAk22 = ρ(A∗ A) = ρ(AA∗ ) =
kA∗ k22 .
The invariance of k · k2 norm to unitary transforms is a translation of relations:

ρ(A∗ A) = ρ(U ∗ A∗ AU ) = ρ(A∗ U ∗ U A) = ρ(U ∗ A∗ U U ∗ AU ).

Finally, if A is normal, there exists a matrix U such that


def
U ∗ AU = diag(λi (A)) = Λ.

In this case
A∗ A = (U ΛU ∗ )∗ U ΛU = U D∗ ΛU ∗ ,
which shows us that
2
ρ(A∗ A) = ρ(Λ∗ Λ) = max |λi (A)|2 = (ρ(A)) .
i

Remark 4.1.6. 1) If U is hermitian or symmetric (therefore normal),

kAk2 = ρ(A).

2) If U is unitary or orthogonal (therefore normal),


p p
kAk2 = ρ(A∗ A) = ρ(I) = 1.

3) Theorem 4.1.5 states that normal matrices and k · k2 norm verify

kAk2 = ρ(A). ♦

An important matrix norm, which is not a subordinate matrix norm is the Frobenius norm:
 1/2
X X 
kAkE = |aij |2 = {tr(A∗ A)}1/2
 
i j


It is not a subordinate norm, since kIkE = n.

Theorem 4.1.7. (1) Let A be an arbitrary square matrix and k · k a certain matrix norm
(subordinate or not). Then
ρ(A) ≤ kAk. (4.1.1)

(2) Given a matrix A and a number ε > 0, there exists a subordinate matrix norm such
that
kAk ≤ ρ(A) + ε. (4.1.2)
106 Numerical Solution of Linear Algebraic Systems

Proof. (1) Let p be a vector verifying p 6= 0, Ap = λp, |λ| = ρ(A) and q a vector such that
pq T 6= 0. Since
ρ(A)kpq T k = kλpq T k = kApq T k ≤ kAkkpq T k,
(4.1.1) results immediately.
(2) Let A be a given matrix. There exists an invertible matrix U such that U −1 AU is
upper-triangular (in fact, U is unitary)
 
λ1 t12 t13 ... t1,n
 λ2 t23 ... t2,n 
 
−1  . .. .. 
U AU =  . ;
 
 λn−1 tn−1,n 
λn

the scalars λi are the eigenvalues of A. To any scalar δ 6= 0 we associate a matrix

Dδ = diag(1, δ, δ 2 , . . . , δ n−1 ),

such that
 
λ1 δt12 δ 2 t13 ... δ n−1 t1n
 λ2 δt23 ... δ n−2 t2n 
 
−1  .. .. 
(U Dδ ) A(U Dδ ) =  . . .
 
 λn−1 δtn−1n 
λn

Given ε > 0, we take a δ fixed, such that


n
X
|δ j−i tij | ≤ ε, 1 ≤ i ≤ n − 1.
j=i+1

Then, the map


k · k : B ∈ Kn×n → kBk = k(U Dδ )−1 B(U Dδ )k∞ , (4.1.3)
which depends on A and ε solve the problem. Indeed,

kAk ≤ ρ(A) + ε
P
and according to the choice of δ and the definition of k · k∞ (kcij k∞ = maxi j |cij |) norm,
the norm given by 4.1.3 is a matrix norm subordinated to the vector norm

v ∈ Kn → k(U Dδ )−1 vk∞ .


Theorem 4.1.8. Let B a square matrix. The following statements are equivalent:

(1) lim B k = 0;
k→∞
4.2. Condition of a linear system 107

(2) lim B k v = 0, ∀ v ∈ Kn ;
k→∞

(3) ρ(B) < 1;


(4) There exists a subordinate matrix norm such that kBk < 1.
Proof. (1) ⇒ (2)
kB k vk ≤ kB k kkvk ⇒ lim B k v = 0
k→∞
(2) ⇒ (3) If ρ(B) ≥ 1, we can find p such that p 6= 0, Bp = λp , |λ| ≥ 1. Then the vector
sequence (B k p)k∈N could not converge to 0.
(3) ⇒ (4) ρ(B) < 1 ⇒ ∃k · k such that kBk ≤ ρ(B) + ε, ∀ ε > 0 hence kBk < 1.
(4) ⇒ (1) It is sufficient to apply the inequality kB k k ≤ kBkk . 
For matrices, the norm norm function is invoked as norm(A,p), where A is a matrix,
and p=1,2,inf for a p-norm and p=’fro’ for the Frobenius norm. Example:
>> A=[1:3;4:6;7:9]
A =
1 2 3
4 5 6
7 8 9
>> [norm(A,1) norm(A,2) norm(A,inf) norm(A,’fro’)]
ans =
18.0000 16.8481 24.0000 16.8819
When the computation of the 2-norm of a matrix is two expensive (it requires the eigen-
values of the matrix, the call normest(A,tol) can be used to obtain an estimation based
on power method. (See Chapter 8). The default is tol=1e-6.

4.2 Condition of a linear system


We are interested in the conditioning of the problem: given the matrix A ∈ Kn×n and the
vector b ∈ Kn×1 , solve the system
Ax = b.
Let the system (this example is due to Wilson)
    
10 7 8 7 x1 32
 7 5 6 5   x2   23 
  = 
 8 6 10 9   x3   33  ,
7 5 9 10 x4 31

having the solution (1, 1, 1, 1)T and we consider the perturbed system where the right-hand
side is slightly modified, the system matrix remaining unchanged
    
10 7 8 7 x1 + δx1 32.1
 7 5 6 5   x2 + δx2   22.9 
    
 8 6 10 9   x3 + δx4  =  33.1  ,
7 5 9 10 x4 + δx4 30.9
108 Numerical Solution of Linear Algebraic Systems

having the solution (9.2, −12.6, 4.5, −1.1)T . In other words a 1/200 error in input data causes
10/1 relative error on result, hence an approx. 2000 times growing of the relative error!
Let now the system with the perturbed matrix
    
10 7 8.1 7.2 x1 + ∆x1 32
 7.08 5.04 6 5     
   x2 + ∆x2  =  23  ,
 8 5.98 9.89 9   x3 + ∆x4   33 
6.99 4.99 9 9.98 x4 + ∆x4 31

having the solution (−81, 137, −34, 22)T . Again, a small variation on input data (here, ma-
trix elements) modifies dramatically the output result. The matrix has a “good” shape, is
symmetric, its determinant is equal to 1, and its inverse is
 
25 −41 10 −6
 −41 68 −17 10 
 ,
 10 −17 5 −3 
−6 10 −3 2
which is also “nice”.
Let us now consider the system parameterized by t
(A + t∆A)x(t) = b + t∆b, x(0) = x.
A being nonsingular, the function x is differentiable at t = 0:
ẋ(0) = A−1 (∆b − ∆Ax).
The Taylor expansion of x(t) is given by
x(t) = x + tẋ(0) + O(t2 ).
It thus follows that the absolute error can be estimated using
k∆x(t)k = kx(t) − xk ≤ |t| kx′ (0)k + O(t2 )

≤ |t| A−1 (k∆bk + k∆Ak kxk) + O(t2 )
and (due to ||b|| ≤ ||A||||x||) we get for the relative error
 
k∆x(t)k −1 k∆bk

≤ |t| A + k∆Ak + O(t2 ) (4.2.1)
kxk kxk
 
−1 k∆bk k∆Ak
≤ kAk A |t| + + O(t2 ).
kbk kAk
By introducing the notations
k∆Ak k∆bk
ρA (t) := |t| , ρb (t) := |t|
kAk kbk
for the relative errors in A and b, the relative error estimate can be written as
k∆x(t)k
≤ kAk A−1 (ρA + ρb ) + O(t2 ). (4.2.2)
kxk
4.2. Condition of a linear system 109

Definition 4.2.1. If A in nonsingular, the number

cond(A) = ||A||||A−1 || (4.2.3)

is called the condition number of the matrix A.

The relation (4.2.2) can be rewritten as

k∆x(t)k
≤ cond(A) (ρA + ρb ) + O(t2 ). (4.2.4)
kxk

MATLAB has several functions for the computation or estimation of condition number.

• cond(A,p), where p=1,2,inf,’fro’ are supported. The default is p=2. For


p=2 one uses svd, and for p=1 and Inf, inv.

• condest(A) estimates cond1 A. It uses lu and an algorithm due to Higham and


Tisseur [45]. Suitable for large sparse matrices.

• rcond(A) estimates 1/cond1 A. It uses lu(A) and an algorithm implemented in


LINPACK and LAPACK.

Example 4.2.2 (Ill-conditioned matrix). Consider the n-th order Hilbert 1 matrix, Hn =
(hij ), given by
1
hij = , i, j = 1, n.
i+j−1
This is a symmetric positive definite matrix, so it is nonsingular. For various values of n one
gets in the Euclidean norm

n 10 20 40
cond2 (Hn ) 1.6 · 1013 2.45 · 1028 7.65 · 1058

A system of order n = 10, for example, cannot be solved with any reliability in single
precision on a 14-decimal computer. Double precision will be “exhausted” by the time we
reach n = 20. The Hilbert matrix is thus a prototype of an ill-conditioned matrix. From a

David Hilbert (1862-1943) was the most prominent member of the


Göttingen school of Mathematics. Hilbert’s fundamental contribu-
tions to almost all parts of mathematics — algebra, number theory,
1 geometry, integral equations, calculus of variations, and foundations

— and in particular the 23 now famous problems he proposed in 1900


at the International Congress of Mathematicians in Paris, gave a new
impetus, and new directions, to the 20th-century mathematics.
110 Numerical Solution of Linear Algebraic Systems

2
result of G. Szegő it can be seen that

( 2 + 1)4n+4
cond2 (Hn ) ∼ √ . ♦
215/4 πn

Example 4.2.3 (Ill-conditioned matrix). Vandermonde matrices are of the form


 
1 1 ... 1
 t1 t2 ... tn 
 
Vn =  . . . ..  ,
 .. .. .. . 
n−1 n−1 n−1
t1 t2 . . . tn
where ti are real parameters. If ti ’s are equally spaced in [-1,1], then it holds
1 − π n( π + 1 ln 2)
cond∞ (Vn ) ∼ e 4e 4 2 .
π
For ti = 1i , i = 1, n
cond∞ (Vn ) > nn+1 . ♦
Let us check practically the conditioning of the matrices in Examples 4.2.2 and 4.2.3. For
Hilbert matrix we used the sequence (file testcondhilb.m):
fprintf(’ n cond_2 est. cond teoretical\n’)
for n=[10:15,20,40]
H=hilb(n);
et=(sqrt(2)+1)ˆ(4*n+4)/(2ˆ(14/4)*sqrt(pi*n));
x=[n, norm(H)*norm(invhilb(n)), condest(H), et];
fprintf(’%d %g %g %g\n’,x)
end
We obtained the following results:
n cond_2 est. cond theoretical
10 1.60263e+013 3.53537e+013 1.09635e+015
11 5.23068e+014 1.23037e+015 3.55105e+016
12 1.71323e+016 3.79926e+016 1.15496e+018
13 5.62794e+017 4.25751e+017 3.76953e+019
14 1.85338e+019 7.09955e+018 1.23395e+021
15 6.11657e+020 7.73753e+017 4.04966e+022
20 2.45216e+028 4.95149e+018 1.58658e+030
40 7.65291e+058 7.02056e+019 4.69897e+060
Gabor Szegő (1895-1985) Hungarian mathematician. Szegő’s most
important work was in the area of extremal problems and Toeplitz
matrices. Orthogonal Polynomials appeared in 1939 and was pub-
lished by the American Mathematical Society. It has proved highly
2 successful, running to four editions and many reprints over the years.
He cooperated with Pólya in bringing out a joint Problem Book: Auf-
gaben und Lehrsätze aus der Analysis, volumes I and II (Problems and
Theorems in Analysis) (1925) which has since gone through many
editions and which has had an enormous impact on later generations
of mathematicians.
4.2. Condition of a linear system 111

The sequence for Vandermonde matrix with equally spaced points in [-1,1] (MATLAB
script condvander2.m) is:

warning off
fprintf(’ n cond_inf cond.estimate theoretical\n’)
for n=[10,20,40,80]
t=linspace(-1,1,n);
V=vander(t);
et=1/pi*exp(-pi/4)*exp(n*(pi/4+1/2*log(2)));
x=[n, norm(V,inf)*norm(inv(V),inf), condest(V), et];
fprintf(’%d %e %e %e\n’,x)
end
warning on

We give the results:

n cond_inf cond.estimate theoretical


10 2.056171e+004 1.362524e+004 1.196319e+004
20 1.751063e+009 1.053490e+009 9.861382e+008
40 1.208386e+019 6.926936e+018 6.700689e+018
80 1.027994e+039 1.003485e+039 3.093734e+038

For the Vandermonde matrix with elements of the form 1/i we used the sequence (file
condvander.m)

warning off
fprintf(’ n cond_inf cond.estimate theoretical\n’)
for n=10:15
t=1./(1:n);
V=vander(t);
x=[n, norm(V,inf)*norm(inv(V),inf), condest(V), nˆ(n+1)];
fprintf(’%d %e %e %e\n’,x)
end
warning on

and we obtained:

n cond_inf cond.estimate theoretical


10 5.792417e+011 5.905580e+011 1.000000e+011
11 2.382382e+013 2.278265e+013 3.138428e+012
12 1.060780e+015 9.692982e+014 1.069932e+014
13 5.087470e+016 4.732000e+016 3.937376e+015
14 2.615990e+018 2.419007e+018 1.555681e+017
15 1.436206e+020 1.294190e+020 6.568408e+018

We used warning off to disable the display of warning messages like

Matrix is close to singular or badly scaled.


Results may be inaccurate. RCOND = ...
112 Numerical Solution of Linear Algebraic Systems

4.3 Gaussian Elimination


Let us consider the linear system having n equations and n unknowns
Ax = b, (4.3.1)
where A ∈ Kn×n , b ∈ Kn×1 are given, and x ∈ Kn×1 must be determined, or written in a
detailed fashion 

 a11 x1 + a12 x2 + · · · + a1n xn = b1 (E1 )

 a21 x1 + a22 x2 + · · · + a2n xn = b2 (E2 )
.. .. (4.3.2)

 . .


an1 x1 + an2 x2 + · · · + ann xn = bn (En )
3
The Gaussian elimination method has two stages:
e1) Transforming the given system into a triangular one.
e2) Solving the triangular system using back substitution.
During the solution of the system (4.3.1) or (4.3.2) the following transforms are allowed:
1. The equation Ei can be multiplied by λ ∈ K∗ . This operation will be denoted by
(λEi ) → (Ei ).
2. The equation Ej can be multiplied by λ ∈ K∗ and added to the equation Ei , the result
replacing Ei . Notation (Ei + λEj ) → (Ei ).
3. The equation Ei and Ej can be interchanged; notation (Ei ) ←→ (Ej ).
In order to express conveniently the transform into a triangular system we shall use the
extended matrix:
 
a11 a12 . . . a1n a1,n+1
 a21 a22 . . . a2n a2,n+1 
Ae = [A, b] = 
 .. .. .. ..

..


 . . . . . 
an1 an2 . . . ann an,n+1
Johann Carl Friedrich Gauss (1777-1855) was one of the greatest
mathematicians of the 19th century — and perhaps of all time. He
spent almost his entire life in Göttingen, where he was the director of
the observatory for some 40 years. Already as a student in Göttingen,
Gauss discovered that the 17-gon can be constructed by compass and
ruler, thereby settling a problem that had been open since antiquity.
His dissertation gave the first proof of the Fundamental Theorem of
3 Algebra. He went on to make fundamental contributions to number
theory, differential and non-Euclidean Geometry, elliptic and hyper-
geometric functions, celestial mechanics and geodesy, and various
branches of physics, notably magnetism and optics. His computa-
tional efforts in celestial mechanics and geodesy, based on the princi-
ple of least squares, required the solution (by hand) of large systems
of linear equations, for which he used what today are known as Gaus-
sian elimination and relaxation methods. Gauss’s work on quadrature
builds upon the earlier work of Newton and Cotes.
4.3. Gaussian Elimination 113

with ai,n+1 = bi .
Assuming a11 6= 0, we shall eliminate the coefficients of x1 in Ej , for j = 2, n using the
operation (Ej − (aj1 /a11 )E1 ) → (Ej ). We proceed similarly for the coefficients of xi , for
i = 2, n − 1, j = i + 1, n. This requires aii 6= 0.
The procedure can be described as follows: one builds the following sequence of extended
matrix Ae(1) , A
e(2) , . . ., A e(1) = A and the elements a(k) of A
e(n) , where A e(k) are given by
ij

(k−1)
!
ai,k−1
Ei − (k−1)
Ek−1 −→ (Ei ).
ak−1,k−1
(p)
Remark 4.3.1. aij denotes the value of aij at the p-th step. ♦

Thus

(1)
a11
(1)
a12
(1)
a13 ...
(1)
a1,k−1 a1k
(1)
...
(1)
a1n 
a(1)
 (2) (2) (2) (2) (2) 1,n+1
 0 a22 a23 ... a2,k−1 a2,k ... a2,n (2) 
 a2,n+1 
 .. .. .. .. .. .. .. 
 . . ... . . . . 
 . 
 .. .. .. .. .. .. 
 . . . . . . .. 
e(k)  . 
A = .. (k−1) 
 .
(k−1)
ak−1,k−1
(k−1)
ak−1,k
(k−1)
. . . ak−1,n a 
 k−1,n+1 
 .. (k) 
 (k) (k) ak,n+1 
 . 0 akk ... akn 
 .. .. .. .. .. 
 . 
 . . . ... .
(k) (k) a(k)
0 ... ... ... 0 ank ... ann n,n+1

represents an equivalent linear system where the variable xk−1 was eliminated from the equa-
tions Ek , Ek+1 , . . . , En . The system corresponding to the matrix Ae(n) is triangular and
equivalent to  (1) (1) (1) (1)

 a11 x1 + a12 x2 + · · · + a1n xn = a1,n+1


 (2) (2) (2)
a22 x2 + · · · + a2n xn = a2,n+1
.. .

 .


 (n) (n)
ann xn = an,n+1
One obtains
(n)
an,n+1
xn = (n)
an,n
and, generally  
n
X
1  (i) (i)
xi = (i)
ai,n+1 − aij xj  , i = n − 1, 1
aii j=i+1

(i) (i)
The procedure is applicable only if aii 6= 0, i = 1, n. The element aii is called pivot. If
(k)
during the elimination process, at the kth step one obtains akk = 0, one can perform the line
114 Numerical Solution of Linear Algebraic Systems

(k)
interchange (Ek ) ↔ (Ep ), where k + 1 ≤ p ≤ n is the smallest integer satisfying apk 6= 0.
In practice, such operations are necessary even if the pivot is nonzero. The reason is that
a pivot which is small cause large rounding errors and even cancelation. The remedy is to
choose for pivoting the subdiagonal element on the same column having the largest absolute
value. That is, we must find a p such that
(k) (k)
|apk | = max |aik |,
k≤i≤n

and then perform the interchange (Ek ) ↔ (Ep ). This technique is called column maximal
pivoting or partial pivoting.
Another technique which decreases errors and prevents from the floating-point cancela-
tion is scaled column pivoting. We define in a first step a scaling factor for each line
n
X
si = max |aij | or si = |aij |.
j=1,n
j=1

If an i such that si = 0 does exist, the matrix is singular. The next steps will establish what
interchange is to be done. In the i-th one finds the smallest integer p, i ≤ p ≤ n, such that

|api | |aji |
= max
sp i≤j≤n sj

and then, (Ei ) ↔ (Ep ). Scaling guarantees us that the largest element in each column has
the relative magnitude 1, before doing the comparisons needed for line interchange. Scaling
is performed only for comparison purpose, so that the division by the scaling factor does not
introduce any rounding error. The third method is total pivoting or maximal pivoting. In this
method, at the kth step one finds

max{|aij |, i = k, n, j = k, n}

and line and columns interchange are carried out.


Pivoting was introduced for the first time by Goldstine and von Neumann, 1947 [38].

Remark 4.3.2. Some suggestions which speeds-up the running time.

1. The pivoting need not physically row or column interchange. One can manage one
(or two) permutation vector(s) p(q); p[i](q[i]) means the line (column) that was inter-
changed to the ith line(column). This is a good solution if matrices are stored row by
row or column by column; for other representation or memory hierarchies, physical
interchange could yield better results.

2. The subdiagonal elements (that vanish) need not to be computed.

3. A matrix A can be inverted solving the systems Ax = ek , k = 1, n, where ek are the


vectors in the canonical base of Kn – simultaneous equations method. ♦
4.3. Gaussian Elimination 115

MATLAB Source 4.1 Solve the system Ax = b by Gaussian elimination with scaled column
pivoting.
function x=Gausselim(A,b)
%GAUSSELIM - Gaussian elimination with scaled colum pivoting
%call x=Gausselim(A,b)
%A - matrix, b- right hand side vector
%x - solution

[l,n]=size(A);
x=zeros(size(b));
s=sum(abs(A),2);
A=[A,b]; %extended matrix
piv=1:n;
%Elimination
for i=1:n-1
[u,p]=max(abs(A(i:n,i))./s(i:n)); %pivoting
p=p+i-1;
if u==0, error(’no unique solution’), end
if p˜=i %line interchange
piv([i,p])=piv([p,i]);
end
for j=i+1:n
m=A(piv(j),i)/A(piv(i),i);
A(piv(j),i+1:n+1)=A(piv(j),i+1:n+1)-m*A(piv(i),i+1:n+1);
end
end
%back substitution
x(n)=A(piv(n),n+1)/A(piv(n),n);
for i=n-1:-1:1
x(i)=(A(piv(i),n+1)-A(piv(i),i+1:n)*x(i+1:n))/A(piv(i),i);
end

Analysis of Gaussian elimination. The method is given in MATLAB Source 4.1, the
file Gausselim.m. Our complexity measure is the number of floating point operations or
shortly, flops. In the innermost loop, lines 10–11 we have 2n − 2i + 3 flops, total (n − i)(2n −
2i + 3) flops. For the outer loop total

n−1
X 2n3
(n − i)(2n − 2i + 3) ∼ .
i=1
3

For back substitution Θ(n2 ) flops. Overall total, Θ(n3 ).


116 Numerical Solution of Linear Algebraic Systems

4.4 Factorization based methods


4.4.1 LU decomposition
Theorem 4.4.1. If the Gaussian elimination for the system Ax = b can be done without line
interchange, then A can factor as A = LU where L is a lower triangular matrix and U is an
upper triangular matrix. The pair (L, U ) is the LU decomposition of the matrix A.
Advantages. Ax = b ⇔ LU x = b ⇔ Ly = b ∧ U x = y. If we have to solve several
systems Ax = bi , i = 1, m, each takes Θ(n3 ), total Θ(mn3 ); if we begin with an LU
decomposition which takes Θ(n3 ) and solve each system in Θ(n2 ), we need a Θ(n3 ) time.
Remark 4.4.2. U is the upper triangular matrix generated by Gaussian elimination, and L is
the matrix of multipliers mij . ♦
If Gaussian elimination carries out with line interchange, it holds also A = LU , but L is
no more lower triangular.
The method is called LU factorization.
We give two examples where Gaussian elimination can be carried out without inter-
changes:
- A is line diagonal dominant, that is
n
X
|aii | > |aij |, i = 1, n
j=1
j6=i

- A is positive definite (∀ x 6= 0 x∗ Ax > 0).


Proof of Theorem 4.4.1. (sketch) For n > 1 we split A in the following way:
 
a11 a12 . . . a1n
 a21 a22 . . . a2n   
  a11 w∗
A= . .. .. ..  = ,
 .. . . .  v A′
an1 an2 . . . ann
where v is a n − 1 column vector, and w∗ - is a n − 1 line vector. We can factor A
    
a11 w∗ 1 0 a11 w∗
A= = .
v A′ v/a11 In−1 0 A′ − vw∗ /a11
The matrix A′ − vw∗ /a11 is called a Schur complement of A with respect to a11 . Then, we
proceed with the recursive decomposition of Schur complement:
A′ − vw∗ /a11 = L′ U ′ .
  
1 0 a11 w∗
A = ′ ∗ =
 v/a 11 In−1   0 A − vw  /a11

1 0 a11 w
= ′ ′ =
 v/a11 In−1  0 L

U
1 0 a11 w
= .
v/a11 L′ 0 U′
4.4. Factorization based methods 117


We have several choices for uii and lii , i = 1, n. For example, if lii = 1, we have
Doolittle factorization , and if uii = 1, we have Crout factorization.

4.4.2 LUP decomposition


The idea behind LU P decomposition is to find out three square matrices L, U and P where L
is lower triangular, U is upper triangular, and P is a permutation matrix, such that P A = LU .
The triple (L, U, P ) is called the LUP decomposition of the matrix A.
The solution of the system Ax = b is equivalent to the solution of two triangular systems,
since
Ax = b ⇔ LU x = P b ⇔ Ly = P b ∧ U x = y
and
Ax = P −1 LU x = P −1 Ly = P −1 P b = b.
We shall choose as pivot ak1 instead of a11 . The effect is a multiplication by a permutation
matrix Q:
    
ak1 w∗ 1 0 ak1 w∗
QA = = .
v A′ v/ak1 In−1 0 A′ − vw∗ /ak1
Then, we compute the LU P -decomposition of the Schur complement.
P ′ (A′ − vw∗ /ak1 ) = L′ U ′ .
We define  
1 0
P = Q,
0 P′
which is a permutation matrix too. We have now
 
1 0
PA = QA =
0 P′
   
1 0 1 0 ak1 w∗
= =
0 P′ v/ak1 In−1 0 A′ − vw∗ /ak1
  
1 0 ak1 w∗
= =
P ′ v/ak1 P ′ 0 A′ − vw∗ /ak1
  
1 0 ak1 w∗
= =
P ′ v/ak1 In−1 0 P ′ (A′ − vw∗ /ak1 )
     
1 0 ak1 w∗ 1 0 ak1 w∗
= = .
P ′ v/ak1 In−1 0 L′ U ′ P ′ v/ak1 L′ 0 U′
Note that in this reasoning, both the column vector and the Schur complement are multiplied
by the permutation matrix P ′ .
For an implementation, see MATLAB Source 4.2, file lup.m. The function lupsolve
implements the solution of a linear algebraic system using LUP decomposition (see MAT-
LAB source 4.5). It uses forward and backward substitutions. See MATLAB sources 4.3 and
4.4 respectively.
118 Numerical Solution of Linear Algebraic Systems

MATLAB Source 4.2 LUP Decomposition


function [L,U,P]=lup(A)
%LUP find LUP decomposition of matrix A
%call [L,U,P]=lup(A)
%permute effectively lines

[m,n]=size(A);
P=zeros(m,n);
piv=(1:m)’;
for i=1:m-1
%pivoting
[pm,kp]=max(abs(A(i:m,i)));
kp=kp+i-1;
%line interchange
if i˜=kp
A([i,kp],:)=A([kp,i],:);
piv([i,kp])=piv([kp,i]);
end
%Schur complement
lin=i+1:m;
A(lin,i)=A(lin,i)/A(i,i);
A(lin,lin)=A(lin,lin)-A(lin,i)*A(i,lin);
end;
for i=1:m
P(i,piv(i))=1;
end;
U=triu(A);
L=tril(A,-1)+eye(m);

MATLAB Source 4.3 Forward substitution


function x=forwardsubst(L,b)
%FORWARDSUBST - forward substitution
%L - lower triangular matrix
%b - right hand side vector

x=zeros(size(b));
n=length(b);
for k=1:n
x(k)=(b(k)-L(k,1:k-1)*x(1:k-1))/L(k,k);
end
4.4. Factorization based methods 119

MATLAB Source 4.4 Back substitution


function x=backsubst(U,b)
%BACKSUBST - backward substitution
%U - upper triangular matrix
%b - right hand side vector

n=length(b);
x=zeros(size(b));
for k=n:-1:1
x(k)=(b(k)-U(k,k+1:n)*x(k+1:n))/U(k,k);
end

MATLAB Source 4.5 Solution of a linear system by LUP decomposition


function x=lupsolve(A,b)
%LUPSOLVE - solution of a linear system by LUP decomposition
%A - matrix
%b - right hand side vector

[L,U,P]=lup(A);
y=forwardsubst(L,P*b);
x=backsubst(U,y);

4.4.3 Cholesky factorization


Hermitian positive definite matrices can be decomposed into triangular factors twice as quickly
as general matrices. The standard algorithm for this, Cholesky 4 factorization, is a variant
of Gaussian elimination, which operates on both the left and the right of the matrix at once,
preserving and exploiting the symmetry.
Systems having hermitian positive definite matrices play an important role in Numeri-
cal Linear Algebra and its applications. Many matrices that arise in physical systems are
hermitian positive definite because of the fundamental physical laws.
Properties of hermitian matrices. Let A be a m × m hermitian and positive definite
matrix.
1. If X is a full-rank m × n matrix, then is X ∗ AX hermitian positive definite;
2. Any principal submatrix of A is positive definite;
3. Any diagonal element of A is a positive real number;
Andre-Louis Cholesky (1875-1918) was a French military officer in-
volving in geodesy and surveying in Crete and North Africa just be-
fore the First World War. He developed the method now named af-
4 ter him to compute solutions to the normal equations for some least
squares data-fitting problems arising in geodesy. He died in 1918,
few time before the end of the First World War. His work was posthu-
mously published on his behalf in 1924 by a fellow officer, comman-
der Benoı̂t, in the Bulletin Géodesique.
120 Numerical Solution of Linear Algebraic Systems

4. The eigenvalues of A are positive real numbers;

5. Eigenvectors corresponding to distinct eigenvalues of a hermitian matrix are orthogo-


nal.

A Cholesky factorization of a matrix A is a decomposition

A = R∗ R, rjj > 0, (4.4.1)

where R is an upper triangular matrix.

Theorem 4.4.3. Every hermitian positive definite matrix A ∈ Cm×m has a unique Cholesky
factorization (4.4.1).

Proof. (Existence) Since A is hermitian and positive definite a11 > 0 and we may set α =

a11 . Note that
 
a11 w∗
A=
w K
    (4.4.2)
α 0 1 0 α w∗ /α ∗
= = R1 A1 R1 .
w/α I 0 K − ww∗ /a11 0 I

This is the basic step that is repeated in Cholesky factorization. The matrix K − ww∗ /a11
being a (m − 1) × (m − 1) principal submatrix of the positive definite matrix R1∗ AR1−1
is positive definite and hence his upper left element is positive. By induction, all matrices
that appear during the factorization are positive definite and thus the process cannot break
down. We proceed to the factorization of A1 = R2∗ A2 R2 , and thus, A = R1∗ R2∗ A2 R2 R1 ; the
process can be employed until the lower right corner is reached, getting

A = R1∗ R2∗ . . . Rm Rm . . . R2 R1 ;
| {z }| {z }
R∗ R

this decomposition has the desired form.


(Uniqueness) In fact, the above process also establishes uniqueness. At each step, (4.4.2),

the value α = a11 is determined by the form of factorization R∗ R and once α is deter-
mined, the first row of R1∗ is also determined. Since the analogous quantities are determined
at each step, the entire factorization is unique. 

Since only half the matrix needs to be stored, it follows that half of the arithmetic opera-
tions can be avoided. Here is one of many variants of Cholesky decomposition (see MATLAB
source 4.6). The input matrix A contains the main diagonal and the upper-half triangular part
of the Hermitian positive matrix m × m to be factorized. The output matrix is the upper
triangular factor in decomposition A = R∗ R. Each outer iteration corresponds to a single
elementary factorization: the upper triangular part of submatrix A∗k:m,k:m is the part above
the diagonal of the Hermitian matrix to be factored at the kth step. The inner loop dominates
the work. A single execution of the line
A(j,j:m)=A(j,j:m)-A(k,j:m)*A(k,j)/A(k,k);
4.4. Factorization based methods 121

MATLAB Source 4.6 Cholesky Decomposition


function R=Cholesky(A)
%CHOLESKY - Cholesky factorization
%call R=Cholesky(A)
%A - HPD matrix
%R - upper triangular matrix

[m,n]=size(A);
for k=1:m
if A(k,k)<=0
error(’matrix is not HPD’)
end
for j=k+1:m
A(j,j:m)=A(j,j:m)-A(k,j:m)*A(k,j)/A(k,k);
end
A(k,k:m)=A(k,k:m)/sqrt(A(k,k));
end
R=triu(A);

requires one division, m − j + 1 multiplications, and m − j + 1 subtractions, for a total of


∼ 2(m − j) flops. This calculation is repeated once for each j from k + 1 to m, and that loop
is repeated for each k from 1 to m. The sum is straightforward to evaluate:

m X
X m m X
X k m
X 1 3
2(m − j) ∼ 2 j∼ k2 ∼ m flops.
3
k=1 j=k+1 k=1 j=1 k=1

Thus, Cholesky factorization involves half as many operations as Gaussian elimination.

4.4.4 QR decomposition

Theorem 4.4.4. Let A ∈ Rm×n , with m ≥ n. Then, there exists a unique m × n orthogonal
matrix Q and a unique n × n upper triangular matrix R, having a positive diagonal (rii > 0)
such that A = QR.

Proof. It is a consequence of the algorithm 4.7 (to be given in this section). 

Orthogonal and unitary matrices are desirable for numerical computation because they
preserve lengths, angles, and do not magnify errors.
122 Numerical Solution of Linear Algebraic Systems

Px=||x||e1

Figure 4.2: A Householder reflector

A Householder 5 transform (or a reflection) is a matrix of form P = I − 2uuT , where


kuk2 = 1. One easily checks that P = P T and

 
P P T = I − 2uuT I − 2uuT = I − 4uuT + 4uuT uuT = I,

hence P is a symmetric orthogonal matrix. It is called a reflection since P x is the reflection


of x with respect to the hyperplane H which pass through the origin and is orthogonal to u
(Figure 4.2).
Given a vector x, it is easy to find a Householder reflection P = I − 2uuT to zero
out all but the first entry of x: P x = [c, 0, . . . , 0]T = ce1 . We do this as follows. Write
P x = x − 2u(uT x) = ce1 , so that u = 2(u1T x) (x − ce1 ), i.e., u is a linear combination of x
and e1 . Since kxk2 = kP xk2 = |c|, u must be parallel to the vector ũ = x ± kxk2 e1 , and so
u = ũ/kũk2 . One can verify that any choice of sign yields a u satisfying P x = ce1 , as long
as ũ 6= 0. We will use ũ = x + sign(x1 )kxk2 e1 , since this means that there is no cancelation
in computing the first component of ũ. If x1 = 0, we choose conventionally sign(x1 ) = 1.

Alston S. Householder (1904-1993), American mathematician. Im-


portant contributions to mathematical biology and mainly to numeri-
5 cal linear algebra. His well known book ”The theory of matrices in

numerical analysis” has a great impact on development of numerical


analysis and computer science.
4.4. Factorization based methods 123

In summary, we get
 
x1 + sign(x1 )kxk2
 x2  ũ
 
ũ =  ..  , with u = .
 .  kũk 2
xn

In practice, we can store ũ instead of u to save the work of computing u, and use the formula
2 T
P = I − kuk 2 ũũ instead of P = I − 2uuT . The matrix Pi′ needs not to be built; rather we
2
can apply it directly:

Ai:m,i:n = Pi′ Ai:m,i:n = (I − 2ui uTi )Ai:m,i:n


= Ai:m,i:n − 2ui (uTi Ai:m,i:n ).

MATLAB Source 4.7 describes the process of computing QR decomposition using House-
holder reflections.

MATLAB Source 4.7 QR decomposition using Householder reflections


function [R,Q]=HouseQR(A)
%HouseQR - QR decomposition using Househoulder reflections
%call [R,Q]=HouseQR(A)
%A mxn matrix, R upper triangular, Q orthogonal

[m,n]=size(A);
u=zeros(m,n); %reflection vectors
%compute R
for k=1:n
x=A(k:m,k);
x(1)=mysign(x(1))*norm(x)+x(1);
u(k:m,k)=x/norm(x);
A(k:m,k:n)=A(k:m,k:n)-2*u(k:m,k)*(u(k:m,k)’*A(k:m,k:n));
end
R=triu(A(1:n,:));
if nargout==2 %Q wanted
Q=eye(m,n);
for j=1:n
for k=n:-1:1
Q(k:m,j)=Q(k:m,j)-2*u(k:m,k)*(u(k:m,k)’*Q(k:m,j));
end
end
end
%sign
function y=mysign(x)
if x>=0, y=1;
else, y=-1;
end
124 Numerical Solution of Linear Algebraic Systems

Starting from the relation

Ax = b ⇔ QRx = b ⇔ Rx = QT b,

we can choose the following strategy for the solution of linear system Ax = b:

1. Determine the factorization A = QR of A;

2. Compute y = QT b;

3. Solve the upper triangular system Rx = y.

The computation of QT b can be performed by QT b = Pn Pn−1 . . . P1 b, so we need to


store the product of b by P1 , P2 , . . . , Pn – see MATLAB Source 4.8.

MATLAB Source 4.8 Solution of Ax = b using QR method


function x=QRSolve(A,b)
%QRSolve - solutions os a system using QR decomposition

[m,n]=size(A);
u=zeros(m,n); %reflection vectors
%compute R and QˆT*b
for k=1:n
x=A(k:m,k);
x(1)=mysign(x(1))*norm(x)+x(1);
u(k:m,k)=x/norm(x);
A(k:m,k:n)=A(k:m,k:n)-2*u(k:m,k)*(u(k:m,k)’*A(k:m,k:n));
b(k:m)=b(k:m)-2*u(k:m,k)*(u(k:m,k)’*b(k:m));
end
R=triu(A(1:n,:));
x=R\b(1:n);

The cost of QR decomposition A = QR is 2n2 m − 23 n3 , and the costs for QT b and Qx


are both O(mn).
If we wish to compute the matrix Q explicitly, we can build QI explicitly, by computing
the column Qe1 , Qe2 , . . . , Qem of QI, as shown in MATLAB Source 4.7.

4.5 Strassen’s algorithm for matrix multiplication


Let A, B ∈ Rn×n . We wish to compute C = AB. Suppose n = 2k . We split A and B
     
a11 a12 b11 b12 c11 c12
A= , B= , C= .
a21 a22 b21 b22 c21 c22

Classical algorithm requires 8 multiplications and 4 additions for one step; the running time
is T (n) = Θ(n3 ), since T (n) = 8T (n/2) + Θ(n2 ).
4.5. Strassen’s algorithm for matrix multiplication 125

We are interested in reducing the number of multiplications. Volker Strassen [92] dis-
covered a method to reduce the number of multiplications to 7 per step. One computes the
following quantities
p1 = (a11 + a22 )(b11 + b22 )
p2 = (a21 + a22 )b11
p3 = a11 (b12 − b22 )
p4 = a22 (b21 − b11 )
p5 = (a11 + a12 )b22
p6 = (a21 − a11 )(b11 + b12 )
p7 = (a12 − a22 )(b21 + b22 )
c11 = p1 + p4 − p5 + p7
c12 = p3 + p5
c21 = p2 + p4
c22 = p1 + p3 − p2 + p6 .
Since we have 7 multiplications and 18 additions per step, the running times verifies the
following recurrence
T (n) = 7T (n/2) + Θ(n2 ).
The solution is
T (n) = Θ(nlog2 7 ) ∼ 28nlog2 7 .
For an implementation see MATLAB Source 4.9.
The algorithm can be extended to matrices of n = m · 2k size. If n is odd, then the last
column of the result can be computed using standard method; then Strassen’s algorithm is
applied to n − 1 by n − 1 matrices: m · 2k+1 → m · 2k . The p-s can be computed in parallel;
the c-s, too.
The theoretical speed-up of matrix multiplication translates into a speed-up of matrix
inversion, hence to the solution of linear algebraic systems. If M (n) is the time for mul-
tiplication of two n × n matrices and I(n) is the inversion time for a n × n matrix, then
M (n) = Θ(n). This can be proven in two steps: we show that M (n) = O(I(n)) and then
I(n) = O(M (n)).

Theorem 4.5.1 (Multiplication is not harder than inversion). If we can invert a n× n ma-
trix in I(n) time, where I(n) = Ω(n2 ) satisfies the regularity condition I(3n) = O(I(n)),
then we can multiply two nth order matrices in O(I(n)) time.

Note that I(n) satisfies the regularity condition only if I(n) has not large jumps in its
values. For example, if I(n) = Θ(nc logd n), for any constants c > 0, d ≥ 0, then I(n)
satisfies the regularity conditions.

Theorem 4.5.2 (Inversion is not harder than multiplication). If we can multiply two real
n×n matrices in M (n) time, where M (n) = Ω(n2 ), M (n) satisfies the regularity conditions
M (n) = O(M (n + k)) for each k, 0 ≤ k ≤ n, and M (n/2) ≤ cM (n), for any constant
c < 1/2, then we can invert a real n × n nonsingular matrix in O(M (n)) time.

For proofs of last two theorems, see [18].


126 Numerical Solution of Linear Algebraic Systems

MATLAB Source 4.9 Strassen’s algorithm for matrix multiplication


function C=strass(A,B,mmin)
%STRASS - Strassen algorithm for matrix multiplication
%size = 2ˆk
%A,B - square matrices
%C - product A*B
%mmin - minimum size

[m,n]=size(A);
if m<=mmin
%classical multiplication
C=A*B;
else
%subdivision and recursive call
n=m/2;
u=1:n; v=n+1:m;
P1=strass(A(u,u)+A(v,v),B(u,u)+B(v,v),mmin);
P2=strass(A(v,u)+A(v,v),B(u,u),mmin);
P3=strass(A(u,u),B(u,v)-B(v,v),mmin);
P4=strass(A(v,v),B(v,u)-B(u,u),mmin);
P5=strass(A(u,u)+A(u,v),B(v,v),mmin);
P6=strass(A(v,u)-A(u,u),B(u,u)+B(u,v),mmin);
P7=strass(A(u,v)-A(v,v),B(v,u)+B(v,v),mmin);
C(u,u)=P1+P4-P5+P7;
C(u,v)=P3+P5;
C(v,u)=P2+P4;
C(v,v)=P1+P3-P2+P6;
end

4.6 Solution of Algebraic Linear Systems in MATLAB


Let m be the number of equations and n the number of unknowns. The fundamental tool for
solving a linear system of equations is the backslash operator, \ (see Section 1.3.3).
It handles three types of linear systems, squared (m = n), overdetermined (m > n) and
underdetermined (m < n). We shall give more details on overdetermined system in Chapter
5. More generally, the \ operator can be used to solve AX = B, where B is a matrix with
n columns; in this case MATLAB solves AX(:; j) = B(:; j) for j = 1 : n. We can use
X = B/A to solve systems of the form XA = B.

4.6.1 Square systems


If A is an n × n nonsingular matrix, then A\b is the solution of the system Ax=b, computed
by LU factorization with partial pivoting. During the solution process MATLAB computes
rcond(A), and it prints a warning message if the result is smaller than about eps:

x = hilb(15)\ones(15,1)
4.6. Solution of Algebraic Linear Systems in MATLAB 127

Warning: Matrix is close to singular or badly scaled.


Results may be inaccurate. RCOND = 1.543404e-018.

MATLAB recognizes three special forms of square systems and takes advantage of them
to reduce the computation.

• Triangular matrix, or permutation of a triangular matrix. The system is solved by


substitution.

• Upper Hessenberg matrix. The system is solved by LU decomposition with partial


pivoting, taking the advantage of the upper Hessenberg form.

• Hermitian positive definite matrix. Cholesky factorization is used instead of LU fac-


torization. When is called with a Hermitian matrix that has positive diagonal elements
MATLAB attempts to Cholesky factorize the matrix. How does MATLAB know the
matrix is definite? If the Cholesky factorization succeeds, the matrix is positive definite
and it is used to solve the system; otherwise an LU factorization is carried out.

4.6.2 Overdetermined systems


In general, if m > n, the system Ax = b has no solution. MATLAB expression A\b gives a
least squares solution to the system, that is, it minimizes norm(A*x-b) (the 2-norm of the
residual) over all vectors x. If A has full rank n there is a unique least squares solution. If A
has rank k less than n then A is a basic solution—one with at most k nonzero elements (k is
¯
determined, and x computed, using the QR factorization with column pivoting). In the latter
case MATLAB displays a warning message.
A least squares solution to Ax = b can also be computed as xmin = pinv(A)*b,
where the pinv function computes the pseudo-inverse. In the case where A is rank-deficient,
xmin is the unique solution of minimal 2-norm.
The (Moore-Penrose) pseudo-inverse generalizes the notion of inverse to rectangular and
rank-deficient matrices A and is written A+ . It is computed with pinv(A). The pseudo-
inverse A+ of A can be characterized as the unique matrix X = A+ satisfying the four
conditions AXA = A, XAX = X, (XA)∗ = XA and (AX)∗ = AX. It can also be
written explicitly in terms of the singular value decomposition (SVD): if the SVD of A is
A = U ΣV ∗ , then A+ = V Σ+ U ∗ , where Σ+ is n × m diagonal with (i; i) entry 1/σi if
σi > 0 and otherwise 0. To illustrate,
>> Y=pinv(ones(3))
Y =
0.1111 0.1111 0.1111
0.1111 0.1111 0.1111
0.1111 0.1111 0.1111
>> A=[0 0 0 0; 0 1 0 0; 0 0 2 0]
A =
0 0 0 0
0 1 0 0
0 0 2 0
>> pinv(A)
128 Numerical Solution of Linear Algebraic Systems

ans =
0 0 0
0 1.0000 0
0 0 0.5000
0 0 0

A vector that minimizes the 2-norm of Ax − b over all nonnegative vectors x, for real A
and b, is computed by lsqnonneg. The simplest usage is x = lsqnonneg(A,b), and
several other input and output arguments can be specified, including a starting vector for the
iterative algorithm that is used. Example
>> A = gallery(’lauchli’,2,0.2), b = [1 2 4]’;
A =
1.0000 1.0000
0.2000 0
0 0.2000
>> x=A\b;
>> xn=lsqnonneg(A,b);
>> [x xn], [norm(A*x-b) norm(A*xn-b)]
ans =
-4.2157 0
5.7843 1.7308
ans =
4.0608 4.2290

4.6.3 Underdetermined systems


In the case of an underdetermined system, we can have either no solution or infinitely many.
In the latter case, A\b produces a basic solution, one with at most k nonzero elements, where
k is the rank of A. This solution is generally not the solution of minimal 2-norm, which can
be computed as pinv(A)*b. If the system has no solution (that is, it is inconsistent), then
A\b is a least squares solution. The next example illustrates the difference between the \ and
pinv solutions:
>> A = [1 1 1; 1 1 -1], b=[3; 1]
A =
1 1 1
1 1 -1
b =
3
1
>> x=A\b; y = pinv(A)*b;
>> [x y]
ans =
2.0000 1.0000
0 1.0000
1.0000 1.0000
>> [norm(x) norm(y)]
ans =
2.2361 1.7321
4.6. Solution of Algebraic Linear Systems in MATLAB 129

MATLAB uses QR factorization with column pivoting. Consider the example


>> R=fix(10*rand(2,4))
R =
9 6 8 4
2 4 7 0
>> b=fix(10*rand(2,1))
b =
8
4

The system has 2 equations and 4 unknowns. Since the coefficient matrix contains small
integers, it is appropriate to display the solution in rational format. One obtains a particular
solution with:
>> format rat
>> p=R\b
p =
24/47
0
20/47
0

One of the nonzero components is p(3), since R(:,3) is the column of R with the largest
norm. The other nonzero component is p(1), because R(:,1) dominates after R(:,3) is
eliminated.
The complete solution to the underdetermined system can be characterized by adding an
arbitrary vector from the null space, which can be found using the null function with an
option requesting a ”rational” basis:
>> Z=null(R,’r’)
Z =
5/12 -2/3
-47/24 1/3
1 0
0 1

It can be confirmed that R*Z is zero and that any vector x where R*Z is the zero matrix, and
any vector x where x=p+Z*q, for an arbitrary vector q satisfies R*x=b.

4.6.4 LU and Cholesky factorizations


The lu function computes an LUP factorization with partial pivoting. The MATLAB call
[L,U,P]=lu(A) returns the triangular factors and the permutation matrix. The form
[L,U]=lu(A) returns L = PT L, so L is a triangular matrix with row permuted. We il-
lustrate with an example.
>> format short g
>> A = gallery(’fiedler’,3), [L,U]=lu(A)
A =
130 Numerical Solution of Linear Algebraic Systems

0 1 2
1 0 1
2 1 0
L =
0 1 0
0.5 -0.5 1
1 0 0
U =
2 1 0
0 1 2
0 0 2

The lu function works also for rectangular matrices. If A is m × n, the call [L,U]=lu(A)
produces an m × n L and an n × n U if m ≥ n and an m × m L and an m × n U if m < n.
The solution of Ax=b with square A by using x=A\b is equivalent to the MATLAB
sequence:
[L,U] = lu(A); x = U\(L\b);
Determinants and matrix inverses are computed also through LU factorization:

det(A)=det(L)*det(U)=+-prod(diag(U))
inv(A)=inv(U)*inv(L)

The command chol(A), where A is Hermitian positive definite, computes the Cholesky
decomposition of A. Since \ operator knows to handle triangular matrices, the system can be
solved quickly with x=R\(R’\R\b). We give an example of Cholesky factorization:

>> A=pascal(4)
A =
1 1 1 1
1 2 3 4
1 3 6 10
1 4 10 20
>> R=chol(A)
R =
1 1 1 1
0 1 2 3
0 0 1 3
0 0 0 1

Note that chol looks only at the elements in the upper triangle of A (including the diag-
onal) — it factorizes the Hermitian matrix agreeing with the upper triangle of A. An error is
produced if A is not positive definite. The chol function can be used to test whether a matrix
is positive definite [R,p] = chol(A), where the integer p will be zero if the factorization
succeeds and positive otherwise; see help chol for more details about p.
Function cholupdate modifies the Cholesky factorization when the original matrix is
subjected to a rank 1 perturbation (that is a matrix of the form either +xx∗ or −xx∗ , where
x is a vector.
4.6. Solution of Algebraic Linear Systems in MATLAB 131

4.6.5 QR factorization
There are four variants of the QR factorization — full (complete) or economy (reduced) size,
and with or without column permutation.
The full QR factorization of an m × n matrix C, m > n, produce an m × m matrix Q,
orthogonal and an m × n upper triangular matrix R. The call syntax is [Q,R]=qr(C). In
many cases, the last m − n columns are not needed, because they are multiplied by the zeros
in the bottom portion of R. For example, for the matrix C, given below:
C =
1 1
1 2
1 3
>> [Q,R]=qr(C)
Q =
-0.5774 0.7071 0.4082
-0.5774 0.0000 -0.8165
-0.5774 -0.7071 0.4082
R =
-1.7321 -3.4641
0 -1.4142
0 0

The economy size or reduced QR factorization produces an m × n rectangular matrix Q, with


orthonormal columns and an n × n upper triangular matrix R. Example
>> [Q,R]=qr(C,0)
Q =
-0.5774 0.7071
-0.5774 0.0000
-0.5774 -0.7071
R =
-1.7321 -3.4641
0 -1.4142
For larger, highly rectangular matrices, the savings in both time and memory can be quite
important.
In contrast to the LU factorization, the standard QR factorization does not require any
pivoting or permutations. But an optional column permutation, triggered by the presence of
a third output argument, is useful for detecting singularity or rank deficiency. A QR factor-
ization with column pivoting has the form AP = QR, where P is a permutation matrix.
The permutation strategy that is used produces a factor R whose diagonal elements are non-
increasing: |r11 | ≥ |r22 | ≥ · · · ≥ |rnn |. Column pivoting is particularly appropriate when
A is suspected of being rank-deficient, as it helps to reveal near rank-deficiency. Roughly
speaking, if A closed to a matrix of rank r < n then the last n − r diagonal elements of R
will be of order eps*norm(A). A third output argument forces qr function to use column
pivoting and return the permutation matrix: [Q,R,P] = qr(A). Example:
>> [Q,R,P]=qr(C)
Q =
132 Numerical Solution of Linear Algebraic Systems

-0.2673 0.8729 0.4082


-0.5345 0.2182 -0.8165
-0.8018 -0.4364 0.4082
R =
-3.7417 -1.6036
0 0.6547
0 0
P =
0 1
1 0

If we combine pivoting with economy form, the qr function return a vector instead a permu-
tation matrix:

>> [Q,R,P]=qr(C,0)
Q =
-0.2673 0.8729
-0.5345 0.2182
-0.8018 -0.4364
R =
-3.7417 -1.6036
0 0.6547
P =
2 1

Functions qrdelete, qrinsert, and qrupdate modify the QR factorization when a


column of the original matrix is deleted or inserted or when a rank 1 perturbation is added.
Consider now a system Ax = b, where A is an n-order square matrix of the form:

 1, for i = j or j = n;
A = (ai,j ), ai,j = −1, for i > j;

0, otherwise.

For example, if n = 6,
 
1 0 0 0 0 1
 −1 1 0 0 0 1 
 
 −1 −1 1 0 0 1 
A=

.

 −1 −1 −1 1 0 1 
 −1 −1 −1 −1 1 1 
−1 −1 −1 −1 −1 1

For a given n, we can generate such a matrix with the sequence

A=[-tril(ones(n,n-1),-1)+eye(n,n-1),ones(n,1)]

Suppose we set b using b=A*ones(n,1). The solution of our system is x = [1, 1, . . . , 1]T .
For n=100, the \ operator yields
4.6. Solution of Algebraic Linear Systems in MATLAB 133

>> x=A\b;
>> reshape(x,10,10)
ans =
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 0 0 0 0 0
1 1 1 1 1 0 0 0 0 0
1 1 1 1 1 0 0 0 0 0
1 1 1 1 1 0 0 0 0 0
1 1 1 1 1 0 0 0 0 0
1 1 1 1 1 0 0 0 0 0
1 1 1 1 1 0 0 0 0 1
>> norm(b-A*x)/norm(b)
ans =
0.3191
a wrong result, although A is well conditioned
>> cond(A)
ans =
44.8023
If we solve the system by QR method, we obtain
>> [Q,R]=qr(A);
>> x2=R\(Q’*b);
>> x2’
ans =
Columns 1 through 6
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
...
Columns 97 through 100
1.0000 1.0000 1.0000 1.0000
>> norm(b-A*x2)/norm(b)
ans =
8.6949e-016

4.6.6 The linsolve function


The linsolve function allows a faster solution of a square or a rectangular system, by
indicating the properties of the system matrix. It can be an alternative to \ operator when
efficiency is important. The call
x = linsolve(A,b,opts)
solves the linear system A*x=b, using the solver that is most appropriate given the properties
of the matrix A, which you specify in the structure opts. If A does not have the properties
that you specify in opts, linsolve returns incorrect results and does not return an error
message. The following table lists all the field of opts and their corresponding matrix prop-
erties. The values of the fields of opts must be logical and the default value for all fields is
false.
134 Numerical Solution of Linear Algebraic Systems

Field Name Matrix Property


LT Lower triangular
UT Upper triangular
UHESS Upper Hessenberg
SYM Real symmetric or complex Hermitian
POSDEF Positive definite
RECT General rectangular
TRANSA Conjugate transpose - specifies whether the function solves
A*x=b or A’*x=b

If opts is missing, linsolve solves the linear system using LU factorization with partial
pivoting when A is square and QR factorization with column pivoting otherwise. It returns a
warning if A is square and ill conditioned or if it is not square and rank deficient. Example:

>> A = triu(rand(5,3)); x = [1 1 1 0 0]’;


>> b = A’*x;
>> y1 = (A’)\b
y1 =
1.0000
1.0000
1.0000
0
0
>> opts.UT = true; opts.TRANSA = true;
>> y2 = linsolve(A,b,opts)
y2 =
1.0000
1.0000
1.0000
0
0

4.7 Iterative refinement


If the solution method for Ax = b is unstable, then Ax 6= b, where x is the computed solution.
We shall compute a correction ∆x such that

A(x + ∆x1 ) = b ⇒ A∆x1 = b − Ax

We solve the system and we obtain a new x, x := x + ∆x1 . If again Ax 6= b, then we


compute a new correction, until

k∆xi − ∆xi−1 k < ε or kAx − bk < ε.

The computation of the residual vector r = b − Ax, will be performed in double precision.
4.8. Iterative solution of Linear Algebraic Systems 135

4.8 Iterative solution of Linear Algebraic Systems


We wish to compute the solution
Ax = b, (4.8.1)
when A is invertible. Suppose we have found a matrix T and a vector c such that I − T is
invertible and the unique fixpoint of the equation

x = Tx + c (4.8.2)

equates the solution of the system Ax = b. Let x∗ be the solution of (4.8.1) or, equivalently,
of (4.8.2).
Iteration: x(0) given; one defines the sequence (x(k) ) by

x(k+1) = T x(k) + c, k ∈ N. (4.8.3)

Lemma 4.8.1. If ρ(X) < 1, then there exists (I − X)−1 and

(I − X)−1 = I + X + X 2 + · · · + X k + . . .

Proof. Let
Sk = I + X + · · · + X k

(I − X)Sk = I − X k+1

lim (I − X)Sk = I ⇒ lim Sk = (I − X)−1


k→∞ k→∞
k+1
since X → 0 ⇔ ρ(X) < 1 (Theorem 4.1.8). 

Theorem 4.8.2. The following statements are equivalent

(1) method (4.8.3) is convergent;

(2) ρ(T ) < 1;

(3) kT k < 1 for at least a matrix norm.

Proof.

x(k) = T x(k−1) + c = T (T x(k−2) + c) + c = · · · =

= T k x(0) + (I + T + · · · + T n−1 )

(4.8.3) convergent ⇔ I − T invertible ⇔ ρ(T ) < 1 ⇔ ∃k · k such that kT k < 1 (from


Theorem 4.1.8). 

Banach’s fixpoint theorem implies:


136 Numerical Solution of Linear Algebraic Systems

Theorem 4.8.3. If there exists k · k such that kT k < 1, the sequence (x(k) ) given by (4.8.3)
is convergent for any x(0) ∈ Rn and the following estimations hold

kx∗ − x(k) k ≤ kT kk kx(0) − xk (4.8.4)


k
kT k kT k
kx∗ − x(k) k ≤ kx(1) − x(0) k ≤ kx(1) − x(0) ||. (4.8.5)
1 − kT k 1 − kT k

An iterative method for the solution of an linear algebraic system Ax = b starts from
an initial approximation x(0) ∈ Rn (Cn ) and generates a sequence of vectors {x(k) }, that
converges to the solution of the system x∗ . Such techniques transform the initial system into
an equivalent system, having the form x = T x + c, T ∈ Kn×n , c ∈ Kn . One generates a
sequence x(k) = T x(k−1) + c.
The stopping criterion is

1 − kT k
kx(k) − x(k−1) k ≤ ε. (4.8.6)
kT k

It is based on the result:

Proposition 4.8.4. If x∗ is the solution of (4.8.2), and kT k < 1, then

kT k
kx∗ − x(k) k ≤ kx(k) − x(k−1) k. (4.8.7)
1 − kT k

Proof. Let p ∈ N∗ . We have

kx(k+p) − x(k) k ≤ kx(k+1) − x(k) k + · · · + kx(k+p) − x(k+p−1) k. (4.8.8)

On the other hand, (4.8.3) implies

kx(m+1) − x(m) k ≤ kT kkx(m) − x(m−1) k

or, for k < m


kx(m+1) − x(m) k ≤ kT km−k+1kx(k) − x(k−1) k.
By applying successively these inequalities for m = k, k + p − 1, the relation (4.8.8) be-
comes

kx(k+p) − x(k) k ≤ (kT k + · · · + kT kp)kx(k) − x(k−1) k


≤ (kT k + · · · + kT kp + . . . )kx(k) − x(k−1) k.

Since kT k < 1, we have

kT k
kx(k+p) − x(k) k ≤ kx(k) − x(k−1) k,
1 − kT k

which when passing to the limit with respect to p yields to (4.8.7). 


4.8. Iterative solution of Linear Algebraic Systems 137

If kT k ≤ 1/2, inequality (4.8.7) becomes

kx∗ − x(k) k ≤ kxk − x(k−1) k,

and the stoping criterion


kxk − x(k−1) k ≤ ε.

Iterative methods are seldom used to the solution of small systems since the time required
to attain the desired accuracy exceeds the time required for Gaussian elimination. For large
sparse systems (i.e. systems whose matrix has many zeros), iterative methods are efficient
both in time and space.
Let the system Ax = b. Suppose we can split A as A = M − N . If M can be easily
inverted (diagonal, triangular, and so on) it is more convenient to carry out the computation
in the following manner

Ax = b ⇔ M x = N x + b ⇔ x = M −1 N x + M −1 b

The last equation is in the form x = T x + c, where T = M −1 N = I − M −1 A. One obtains


the sequence
x(k+1) = M −1 N x(k) + M −1 b, k ∈ N,
where x(0) is an arbitrary vector.
The first splitting we consider is A = D − L − U , where

aij , i > j
(D)ij = aij δij , (−L)ij =
0, otherwise

aij , i < j
(−U )ij =
0, otherwise
Taking M = D, N = L + U , one successively obtains

Ax = b ⇔ Dx = (L + U )x + b ⇔ x = D−1 (L + U )x + D−1 b

Thus, T = TJ = D−1 (L + U ), c = cJ = D−1 b; — this is Jacobi’s method (D is


invertible, why?), due to Carl Jacobi. 6 [49]
Another decomposition is A = D − L − U , M = D − L, N = U which yields to
TGS = (D − L)−1 U , and cGS = (D − L)−1 b – called Gauss-Seidel method (D − L
invertible, why?)

Carl Gustav Jacob Jacobi (1804-1851) was a contemporary of Gauss,


and with him one of the most important 19th-century mathematicians
in Germany. His name is connected with elliptic functions, partial dif-
6 ferential equations of dynamics, calculus of variations, celestial me-

chanics; functional determinants also bear his name. In his work on


celestial mechanics he invented what is now called the Jacobi method
for solving linear algebraic systems.
138 Numerical Solution of Linear Algebraic Systems

MATLAB Source 4.10 Jacobi method for linear sytems


function [x,ni]=Jacobi(A,b,x0,err,nitmax)
%JACOBI Jacobi method
%call [x,ni]=Jacobi(A,b,x0,err,nitmax)
%parameters
%A - system matrix
%b - right hand side vector
%x0 - starting vector
%err - tolerance (default 1e-3)
%nitmax - maximum number of iterations (default 50)
%x - solution
%ni -number of actual iterations

%parameter check
if nargin < 5, nitmax=50; end
if nargin < 4, err=1e-3; end
if nargin <3, x0=zeros(size(b)); end
[m,n]=size(A);
if (m˜=n) | (n˜=length(b))
error(’ilegal size’)
end
%compute T and c (prepare iterations)
M=diag(diag(A));
N=M-A;
T=inv(M)*N;
c=inv(M)*b;
alfa=norm(T,inf);
x=x0(:);
for i=1:nitmax
x0=x;
x=T*x0+c;
if norm(x-x0,inf)<(1-alfa)/alfa*err
ni=i;
return
end
end
error(’iteration number exceeded’)
4.8. Iterative solution of Linear Algebraic Systems 139

 
n
X
(k) 1  (k−1) 
Let us examine Jacobi iteration xi = bi − aij xj .
aii j=1
j6=i
(k)
Computation of xi uses all components of x(k−1) (simultaneous substitution). Since for
(k) (k)
i > 1, x1 , . . . , xi−1 have already been computed, and we suppose they are better approxi-
(k−1) (k−1)
mations of the solution components than x1 , . . . , xi−1 it seems reasonable to compute
(k)
xi using the most recent values, i.e.
 
k−1
X n
X
(k) 1 bi − (k) (k−1) 
xi = aij xj − aij xj .
aii j=1 k=i+1

One can state necessary and sufficient conditions for the convergence of Jacobi and
Gauss-Seidel methods
ρ(TJ ) < 1
ρ(TGS ) < 1
and sufficient conditions: there exists k · k such that

kTJ k < 1

kTGS k < 1.
We can improve Gauss-Seidel method introducing a parameter ω and splitting
D
M= − L.
ω
We have    
D 1−ω
A= −L − D+U ,
ω ω
and the iteration is
   
D (k+1) 1−ω
−L x = D + U x(k) + b
ω ω
Finally, we obtain the matrix
 −1  
D 1−ω
T = Tω = −L D+U
ω ω
= (D − ωL)−1 ((1 − ω)D + ωU ).

The method is called relaxation method. We have the following variants:

– ω > 1 overrelaxation (SOR - Successive Over Relaxation)


– ω < 1 subrelaxation
140 Numerical Solution of Linear Algebraic Systems

MATLAB Source 4.11 Successive Overrelaxation method (SOR)


function [x,ni]=relax(A,b,omega,x0,err,nitmax)
%RELAX Successive overrelaxation (SOR) method
%call [z,ni]=relax(A,b,omega,err,nitmax)
%parameters
%A - system matrix
%b - right hand side vector
%omega - relaxation parameter
%x0 - starting vector
%err - tolerance (default 1e-3)
%nitmax - maximum number of iterations (default 50)
%z - solution
%ni -actual number of iterations

%check parameters
if nargin < 6, nitmax=50; end
if nargin < 5, err=1e-3; end
if nargin < 4, x0=zeros(size(b)); end
if (omega<=0) | (omega>=2)
error(’ilegal relaxation parameter’)
end
[m,n]=size(A);
if (m˜=n) | (n˜=length(b))
error(’ilegal size’)
end
%compute T and c (prepare iterations)
M=1/omega*diag(diag(A))+tril(A,-1);
N=M-A;
T=M\N;
c=M\b;
alfa=norm(T,inf);
x=x0(:);
for i=1:nitmax
x0=x;
x=T*x0+c;
if norm(x-x0,inf)<(1-alfa)/alfa*err
ni=i;
return
end
end
error(’iteration number exceeded’)
4.9. Applications 141

– ω = 1 Gauss-Seidel

In the sequel, we state two theorems on the convergence of relaxation method.


Theorem 4.8.5 (Kahan). If aii 6= 0, i = 1, n, ρ(Tω ) ≥ |ω − 1|. This implies the following
necessary conditions ρ(Tω ) < 1 ⇒ 0 < ω < 2.

Theorem 4.8.6 (Ostrowski-Reich). If A is a positive definite matrix, and 0 < ω < 2, then
SOR converges for any choice of the initial approximation x(0) .

Remark 4.8.7. For Jacobi (and Gauss-Seidel) method a sufficient condition for convergence
is
n
X
|aii | > |aij | (A row diagonal dominant)
j=1
j6=i
n
X
|aii | > |aji | (A column diagonal dominant) ♦
j=1
j6=i

The optimal value for ω is


2
ωO = p .
1 + 1 − (ρ(TJ ))2
For an implementation, see MATLAB Source 4.12. In practice, finding the optimal relaxation
parameter is inefficient. The function has only a didactic purpose.

MATLAB Source 4.12 Finding optimal value of relaxation parameter


function omega=relopt(A)
%RELOPT find optimal value of relaxation parameter
%call omega=relopt(A)
M=diag(diag(A)); %find Jacobi matrix
N=M-A;
T=M\ N;
e=eig(T);
rt=max(abs(e)); %spectral radius of Jacobi matrix
omega=2/(1+sqrt(1-rtˆ2));

4.9 Applications
4.9.1 The finite difference method for linear two-points boundary value
problem
Consider the two-point boundary value problem y ′′ (x)−p(x)y ′ (x)−q(x)y(x) = r(x) on the
interval [a, b] with boundary conditions y(a) = α, y(b) = β. We also assume q(x) ≥ q > 0.
142 Numerical Solution of Linear Algebraic Systems

This equation may be used to model the heat flow in a long, thin rod, for example. To solve
the differential equation numerically, we discretize it by seeking its solution only at the evenly
spaced mesh point xi = a + ih, i = 0, . . . , N + 1, where h = (b − a)/(N + 1) is the mesh
spacing. Define pi = p(xi ), ri = r(xi ), and qi = q(xi ). We need to derive equations to
solve for our desired approximations yi ≈ y(xi ), where y0 = α and yN +1 = β.
To derive these equations, we approximate the derivative y ′ (xi ) by the following finite
difference approximation:
yi+1 − yi−1
y ′ (xi ) ≈
2h
and the second derivative by
yi+1 − 2yi + yi−1
y ′′ (xi ) ≈ .
h2
Inserting these approximations into the differential equations yields
yi+1 − 2yi + yi−1 yi+1 − yi−1
2
− pi − qi yi = ri , 1 ≤ i ≤ N.
h 2h
Rewriting this as a linear system we get Ay = b, where
     1 h  
y1 r1 2 + 4 p1 α
     0 
  −h2    
 ..   ..   .
.. 
y =  . , b =  . + ,
  2    
     0  
yN rN 1 h
2 − 4 p1 β

and  
a1 −c1
 .. ..  2
ai = 1 + h2 qi ,
 −b2 . . 
A=

,
 bi = 21 1 + h2 pi  ,
 .. .. 
. . −cN −1 ci = 12 1 − h2 pi .
−bN aN
Note that ai > 0, and also bi > 0 and ci > 0, if h is small enough.
This is a nonsymmetric tridiagonal system to solve for y. We will show how to change
it to a symmetric positive definite tridiagonal system, so that we may use band Cholesky to
solve it.
Choose  r r r 
c1 c1 c2 c1 c2 . . . cN −1
D = diag 1, , ,..., .
b2 b2 b3 b2 b3 . . . bN
Then we may change Ay = b to (DAD−1 )(Dy) = Db or Ãỹ = b̃, where
 √ 
√a1 − c1 b 2 √
 − c1 b 2 a2 − c2 b 3 
 
 √ .. .. 
à = 
 − c2 b 3 . . .

 . .. . .. p 
 − cN −1 bN 
p
− cN −1 bN aN
4.9. Applications 143

It is easy to see that à is symmetric, and it has the same eigenvalues of A because à and
A are similar. By Gershgorin’s Theorem7 eigenvalues of A lie inside the disk centered at
1 + h2 qi ≥ 1 + h2 q/2 with radius 1; in particular, they must all have positive real parts.
Since A is symmetric, its eigenvalues are real and hence positive, so à is positive definite. Its
smallest eigenvalue is bounded below by h2 q/2. Thus, it can be solved by Cholesky method.
For implementations of both methods, see MATLAB Sources 4.13 and 4.14, respectively.

MATLAB Source 4.13 Two point boundary value problem – finite difference method
function [x,y]=bilocal(p,q,r,a,b,alpha,beta,N)
%BILOCAL - two-point boundary value problem
%y’’(x)-p(x)y’(x)-q(x)y(x)=r(x), x in [a,b]
%y(a)=alpha, y(b)=beta
%call Y=BILOCAL(P,Q,R,A,B,ALPHA,BETA,N)
%P,Q,R - functions
%[A,B] - interval
%alpha,beta - values at endpoints
%N - no. of points

h=(b-a)/(N+1); x=a+[1:N]’*h;
vp=p(x); vr=r(x); vq=q(x);
av=1+hˆ2/2*vq;
bv=1/2*(1+h/2*vp);
cv=1/2*(1-h/2*vp);
B=[[-bv(2:end);0],av,[0;-cv(1:end-1)]];
A=spdiags(B,[-1:1],N,N);
bb=-hˆ2/2*vr;
bb(1)=bb(1)+(1/2+h/4*vp(1))*alpha;
bb(N)=bb(N)+(1/2-h/4*vp(N))*beta;
y=A\bb;
x=[a;x;b];
y=[alpha;y;beta];

We test both methods for the problem

2 2 sin(ln x)
y ′′ = − y ′ + 2 y + , x ∈ [1, 2], y(1) = 1, y(2) = 2,
x x x2
with exact solution
c2 3 1
y = c1 x + 2
− sin(ln x) − cos(ln x),
x 10 10
7 Gershgorin’s Theorem has the following statement: Let B be an arbitrary matrix. Then the eigenvalues λ of B

are located in the union of the n disks X


|λ − bkk | ≤ bkj .
j6=k
144 Numerical Solution of Linear Algebraic Systems

MATLAB Source 4.14 Two point boundary value problem – finite difference method and
symmetric matrix
function [x,y]=bilocalsim(p,q,r,a,b,alfa,beta,N)
%BILOCALSIM - two-point boundary value problem
%y’’(x)-p(x)y’(x)-q(x)y(x)=r(x), x in [a,b]
%y(a)=alpha, y(b)=beta
%call Y=BILOCALSIM(P,Q,R,A,B,ALPHA,BETA,N)
%P,Q,R - functions
%[A,B] - interval
%alpha,beta - values at endpoints
%N - no. of points
%transform the matrix into a symmetric positive one

h=(b-a)/(N+1); x=a+[1:N]’*h;
vp=p(x); vr=r(x); vq=q(x);
av=1+hˆ2/2*vq;
bv=1/2*(1+h/2*vp);
cv=1/2*(1-h/2*vp);
dd=-sqrt(cv(1:end-1).*bv(2:end));
B=[[dd;0],av,[0;dd]];
A=spdiags(B,[-1:1],N,N);
bb=-hˆ2/2*vr;
bb(1)=bb(1)+(1/2+h/4*vp(1))*alfa;
bb(N)=bb(N)+(1/2-h/4*vp(N))*beta;
%Cholesky method
R=chol(A);
D=diag(sqrt(cumprod([1;cv(1:end-1)./bv(2:end)])));
y=D\(R\(R’\(D*bb)));
x=[a;x;b];
y=[alfa;y;beta];

where
1
c2 = [8 − 12 sin(ln 2) − 4 cos(ln 2)],
70
11
c1 = − c2.
10
Here is the code. It calls the methods and tabulate the computed solutions and the exact
solution.
p=@(x) -2./x;
q=@(x) 2./x.ˆ2;
r=@(x) sin(log(x))./x.ˆ2;
a=1; b=2;
alpha=1; bet=2;
N=9;
[x,y]=bilocal(p,q,r,a,b,alpha,bet,N);
4.9. Applications 145

[x2,y2]=bilocalsim(p,q,r,a,b,alpha,bet,N);
%Exact solution
c2=1/70*(8-12*sin(log(2))-4*cos(log(2)));
c1=11/10-c2;
ye=c1*x+c2./x.ˆ2-3/10*sin(log(x))-1/10*cos(log(x));
disp([x,y,y2,ye])

and the output


1.0000 1.0000 1.0000 1.0000
1.1000 1.0926 1.0926 1.0926
1.2000 1.1870 1.1870 1.1871
1.3000 1.2833 1.2833 1.2834
1.4000 1.3814 1.3814 1.3814
1.5000 1.4811 1.4811 1.4812
1.6000 1.5824 1.5824 1.5824
1.7000 1.6850 1.6850 1.6850
1.8000 1.7889 1.7889 1.7889
1.9000 1.8939 1.8939 1.8939
2.0000 2.0000 2.0000 2.0000

See also Problem 4.7.

4.9.2 Computing a plane truss


This example is adapted after [66].
Figure 4.3 depicts a plane truss having 21 members (the numbered lines) connecting 12
joints (the numbered circles). The indicating loads, in tons, are applied at joints 2, 5, 6, 9,
and 10, and we want to determine the resulting force on each member of the truss.

4 8 12 16
3 4 7 8 11

5 9 13 20
3 7 11 15 19
1 17

1 2 5 6 9 10 12
2 6 10 14 18 21

10 15 20 25 30

Figure 4.3: A plane truss

For the truss to be in static equilibrium, there must be no net force, horizontally or ver-
tically, at any joint. Thus, we can determine the member forces, by equating the horizontal
forces to the left and right at each joint, and similarly equating the vertical forces upward
and downward at each joint. For the 12 joints, this would give 24 equations, which is more
than the 21 unknowns factors to be determined. For the truss to be statically determined, that
146 Numerical Solution of Linear Algebraic Systems

is, for there to be a unique solution, we assume that joint 1 is rigidly fixed both horizontally
and vertically and that joint 12 is fixed vertically. Resolving
√ the number of forces into hori-
zontal and vertical components and defining α = 1/ 2, we obtain the following system of
equations for the member forces fi :
f2 = f6 αf1 = f4 + αf5
Joint 2 Joint 3
f3 = 10 αf1 + f3 + αf5 = 0
f4 = f8 αf5 + f6 = αf9 + f10
Joint 4 Joint 5
f7 = 0 αf5 + f7 + αf9 = 15
f10 = f14 f8 + αf9 = f12 + αf13
Joint 6 Joint 7
f11 = 20 αf9 + f11 + αf13 = 0
f12 = f16 αf13 + f14 = αf17 + f18
Joint 8 Joint 9
f15 = 0 αf13 + f15 + αf17 = 25
f18 = f21 f16 + αf17 = f20
Joint 10 Joint 11
f19 = 30 αf17 + f19 + αf20 = 0
Joint 12 αf20 + f21 = 0
We give here only a fragment of MATLAB code. You can download the whole code from
author’s web page (file mytruss2.m).
% MYTRUSS2 Solution to the truss problem.

n = 21;
A = zeros(n,n);
b = zeros(n,1);
alpha = 1/sqrt(2);

% Joint 2: f2 = f6
% f3 = 10
A(1,2) = 1;
A(1,6) = -1;
A(2,3) = 1;
b(2) = 10;

% Joint 3: alpha f1 = f4 + alpha f5


% alpha f1 + f3 + alpha f5 = 0
A(3,1) = alpha;
A(3,4) = -1;
A(3,5) = -alpha;
A(4,1) = alpha;
A(4,3) = 1;
A(4,5) = alpha;

% Joint 4: f4 = f8
% f7 = 0
A(5,4) = 1;
A(5,8) = -1;
A(6,7) = 1;
b(6) = 0;
4.9. Applications 147

%Joints 5,...,11

% Joint 12 alpha f20+f21=0;

A(21,20)=alpha;
A(21,21)=1;

x = A\b;

Figure 4.4 gives a plot of the truss with the force members.

−80.87 −80.87 −101.7 −101.7

−28.9
−64.3 10 50.1 0 20 0.613 0 30 −77.2
34.7

22.83 22.83 78.7 78.7 54.57 54.57

10 15 20 25 30

Figure 4.4: The solution of plane truss problem

Problems
Problem 4.1. For a system with a tridiagonal matrix implement the following methods:
(a) Gaussian elimination, with and without pivoting.
(b) LU decomposition.
(c) LUP decomposition.
(d) Cholesky decomposition for a symmetric, positive definite matrix.
(e) Jacobi method.
(f) Gauss-Seidel method.
(g) SOR method.
148 Numerical Solution of Linear Algebraic Systems

Problem 4.2. Implement Gaussian elimination with partial pivoting in two variants: with
logical line permutation (using a permutation vector) and with physical line permutation.
Compare execution time for various system matrix sizes. Do the same for LUP decomposi-
tion.

Problem 4.3. Modify the LUP decomposition function to return the determinant of the initial
matrix.

Problem 4.4. Consider the system

2x1 − x2 = 1
−xj−1 + 2xj − xj+1 = j, j = 2, n − 1
−xn−1 + 2xn = n.

(a) Generate its matrix using diag.

(b) Solve it using LU decomposition.

(c) Solve it using a suitable function in Problem 4.1.

(d) Generate its matrix using spdiags, solve it using \, and compare the run time with
the required run time for the same system, but with a dense matrix.

(e) Estimate the condition number of the system using condest.

Problem 4.5. Modify Gaussian elimination and LUP decomposition to allow total pivoting.

Problem 4.6. Write a MATLAB function for the generation of random diagonal-dominant
band matrix of given size. Test Jacobi and SOR methods on system having such matrices.

Problem 4.7. Apply the idea in Section 4.9.1 two solve the univariate Poisson equation

d2 v(x)
− = f, 0 < x < 1,
dx2
with boundary conditions v(0) = v(1) = 0. Solve the system of discretized problem by
Cholesky and SOR method.

Problem 4.8. Find the Gauss-Seidel method matrix corresponding to the matrix
 
2 −1
−1 2 −1 
 
 −1 2 −1 
 
A= . . . .
 .. .. .. 
 
 −1 2 −1
−1 2
4.9. Applications 149

Problem 4.9. A finite element analysis for the load of a structure yields the following system
   
α 0 0 0 β −β 15
 0 α 0 −β 0 −β   0 
   
 0 0 α β β 0   −15 
 x=  ,
 0 −β β γ 0 0  0 
   
 β 0 β 0 γ 0   25 
−β −β 0 0 0 γ 0

where α = 482317, β = 2196.05 and γ = 6708.43. Here, x1 , x2 , x3 are the side dis-
placements, and x4 , x5 , x6 are (tridimensional) rotational displacement corresponding to the
applied force (the right-hand side).
(a) Find x.
(b) How accurate is the computing? Suppose first exact input data, then a relative error in
input data of k∆Ak/kAk = 5 × 10−7 .

Problem 4.10. Consider the system

x1 + x2 = 2
10x1 + 1018 x2 = 10 + 1018 .

(a) Solve it by Gaussian elimination with partial pivoting.

(b) Divide each line by maximum in modulus for that line, and then apply the Gaussian
elimination.
(c) Solve the system using Symbolic Math Toolbox.
150 Numerical Solution of Linear Algebraic Systems
CHAPTER 5

Function Approximation

The function to be approximated can be defined:


• On a continuum (typically a finite interval) – special functions (elementary or transcen-
dental) that one wishes to evaluate as part of a subroutine.
• On a finite set of points – instance frequently encountered in the physical sciences when
measurements are taken of a certain physical quantity as a function of other physical
quantity (such as time).
In either case one wants to approximate the given function “as well as possible” in terms
of other simpler functions. Since such an evaluation must be reduced to a finite number of
arithmetical operations, the simpler functions should be polynomial or rational functions.
The general scheme of approximation can be described as:
• A given function f ∈ X to be approximated.
• A class Φ of “approximations”.
• A “norm” k · k measuring the overall magnitude of functions.
b ∈ Φ of f such that
We are looking for an approximation ϕ
b ≤ kf − ϕk for all ϕ ∈ Φ.
kf − ϕk (5.0.1)
This problem is called best approximation problem of f from the class Φ, and the function
b is called best approximation element of f , relative to the norm k · k.
ϕ
Given a basis {πj }nj=1 of Φ, we can express a ϕ ∈ Φ and Φ as
 
 Xn 
Φ = Φn = ϕ : ϕ(t) = cj πj (t), cj ∈ R . (5.0.2)
 
j=1

151
152 Function Approximation

Φ is a finite-dimensional linear space or a proper subset of it.

Example 5.0.1. Φ = Pm - the set of polynomials of degree at most m. A basis of this


space is ej (t) = tj , j = 0, 1, . . . , m. So, dim Pm = m + 1. Polynomials are the most
frequently used “general-purpose” approximations for dealing with functions on bounded
domains (finite intervals or finite set of points). One reason – Weierstrass’ theorem – any
function from C[a, b] can be approximated on a finite interval as closely as one wishes by a
polynomial of sufficiently high degree. ♦

Example 5.0.2. Φ = Skm (∆) the space of polynomial spline functions of degree m and
smoothness class k on the subdivision

∆ : a = t1 < t2 < t3 < · · · < tN −1 < tN = b

of the interval [a, b]. These are piecewise polynomials of degree ≤ m, pieced together at
the “joints” t1 , . . . , tN −1 , in such a way that all derivatives up to and including the kth are
continuous on the whole interval [a, b] including the joints. We assume 0 ≤ k < m. For
k = m this space equates Pm . We set k = −1 if we allow discontinuities at the joints. ♦

Example 5.0.3. Φ = Tm [0, 2π] the space of trigonometric polynomials of degree ≤ m on


[0, 2π]. This are linear combinations of the functions

πk (t) = cos(k − 1)t k = 1, m + 1,


πm+1−k (t) = sin kt k = 1, m.

The dimension of this space is n = 2m + 1. Such approximations are natural choices when
the function f to be approximated is periodic with period 2π. (If f has period p, one makes
a change of variables t 7→ tp/2π.) ♦

The class of rational functions

Φ = Rr,s = {ϕ : ϕ = p/q, p ∈ Pr , q ∈ Ps },

is not a linear space.


Possible choice of norms – both for continuous and discrete functions – and the approx-
imation they generate are summarized in Table 5.1. The continuous case involve an interval
[a, b] and a weight function w(t) (possibly w(t) ≡ 1) defined on [a, b] and positive except for
isolate zeros. The discrete case involve a set of N distinct points t1 , t2 , . . . , tN along with
positive weight factors w1 , w2 , . . . , wN (possibly wi = 1, i = 1, N ). The interval [a, b] may
be unbounded if the weight function w is such that the improper integral extended over [a, b],
which defines the norm makes sense.
Hence, we may take any one of the norms in Table 5.1 and combine it with any of the
preceding linear spaces Φ to arrive at a meaningful best approximation problem (5.0.1). In
the continuous case, the given function f and the functions ϕ ∈ Φ must be defined on [a, b]
and such that the norm kf − ϕk makes sense. Likewise, f and ϕ must be defined at the points
ti in the discrete case.
153

continuous norm type discrete norm


kuk∞ = max |u(t)| L∞ kuk∞ = max |u(ti )|
a≤t≤b 1≤i≤N
Rb Pn
kuk1,w = a |u(t)|w(t) dt L1w kuk1,w = i=1 wi |u(ti )|
R 1/2 P 1/2
b N
kuk2,w = a |u(t)|2 w(t) dt L2w kuk2,w = i=1 wi |u(ti )|
2

Table 5.1: Types of approximations and associated norms

Note that if the best approximant ϕ b in the discrete case is such that kf − ϕkb = 0, then
b i ) = f (ti ), for i = 1, 2, . . . , N . We then say that ϕ
ϕ(t b interpolates f at the points ti and we
refer to this kind of approximation as an interpolation problem.
The simplest approximation problems are the least squares problem and the interpolation
problem and the easiest space is the space of polynomials.
Before we start with the least square problem we introduce a notational device (as in [33])
that allows us to treat the continuous and the discrete case simultaneously. We define in the
continuous case


 0, if t < a (when − ∞ < a),
 Z t


w(τ ) dτ, if a ≤ t ≤ b,
λ(t) = (5.0.3)

 Z a
b


 w(τ ) dτ, if t > b (when b < ∞).
a

then we can write, for any continuous function u


Z Z b
u(t) dλ(t) = u(t)w(t) dt, (5.0.4)
R a

since dλ(t) ≡ 0 outside [a, b] and dλ(t) = w(t) dt inside. We call dλ a continuous (pos-
itive) measure. The discrete measure (also called “Dirac measure”) associated to the point
set {t1 , t2 , . . . , tN } is a measure dλ that is nonzero only at the points ti and has the value wi
there. Thus in this case
Z XN
u(t) dλ(t) = wi u(ti ). (5.0.5)
R i=1

A more precise definition can be given in terms of Stieltjes integrals, if we define λ(t) to be
a step function having the jump wi at ti . In particular, we can define the L2 norm as
Z  12
2
kuk2, dλ = |u(t)| dλ(t) (5.0.6)
R

and obtain the continuous or the discrete norm depending on whether λ is taken to be as in
(5.0.3) or a step function as in (5.0.5).
We call the support of dλ – denoted by supp dλ – the interval [a, b] in the continuous
case (assuming w is positive on [a, b] except for isolated zeros) and the set {t1 , t2 , . . . , tN }
154 Function Approximation

in the discrete case. We say that the set of functions πj in (5.0.2) is linearly independent on
supp dλ if
n
X
∀ t ∈ supp dλ cj πj (t) ≡ 0 ⇒ c1 = c2 = · · · = ck = 0 (5.0.7)
j=1

5.1 Least Squares approximation


We specialize the best approximation problem (5.0.1) by taking as norm the L2 norm
Z  21
2
kuk2, dλ = |u(t)| dλ(t) , (5.1.1)
R

where dλ is either a continuous measure (cf. (5.0.3)) or a discrete measure (cf. (5.0.5)) and
using approximants ϕ from an n-dimensional linear space
 
 X n 
Φ = Φn = ϕ : ϕ(t) = cj πj (t), cj ∈ R . (5.1.2)
 
j=1

πj linearly independent on suppdλ; the integral in (5.1.1) is meaningful whenever u = πj ,


j = 1, . . . , n or u = f .
The specialized problem is called least squares approximation problem or square mean
approximation problem. His solution (beginning of the 19th century) is due to Gauss and
Legendre 1 .

5.1.1 Inner products


Given a discrete or continuous measure dλ, and given any two function u and v having a
finite norm (5.0.1), we can define the inner (scalar) product
Z
(u, v) = u(t)v(t) dλ(t). (5.1.3)
R

The Cauchy-Buniakovski-Schwarz inequality


k(u, v)k ≤ kuk2,dλ kvk2,dλ
tells us that the integral in (5.1.3) is well defined.
A real inner product has the following properties:

Adrien Marie Legendre (1752-1833) was a French mathematicians


active in Paris, best known for his treatise on elliptic integrals, but also
1 famous for his work in number theory and geometry. He is considered
the originator (in 1805) of the method of least squares, although Gauss
had already used it in 1794, but published it only in 1809.
5.1. Least Squares approximation 155

(i) symmetry (u, v) = (v, u);

(ii) homogeneity (αu, v) = α(u, v), α ∈ R;

(iii) additivity (u + v, w) = (u, w) + (v, w);

(iv) positive definiteness (u, u) ≥ 0 and (u, u) = 0 ⇔ u ≡ 0 on supp dλ.

(i)+(ii) ⇒ linearity
(α1 u1 + α2 u2 , v) = α1 (u1 , v) + α2 (u2 , v) (5.1.4)
(5.1.4) extends to finite linear combinations. Also

kuk22,dλ = (u, u). (5.1.5)

We say that u and v are orthogonal if

(u, v) = 0. (5.1.6)

More generally, we may consider an orthogonal system {uk }nk=1 :

(ui , uj ) = 0 if i 6= j, uk 6= 0 on supp dλ; i, j = 1, n, k = 1, n. (5.1.7)

For such a system we have the Generalized Theorem of Pythagoras


n 2
X n
X

αk uk = |αk |2 kuk k2 . (5.1.8)

k=1 k=1

(5.1.8) implies that every orthogonal system is linearly independent on supp dλ. Indeed,
if the left-hand side of (5.1.8) vanishes, then so does the right-hand side, and this, since
kuk k2 > 0, by assumption, implies α1 = α2 = · · · = αn = 0.

5.1.2 The normal equations


By (5.1.5) we can write the square of L2 error in the form

E 2 [ϕ] := kϕ − f k2 = (ϕ − f, ϕ − f ) = (ϕ, ϕ) − 2(ϕ, f ) + (f, f ).

Inserting ϕ from (5.1.2) gives


 2  
Z Xn Z n
X
E 2 [ϕ] =  cj πj (t) dλ(t) − 2  cj πj (t) f (t) dλ(t)+ (5.1.9)
R j=1 R j=1
Z
+ f 2 (t) dλ(t).
R
156 Function Approximation

The squared L2 error is a quadratic function of the coefficients c1 , . . . , cn of ϕ. The


problem of best L2 approximation thus amounts to minimizing this quadratic function; one
solves it by vanishing the partial derivatives. One obtains
 
Z Xn Z
∂ 2
E [ϕ] = 2  cj πj (t) πi (t) dλ(t) − 2 πi (t)f (t) dλ(t) = 0,
∂ci R j=1 R

that is,
n
X
(πi , πj )cj = (πi , f ), i = 1, 2, . . . , n. (5.1.10)
j=1

These are called normal equations for the least squares problem. They form a system
having the form
Ac = b, (5.1.11)
where the matrix A and the vector b have elements

A = [aij ], aij = (πi , πj ), b = [bi ], bi = (πi , f ). (5.1.12)

By symmetry of the inner product, A is a symmetric matrix. Moreover, A is positive definite;


that is
Xn X n
T
x Ax = aij xi xj > 0 if x 6= [0, 0, . . . , 0]T . (5.1.13)
i=1 j=1

The quadratic function in (5.1.13) is called a quadratic form (since it is homogeneous of


degree 2). The positive definiteness of A says that the quadratic form whose coefficients are
the elements of A is always nonnegative, and it is zero only if all variables xi vanish.
To prove (5.1.13), all we have to do is insert the definition of aij and to use the property
(i)-(iv) of the inner product
n 2
n X
X n n X
X n X

xT Ax = xi xj (πi , πj ) = (xi πi , xj πj ) = xi πi .

i=1 j=1 i=1 j=1 i=1
Pn
This is clearly nonnegative. It is zero only if i=1 xi πi ≡ 0 on supp dλ, which, by the
assumption of linear independence of the πi , implies x1 = x2 = · · · = xn = 0.
It is a well-known fact of linear algebra that a symmetric positive definite matrix A is
nonsingular. Indeed, its determinant, as well as its leading principal minor determinants
are strictly positive. If follows that the system (5.1.10) of normal equation has a unique
solution. Does this solution correspond to a minimum of E[ϕ] in (5.1.9)? The hessian matrix
H = [∂ 2 E 2 /∂ci ∂cj ] has to be positive definite. But H = 2A, since E 2 is a quadratic
function. Therefore, H, with A, is indeed positive definite, and the solution of the normal
equations gives us the desired minimum. The least squares approximation problem thus has
a unique solution, given by
Xn
b =
ϕ(t) b
cj πj (t), (5.1.14)
j=1
5.1. Least Squares approximation 157

where ĉ = [ĉ1 , ĉ2 , . . . , ĉn ]T is the solution of the normal equation (5.1.10).
This completely settles the least square approximation problem in theory. How in prac-
tice? For a general set of linearly independent basis function, we can see the following
difficulties.
(1) The system (5.1.10) may be ill-conditioned. A classical example is provided by
suppdλ = [0, 1], dλ(t) = dt on [0, 1] and πj (t) = tj−1 , j = 1, 2, . . . , n. Then
Z 1
1
(πi , πj ) = ti+j−2 dt = , i, j = 1, 2, . . . , n,
0 i+j−1

that is A is precisely the Hilbert matrix. The resulting severe ill-conditioning of the normal
equations is entirely due to an unfortunate choice of the basis function. These become almost
linearly dependent, as the exponent grows. Another source of degradation lies in the element
R1
bj = 0 πj (t)f (t)dt of the right-hand side vector. When j is large πj (t) = tj−1 behaves on
[0, 1] like a discontinuous function. A polynomial πj that oscillates rapidly on [0, 1] would
seem to be preferable from this point of view, since it would ”engage“ more vigorously the
function f over all the interval [0, 1], in contrast to a canonical monomial which shoots from
almost zero to 1 at the right endpoint.
(2) The second disadvantage is that all the coefficients b cj in (5.1.14) depends on n, i.e.
(n)
b
cj = b cj , j = 1, 2, . . . , n. Increasing n will give an enlarged system of normal equations
with a completely new solution vector. We refer to this as the nonpermanence of the coeffi-
cients bcj .
Both defects (1) and (2) can be eliminated (or at least attenuated) by choosing for the
basis functions πj an orthogonal system,

(πi , πj ) = 0 if i 6= j (πi , πj ) = kπj k2 > 0 (5.1.15)

Then the system of normal equations becomes diagonal and is solved immediately by

(πj , f )
b
cj = , j = 1, 2, . . . , n. (5.1.16)
(πi , πj )

Clearly, each of these coefficients ĉj are independent of n and once computed, they remain
the same for any larger n. We now have permanence of the coefficients. We must not solve a
system of normal equations, but instead we can use the formula (5.1.16) directly.
Any system {π̂j } that is linearly independent on suppdλ can be orthogonalized with
respect to the measure dλ by the Gram-Schmidt procedure. One takes

π = π̂1

and then, for j = 2, 3, . . . recursively computes


j−1
X (b
πj , πk )
bj −
πj = π ck πk , ck = , k = 1, j − 1.
(πk , πk )
k=1

Then each πj so determined is orthogonal to all preceding ones.


158 Function Approximation

5.1.3 Least squares error; convergence


We have seen that if Φ = Φn consists of n functions πj , j = 1, 2, . . . , n which are linearly
independent on supp dλ, then the least squares problem for dλ

b 2,dλ
min kf − ϕk2,dλ = kf − ϕk (5.1.17)
ϕ∈φn

has a unique solution ϕ b=ϕ bn , given by (5.1.14). There are many ways to select a basis {πj }
in Φn and, therefore, many ways the solution ϕ̂n be represented. Nevertheless, is always
one and the same function. The least squares error – the quantity on the right of (5.1.17) –
is independent of the choice of basis functions (although the calculation of the least square
solution, as mentioned previously, is not). In studying this error we may assume, without
restricting generality, that the basis πj is an orthogonal system. (Every linear independent
system can be orthogonalized by the Gram-Schmidt orthogonalization procedure). Then we
have (cf. (5.1.16))
Xn
(πj , f )
bn (t) =
ϕ cj πj (t), b
b cj = . (5.1.18)
j=1
(πj , πj )

We first note that the error f − ϕn is orthogonal to the space Φn ; that is

cn , ϕ) = 0, ∀ ϕ ∈ Φn
(f − ϕ (5.1.19)

where the inner product is the one in (5.1.3). Since ϕ is a linear combination of the πk , it
suffices to show (5.1.19) for each ϕ = πk , k = 1, 2, . . . , n. Inserting ϕ̂n from (5.1.18) in the
left of (5.1.19), we find indeed
 
n
X
(f − ϕbn , πk ) = f − cj πk , πk  = (f, πk ) − b
b ck (πk , πk ) = 0,
j=1

the last equation following from the formula for ĉk in (5.1.18). The result (5.1.19) has a
simple geometric interpretation. If we picture functions as vectors, and the space Φn as a
plane, then for any function f that “sticks out” of the plane Φn , the least square approximant
ϕ̂n is the orthogonal projection of f onto Φn ; see Figure 5.1.
In particular, choosing ϕ = ϕ̂n in (5.1.19), we get

bn , ϕ
(f − ϕ bn ) = 0

b + ϕ,
and, therefore, since f = (f − ϕ) b by the theorem of Pythagoras and its generalization
(5.1.8)
2
Xn

kf k2 = kf − ϕk
b 2 + kϕk
b 2 = kf − ϕ bn k2 +
b
c π
j j
j=1
n
X
bn k2 +
= kf − ϕ cj |2 kπj k2 .
|b
j=1
5.1. Least Squares approximation 159

ϕ
O n

Figure 5.1: Least square approximation as orthogonal projection

Solving for the first term on the right, we get


 1/2
 n
X  (πj , f )
bn k =
kf − ϕ kf k2 − cj |kπj k2
|b , b
cj = . (5.1.20)
  (πj , πj )
j=1

Note that the expression in braces must necessarily be nonnegative.


The formula (5.1.20) is interesting theoretically, but for limited practical use. Note, in-
deed, that as the error approaches the level of the machine precision eps, computing the error

from the right-hand side of (5.1.20) cannot produce anything smaller than eps because of
inevitable rounding errors committed during the subtraction in the radicand. (They may even
produce a negative result for the radicand.) Using instead the definition,
Z  21
2
bn k =
kf − ϕ bn (t)] dλ(t)
[f (t) − ϕ ,
R

along, perhaps, with a suitable (positive) quadrature rule, it is guaranteed to produce a non-
negative result that may potentially be as small as O(eps).
If now we are given a sequence of linear spaces Φn , n = 1, 2, 3, . . . , then clearly

b1 k ≥ kf − ϕ
kf − ϕ b2 k ≥ kf − ϕ
b3 k ≥ . . . ,

which follows not only from (5.1.20), but more directly from the fact that

Φ1 ⊂ Φ2 ⊂ Φ3 ⊂ . . . .

If there are infinitely many such spaces, then the sequence of L2 errors, being monotonically
decreasing, must converge to a limit. Is this limit zero? If so, we say that the least square
160 Function Approximation

approximation process converges (in the mean) as n → ∞. It is obvious from (5.1.20) that a
necessary and sufficient condition for this is

X
cj |2 kπj k2 = kf k2.
|b (5.1.21)
j=1

An equivalent way of stating convergence is as follows: given any f with kf k < ∞, that
is ∀ f ∈ L2dλ and given any ε > 0 no matter how small, there exists an integer n = nε and a
function ϕ∗ ∈ Φn such that kf − ϕ∗ k ≤ ε, for all n > nε . A class of spaces Φn having this
property is said to be complete with respect to the norm k · k = k · k2,dλ . One therefore calls
the relation (5.1.21) the completeness relation or Parseval-Liapunov relation.

5.2 Examples of orthogonal systems


The prototype of all orthogonal systems is the system of trigonometric functions known from
Fourier analysis. Other widely used systems involve orthogonal algebraic polynomials.
(1) The trigonometric system consists of the functions

1, cos t, cos 2t, cos 3t, . . . , sin t, sin 2t, sin 3t, . . .

It is orthogonal on [0, 2π] with respect to the equally weighted measure



dt on [0, 2π],
dλ(t) =
0 otherwise.

We have
Z 2π 
0, if k 6= ℓ
sin kt sin ℓt dt = k, ℓ = 1, 2, 3, . . .
0 π, if k=ℓ

Z 2π  0, k 6= ℓ
cos kt cos ℓt dt = 2π, k = ℓ = 0 k, ℓ = 0, 1, 2, . . .
0 
π, k=ℓ>0
Z 2π
sin kt cos ℓt dt = 0, k = 1, 2, 3, . . . , ℓ = 0, 1, 2, . . .
0

The form of approximation is



a0 X
f (t) = + (ak cos kt + bk sin kt). (5.2.1)
2
k=1

Using (5.1.16) we get


Z 2π
1
ak = f (t) cos kt dt, k = 1, 2, . . .
π 0
5.2. Examples of orthogonal systems 161

Z 2π
1
bk = f (t) sin kt dt, k = 1, 2, . . . (5.2.2)
π 0

which are known as Fourier coefficients of f . They are precisely the coefficients (5.1.16) for
the trigonometric system. By extension, the coefficients (5.1.16) for any orthogonal system
(πj ) will be called Fourier coefficients of f relative to this system. In particular, we recog-
nize the truncated Fourier series at k = m the best approximation of f from the class of
trigonometric polynomials of degree ≤ n relative to the norm
Z 2π 1/2
2
kuk2 = |u(t)| dt .
0

(2) Orthogonal polynomials. Given a measure dλ, we know that any finite number
of consecutive powers 1, t, t2 , . . . are linearly independent on [a, b], if supp dλ = [a, b],
whereas the finite set 1, t, . . . , tn−1 is linearly independent on supp dλ = {t1 , t2 , . . . , tN }.
Since a linearly independent set can be orthogonalized by Gram-Schmidt procedure, any
measure dλ of the type considered generates a unique set of monic2 polynomials πj (t, dλ),
j = 0, 1, 2, . . . satisfying

degree πj = j, j = 0, 1, 2, . . .
Z
(5.2.3)
πk (t)πℓ (t) dλ(t) = 0, if k 6= ℓ
R

These are called orthogonal polynomials relative to the measure dλ. Let the index j start
from zero. The set {πj } is infinite if suppdλ = [a, b], and consists of exactly N polynomials
π0 , π1 , . . . , πN −1 if supp dλ = {t1 , . . . , tN }. The latter are referred to as discrete orthogonal
polynomials.
Three consecutive orthogonal polynomials are linearly related. Specifically, there exists
real constants αk = αk ( dλ) and βk = βk ( dλ) > 0 (depending on the measure dλ) such that

πk+1 (t) = (t − αk )πk (t) − βk πk−1 (t), k = 0, 1, 2, . . .


(5.2.4)
π−1 (t) = 0, π0 (t) = 1.

(It is understood that (5.2.4) holds for all k ∈ N if supp dλ = [a, b] and only for k = 0, N − 2
if supp dλ = {t1 , t2 , . . . , tN }).
To prove (5.2.4) and, at the same time identify the coefficients αk , βk we note that

πk+1 (t) − tπk (t)

is a polynomial of degree ≤ k, and it can be expressed as a linear combination of π0 , π1 , . . . ,


πk . We write this linear combination in the form
k−2
X
πk+1 − tπk (t) = −αk πk (t) − βk πk−1 (t) + γk,j πj (t) (5.2.5)
j=0

2A polynomial is called monic if its leading coefficient is equal to 1.


162 Function Approximation

(with the understanding that empty sums are zero). Now, multiplying both sides of (5.2.5) by
πk in the sense of inner product defined in (5.1.3), we get
(−tπk , πk ) = −αk (πk , πk );
that is
(tπk , πk )
αk = , k = 0, 1, 2, . . . (5.2.6)
(πk , πk )
Similarly, forming the inner product of (5.2.5) with πk−1 gives
(−tπk , πk−1 ) = −βk (πk−1 , πk−1 ).
Since (tπk , πk−1 ) = (πk , tπk−1 ) and tπk−1 differs from πk by a polynomial of degree < k,
we obtain by orthogonality (tπk , πk−1 ) = (πk , πk ); hence
(πk , πk )
βk = , k = 1, 2, . . . (5.2.7)
(πk−1 , πk−1 )
Multiplication of (5.2.5) by πℓ , ℓ < k − 1, yields
γk,ℓ = 0, ℓ = 0, 1, . . . , k − 1 (5.2.8)
The recursion (5.2.4) provides us with a practical scheme of generating orthogonal poly-
nomials. Since π0 = 1, we can compute α0 by (5.2.6) with k = 0. This allows us to compute
π1 , using (5.2.4), with k = 0. Knowing π0 , π1 we can go back to (5.2.6) and (5.2.7) and
compute α1 and β1 , respectively. This allow us to compute π2 via (5.2.4) with k = 1. Pro-
ceeding in this fashion, using alternatively (5.2.6), (5.2.7) and (5.2.4), we can generate as
many orthogonal polynomials as are desired. This procedure, called Stieltjes’s 3 procedure –
is particularly well suited for discrete orthogonal polynomials, since the inner product is then
a finite sum. In the continuous case, the computation of the inner product requires integration,
which complicates matters. Fortunately, for some important special measures dλ(t) = w(t)
the recursion coefficients are explicitly known.
The special case of symmetry (i.e. dλ(t) = w(t) with w(−t) = w(t) and supp dλ is
symmetric with respect to the origin) deserves special attention. In this case αk = 0, ∀ k ∈ N,
due to (5.2.1) since
Z Z b
(tπk , πk ) = tπk2 (t) dλ(t) = w(t)tπk2 (t) dt = 0,
R a
Thomas Joannes Stieltjes (1856-1894), born in the Netherlands, stud-
ied at the Technical Institute of Delft, but never finished to get his
degree because of a deep-seated aversion to examinations. He never-
theless got a job at the Observatory of Leiden as a “computer assistant
for astronomical calculation”. His early publication caught the atten-
tion of Hermite, who was able to eventually secure a university posi-
3
tion for Stieltjes in Toulouse. A life-long friendship evolved between
these two great men, of which two volumes of their correspondence
gives vivid testimony (and still makes fascinating reading). Stieltjes is
best known for his work on continued fractions and moment problem,
which, among other things, led him to invent a new concept of integral
which now bears his name. He died very young of tuberculosis at age
of 38.
5.3. Examples of orthogonal polynomials 163

because the integrand is an odd function and the domain is symmetric with respect to the
origin.

5.3 Examples of orthogonal polynomials


5.3.1 Legendre polynomials
They are defined by means of the so-called Rodrigues’s formula
k! dk 2
πk (t) = (t − 1)k . (5.3.1)
(2k)! dtk
Let us check first the orthogonality on [−1, 1] relative to the measure dλ(t) = dt. For
any 0 ≤ ℓ < k, repeated integration by parts gives
Z 1
dk
tℓ k (t2 − 1)k =
−1 dt
1
X ℓ k−m−1
ℓ ℓ−m d 2 k
(−1) ℓ(ℓ − 1) . . . (ℓ − m + 1)t (t − 1) = 0, (5.3.2)
m=0
dtk−m−1
−1

the last relation since 0 ≤ k − m − 1 < k. Thus,


(πk , p) = 0, ∀p ∈ Pk−1 ,
proving orthogonality. Writing (by symmetry)
πk (t) = tk + µk tk−2 + . . . , k≥2
and noting (again by symmetry) that the recurrence relation has the form
πk+1 (t) = tπk (t) − βk πk−1 (t),
we obtain
tπk (t) − πk+1 (t)
βk = ,
πk−1 (t)
which is valid for all t. In particular as t → ∞,
tπk (t) − πk+1 (t) (µk − µk+1 )tk−1 + . . .
βk = lim = lim = µk − µk+1 .
t→∞ πk−1 (t) t→∞ tk−1 + . . .
(If k = 1, set µ1 = 0.)
From Rodrigues’s formula we find

k! dk 2k 
πk (t) = k
t − kt2k−2 + . . .
(2k)! dt
k!
= (2k(2k − 1) . . . (k + 1)tk − k(2k − 2)(2k − 3) . . . (k − 1)tk−1 + . . . )
(2k)!
k(k − 1) k−2
= tk − t + ...,
2(2k − 1)
164 Function Approximation

so that
k(k − 1)
µk = , k ≥ 2.
2(2k − 1)
Therefore,
k2
βk = µk − µk+1 =
(2k − 1)(2k + 1)
that is, since µ1 = 0,
1
βk = , k ≥ 1. (5.3.3)
4 − k −2

Legendre polynomials
1

1
0.8 Lk+1 (t) = tLk (t) − 4−k−2 Lk−1 (t)

0.6

0.4

0.2
k

0
L

−0.2

−0.4

−0.6

n=1
−0.8 n=2
n=3
n=4
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
t

Figure 5.2: Legendre polynomials

MATLAB Source 5.3 computes the nth degree least squares Legendre approximation.
It computes coefficients using formula (5.1.16) and then evaluates the approximation. The
function vLegendre (MATLAB Source 5.1) computes the values of a Legendre polynomi-
als with given degree on a given set of points. The inner products are computed via MATLAB
function quadl.
Figure 5.2 gives the graphs of kth Legendre polynomials, k = 1, 4. They were generated
using the MATLAB script graphLegendre.m:
%graphs for Legendre polynomials
n=4; clf
t=(-1:0.01:1)’;
s=[];
ls={’:’,’-’,’--’,’-.’};
lw=[1.5,0.5,0.5,0.5];
for k=1:n
5.3. Examples of orthogonal polynomials 165

y=vLegendre(t,k);
s=[s;strcat(’\itn=’,int2str(k))];
plot(t,y,’LineStyle’,ls{k},’Linewidth’,lw(k),’Color’,’k’);
hold on
end
legend(s,4)
xlabel(’t’,’FontSize’,12,’FontAngle’,’italic’)
ylabel(’L_k’,’FontSize’,12,’FontAngle’,’italic’)
title(’polinoame Legendre’,’Fontsize’,14);
text(-0.65,0.8,...
’$L_{k+1}(t)=tL_k(t)-\frac{1}{4-kˆ{-2}}L_{k-1}(t)$’,...
’FontSize’,14,’FontAngle’,’italic’,’Interpreter’,’LaTeX’)

We used LATEX commands within text for a more pleasant looking of the recurrence relation.

MATLAB Source 5.1 Compute Legendre polynomials using recurrence relation


function vl=vLegendre(x,n)
%VLEGENDRE - value of Legendre polynomial
%call vl=vLegendre(x,n)
%x - points
%n - degree
%vl - value

pnm1 = ones(size(x));
if n==0, vl=pnm1; return; end
pn = x;
if n==1, vl=pn; return; end
for k=2:n
vl=x.*pn-1/(4-(k-1)ˆ(-2)).*pnm1;
pnm1=pn; pn=vl;
end
166 Function Approximation

MATLAB Source 5.2 Compute Legendre coefficients


function c=Legendrecoeff(f,n)
%LEGENDRECOEFF - coefficients of least squares Legendre
% approximation
%call c=Legendrecoeff(f,n)
%f - function
%n - degree

n3=2;
for k=0:n
if k>0, n3=n3*kˆ2/(2*k-1)/(2*k+1); end
c(k+1)=quadl(@fleg,-1,1,1e-12,0,f,k)/n3;
end
%subfunction
function y=fleg(x,f,k)
y=f(x).*vLegendre(x,k);

MATLAB Source 5.3 Least square approximation via Legendre polynomials


function y=Legendreapprox(f,x,n)
%LEGENDREAPPROX - continuous least squares Legendre
% approximation
%call y=Legendreapprox(f,x,n)
%f - function
%x - points
%n - degree

c=Legendrecoeff(f,n);
y=evalLegendreapprox(c,x);

function y=evalLegendreapprox(c,x)
%EVALLEGENDREAPPROX - evaluate least squares Legendre
% approximation

y=zeros(size(x));
for k=1:length(c)
y=y+c(k)*vLegendre(x,k-1);
end
5.3. Examples of orthogonal polynomials 167

5.3.2 First kind Chebyshev polynomials


4
The Chebyshev #1 polynomials can be defined by formula

Tn (x) = cos(n arccos x), n ∈ N. (5.3.4)

The trigonometric identity

cos(k + 1)θ + cos(k − 1)θ = 2 cos θ cos kθ

and (5.3.4), by setting θ = arccos x give us


Tk+1 (x) = 2xTk (x) − Tk−1 (x) k = 1, 2, 3, . . .
(5.3.5)
T0 (x) = 1, T1 (x) = x.
For example,

T2 (x) = 2x2 − 1,
T3 (x) = 4x3 − 3x,
T4 (x) = 8x4 − 8x2 + 1,

and so on.
It is evident from (5.3.5) that the leading coefficient of Tn is 2n−1 (if n ≥ 1); the first
kind monic Chebyshev polynomial is
◦ 1 ◦
Tn (x) = Tn (x), n ≥ 0, T0 = T0 . (5.3.6)
2n−1
From (5.3.4) we obtain immediately the zeros of Tn
(n) (n) (n) 2k − 1
xk = cos θk , θk = π, k = 1, n. (5.3.7)
2n
They are the projections onto the real line of equally spaced points on the unit circle; see
Figure 5.3 for n = 4.
On [−1, 1] Tn oscillates from +1 to -1, attaining this extreme values at

(n) (n) (n) kπ


yk = cos ηk , ηk = , k = 0, n.
n
Figure 5.4 give the graphs of some first kind Chebyshev polynomials.

Pafnuty Levovich Cebyshev (1821-1894) was the most prominent of


the St. Petersburg school of mathematics. He made pioneering contri-
4 butions to number theory, probability theory, and approximation the-
ory. He is regarded as the founder of the constructive function theory,
but also worked in mechanics, notably the theory of mechanisms, and
in ballistics.
168 Function Approximation

θ3

θ4 θ2

θ1

Figure 5.3: The Chebyshev polynomial T4 and its root

3
T (x)
3
T (x)
4
2.5 T (x)
7
T (x)
8
2

1.5

0.5

−0.5

−1

−1.5

−2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure 5.4: The Chebyshev #1 polynomials T3 , T4 , T7 , T8 on [-1,1]


5.3. Examples of orthogonal polynomials 169

First kind Chebyshev polynomials are orthogonal relative to the measure

dx
dλ(x) = √ , on [−1, 1].
1 − x2
One easily checks from (5.3.4) that
Z 1 Z π
dx
Tk (x)Tℓ (x) √ = Tk (cos θ)Tℓ (cos θ) dθ
−1 1 − x2 0

Z π  0 if k 6= ℓ
= cos kθ cos ℓθ dθ = π if k=ℓ=0 (5.3.8)
0 
π/2 if k = ℓ 6= 0

The Fourier expansion in Chebyshev polynomials (essentially the Fourier cosine expansion)
is given by
∞ ∞
X ′ 1 X
f (x) = cj Tj (x) := c0 + cj Tj (x), (5.3.9)
j=0
2 j=1

where Z 1
2 dx
cj = f (x)Tj (x) √ , j ∈ N.
π −1 1 − x2
Truncating (5.3.9) with the term of degree n gives a useful polynomial approximation of
degree n
n n
X ′ c0 X
τn (x) = cj Tj (x) := + cj Tj (x), (5.3.10)
j=0
2 j=1

having an error

X
f (x) − τn (x) = cj Tj (x) ≈ cn+1 Tn+1 (x). (5.3.11)
j=n+1

The approximation on the far right is better the faster the Fourier coefficients cj tend to zero.
The error (5.3.11), essentially oscillates between +cn+1 and −cn+1 and thus is of “uniform”
size. This is in stark contrast to Taylor’s expansion at x = 0, where the nth degree polynomial
partial sum has an error proportional to xn+1 on [−1, 1].
With respect to the inner product
n+1
X
(f, g)T := f (ξk )g(ξk ), (5.3.12)
k=1

where {ξ1 , . . . , ξn+1 } is the set of zeros of Tn+1 , the following discrete orthogonality prop-
erty holds 
 0, i 6= j
n+1
(Ti , Tj )T = , i = j 6= 0 .
 2
n + 1, i = j = 0
170 Function Approximation

2k−1
Indeed, we have arccos ξk = 2n+2 π, k = 1, n + 1. Let us compute now the inner product:

(Ti , Tj )T = (cos i arccos t, cos j arccos t)T =


n+1
X
= cos(i arccos ξk ) cos(j arccos ξk ) =
k=1
n+1
X    
2k − 1 2k − 1
= cos i π cos j π =
2(n + 1) 2(n + 1)
k=1
n+1  
1X 2k − 1 2k − 1
= cos(i + j) π + cos(i − j) π =
2 2(n + 1) 2(n + 1)
k=1
n+1
X n+1
1 i+j 1X i−j
= cos(2k − 1) π+ cos(2k − 1) π.
2 2(n + 1) 2 2(n + 1)
k=1 k=1
i+j i−j
One introduces the notations α := 2(n+1) π, β := 2(n+1) π and
n+1
1X
S1 := cos(2k − 1)α,
2
k=1
n+1
X
1
S2 := cos(2k − 1)β.
2
k=1

Since
2 sin αS1 = sin 2(n + 1)α,
2 sin βS2 = sin 2(n + 1)β,
one obtains S1 = 0 and S2 = 0.
With respect to the inner product
1 1
(f, g)U := f (η0 )g(η0 ) + f (η1 )g(η1 ) + · · · + f (ηn−1 )g(ηn−1 ) + f (ηn )g(ηn )
2 2
n
X (5.3.13)
′′
= f (ηk )g(ηk ),
k=0

where {η0 , . . . , ηn } is the set of extremal points of Tn , a similar property holds



 0, i 6= j
n
(Ti , Tj )U = , i = j 6= 0 .
 2
n, i = j = 0
MATLAB Source 5.4 computes continuous nth degree least squares approximation based
on first kind Chebyshev polynomials. The method is analogous to Legendre approximation.
In order to avoid the computation of improper integrals of the form
Z
2 1 1
ck = √ f (x) cos(k arccos x) dx,
π −1 1 − x2
5.3. Examples of orthogonal polynomials 171

for the coefficients ck = (f, Tk ), one performed the change of variable u = arccos x. Thus,
the formula for ck becomes
Z
2 π
ck = f (cos u) cos ku du.
π 0
MATLAB Source 5.5 computes the value of first kind Chebyshev polynomials of degree n

MATLAB Source 5.4 Leat squares continuous approximation with Chebyshev # 1 polyno-
mials
function y=Chebyshevapprox(f,x,n)
%CHEBYSHEVAPPROX - continuous least square Cebisev #1 approx
%call y=Chebyshevapprox(f,x,n)
%f - function
%x - points
%n - degree
%y - approximation value

c=Chebyshevcoeff(f,n);
y=evalChebyshev(c,x);

function y=evalChebyshev(c,x)
%EVALCHEBYSHEV - evaluate least square Chebyshev aproximation

y=c(1)/2*ones(size(x));
for k=1:length(c)-1
y=y+c(k+1)*vChebyshev(x,k);
end

on a set of given points, and MATLAB Source 5.6 find Fourier coefficients of a Fourier series
of Chebyshev polynomials.
The function that computes the discrete approximation corresponding to the inner product
5.3.12 is given in MATLAB Source 5.7. Such an approximation is useful in practical prob-
lems, and the precision is not much smaller than that of continuous approximation; moreover
is simpler to compute, since it does not require to approximate integrals (the inner products
are finite sums, see MATLAB Source 5.8).

The polynomial Tn has the least uniform norm in the set of n-th monic polynomials.

Theorem 5.3.1 (Chebyshev). For an arbitrary monic polynomial pn of degree n, there holds

◦ ◦ 1

max pn (x) ≥ max Tn (x) = n−1 , n ≥ 1, (5.3.14)
−1≤x≤1 −1≤x≤1 2

where Tn (x) is the monic Chebyshev polynomial (5.3.6) of degree n.
Proof. (by contradiction) Assume contrary to (5.3.14), that
1

max pn (x) < n−1 . (5.3.15)
−1≤x≤1 2
172 Function Approximation

MATLAB Source 5.5 Compute Chebyshev # 1 polynomials by mean of recurrence relation


function y=vChebyshev(x,n)
%VCHEBYSHEV - values of Chebyshev #1 polynomial
%call y=vChebyshev(x,n)
%x - points
%n - degree
%y - values of Chebyshev polynomial

pnm1=ones(size(x));
if n==0, y=pnm1; return; end
pn=x;
if n==1, y=pn; return; end
for k=2:n
y=2*x.*pn-pnm1;
pnm1=pn;
pn=y;
end

MATLAB Source 5.6 Least squares approximation with Chebyshev # 1 polynomials – con-
tinuation: computing Fourier coefficients
function c=Chebyshevcoeff(f,n)
%CHEBYSHEVCOEFF - least square Cebyshev coefficients.
%call c=Chebyshevcoeff(f,n)
%f - function
%n - degree
%c - coefficients

for k=0:n
c(k+1)=2/pi*quadl(@fceb,0,pi,1e-12,0,f,k);
end
%subfunction
function y=fceb(x,f,k)
y=cos(k*x).*feval(f,cos(x));

MATLAB Source 5.7 Discrete Chebyshev least squares approximation


function y=discrChebyshevapprox(f,x,n)
%DISCRCHEBYSHEVAPPROX - discrete least square Chebyshev #1
%call y=discrChebyshevapprox(f,x,n)
%f - function
%x - points
%n - degree

c=discrChebyshevcoeff(f,n);
y=evalChebyshev(c,x);
5.3. Examples of orthogonal polynomials 173

MATLAB Source 5.8 The coefficients of discrete Chebyshev least squares approximation
function c=discrChebyshevcoeff(f,n)
%DISCRCHEBYSHEVCOEFF - discrete least squares Chebyshev
% coefficients
%call c=discrChebyshevcoeff(f,n)
%f - function
%n - degree

xi=cos((2*[1:n+1]-1)*pi/(2*n+2));
y=f(xi)’;
for k=1:n+1
c(k)=2/(n+1)*vChebyshev(xi,k-1)*y;
end

◦ ◦
Then the polynomial dn (x) = Tn (x) − pn (x) (of degree ≤ n − 1) satisfies
       
(n) (n) (n)
dn y0 > 0, dn y1 < 0, dn y2 > 0, . . . , (−1)n dn yn(n) > 0. (5.3.16)

Since dn change sign at least n times, it must vanish identically; this contradicts (5.3.16);
thus (5.3.15) cannot be true. 

The result (5.3.14) can be given the following interesting interpretation: the best uniform

approximation on [−1, 1] to f (x) = xn from Pn−1 is given by xn − Tn (x), that is, by the

aggregate of terms of degree ≤ n − 1 in Tn taken with the minus sign. From the theory of
uniform polynomial approximation it is known that the best approximant is unique. Therefore
◦ ◦
equality in (5.3.14) can hold only if pn (x) = Tn (x).

5.3.3 Second kind Chebyshev polynomials


Chebyshev #2 polynomials are defined by

sin[(n + 1) arccos t]
Qn (t) = √ , t ∈ [−1, 1]
1 − t2


They are orthogonal on [−1, 1] relative to the measure dλ(t) = w(t)dt, w(t) = 1 − t2 .
The recurrence relation is

Qn+1 (t) = 2tQn (t) − Qn−1 (t), Q0 (t) = 1, Q1 (t) = 2t.


174 Function Approximation

5.3.4 Laguerre polynomials


This Laguerre 5 polynomials are orthogonal on [0, ∞) with respect to the weight w(t) =
tα e−t . They are defined by

et t−α dn n+α −t
lnα (t) = (t e ) for α > 1
n! dtn

The recurrence relation for monic polynomials l̃nα is


α
l̃n+1 (t) = (t − αn )l̃nα (t) − (2n + α + 1)l̃n−1
α
(t),

where α0 = Γ(1 + α) and αk = k(k + α), for k > 0.

5.3.5 Hermite polynomials


Hermite polynomials are defined by
2 dn −t2
Hn (t) = (−1)n et (e ).
dtn
2
They are orthogonal on (−∞, ∞) with respect to the weight w(t) = e−t and the recur-
rence relation is for monic polynomials H̃n (t) is

H̃n+1 (t) = tH̃n (t) − βn H̃n−1 (t),



where β0 = π and βk = k/2, for k > 0.

5.3.6 Jacobi polynomials


They are orthogonal on [−1, 1] relative to the weight

w(t) = (1 − t)α (1 + t)β .

Jacobi polynomials are generalizations of other orthogonal polynomials:


• For α = β = 0 we obtain Legendre polynomials.
• For α = β = −1/2 we obtain Chebyshev #1 polynomials.

Edmond Laguerre (1834-1886) was a French mathematician active


5 in
Paris, who made essential contributions to geometry, algebra, and
analysis.
5.3. Examples of orthogonal polynomials 175

• For α = β = 1/2 we obtain Chebyshev #2 polynomials.

Remark 5.3.2. For Jacobi polynomials we have

β 2 − α2
αk =
(2k + α + β)(2k + α + β + 2)

and

β0 =2α+β+1 B(α + 1, β + 1),


4k(k + α)(k + α + β)(k + β)
βk = , k > 0. ♦
(2k + α + β − 1)(2k + α + β)2 (2k + α + β + 1)

We conclude this section with a table of some classical weight functions, their correspond-
ing orthogonal polynomials, and the recursion coefficients αk , βk for generating orthogonal
polynomials (see Table 5.2).

Polynomials Notation Weight interval αk βk


Legendre Pn (ln ) 1 [-1,1] 0 2 (k=0)
(4−k−2 )−1 (k>0)
1
2 −2
Chebyshev #1 Tn (1−t ) [−1,1] 0 π (k=0)
1
2 π (k=1)
1
4 (k>0)
1
Chebyshev #2 un (Qn ) (1−t2 ) 2 [−1,1] 0 1
2 π (k=0)
1
4 (k>0)
(α)
Laguerre Ln tα e−t α>−1 [0,∞) 2k+α+1 Γ(1+α) (k=0)
k(k+α) (k>0)
2 √
Hermite Hn e−t R 0 π (k=0)
1
2k (k>0)
Jacobi Pn(α,β) α
(1−t) (1−t) β
[−1,1] See Remark 5.3.2
α>−1, β>−1 page 175

Table 5.2: Orthogonal Polynomials

5.3.7 A MATLAB example


Consider the function f : [−1, 1] → R, f (x) = x + sin πx2 . We shall study experimentally
the following least squares approximations: Legendre, continuous Chebyshev #1 and discrete
Chebyshev #1. First, we shall try to find the degree of approximation such that the error stays
within a given tolerance. The idea is as follows: we shall evaluate the function and the
approximation on a large number of point and we shall see if the Chebyshev norm, k · k∞ ,
of the difference vector is lower than the prescribed error. If true, one returns the degree,
otherwise the computing proceeds with a larger n. MATLAB Source 5.9 returns the degree
and the actual error. For example, for a tolerance equal to 10−3 , one obtains the following
results
176 Function Approximation

MATLAB Source 5.9 Test for least squares approximations


function [n,erref]=excebc(f,err,proc)
%f - function
%err - error
%proc - approximation method (Legendre, continuous
% Chebyshev, discrete Chebyshev

x=linspace(-1,1,100); %abscissas
y=f(x); %function values
n=1;
while 1
ycc=proc(f,x,n); %approximation values
erref=norm(y-ycc,inf); %error
if norm(y-ycc,inf)<err %success
return
end
n=n+1;
end

>> fp=@(x)x+sin(pi*x.ˆ2);
>> [n,er]=excebc(fp,1e-3,@approxLegendre)
n =
8
er =
9.5931e-004
>> [n,er]=excebc(fp,1e-3,@approxChebyshev)
n =
8
er =
5.9801e-004
>> [n,er]=excebc(fp,1e-3,@approxChebyshevdiscr)
n =
11
er =
6.0161e-004

The next program (exorthpol.m) computes the coefficients and plots the three types
of approximations for a given degree:
k=input(’k=’);
fp=inline(’x+sin(pi*x.ˆ2)’);
x=linspace(-1,1,100);
y=fp(x);
yle=Legendreapprox(fp,x,k);
ycc=Chebyshevapprox(fp,x,k);
ycd=discrChebyshevapprox(fp,x,k);
plot(x,y,’:’, x,yle,’--’, x,ycc,’-.’,x,ycd,’-’);
legend(’f’,’Legendre’, ’Continuous Cebyshev’, ’Discrete Chebyshev’,4)
5.4. Polynomials and data fitting in MATLAB 177

title([’k=’,int2str(k)],’Fontsize’,14);
cl=Legendrecoeff(fp,k)
ccc=Chebyshevcoeff(fp,k)
ccd=discrChebyshevcoeff(fp,k)

For k=3 and k=4, one obtains the following values for the coefficients:
k=3 cl =
Columns 1 through 3
0.50485459411369 1.00000000000000 0.56690206826580
Column 4
0.00000000000000
ccc =
Columns 1 through 3
0.94400243153647 1.00000000000114 0.00000000000000
Column 4
-0.00000000000000
ccd =
Columns 1 through 3
0.88803168065243 1.00000000000000 0
Column 4
-0.00000000000000
k=4 cl =
Columns 1 through 3
0.50485459411369 1.00000000000000 0.56690206826580
Columns 4 through 5
0.00000000000000 -4.02634086092250
ccc =
Columns 1 through 3
0.94400243153647 1.00000000000114 0.00000000000000
Columns 4 through 5
-0.00000000000000 -0.49940325827041
ccd =
Columns 1 through 3
0.94400233847188 1.00000000000000 -0.02739538442025
Columns 4 through 5
-0.00000000000000 -0.49939655365619

The graphs are given in figure 5.5.

5.4 Polynomials and data fitting in MATLAB


MATLAB represents a polynomial

p(x) = p1 xn + p2 xn−1 + pn x + pn+1

by a row vector p=[p(1) p(2) ...p(n+1)] of coefficients, ordered in decreasing or-


der of variable powers.
We shall consider three problems related to polynomials:
178 Function Approximation

k=3 k=4
2 2

1.5
1.5

1
1

0.5

0.5

0
−0.5

−0.5
f −1 f
Legendre Legendre
Continuous Cebyshev Continuous Cebyshev
Discrete Chebyshev Discrete Chebyshev
−1 −1.5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

(a) k = 3 (b) k = 4

Figure 5.5: Least squares approximation of degree k for the function f : [−1, 1] → R,
f (x) = x + sin πx2

• Evaluation — given the coefficients compute the value of the polynomial at one or
more points.

• Root finding — given the coefficients find the roots.

• Data fitting — given a data set (xi , yi )m


i=1 , find a polynomial (or another combination
of basic functions) that “fits” the data (i.e. a least squares approximation).

The evaluation uses Horner’s scheme, implemented in MATLAB by polyval function.


In the command y=polyval(p,x) x can be a matrix, in which case the polynomial is eval-
uated at each element of the matrix (that is, in the array sense). Evaluation of the polynomial
in matrix sense, that is the computation of the matrix

p(X) = p1 X n + p2 Xn−1 + pn X + pn+1 ,

where X is a square matrix is carried out through the command Y = polyvalm(p,X).


The command z = roots(p) computes the (real and complex) roots of p. The func-
tion poly carries out the converse operation, that is it constructs the polynomial given the
roots. Thus if z is an n-vector then p = poly(z) gives the coefficients of the polyno-
mial (x − z1 )(x − z2 ) . . . (x − zn ). It also accepts a square matrix argument: in this case
p=poly(A) computes the coefficients of the characteristic polynomial of A, det(xI − A).
The polyder function computes the coefficients of the derivative of a polynomial.
As an example, consider the quadratic polynomial p(x) = x2 + x + 1. First, we find its
roots:
>> p = [1 1 1]; z = roots(p)
z =
-0.5000 + 0.8660i
-0.5000 - 0.8660i

We check them, up to roundoff:


5.4. Polynomials and data fitting in MATLAB 179

>> polyval(p,z)
ans =
1.0e-015 *
0.3331
0.3331

This is the characteristic polynomial of a certain 2 × 2 matrix


>> A=[0, 1; -1,-1]; chp = poly(A)
chp =
1.0000 1.0000 1.0000

The Cayley-Hamilton theorem says that every matrix satisfies its own characteristic polyno-
mial. We check this, modulo roundoff, for our matrix:
>> polyvalm(chp,A)
ans =
1.0e-015 *
-0.3331 0
0 -0.3331

For polynomial multiplication and division we can use conv and deconv, respectively.
The deconv syntax is [q,r]=deconv(g,h), where g is the dividend, h is the divisor,
q the quotient and r the remainder. In the following example we divide x3 − 2x2 − x + 2
by x − 2, obtaining quotient x2 − 1 and zero remainder. Then we reproduce the original
polynomial using conv.
>> g = [1 -2 -1 2]; h=[1 -2];
>> [q,r] = deconv(g,h)
q =
1 0 -1
r =
0 0 0 0
>> conv(h,q)+r
ans =
1 -2 -1 2

Consider now the problem of data fitting. Data fitting is a very common source of least
squares problems. Let t be the independent variable and let y(t) denote an unknown func-
tion of t that we want to approximate. Assume there are m observations, (yi ), measured at
specified values (ti ):
yi = y(ti ), i = 1, m.
Our model for y is a combination of n basis functions (πi )

y(t) ≈ c1 π1 (t, α) + · · · + cn πn (t, α).

The design matrix A(α) is a rectangular m × n matrix with elements

ai,j = πj (ti , α).


180 Function Approximation

They may depend upon α. In matrix notation, we can express our model as:

y ≈ A(α)c.

The residuals are the differences between the observations and the model:
n
X
ri = yi − cj πj (ti , α)
j=1

or in matrix notation
r = y − A(α)c.
We wish to minimize a certain norm of the residuals. The most frequent choices are
m
X
krk22 = ri2
i=1

or
m
X
krk22,w = wi ri2 .
i=1

A physical intuitive explanation of the second choice is: If some observations are more im-
portant or more accurate than others, then we might associate different weights, wi , with
different observations. For example, if the error in the ith observation is approximately ei ,
then choose wi = 1/ei . Thus, we have a discrete least-squares approximation problem. This
problem is linear if it does not depend on α and nonlinear otherwise.
Any algorithm for solving an unweighted least squares problem can be used to solve a
weighted problem by scaling the observations and design matrix. We simply multiply both
yi and the ith row of A by wi . In MATLAB, this can be accomplished with
A=diag(w)*A
y=diag(w)*y

The MATLAB Optimization and Curve Fitting Toolboxes include functions for one-norm
and infinity-norm problems. We will limit ourselves here to least squares.
If the problem is linear and we have more observation than base functions, we obtain an
overdetermined system (see Section 4.6.2)

Ac ≈ y.

We solve it in least squares sense


c = A\y.
The theoretical approach is based on normal equations

AT Ac = AT y.

If the base functions are linear independent (and hence, AT A nonsingular), the solution is

c = (AT A)−1 AT y,
5.4. Polynomials and data fitting in MATLAB 181

or
c = A+ y,
where A+ is the pseudoinverse of A. MATLAB function pinv computes it.
Let Ax = b be an arbitrary system. If A is an m × n with m > n and its rank is n, then
any of the three following MATLAB instructions
x=A\b
x=pinv(A)*b
x=inv(A’*A)*A’*b

calculates the same least squares solution. Nevertheless, \ operator does it faster.
If A is not full rank, the least squares solution is not unique. There exist several vectors
which minimize the norm kAx− bk2 . The solution computed with x=A\b is a basic solution;
it has at most r nonzero components, where r is the rank of A. The solution computed with
x=pinv(A)*b is the minimum norm solution (it minimizes norm(x)). The attempt to
find the solution by using x=inv(A’*A)*A’*b fails, since A’*A is singular. Here is an
example that illustrates various solutions. The matrix
A=[1,2,3; 4,5,6; 7,8,9; 10,11,12];
is rank deficient. If b=A(:,2), then an obvious solution of A*x=b is x=[0,1,0]’. None
of the previous approaches compute x. The \ operator yields
>> x=A\b
Warning: Rank deficient, rank = 2 tol = 1.4594e-014.
x =
0.5000
0
0.5000
This solution has two nonzero components. The variant with pseudoinverse gives us
>> y=pinv(A)*b
y =
0.3333
0.3333
0.3333
We see that norm(y)=0.5774<norm(x)=0.7071. The third variant fails:
>> z=inv(A’*A)*A’*b
Warning: Matrix is singular to working precision.
z =
Inf
Inf
Inf

The normal equation approach has several drawbacks. The corresponding matrix is al-
ways more ill-conditioned than the initial overdetermined system. The condition number is
in fact squared6 :
cond(AT A) = cond(A)2 .
6 For a rectangular matrix X, the condition number can be defined by cond(X) = kXkkX + k
182 Function Approximation

In floating point representation, even if the columns of A are linear independent, (AT A)−1
can be almost singular (very close to a singular matrix).
MATLAB avoids normal equations. \ operator uses internally QR factorization. We can
find the solution using c=R\(Q’*y).
If our base is 1, t, . . . , tn , one can use MATLAB polyfit function. The command
p=polyfit(x,y,n) finds the coefficients of the degree n discrete least squares approxi-
mation polynomial for data x and y. If n ≥ m, it returns the coefficients of the interpolation
polynomial.
Example 5.4.1. A quantity y is measured at various time moments, t, to produce the obser-
vations given in Table 5.3. We introduce it in MATLAB using

t y
0.0 0.82
0.3 0.72
0.8 0.63
1.1 0.60
1.6 0.55
2.3 0.50

Table 5.3: Input data for example 5.4.1

t=[0,0.3,0.8,1.1,1.6,2.3]’;
y=[0.82,0.72,0.63,0.60,0.55,0.50]’;

We shall try to model these data by the function

y(t) = c1 + c2 e−t .

One computes the unknown coefficients with least squares methods. We have 6 equations
and two unknowns, represented by a 6 × 2 matrix
>> E=[ones(size(t)),exp(-t)]
E =
1.0000 1.0000
1.0000 0.7408
1.0000 0.4493
1.0000 0.3329
1.0000 0.2019
1.0000 0.1003

Using \ operator, we find:


c=E\y
c =
0.4760
0.3413

We plot on the same graph the function and the original data:
5.4. Polynomials and data fitting in MATLAB 183

T=[0:0.1:2.5]’;
Y=[ones(size(T)),exp(-T)]*c;
plot(T,Y,’-’,t,y,’o’)
xlabel(’t’); ylabel(’y’);

We observe that E · c 6= y, but the Euclidian norm of the residual is minimized (Figure 5.6).

0.9

0.85

0.8

0.75

0.7
y

0.65

0.6

0.55

0.5
0 0.5 1 1.5 2 2.5
t

Figure 5.6: Illustration of data fitting

If A is rank deficient (i.e it has not linear independent columns), the operator \ gives a
warning message and produce a solution with a minimum number of nonzero elements.

5.4.1 An application — Census Data

In this example, the data are the total population of the United States, as determined by the
U.S. Census, for the years 1900 to 2000. The units are millions of people [66]:
184 Function Approximation

t y
1900 75.995
1910 91.972
1920 105.711
1930 123.203
1940 131.669
1950 150.697
1960 179.323
1970 203.212
1980 226.505
1990 249.633
2000 281.422
The task is to model the population growth by a third order polynomial
y(t) = c1 t3 + c2 t2 + c3 t + c4
and predict the population when t = 2010.
If we try to fit our data with c=polyfit(t,y,3), the design matrix will be ill-
conditioned and badly scaled and its columns are nearly linearly dependent. We shall obtain
the message
Warning: Polynomial is badly conditioned. Remove repeated
data points or try centering and scaling as
described in HELP POLYFIT.
A much better basis is provided by powers of a translated and scaled t
s = (t − 1950)/50.
This new variable is in the interval [−1, 1] and the resulting design matrix is well conditioned.
The MATLAB script 5.10, census.m, finds the coefficients, plot the data and the polyno-
mial, and estimates the population in 2010. The estimation is given explicitly and marked by
an asterisk (see Figure 5.7).
The reader is strongly encouraged to try other models.

5.5 The Space H n [a, b]


For n ∈ N∗ , we define
H n [a, b] = {f : [a, b] → R : f ∈ C n−1 [a, b], f (n−1) absolute continuous on [a, b]}.
(5.5.1)
Each function f ∈ H n [a, b] admits a Taylor-type representation with the remainder in
integral form
n−1
X (x − a)k Z x
(k) (x − t)n−1 (n)
f (x) = f (a) + f (t)dt. (5.5.2)
k! a (n − 1)!
k=0
n
H [a, b] is a linear space.
5.5. The Space H n [a, b] 185

MATLAB Source 5.10 An example of least squares approximation


%CENSUS - example with polynomial fit

%data
y = [ 75.995 91.972 105.711 123.203 131.669 150.697 ...
179.323 203.212 226.505 249.633 281.422]’;
t = (1900:10:2000)’; % census years
x = (1890:1:2019)’; % evaluation years
w = 2010; % prediction year

s=(t-1950)/50;
xs=(x-1950)/50;
cs=polyfit(s,y,3);
zs=polyval(cs,xs);
est=polyval(cs,(2010-1950)/50);
plot(t,y,’o’,x,zs,’-’,w,est,’*’)
text(1990,est,num2str(est))
title(’U.S. Population’, ’FontSize’, 14)
xlabel(’year’, ’FontSize’, 12)
ylabel(’Millions’, ’FontSize’, 12)

U.S. Population
350

312.6914
300

250
Millions

200

150

100

50
1880 1900 1920 1940 1960 1980 2000 2020
year

Figure 5.7: An illustration of census example


186 Function Approximation

Remark 5.5.1. A function f : I → R, I interval, is called absolute continuous on I if


∀ ε > 0 ∃ δ > 0 such that for each finite system of disjoint subinterval in I {(ak , bk )}k=1,n
Pn
having the property k=1 (bk − ak ) < δ it holds
n
X
|f (bk ) − f (ak )| < ε.
k=1 ♦

The next theorem, due to Peano 7 , extremely important for Numerical Analysis, gives a
representation of real linear functionals, defined on H n [a, b].
Theorem 5.5.2 (Peano). Let L be a real continuous linear functional, defined on H n [a, b].
If KerL = Pn−1 then
Z b
Lf = K(t)f (n) (t)dt, (5.5.3)
a
where
1 n−1
K(t) = L[(· − t)+ ] (Peano kernel). (5.5.4)
(n − 1)!
Remark 5.5.3. The function 
z, z ≥ 0
z+ =
0, z < 0
n
is called positive part, and z+ is called truncated power. ♦

Proof. f admits a Taylor representation with the remainder in integral form

f (x) = Tn−1 (x) + Rn−1 (x)

where
Z x Z b
(x − t)n−1 (n) 1 n−1 (n)
Rn−1 (x) = f (t)dt = (x − t)+ f (t)dt
a (n − 1)! (n − 1)! a

By applying L to both sides we get


Z !
b
1 n−1 (n)
Lf = LTn−1 +LRn−1 ⇒ Lf = L (· − t)+ f (t)dt =
| {z } (n − 1)! a
0

Giuseppe Peano (1858-1932), an Italian mathematician active in


Turin, made fundamental contributions to mathematical logic, set the-
ory, and the foundations of mathematics. General existence theorems
7 in ordinary differential equations also bear his name. He created his
own mathematical language, using symbols of the algebra and logic,
and even promoted (and used) a simplified Latin (his “latino”) as a
world language for scientific publication.
5.6. Polynomial Interpolation 187

Z b
cont 1 n−1 (n)
= L(· − t)+ f (t)dt.
(n − 1)! a

Remark 5.5.4. The conclusion of the theorem remains valid if L is not continuous, but it has
the form
XZ b
n−1
Lf = f (i) (x)dµi (x), µi ∈ BV [a, b].
i=0 a ♦

Corollary 5.5.5. If K does not change sign on [a, b] and f (n) is continuous on [a, b], then
there exists ξ ∈ [a, b] such that

1 (n)
Lf = f (ξ)Len , (5.5.5)
n!

where ek (x) = xk , k ∈ N.

Proof. Since K does not change sign we may apply in (5.5.3) the second mean value theorem
of integral calculus
Z b
Lf = f (n) (ξ) Kn (t)dt, ξ ∈ [a, b].
a

Setting f = en we get precise (5.5.5). 

5.6 Polynomial Interpolation


We now wish to approximate functions by matching their values at given points.

Problem 5.1. Given m + 1 distinct points x0 , x1 , . . . , xm and values fi = f (xi ) of some


function f ∈ X at this points, find a function ϕ ∈ Φ such that

ϕ(xi ) = fi , i = 1, m.

Suppose Φ is a (m + 1)-dimensional linear space. Since we have to satisfy m + 1 conditions,


and have at our disposal m + 1 degrees of freedom – the coefficients of ϕ relative to a base of
Φ – we expect the problem to have a unique solution. Other question of interest, in addition
to existence and uniqueness, are different ways of representing and computing ϕ, what can
be said about the error e(x) = f (x) − p(x) when x 6= xi , i = 1, m and the quality of approx-
imation f (x) ≈ ϕ(x) when the number of points, and hence the “degree” of ϕ, is allowed
to increase indefinitely. Although these question are not of the utmost interest in themselves,
the result discussed in the sequel are widely used in the development of approximate methods
for important practical tasks (numerical integration, equation solving and so on).
188 Function Approximation

Interpolation to function values is referred to as Lagrange-type interpolation. More gen-


erally, we may wish to interpolate to function and derivative values of some function. This is
called Hermite-type interpolation.
When Φ = Pn we have to deal with polynomial interpolation. In this case interpola-
tion problem is called Lagrange interpolation and Hermite interpolation, respectively. For
example, the Lagrange interpolation problem is stated as follows.

Problem 5.2. Given m + 1 distinct points x0 , x1 , . . . , xm and values fi = f (xi ) of some


function f ∈ X at this points, find a polynomial ϕ of minimum degree such that

ϕ(xi ) = fi , i = 1, m.

5.6.1 Lagrange interpolation


Let [a, b] ⊂ R a closed interval, a set of m + 1 distinct points {x0 , x1 , . . . , xm } ⊂ [a, b] and a
function f : [a, b] 7→ R. We wish to determine a polynomial of minimum degree reproducing
the values of i f at xk , k = 0, m.

Theorem 5.6.1. There exists one polynomial and only one Lm f ∈ Pm such that

∀ i = 0, 1, . . . , m, (Lm f )(xi ) = f (xi ); (5.6.1)

this polynomial can be written in the form


m
X
(Lm f )(x) = f (xi )ℓi (x), (5.6.2)
i=0

where
Ym
x − xj
ℓi (x) = . (5.6.3)
j=0
xi − xj
j6=i

Definition 5.6.2. The polynomial Lm f defined in Theorem 5.6.1 is called Lagrange 8 in-
terpolation polynomial of f relative to the points x0 , x1 , . . . , xm , and the functions ℓi (x),
i = 0, m, are called elementary (fundamental, basic) Lagrange polynomials associated to
those points.
Joseph Louis Lagrange (1736-1813), born in Turin, became, through
correspondence with Euler, his protégé. In 1766 he indeed succeeded
Euler in Berlin. He returned to Paris in 1787. Clairaut wrote of the
young Lagrange: “... a Young man, no less remarkable for his tal-
ents than for his modesty; his temperament is mild and melancholic;
8 he knows no other pleasure than study”. Lagrange made fundamen-
tal contributions to the calculus of variations and to number theory,
but worked also on many problems in analysis. He is widely known
for his representation of the remainder term in Taylor’s formula. The
interpolation formula appeared in 1794. His Mécanique Analytique,
published in 1788, made him one of the founders of analytic mechan-
ics.
5.6. Polynomial Interpolation 189

Proof. One proves immediately that ℓi ∈ Pi and that ℓi (xj ) = δij (Krönecker’s symbol);
it results that the polynomial Lm f defined by (5.6.1) is of degree at most m and it satisfies
(5.6.2). Suppose that there is another polynomial p∗m ∈ Pm which also verifies (5.6.2) and
we set qm = Lm − p∗m ; we have qm ∈ Pm and ∀ i = 0, m, qm (xi ) = 0; so qm , having
(m + 1) distinct roots vanishes identically, therefore the uniqueness result. 

The M-file lagr.m (MATLAB Source 5.11) gives the code for Lagrange interpolation
using the formulas (5.6.2) and (5.6.3). The lagr function works also for symbolic variables.

MATLAB Source 5.11 Lagrange Interpolation


function fi=lagr(x,y,xi)
%LAGR - computes Lagrange interpolation polynomial
% x,y - coordinates of nodes
% xi - evaluation points

if nargin ˜=3
error(’illegal no. of arguments’)
end
[mu,nu]=size(xi);
fi=zeros(mu,nu);
np1=length(y);
for i=1:np1
z=ones(mu,nu);
for j=[1:i-1,i+1:np1]
z=z.*(xi-x(j))/(x(i)-x(j));
end;
fi=fi+z*y(i);
end

Example:
>> a = 1:3; b = (a-2).ˆ3; syms x;
>> P = lagr(a,b,x); pretty(P);
1/2 (-x + 2) (x - 3) + (1/2 x - 1/2) (x - 2)

Remark 5.6.3. The basic polynomial ℓi is thus the unique polynomial satisfying

ℓi ∈ Pm and ∀ j = 0, 1, . . . , m, ℓi (xj ) = δij

Setting
m
Y
u(x) = (x − xj )
j=0

u(x)
from (5.6.3) we obtain that ∀ x 6= xi , ℓi (x) = (x−xi )u′ (xi ) . ♦

The Figure 5.8 shows the graphs of third degree basic Lagrange polynomials for the nodes
xk = k, k = 0, 3.
190 Function Approximation

1.2

ℓ1
1

0.8

0.6

ℓ0 ℓ3
0.4

0.2

−0.2
ℓ2
−0.4
0 0.5 1 1.5 2 2.5 3

Figure 5.8: Basic Lagrange polynomials for the nodes x = 0, 1, 2, 3

MATLAB Source 5.12 Find basic Lagrange interpolation polynomials using MATLAB fa-
cilities.
function Z=pfl2(x,t)
%PFL2 - computes basic Lagrange polynomials
%call Z=pfl2(x,t)
%x - interpolation nodes
%t - evaluation points

%Return the result in a matrix: each line corresponds to a basic


%polynomial, and columns correspond to evaluation points
m=length(x);
n=length(t);
[T,X]=meshgrid(t,x);
TT=T-X;
Z=zeros(m,n);
TX=zeros(m,m);
[U,V]=meshgrid(x,x);
XX=U-V;
for i=1:m
TX(i)=prod(XX([1:i-1,i+1:m],i));
Z(i,:)=prod(TT([1:i-1,i+1:m],:))/TX(i);
end
5.6. Polynomial Interpolation 191

The M-file pfl2b.m (MATLAB Source 5.12) computes basic Lagrange polynomials for
a given set of nodes and a given set of evaluation points.
The proof of 5.6.1 proves in fact the existence and the uniqueness of the solution of
general Lagrange interpolation problem:

(PGIL) Given the data b0 , b1 , . . . , bm ∈ R, determine

pm ∈ Pm such that ∀ i = 0, 1, . . . , n, pm (xi ) = bi . (5.6.4)

Problem (5.6.4) leads us to a linear system of (m + 1) equations with (m + 1) unknowns


(the coefficients of pm ).
It is a well-known result from linear algebra

{Existence of a solution ∀ b0 , b1 , . . . , bm } ⇔ {uniqueness of the solution} ⇔

{(b0 = b1 = · · · = bm = 0) ⇒ pm ≡ 0}

We set pm = a0 + a1 x + · · · + am xm

a = (a0 , a1 , . . . , am )T , b = (b0 , b1 , . . . , bm )T

and let V = (vij ) be the m + 1 by m + 1 square matrix with elements vij = xji . The equation
(5.6.4) can be rewritten in the form
Va = b

a Vandermonde matrix); one can prove that V −1 = U T


The matrix V is invertible (it isP
m k
where U = (uij ) with ℓi (x) = k=0 uik x ; in this way we obtain a not so expensive
procedure to invert the Vandermonde matrix and thus to solve the system (5.6.4).

Example 5.6.4. The Lagrange interpolation polynomial of a function f relative to the nodes
x0 and x1 is
x − x1 x − x0
(L1 f ) (x) = f (x0 ) + f (x1 ),
x0 − x1 x1 − x0
that is, the line passing through the points (x0 , f (x0 )) and (x1 , f (x1 )). Analogously, the
Lagrange interpolation polynomial of a function f relative to the nodes x0 , x1 and x2 is

(x − x1 )(x − x2 ) (x − x0 )(x − x2 )
(L2 f ) (x) = f (x0 ) + f (x1 )+
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 )
f (x2 ),
(x2 − x0 )(x2 − x1 )

that is, the parabola passing through the points of coordinates (x0 , f (x0 )), (x1 , f (x1 )) and
(x2 , f (x2 )). Their geometric interpretation is given in Figure 5.9. ♦
192 Function Approximation

f
f
(L1 f)(x)
(L2 f)(x)

(a) (L1 f ) (b) (L2 f )

Figure 5.9: Geometric interpretation of L1 f (left) and L2 f

5.6.2 Hermite Interpolation


Instead of making f and the interpolation polynomial to agree at points xi in [a, b], we could
make that f and the interpolation polynomial to agree together with their derivatives up to
the order ri at points xi . One obtains:
Theorem 5.6.5. Given (m + 1) distinct points x0 , x1 , . . . , xm in [a, b] and (m + 1) natural
numbers r0 , r1 , . . . , rm , we set n = m + r0 + r1 + · · · + rm . Then, given a function f ,
defined on [a, b] and having ri th order derivative at point xi , there exists one polynomial and
only one Hn f of degree ≤ n such that

∀ (i, ℓ), 0 ≤ i ≤ m, 0 ≤ ℓ ≤ ri (Hn f )(ℓ) (xi ) = f (ℓ) (xi ), (5.6.5)

where f (ℓ) (xi ) is the ℓth order derivative of f at xi .

Definition 5.6.6. The polynomial defined as above is called Hermite 9 interpolation poly-
nomial of the function f relative to the points x0 , x1 , . . . , xm and integers r0 , r1 , . . . , rm .

Proof. Equation (5.6.5) leads us to a linear system having (n + 1) equations and (n + 1)


unknowns (the coefficients of Hn f ), so it is sufficient to show that the corresponding homo-
geneous system has only the null solution, that is, the relations

Hn f ∈ Pn and ∀ (i, ℓ), 0 ≤ i ≤ k, 0 ≤ ℓ ≤ ri , (Hn f )(ℓ) (xi ) = 0

Charles Hermite (1822-1901) was a leading French mathematicians,


9 Academician in Paris, known for his extensive work in number the-
ory, algebra, and analysis. He is famous for his proof in 1873 of the
transcendental nature of the number e.
5.6. Polynomial Interpolation 193

guarantee us that for each i = 0, 1, . . . , m xi is a (ri + 1)th order multiple root of Hn f ;


therefore Hn f has the form
m
Y
(Hn f )(x) = q(x) (x − xi )ri +1 ,
i=0
Pm
where q is a polynomial. Since i=0 (αi + 1) = n + 1, the above relation is incompatible to
the membership of Hn to Pn , excepting the situation when q ≡ 0, hence Hn ≡ 0. 
Remark 5.6.7. 1) Given the real numbers biℓ , for each pair (i, ℓ) such that 0 ≤ i ≤ k and
0 ≤ ℓ ≤ ri , we proved that the general Hermite interpolation problem
determine pn ∈ Pn such that ∀ (i, ℓ) 0 ≤ i ≤ m and
(ℓ) (5.6.6)
0 ≤ ℓ ≤ ri , pn (xi ) = biℓ
has a solution and only one. In particular, if we choose a given pair (i, ℓ), biℓ = 1
and bjn = 0, ∀ (j, m) 6= (i, ℓ) one obtains a basic (fundamental) Hermite interpola-
tion polynomial relative to the points x0 , x1 , . . . , xm and integers r0 , r1 , . . . , rm . The
Hermite interpolation polynomial defined by (5.6.5) can be obtained using the basic
polynomials
Xm X ri
(Hn f )(x) = f (ℓ) (x)hiℓ (x). (5.6.7)
i=0 l=0
Setting
k 
Y r
x − xj j+1
qi (x) =
j=0
xi − xj
j6=i

one checks easily that the basic polynomials hiℓ are defined by the recurrences
(x − xi )ri
hiri (x) = qi (x)
ri !
and for ℓ = ri−1 , ri−2 , . . . , 1, 0
ri  
X
(x − xi )ℓ j (j−ℓ)
hiℓ (x) = qi (x) − q (xi )hij (x).
ℓ! ℓ i
j=ℓ+1

2) The matrix V of the linear system (5.6.6) is called generalized Vandermonde matrix; it
is invertible and the elements of its inverse are the coefficients of polynomials hil .
3) Lagrange interpolation is a particular case of Hermite interpolation (for ri = 0, i =
0, 1, . . . , m); Taylor’s polynomial is a particular case for m = 0 and r0 = n. ♦
We shall give a more convenient expression for Hermite basic polynomials due to Dimi-
trie D. Stancu [86]. They verify
(p)
hkj (xν ) = 0, ν 6= k, p = 0, rν (5.6.8)
(p)
hkj (xk ) = δjp , p = 0, rk
194 Function Approximation

for j = 0, rk and ν, k = 0, m. Setting


m
Y
u(x) = (x − xk )rk +1
k=0

and
u(x)
uk (x) = ,
(x − xk )rk +1
it results from (5.6.8) that hkj is of the form

hkj (x) = uk (x)(x − xk )j gkj (x), gkj ∈ Prk −j . (5.6.9)

Applying Taylor’s formula, we get


rX
k −j
(x − xk )ν ν
gkj (x) = gkj (xk ); (5.6.10)
ν=0
ν!

ν
now we must determine the values gkj (xk ), ν = 0, rk − j. Rewriting (5.6.9) in the form

1
(x − xk )j gkj (x) = hkj (x) ,
uk (x)

and applying Leibnitz’s formula for the (j + ν)th order derivative of the product one gets

X
j+ν 
j+ν h j
i(j+ν−s)
(s)
j+ν 
X 
j + ν (j+ν−s)

1
(s)
(x − xk ) gkj (x) = hkj (x) .
s=0
s s=0
s uk (x)

Taking x = xk , all terms in both sides, excepting those corresponding to s = ν will vanish.
Thus, we have
    (ν)
j+ν (ν) j+ν 1
j!gkj (xk ) = , ν = 0, rk − j.
ν ν uk (x) x=xk

We got
 (ν)
(ν) 1 1
gkj (xk ) = ,
j! uk (x) x=xk
and from (5.6.10) and (5.6.9) we finally have
rX
k −j  (ν)
(x − xk )j (x − xk )ν 1
hkj (x) = uk (x) .
j! ν=0
ν! uk (x) x=xk

Proposition 5.6.8. The operator Hn is a projector, i.e.

• it is linear (Hn (αf + βg) = αHn f + βHn g);


5.6. Polynomial Interpolation 195

• it is idempotent (Hn ◦ Hn = Hn ).

Proof. Linearity results from (5.6.7). Due to the uniqueness of Hermite interpolation poly-
nomial, Hn (Hn f ) − Hn f vanishes identically, hence Hn (Hn f ) = Hn f , that is, it is idem-
potent. 

Example 5.6.9. The Hermite interpolation polynomial corresponding to a function f and


double nodes 0 and 1 is

(H3 f ) (x) = h00 (x)f (0) + h10 (x)f (1) + h01 (x)f ′ (0) + h11 (x)f ′ (1),

where

h00 (x) = (x − 1)2 (2x + 1),


h01 (x) = x(x − 1)2 ,
h10 (x) = x2 (3 − 2x),
h11 (x) = x2 (x − 1).

If we add the node x = 12 , then the quality of approximation increases (see Figure 5.10). ♦

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) (H3 f ) (b) (H3 f )

Figure 5.10: Hermite interpolation polynomial (H3 f ) (black) of the function f : [0, 1] → R ,
f (x) = sin πx (dotted) and double nodes x0 = 0 and x1 = 1 (left) and (H5 f ) of the function
f : [0, 1] → R , f (x) = sin πx (dotted) and double nodes x0 = 0, x1 = 12 and x2 = 1

5.6.3 Interpolation error


Recall that the norm of a linear operator, Pn , can be defined by

kPn f k
kPn k = max , (5.6.11)
f ∈C[a,b] kf k
196 Function Approximation

where in the right-hand side one chooses a convenient norm for functions. Taking the norm
L∞ , from Lagrange formula one obtains
m
X

k(Lm f )(.)k∞ = max f (xi )ℓi (x)
a≤x≤b
i=0
m
(5.6.12)
X
≤ kf k∞ max |ℓi (x)|.
a≤x≤b
i=0

Let kλm k∞ = λm (x∞ ). The equality holds for a function ϕ ∈ C[a, b], piecewise linear, that
verifies ϕ(xi ) = sgnℓi (x∞ ), i = 0, m. So,

kPn k∞ = Λm , (5.6.13)

where
m
X
Λm = kλm k∞ , λm (x) = |ℓi (x)|. (5.6.14)
i=0

The function λm (x) and its maximum Λm is called Lebesgue function 10 and Lebesgue
constant for Lagrange interpolation. They provide a first estimation of interpolation error: let
Em (f ) be the error in best approximation of f by polynomials of degree ≤ m,

Em (f ) = min kf − pk∞ = kf − p̂n k∞ , (5.6.15)


p∈Pm

where p̂n is the m-th degree best approximation polynomial of f . Using the fact that Lm is a
projector and formulas (5.6.12) and (5.6.14), one finds

kf − Lm f k = kf − p̂m − Lm (f − p̂m )k∞


≤ kf − p̂m k∞ + Λm kf − p̂m k∞ ;

that is,
kf − Lm f k∞ ≤ (1 + Λm )Em (f ). (5.6.16)
Thus, better is the approximation of f by polynomials of degree ≤ m, smaller is the inter-
polation error. Unfortunately, Λm is not uniformly bounded: for any choice of the nodes
(m)
xi = xi , i = 0, m, it can show that Λm > O(log m), when m → ∞. Nevertheless, it is

Henry Lebesgue (1875-1941) was a French mathematician known for


10 his fundamental work on the theory of real function, notably the con-
cepts of measure and integral that now bear his name.
5.6. Polynomial Interpolation 197

not possible, based on Weierstrass’ approximation theorem (that is, from Em → 0, m → ∞),
to draw the conclusion that Lagrange interpolation converges uniformly for any function f ,
not even for judiciously chosen nodes; in fact, it is known that the convergence does not hold.
If we wish to use Lagrange or Hermite interpolation polynomial to approximate a function
f at a point x ∈ [a, b], x 6= xk , k = 0, m, we need to estimate the error (Rn f )(x) =
f (x) − (Hn f )(x). If we have not any information about f excepting the values at xi , we can
say nothing about (Rn f )(x); we can change f everywhere excepting the points xi without
modify (Hn f ) (x). We need some supplementary assumptions (regularity conditions) on f .
Let C m [a, b] be the space of real functions m times continuous-differentiable on [a, b]. We
have the following theorem about error in Hermite interpolation.

Theorem 5.6.10. Suppose f ∈ C n [α, β] and there exists f (n+1) on (α, β), where α =
min{x, x0 , . . . , xm } and β = max{x, x0 , . . . , xm }; then, for each x ∈ [α, β], there exists a
ξx ∈ (α, β) such that

1
(Rn f )(x) = un (x)f (n+1) (ξx ), (5.6.17)
(n + 1)!

where
m
Y
un (x) = (x − xi )ri+1 .
i=0

Proof. If x = xi , (Rn f )(x) = 0 and (5.6.17) holds trivially. Suppose x 6= xi , i = 0, m and


for a fixed x, we introduce the auxiliary function

u (z) (Rn f )(z)
F (z) = n .
un (x) (Rn f )(x)

Note that F ∈ C n [α, β], ∃ F (n+1) on (α, β), F (x) = 0 and F (j) (xk ) = 0 for k = 0, m,
j = 0, rk . Thus, F has (n + 2) zeros, considering their multiplicities. Applying successively
generalized Rolle’s Theorem, it results that there exists at least one ξ ∈ (α, β) such that
F (n+1) (ξ) = 0, i.e.
(n + 1)! f (n+1) (ξ)
F (m+1)
(ξ) = = 0, (5.6.18)
un (x) (Rn f )(x)
where we used the relation (Rn f )(n+1) = f (n+1) − (Hn f )(n+1) = f (n+1) . Expressing
(Rn f )(x) from (5.6.18) one obtains (5.6.17). 

Corollary 5.6.11. We set Mn+1 = max |f (n+1) (x)|; an upper bound of interpolation error
x∈[a,b]
(Rn f )(x) = f (x) − (Hn f )(x) is given by

Mn+1
|(Rn f )(x)| ≤ |un (x)|.
(n + 1)!

Since Hn is a projector, Rn is also a projector and additionally KerRn = Pn , because


Rn f = f − Hn f = f − f = 0, ∀f ∈ Pn . Thus, we can apply Peano’s Theorem to Rn .
198 Function Approximation

Theorem 5.6.12. If f ∈ C n+1 [a, b], then


Z b
(Rn f ) (x) = Kn (x; t)f (n+1) (t)dt, (5.6.19)
a

where
 
1  
m X
X rk
 (j)
Kn (x; t) = (x − t)n+ − hkj (x) (xk − t)n+ . (5.6.20)
n!  j=0

k=0

Proof. Applying Peano’s Theorem, we have


Z b
(Rn f ) (x) = Kn (x; t)f (n+1) (t)dt
a

and taking into account that


   
(x − t)n+ (x − t)n+ (x − t)n+
Kn (x; t) = Rn = − Hn ,
n! n! n!

the theorem follows immediately. 

Since Lagrange interpolation is a particular case of Hermite interpolation for ri = 0,


i = 0, 1, . . . , m we have from Theorem 5.6.10:

Corollary 5.6.13. Suppose f ∈ C m [α, β] and there exists f (m+1) on (α, β), where α =
min{x, x0 , . . . , xm } and β = max{x, x0 , . . . , xm }; then, for each x ∈ [α, β], there exists a
ξx ∈ (α, β) such that

1
(Rm f )(x) = um (x)f (m+1) (ξx ), (5.6.21)
(n + 1)!

where
m
Y
um (x) = (x − xi ).
i=0

Also, it follows from Peano’s Theorem 5.6.12:

Corollary 5.6.14. If f ∈ C m+1 [a, b], then


Z b
(Rm f ) (x) = Km (x; t)f (m+1) (t)dt (5.6.22)
a

where " #
m
X
1 m m
Km (x; t) = (x − t)+ − lk (x)(xk − t)+ . (5.6.23)
m!
k=0
5.7. Efficient Computation of Interpolation Polynomials 199

Example 5.6.15. For interpolation polynomials in example 5.6.4 the corresponding remain-
ders are
(x − x0 )(x − x1 ) ′′
(R1 f )(x) = f (ξ),
2
and
(x − x0 )(x − x1 )(x − x2 ) ′′′
(R2 f )(x) = f (ξ),
6
respectively. ♦

Example 5.6.16. The remainder for the Hermite interpolation formula with double nodes 0
and 1, for f ∈ C 4 [α, β], is

x2 (x − 1)2 (4)
(R3 f )(x) = f (ξ). ♦
6!

Example 5.6.17. Let f (x) = ex . For x ∈ [a, b], we have Mn+1 = eb and for every choice
of the points xi , |un (x)| ≤ (b − a)n+1 , which implies

(b − a)n+1 b
max |(Rn f )(x)| ≤ e .
x∈[a,b] (n + 1)!

One gets  
lim max |(Rn f )(x)| = lim k(Rn f )(x)k = 0,
n→∞ x∈[a,b] n→∞

that is, Hn f converges uniformly to f on [a, b], when n tends to ∞. In fact, we can prove
an analogous result for any function which can be developed into a Taylor series in a disk
centered in x = a+b 3
2 and with the radius of convergence r > 2 (b − a). ♦

5.7 Efficient Computation of Interpolation Polynomials


5.7.1 Aitken-type methods
Many times, the degree required to attain the desired accuracy in polynomial interpolation is
not known. It can be obtained from the remainder expression, but this require kf (m+1) k∞
to be known. Pm1 ,m2 ,...,mk will denote the Lagrange interpolation polynomial with nodes
xm1 , . . . , xmk .

Proposition 5.7.1. If f is defined at x0 , . . . , xk , xj 6= xi , 0 ≤ i, j ≤ k, then

(x − xj )P0,1,...,j−1,j+1,...,k (x) − (x − xi )P0,1,...,i−1,i+1,...,k (x)


P0,1,...,k = =
xi − xj

1 x − xj P0,1,...,i−1,i+1,...,k (x)
= . (5.7.1)
xi − xj x − xi P0,1,...,j−1,j+1,...,k (x)
200 Function Approximation

b = P0,1,...,j−1,j+1,k
Proof. Q = P0,1,...,i−1,i+1,...,k , Q
b
(x − xj )Q(x) − (x − xi )Q(x)
P (x) =
xi − xj

b r ) − (xr − xi )Q(xr )
(xr − xj )Q(x xi − xj
P (xr ) = = f (xr ) = f (xr )
xi − xj xi − xj

b r ) = f (xr ). But,
for r 6= i ∧ r 6= j, since Q(xr ) = Q(x
b i ) − (xi − xj )Q(xi )
(xi − xj )Q(x
P (xi ) = = f (xi )
xi − xj
and
b j ) − (xj − xi )Q(xj )
(xj − xi )Q(x
P (xj ) = = f (xj ),
xi − xj
hence P = P0,1,...,k . 
Thus we established a recurrence relation between a Lagrange interpolation polynomial
of degree k and two Lagrange interpolation polynomials of degree k − 1. The computation
could be organized in a tabular fashion

x0 P0
x1 P1 P0,1
x2 P2 P1,2 P0,1,2
x3 P3 P2,3 P1,2,3 P0,1,2,3
x4 P4 P3,4 P2,3,4 P1,2,3,4 P0,1,2,3,4

And now, suppose P0,1,2,3,4 does not provide the desired accuracy. One can select a new
node and add a new line to the table

x5 P5 P4,5 P3,4,5 P2,3,4,5 P1,2,3,4,5 P0,1,2,3,4,5

and neighbor elements on row, column and diagonal could be compared to check if the desired
accuracy was achieved.
The method is called Neville method.
We can simplify the notations

Qi,j := Pi−j,i−j+1,...,i−1,i ,

Qi,j−1 = Pi−j+1,...,i−1,i ,
Qi−1,j−1 := Pi−j,i−j+1,...,i−1 .
Formula (5.7.1) implies
(x − xi−j )Qi,j−1 − (x − xi )Qi−1,j−1
Qi,j = ,
xi − xi−j
5.7. Efficient Computation of Interpolation Polynomials 201

for j = 1, 2, 3, . . . , i = j + 1, j + 2, . . .
Moreover, Qi,0 = f (xi ). We obtain

x0 Q0,0
x1 Q1,0 Q1,1
x2 Q2,0 Q2,1 Q2,2
x3 Q3,0 Q3,1 Q3,2 Q3,3

If the interpolation procedure converges, then the sequence Qi,i also converges and a
stopping criterion could be
|Qi,i − Qi−1,i−1 | < ε.

The algorithm speeds-up by sorting the nodes on ascending order over |xi − x|.
Aitken methods is similar to Neville method. It builds the table

x0 P0
x1 P1 P0,1
x2 P2 P0,2 P0,1,2
x3 P3 P0,3 P0,1,3 P0,1,2,3
x4 P4 P0,4 P0,1,4 P0,1,2,4 P0,1,2,3,4

To compute a new value one takes the value in top of the preceding column and the value
from the current line and the preceding column.

5.7.2 Divided difference method


Let Lk f denotes the Lagrange interpolation polynomial with nodes x0 , x1 , . . . , xk for k =
0, 1, . . . , n. We shall construct Lm by recurrence. We have

(L0 f )(x) = f (x0 )

for k ≥ 1, the polynomial Lk − Lk−1 is of degree k, vanish at x0 , x1 , . . . , xk−1 , so its form


is

(Lk f )(x) − (Lk−1 f )(x) = f [x0 , x1 , . . . , xk ](x − x0 )(x − x1 ) . . . (x − xk−1 ), (5.7.2)

where f [x0 , x1 , . . . , xk ] denotes the coefficient of xk in (Lk f )(x). One derives the expres-
sion of the interpolation polynomial Lm f with nodes x0 , x1 , . . . , xn

m
X
(Lm f )(x) = f (x0 ) + f [x0 , x1 , . . . , xk ](x − x0 )(x − x1 ) . . . (x − xk−1 ), (5.7.3)
k=1
202 Function Approximation

called Newton’s 11 form of Lagrange interpolation polynomial.


Formula (5.7.3) reduces the computation by recurrence of Lm f to that of the coefficients
f [x0 , x1 , . . . , xk ], k = 0, m.
It holds
Lemma 5.7.2.
f [x1 , x2 , . . . , xk ] − f [x0 , x1 , . . . , xk−1 ]
∀k≥1 f [x0 , x1 , . . . , xk ] = (5.7.4)
xk − x0
and
f [xi ] = f (xi ), i = 0, 1, . . . , k.
Proof. For k ≥ 1 let L∗k−1 fbe the interpolation polynomial for f , having the degree k − 1
and the nodes x1 , x2 , . . . , xk ; the coefficient of xk−1 is f [x1 , x2 , . . . , xk ]. The polynomial
qk of degree k defined by
(x − x0 )(L∗k−1 f )(x) − (x − xk )(Lk−1 f )(x)
qk (x) =
xk − x0
equates f at points x0 , x1 , . . . , xk , hence qk (x) ≡ (Lk f )(x). Formula (5.7.4) is obtaining by
identification of xk coefficients in both sides. 
Definition 5.7.3. The quantity f [x0 , x1 , . . . , xk ] is called kth divided difference of f relative
to the nodes x0 , x1 , . . . , xk .
An alternative notation is [x0 , . . . , xk ; f ].
The definition implies that f [x0 , x1 , . . . , xk ] is independent of x’s order and it could be
computed as a function of f (x0 ), . . . , f (xm ). Indeed, the Lagrange interpolation polynomial
of degree ≤ m relative to the nodes x0 , . . . , xm can be written as
m
X
(Lm f )(x) = li f (xi )
i=0

and the coefficient of xm is


m
X f (xi )
f [x0 , . . . , xm ] = m . (5.7.5)
Y
i=0
(xi − xj )
j=0
j6=i

Sir Isaac Newton (1643 - 1727) was an eminent figure of the 17th
century mathematics and physics. Not only did he lay the founda-
tions of modern physics, but he was also one of the co-inventors of
11 the differential calculus. Another was Leibniz, with whom he became
entangled in a biter and life-long priority dispute. His most influential
work was the Principia, which not only contains his ideas on interpo-
lation, but also his suggestion to use the interpolating polynomial for
purposes of integration.
5.7. Efficient Computation of Interpolation Polynomials 203

The formula (5.7.4) can be used to generate the table of divided differences

x0 f [x0 ] ✲ f [x0 , x1 ] ✲ f [x0 , x1 , x2 ] ✲ f [x0 , x1 , x2 , x3 ]



✟ ✟
✯ ✯

✟✟ ✟✟ ✟✟
✟✟ ✟✟ ✟✟
✟ ✟ ✟
x1 f [x1 ] ✲ f [x1 , x2 ] ✲ f [x1 , x2 , x3 ]
✟✯
✟ ✟✟

✟✟ ✟✟
✟✟ ✟✟
x2 f [x2 ] ✲ f [x2 , x3 ]


✟✟
✟✟

x3 f [x3 ]

The first column contains the values of function f , the second contains the 1st order divided
difference and so on; we pass from a column to the next using formula (5.7.4): each entry is
the difference of the entry to the left and below it and the one immediately to the left, divided
by the difference of the x-value found by going diagonally down and the x-value horizontally
to the left. The divided differences that occur in the Newton formula (5.7.3) are precisely the
m + 1 entries in the first line of the table of divided differences. Their computation requires
n(n + 1) additions and 12 n(n + 1) divisions. Adding another data point (xm+1 , f [xm+1 ])
requires the generation of the next diagonal. Lm+1 f can be obtained from Lm f by adding to
it the term f [x0 , . . . , xm+1 ](x − x0 ) . . . (x − xm+1 ).
MATLAB Source 5.13 generates the divided difference table, and MATLAB Source 5.14
computes Newton’s form of Lagrange interpolation polynomial.

MATLAB Source 5.13 Generate divided difference table


function td=divdiff(x,f)
%DIVDIFF - compute divided difference table
%call td=divdiff(x,f)
%x - nodes
%f- function value
%td - divided difference table

lx=length(x);
td=zeros(lx,lx);
td(:,1)=f’;
for j=2:lx
td(1:lx-j+1,j)=diff(td(1:lx-j+2,j-1))./...
(x(j:lx)-x(1:lx-j+1))’;
end

Remark 5.7.4. The interpolation error is given by


f (x) − (Lm f )(x) = um (x)f [x0 , x1 , . . . , xm , x]. (5.7.6)
204 Function Approximation

MATLAB Source 5.14 Compute Newton’s form of Lagrange interpolation polynomial


function z=Newtonpol(td,x,t)
%NEWTONPOL - computes Newton interpolation polynomial
%call z=Newtonpol(td,x,t)
%td - divided difference table
%x - interpolation nodes
%t - evaluation points
%z - values of interpolation polynomial

lt=length(t); lx=length(x);
for j=1:lt
d=t(j)-x;
z(j)=[1,cumprod(d(1:lx-1))]*td(1,:)’;
end

Indeed, it is sufficient to note that

(Lm f )(t) + um (t)f [x0 , . . . , xm ; x]

is, according to (5.7.3) the interpolation polynomial (in t) of f relative to the points x0 , x1 ,
. . . , xm , x. The theorem on the remainder of Lagrange interpolation formula (5.6.17) implies
the existence of a ξ ∈ (a, b) such that
1 (m)
f [x0 , x1 , . . . , xm ] = f (ξ) (5.7.7)
m!
(mean formula for divided differences). ♦

A divided difference could be written as the quotient of two determinants.


Theorem 5.7.5. It holds
(W f )(x0 , . . . , xm )
f [x0 , . . . , xm ] = (5.7.8)
V (x0 , . . . , xm )
where
1 x0 x20 . . . xm−1 f (x0 )
0
1 x1 x21 . . . xm−1 f (x1 )
1
(W f )(x0 , . . . , xn ) = .. .. .. .. .. .. , (5.7.9)
. . . . . .

1 xm x2m . . . xm−1 f (xm )
m

and V (x0 , . . . , xm ) is the Vandermonde determinant.

Proof. One expands (W f )(x0 , . . . , xm ) over the last columns; taking into account that every
algebraic complement is a Vandermonde determinant, one gets
Xm
1
f [x0 , . . . , xm ] = V (x0 , . . . , xi−1 , xi+1 , . . . , xm )f (xi ) =
V (x0 , . . . , xm ) i=0
5.7. Efficient Computation of Interpolation Polynomials 205

m
X f (xi )
= (−1)m−i ,
i=0
(xi − x0 ) . . . (xi − xi−1 )(xi − xi+1 ) . . . (xn − xi )
that after the sign changing of the last m − i terms implies (5.7.5). 

5.7.3 Barycentric Lagrange Interpolation


We rewrite (5.6.2), (5.6.3) such that the Lagrange interpolation polynomial can be evaluated
and updated in O(m) operations,in a numerically stable way. We have

um (x) 1
ℓj (x) = Q · , (5.7.10)
(xj − xk ) x − xj
k6=j

where
um (x) = (x − x0 )(x − x1 ) · · · (x − xm ). (5.7.11)
If one defines the barycentric weights by
1
wj = Q , j = 0, . . . , m, (5.7.12)
(xj − xk )
k6=j

that is, wj = 1/u′m (xj ), we can thus write the basic Lagrange polynomials ℓj as
wj
ℓj (x) = um (x) . (5.7.13)
x − xj

Now, (setting fj := f (xj )) we can write the Lagrange interpolation polynomial as


m
X wj
(Lm f )(x) = um (x) fj . (5.7.14)
j=0
x − xj

The formula (5.7.14) is called the first barycentric formula.


By interpolating the constant function 1, whos interpolant is itself, we obtain
m
X m
X wj
1= ℓj (x) = um (x) (5.7.15)
j=0 j=0
x − xj .

Dividing (5.7.14) by the above expression and simplifying with the factor um (x), we
obtain the formula
Xm
wj
fj
j=0
x − xj
p(x) = m , (5.7.16)
X wj

j=0
x − xj

called the second barycentric formula [77].


206 Function Approximation

Remarkable point distributions

For certain particular distributions of nodes there exist explicit formulas for the barycentric
weights wj , starting from the formula

1
wj = , (5.7.17)
u′m (xj )

• For equispaced nodes, we have


 
m j
wj = (−1) . (5.7.18)
j

• The Chebyshev nodes family could be obtained by projecting equally spaced points laid
on the unit circle on the interval [−1, 1].

– The Chebyshev points of the first kind of the first kind on [−1, 1] are given by

(2j + 1) π
xj = cos , j = 0, . . . , m.
2m + 2

Simplifying factors independents of j we find

j (2j + 1) π
wj = (−1) sin . (5.7.19)
2m + 2

– The Chebyshev points of the second kind or Gauss-Lobatto points on [−1, 1] ar


given by
πj
xj = cos , j = 0, . . . , n (5.7.20)
n
and the corresponding barycentric weights are

2n−1 (−1)j /2 dacă j = 0 sau j = n,
wj = (5.7.21)
n (−1)j altfel.

The common factor 2n−1 /n could be eliminated, since it appears both in numer-
ator and denominator of the barycentric formula (5.7.16):

j 1/2, j = 0 sau j = m,
wj = (−1) δj , δj = (5.7.22)
1, altfel.

• For an arbitrary interval [a, b] we can use the change of variable

2x − b − a
t= .
b−a
5.7. Efficient Computation of Interpolation Polynomials 207

Interpolation at Chebyshev nodes


We can avoid the difficulties generated by higher degree polynomial interpolation by using
points set which are clustered at the ends of the interval, for example Chebyshev points of
second kind.
These nodes are utilized in chebfun MATLAB package, implemented at University of
Oxford by a group led by professor L. N. Trefethen [26].
If (xj )nj=0 are Chebyshev points, the polynomial of nodes satisfies

n
Y
(x − xj ) ≤ 2−n+1

j=0

Theorem 5.7.6 gives several properties of interpolation at Chebyshev points of second


kind.:

Theorem 5.7.6. Let f ∈ C[−1, 1], pn its interpolation polynomial at Chebyshev points
(5.7.20) and p∗n its best approximation polynomial with respect to the uniform norm k·k∞ .
Then

1. kf − pn k∞ ≤ 2 + π2 log n kf − p∗n k∞

2. If ∃k ∈ N∗ such that f (k) is a bounded variation function on [−1, 1], then kf − pn k∞ =
O n−k , as n → ∞.
3. If f is analytic in a complex neighbourhood of [−1, 1], the ∃C < 1 such that kf − pn k∞ =
O (C n ); in particular, if f is analytic in a closes ellipse with foci ±1 and semi-axis
M ≥ 1 and m ≥ 0, we can choose C = 1/(M + m).

We can implement the barycentric formula as follows:


• If the evaluation point x is among the nodes, say x = xi , one returns fi = f (xi ).

• Otherwise, we apply formula (5.7.16).


We give a MATLAB implementation of barycentric method (function barycentricInterpolation),
see MATLAB source 5.15, inspired from [5].
The function barycentricweigths, MATLAB source 5.16, computes the barycen-
tric weights. For barycentric interpolation at Chebyshev nodes of first kind and second kind,
we give the functions ChebLagrangek1 (MATLAB source 5.17) and ChebLagrange
(MATLAB source 5.18), respectively.
As an example we will interpolate the function f : [−1, 1] → R, f (x) = |x| + 1/2x − x2
for m = 20 and m = 100 Chebyshev nodes of second kind.
m = 20;
fun = @(x) abs(x)+.5*x-x.ˆ2;
x = sort(cos(pi*(0:m)/m));
f = fun(x);
xx = linspace(-1,1,5000);
208 Function Approximation

MATLAB Source 5.15 Barycentric Lagrange interpolation


function ff=barycentricInterpolation(x,y,xx,c)
%BARYCENTRICINTERPOLATION - barycentric Lagrange interpolation
%call ff=barycentricInterpolation(x,y,xx,c)
%x - nodes
%y - function values
%xx - interpolation points
%c - barycentric weights
%ff - values of interpolation polynomial

numer = zeros(size(xx));
denom = zeros(size(xx));
exact = zeros(size(xx)); %test if xx=nodes
for j=1:n+1
xdiff = xx-x(j);
temp = c(j)./xdiff;
numer = numer+temp*y(j);
denom = denom+temp;
exact(xdiff==0) = j;
end
ff = numer ./ denom;
jj = find(exact);
ff(jj) = y(exact(jj));

MATLAB Source 5.16 Bayicentric weights


function c = barycentricweigths( x )
%BARYCENTRICWEIGHTS - compute barycentric weights(coefficient)
%call c = barycentricweigths( x )
%x - nodes
%c - weights

n=length(x)-1;
c=ones(1,n+1);
for j=1:n+1
c(j)=prod(x(j)-x([1:j-1,j+1:n+1]));
end
c=1./c;
end
5.7. Efficient Computation of Interpolation Polynomials 209

MATLAB Source 5.17 Barycentric Lagrange interpolation with Chebyshev nodes of first
kind
function ff=ChebLagrangek1(y,xx,a,b)
%CHEBLAGRANGEK1 - Lagrange interpolation for Chebyshev #1 points
% - barycentric method
%call ff=ChebLagrangek1(y,xx,a,b)
%y - function values;
%xx - evaluation points
%a,b - interval
%ff - values of Lagrange interpolation polynomial

n = length(y)-1;
if nargin==2
a=-1; b=1;
end
c = sin((2*(0:n)’+1)*pi/(2*n+2)).*(-1).ˆ((0:n)’);
x = sort(cos((2*(0:n)’+1)*pi/(2*n+2))*(b-a)/2+(a+b)/2);
ff=barycentricInterpolation(x,y,xx,c);
end

MATLAB Source 5.18 Barycentric Lagrange interpolation with Chebyshev nodes of second
kind
function ff=ChebLagrange(y,xx,a,b)
%CHEBLAGRANGE - Lagrange interpolation for Chebyshev #2 points
%- barycentric
%call ff=ChebLagrange(y,xx,a,b)
%y - function values;
%xx - evaluation points
%a,b - interval
%ff - values of Lagrange interpolation polynomial

n = length(y)-1;
if nargin==2
a=-1; b=1;
end
c = [1/2; ones(n-1,1); 1/2].*(-1).ˆ((0:n)’);
x = sort(cos((0:n)’*pi/n))*(b-a)/2+(a+b)/2;
ff=barycentricInterpolation(x,y,xx,c);
end
210 Function Approximation

ff=ChebLagrange(f,xx);
subplot(2,1,1)
plot(x,f,’.’,xx,ff,’-’)
m = 100;
x = sort(cos(pi*(0:m)/m));
f = fun(x);
xx = linspace(-1,1,5000);
ff=ChebLagrange(f,xx);
subplot(2,1,2)
plot(x,f,’.’,xx,ff,’-’)

The graph is given in Figure 5.11.

0.5

-0.5
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

0.5

-0.5
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Figure 5.11: Barycentric Lagrange interpolation

5.7.4 Multiple nodes divided differences


Formula (5.7.8) allows us to introduce the notion of a multiple nodes divided difference: if f ∈ C m [a, b]
and α ∈ [a, b], then

f m (ξ) f (m) (α)


lim [x0 , . . . , xn ; f ] = lim =
x0 ,...,xn →α ξ→α m! m!

This suggests the relation


1 (m)
[α, . . . , α; f ] = f (α).
| {z } m!
m+1
5.7. Efficient Computation of Interpolation Polynomials 211

Expressing this as a quotient of two determinants one obtains


  1 α α2 . . . αm−1 f (α)



  0 1 2α . . . (m − 1)αm−2 f ′ (α)
(W f ) α, . . . , α =

| {z } ... ... ... ... ... ...
m+1 0 0 0 ... (m − 1)! f (m−1) (α)

and
  1 α α2 ... αm

  0 1 2α ... mαm−1
,
V α, . . . , α =
| {z } ... ... ... ... ...
m+1 0 0 0 ... m!
that is, the two determinants are built from the line of the node α and its successive derivatives with
respect to α up to the mth order.
The generalization to several nodes is:

Definition 5.7.7. Let rk ∈ N, k = 0, m, n = r0 + · · · + rm . Suppose that f (j) (xk ), k = 0, m,


j = 0, rk − 1 exist. The quantity
(Wf )(x0 , . . . , x0 , . . . , xm , . . . , xm )
[x0 , . . . , x0 , x1 , . . . , x1 , . . . , xm , . . . , xm ;f ] =
| {z } | {z } | {z } V (x0 , . . . , x0 , . . . , xm , . . . , xm )
r0 r1 rm

where
(W f )(x0 , . . . , x0 , . . . , xm , . . . , xm ) =


1 x0 ... x0r0 −1 ... xn−1
0 f (x0 )


0 1 ... (r0 − 1)x0r0 −2 ... f ′ (x0 )

.. .. .. .. .. .. ..
. . . . . . .
Qr0−1
n−r0 (r0 −1)
0 0 ... (r0 − 1)! ... p=1 (n − p)x0 f (x0 )
=
1 xm ... xrmm −1 ... xn−1
m f (xm )

0 1 ... (rm − 1)xrmm −2 ... (n − 1)xn−2 m

f (xm )

.. .. .. .. .. .. ..
. . . . . . .
Qrm−1
(rm − 1)! n−rm (rn −1)
(xn )
0 0 ... ... p=1 (n − p)xm f
and V (x0 , . . . , x0 , . . . , xm , . . . , xm ) is as above, excepting the last column which is
r0 −2 rm −2
Y Y
(xn n−1
0 , nx0 ,..., (n − p)xn−r
0
0 +1
, . . . , xn n−1
m , nxm , . . . , xn−r
m
m +1 T
)
p=0 p=0

is called divided difference with multiple nodes xk , k = 0, m and orders of multiplicity rk , k = 0, m.

By generalization of Newton’s form for Lagrange interpolation polynomial one obtains a method
for computing Hermite interpolation polynomial based on multiple nodes divided difference.
Suppose nodes xi , i = 0, m and values f (xi ), f ′ (xi ) are given. We define the sequence of nodes
z0 , z1 , . . . , z2n+1 by z2i = z2i+1 = xi , i = 0, m. We build the divided difference table relative to the
nodes zi , i = 0, 2m + 1. Since z2i = z2i+1 = xi for every i, f [x2i , x2i+1 ] is a divided difference
with a double node and it equates f ′ (xi ); therefore we shall use f ′ (x0 ), f ′ (x1 ), . . . , f ′ (xm ) instead of
first order divided differences

f [z0 , z1 ], f [z2 , z3 ], . . . , f [z2m , z2m+1 ].


212 Function Approximation

z0 = x0 f [z0 ]
f [z0 , z1 ] = f ′ (x0 )
f [z1 ,z2 ]−f [z0 ,z1 ]
z1 = x0 f [z1 ] f [z0 , z1 , z2 ] = z2 −z0
f (z2 )−f (z1 )
f [z1 , z2 ] = z2 −z1
f [z3 ,z2 ]−f [z2 ,z1 ]
z2 = x1 f [z2 ] f [x1 , z2 , z3 ] = z3 −z1

f [z2 , z3 ] = f (x1 )
f [z4 ,z3 ]−f [z3 ,z2 ]
z3 = x1 f [z3 ] f [z2 , z3 , z4 ] = z4 −z2
f (z4 )−f (z3 )
f [z3 , z4 ] = z4 −z3
f [z5 ,z4 ]−f [z4 ,z3 ]
z4 = x2 f [z4 ] f [z3 , z4 , z5 ] = z5 −z3

f [z4 , z5 ] = f (x2 )
z5 = x2 f [z5 ]

Table 5.4: A divided difference table with double nodes

The other divided differences are obtained as usual, as the Table 5.4 shows. This idea could be extended
to another Hermite interpolation problems. The method is due to Powell.
MATLAB Source 5.19 contains a function for the computation of a divided difference table with
double nodes, divdiffdn. It returns the nodes, taking into account their multiplicities, and the di-
vided difference table. The Hermite interpolation polynomial can be computed using the function
Newtonpol (see MATLAB Source 5.14, page 204) with nodes and table returned by divdiffdn.

5.8 Convergence of polynomial interpolation


Let’s explain first what we mean by “convergence”. We assume that we are given a triangular array of
(m)
interpolation nodes xi = xi , exactly m + 1 distinct nodes for each m = 0, 1, 2, . . . .
(0)
x0
(1) (1)
x0 x1
(2) (2) (2)
x0 x1 x2
.. .. .. .. (5.8.1)
. . . .
(m) (m) (m) (m)
x0 x1 x2 ... xm
.. .. .. ..
. . . .
We assume further that all nodes are contained in some finite interval [a, b]. Then, for each m we
define
(m) (m)
Pm (x) = Lm (f ; x0 , x1 , . . . , x(m)
m ; x), x ∈ [a, b]. (5.8.2)
We say that Lagrange interpolation based on the triangular array of nodes (5.8.1) converges if
pm (x) ⇒ f (x), when n → ∞ pe [a, b]. (5.8.3)
Example 5.8.1 (Runge’s example).
1
f (x) = , x ∈ [−5, 5],
1 + x2
(m) k
xk = −5 + 10 , k = 0, m. (5.8.4)
m
5.8. Convergence of polynomial interpolation 213

MATLAB Source 5.19 Generates a divided difference table with double nodes
function [z,td]=divdiffdn(x,f,fd)
%DIVDIFFDN - compute divide difference table for double nodes
%call [z,td]=divdiffdn(x,f,fd)
%x -nodes
%f - function values at nodes
%fd - derivative values in nodes
%z - doubled nodes
%td - divided difference table

z=zeros(1,2*length(x));
lz=length(z);
z(1:2:lz-1)=x;
z(2:2:lz)=x;
td=zeros(lz,lz);
td(1:2:lz-1,1)=f’;
td(2:2:lz,1)=f’;
td(1:2:lz-1,2)=fd’;
td(2:2:lz-2,2)=(diff(f)./diff(x))’;
for j=3:lz
td(1:lz-j+1,j)=diff(td(1:lz-j+2,j-1))./...
(z(j:lz)-z(1:lz-j+1))’;
end

Here the nodes are equally spaced in [−5, 5]. Note that f has two poles at z = ±i. It has been shown,
indeed, that 
0 if |x| < 3.633 . . .
lim |f (x) − pm (f ; x)| = (5.8.5)
m→∞ ∞ if |x| > 3.633 . . .
The graph for m = 10, 13, 16 is given in Figure 5.12. It was generated with MATLAB Source 5.20, by
command
>>runge3([10,13,17],[-5,5,-2,2])
using MATLAB graph’s annotation facilities. ♦

Example 5.8.2 (Bernstein’s example). Let us consider the function


f (x) = |x|, x ∈ [−1, 1],
and the nodes
(m) 2k
xk = −1 + , k = 0, 1, 2, . . . , m. (5.8.6)
m
Here analyticity of f is completely gone, f being not differentiable at x = 0. One finds that
lim |f (x) − Lm (f ; x)| = ∞ ∀x ∈ [−1, 1]
m→∞

excepting the points x = −1, x = 0 and x = 1. See figure 5.13(a), for m = 20. The convergence in
x = ±1 is trivial, since they are interpolation nodes, where the error is zero. The same is true for x = 0
when m is even, but not if m is odd. The failure of the convergence in the last two examples can only
in part be blamed on insufficient regularity of f . Another culprit is the equidistribution of nodes. There
are better distributions such as Chebyshev nodes. Figure 5.13(b) gives the graph for m = 17. ♦
214 Function Approximation

MATLAB Source 5.20 Runge’s Counterexample


function runge3(n,w)
%n- a vector of degrees, w- window
clf
xg=-5:0.1:5; yg=1./(1+xg.ˆ2);
plot(xg,yg,’k-’,’Linewidth’,2);
hold on
nl=length(n);
ta=5*[-1:0.001:-0.36,-0.35:0.01:0.35, 0.36:0.001:1]’;
ya=zeros(length(ta),nl);
leg=cell(1,nl+1); leg{1}=’f’;
for l=1:nl
xn=5*[-1:2/n(l):1]; yn=1./(1+xn.ˆ2);
ya(:,l)=lagr(xn,yn,ta);
leg{l+1}=strcat(’L_{’,int2str(n(l)),’}’);
end
plot(ta,ya); axis(w)
legend(leg,-1)

m=10
1.5

0.5

−0.5

m=13
−1

−1.5
m=16

−2
−5 −4 −3 −2 −1 0 1 2 3 4 5

Figure 5.12: A graphical illustration of Runge’s counterexample


5.9. Spline Interpolation 215

2 1

1.8 0.9

1.6 0.8

1.4
0.7

1.2
0.6

1
0.5

0.8
0.4

0.6
0.3

0.4
0.2

0.2
0.1
0
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

(a) Equispaced nodes, m = 20 (b) Chebyshev nodes, m = 17

Figure 5.13: Behavior of Lagrange interpolation for f : [−1, 1] → R, f (x) = |x|.

The problem of convergence was solved for the general case by Faber and Bernstein during 1914
and 1916. Faber has proved that for each triangular array of nodes of type 5.8.1 in [a, b] there exists a
function f ∈ C[a, b] such that the sequence of Lagrange interpolation polynomials Lm f for the nodes
(m)
xi (row wise) does not converge uniformly to f on [a, b].
Bernstein 12 has proved that for any array of nodes as before there exists a function f ∈ C[a, b]
such that the corresponding sequence (Lm f ) is divergent.
Remedies:
• Local approach – the interval [a, b] is taken very small – the approach used to numerical solution
of differential equations;
• Spline interpolation – the interpolant is piecewise polynomial.

5.9 Spline Interpolation


Let ∆ be a subdivision upon the interval [a, b]
∆ : a = x1 < x2 < · · · < xn−1 < xn = b (5.9.1)
We shall use low-degree polynomials on each subinterval [xi , xi+1 ], i = 1, n − 1. The rationale
behind this is the recognition that on a sufficiently small interval, functions can be approximated arbi-
trarily by polynomials of low degree, even degree 1, or 0 for that matter.

Sergi Natanovitch Bernstein (1880-1968) made major contribution to


polynomial approximation, continuing the tradition of Chebyshev. In
1911 he introduced what are now called the Bernstein polynomials
12 to give a constructive proof of Weierstrass’s theorem (1885), namely
that a continuous function on a finite subinterval of the real line can be
uniformly approximated as closely as we wish by a polynomial. He
is also known for his works on differential equations and probability
theory.
216 Function Approximation

………

a=x
1
x
2
x
3
x
4 ……… x
n−1
x =b
n

Figure 5.14: Piecewise linear interpolation

We have already introduced the space

Skm (∆) = {s : s ∈ C k [a, b], s|[xi ,xi+1 ] ∈ Pm , i = 1, 2, . . . , n − 1} (5.9.2)

m ≥ 0, k ∈ N ∪ {−1}, of spline functions of degree m and smoothness class k relative to the


subdivision ∆. If k = m, then functions s ∈ Sm
m (∆) are polynomials.
For m = 1 and k = 0 one obtains linear splines.
We wish to find s ∈ S01 (∆) such that

s(xi ) = fi , where fi = f (xi ), i = 1, 2, . . . , n.

The solution is trivial, see Figure 5.14. On the interval [xi , xi+1 ]

s(f ; x) = fi + (x − xi )f [xi , xi+1 ], (5.9.3)


and
(∆xi )2
|f (x) − s(f (x))| ≤ max |f ′′ (x)|. (5.9.4)
8 x∈[xi ,xi+1 ]

It follows that
1
kf (·) − s(f, ·)k∞ ≤ |∆|2 kf ′′ k∞ . (5.9.5)
8
The dimension of S01 (∆) can be computed in the following way: since we have n − 1 subintervals,
each linear piece has 2 coefficients (2 degrees of freedom) and each continuity condition reduces the
degree of freedom by 1, we have finally

dim S01 (∆) = 2n − 2 − (n − 2) = n.

A basis of this space is given by the so-called B-spline functions:


5.9. Spline Interpolation 217

We let x0 = x1 , xn+1 = xn , for i = 1, n



 x − xi−1

 , for xi−1 ≤ x ≤ xi
 xi − xi−1
Bi (x) = xi+1 − x (5.9.6)
 , for xi ≤ x ≤ xi+1
 xi+1 − xi


0, otherwise

Note that the first equation, when i = 1, and the second when i = n are to be ignored. The
functions Bi may be referred to as “hat functions” (Chinese hats), but note that the first and the last hat
is cut in half. The functions Bi are depicted in Figure 5.15.

B1 B2 B3 Bn−1 Bn

a=x =x x x x x x =x =b
0 1 2 3 4 n−1 n n+1

Figure 5.15: First degree B-spline functions

They have the property


Bi (xj ) = δij ,
are linear independent, since
n
X
s(x) = ci Bi (x) = 0 ∧ x 6= xj ⇒ cj = 0.
i=1

and
hBi ii=1,n = S10 (∆),
Bi plays the same role as elementary Lagrange polynomials ℓi .

5.9.1 Interpolation by cubic splines


Cubic spline are the most widely used. We first discuss the interpolation problem for s ∈ S13 (∆).
Continuity of the first derivative of any cubic spline interpolant s3 (f ; ·) can be enforced by prescribing
218 Function Approximation

the values of the first derivative at each point xi , i = 1, 2, . . . , n. Thus, let m1 , m2 , . . . , mn be arbitrary
given numbers, and denote

s3 (f ; ·)|[xi ,xi+1 ] = pi (x), i = 1, 2, . . . , n − 1 (5.9.7)

Then we enforce s3 (f ; xi ) = mi , i = 1, n, by selecting each piece pi of s3 (f, ·) to be the (unique)


solution of a Hermite interpolation problem, namely,

pi (xi ) = fi , pi (xi+1 ) = fi+1 , i = 1, n − 1, (5.9.8)


p′i (xi ) = mi , p′i (xi+1 ) = mi+1

We solve the problem by Newton’s interpolation formula. The required divided differences are
f [xi ,xi+1 ]−mi mi+1 +mi −2f [xi ,xi+1 ]
xi fi mi ∆xi (∆xi )2
mi+1 −f [xi ,xi+1 ]
xi fi f [xi , xi+1 ] ∆xi
xi+1 fi+1 mi+1
xi+1 fi+1

and the Hermite interpolation polynomial (in Newton form) is

f [xi , xi+1 ] − mi
pi (x) = fi + (x − xi )mi + (x − xi )2 +
∆xi
mi+1 + mi − 2f [xi , xi+1 ]
+ (x − xi )2 (x − xi+1 ) .
(∆xi )2

Alternatively, in Taylor’s form, we can write for xi ≤ x ≤ xi+1

pi (x) = ci,0 + ci,1 (x − xi ) + ci,2 (x − xi )2 + ci,3 (x − xi )3 (5.9.9)

and since x − xi+1 = x − xi − ∆xi , by identification we have

ci,0 = fi
ci,1 = mi
f [xi , xi+1 ] − mi
ci,2 = − ci,3 ∆xi (5.9.10)
∆xi
mi+1 + mi − 2f [xi , xi+1 ]
ci,3 =
(∆xi )2

Thus to compute s3 (f ; x) for any given x ∈ [a, b] that is not an interpolation node, one first locates
the interval [xi , xi+1 ] ∋ x and then computes the corresponding piece (5.9.7) by (5.9.9) and (5.9.10).
We now discuss some possible choices of the parameters m1 , m2 , . . . , mn .

Piecewise cubic Hermite interpolation


Here one selects mi = f ′ (xi ) (assuming that these derivative values are known). This gives rise to a
strictly local scheme, in that each piece pi can be determined independently from the others. Further-
more, the interpolation error is, for
 4
1 |f (4) (x)|
|f (x) − pi (x)| ≤ ∆xi max , xi ≤ x ≤ xi+1 . (5.9.11)
2 x∈[xi ,xi+1 ] 4!
5.9. Spline Interpolation 219

Hence,
1
kf (·) − s3 (f ; ·)k∞ ≤ |∆|4 kf (4) k∞ . (5.9.12)
384
For equally spaced points xi , one has |∆| = (b − a)/(n − 1) and, therefore
kf (·) − s3 (f ; ·)k∞ = O(n−4 ), n → ∞. (5.9.13)

Cubic spline interpolation


Here we require s3 (f ; ·) ∈ S23 (∆), that is, continuity of the second derivatives. In terms of pieces
(5.9.7) of s3 (f ; ·), this means that
p′′i−1 (xi ) = p′′i (xi ), i = 2, n − 1, (5.9.14)
and translates into a condition for the Taylor coefficients in (5.9.9), namely
2ci−1,2 + 6ci−1,3 ∆xi−1 = 2ci,2 , i = 2, n − 1.
Plugging in explicit values (5.9.10) for these coefficients, we arrived at the linear system
∆xi mi−1 + 2(∆xi−1 + ∆xi )mi + (∆xi−1 )mi+1 = bi , i = 2, n − 1 (5.9.15)
where
bi = 3{∆xi f [xi−1 , xi ] + ∆xi−1 f [xi , xi+1 ]} (5.9.16)
These are n − 2 linear equations in the n unknowns m1 , m2 , . . . , mn . Once m1 and mn have been
chosen in some way, the system becomes again tridiagonal in the remaining unknowns and hence is
readily solved by Gaussian elimination, by factorization or by an iterative method.
Here are some possible choices of m1 and mn .
Complete (clamped) spline. We take m1 = f ′ (a), mn = f ′ (b). It is known that for this spline, if
f ∈ C 4 [a, b],
kf (r) (·) − s(r) (f ; ·)k∞ ≤ cr |∆|4−r kf (n) k∞ , r = 0, 1, 2, 3 (5.9.17)
where c0 = 384 5 1
, c1 = 24 , c2 = 83 , and c3 is a constant depending on the ratio min|∆| i ∆xi
.
Matching of the second derivatives at the endpoints. For this type of spline, we enforce the
conditions s′′3 (f ; a) = f ′′ (a); s′′3 (f ; b) = f ′′ (b). Each of these conditions gives rise to an additional
equation, namely,
2m1 + m2 = 3f [x1 , x2 ] − 21 f ′′ (a)∆x1
(5.9.18)
mn−1 + 2mn = 3f [xn−1 , xn ] + 21 f ′′ (b)∆xn−1
One conveniently adjoins the first equation to the top of the system (5.9.15), and the second to the
bottom, thereby preserving the tridiagonal structure of the system.
Natural cubic spline. Enforcing s′′ (f ; a) = s′′ (f ; b) = 0, one obtains two additional equations,
which can be obtained from (5.9.18) by putting there f ′′ (a) = f ′′ (b) = 0. The nice thing about
this spline is that it requires only function values of f – no derivatives – but the price one pays is a
degradation of the accuracy to O(|∆|2 ) near the endpoints (unless indeed f ′′ (a) = f ′′ (b) = 0).
”Not-a-knot spline”. (C. deBoor). Here we impose the conditions p1 (x) ≡ p2 (x) and pn−2 (x) ≡
pn−1 (x); that is, the first two pieces and the last two should be the same polynomial. In effect, this
means that the first interior knot x2 , and the last one xn−1 both are inactive. This again gives rise
to two supplementary equations expressing the continuity of s′′′ 3 (f ; x) in x = x2 and x = xn−1 .
The continuity condition of s3 (f, .) at x2 and xn−1 implies the equality of the leading coefficients
c1,3 = c2,3 and cn−2,3 = cn−1,3 . This gives rise to the equations
(∆x2 )2 m1 + [(∆x2 )2 − (∆x1 )2 ]m2 − (∆x1 )2 m3 = β1
2 2 2 2
(∆x2 ) mn−2 + [(∆x2 ) − (∆x1 ) ]mn−1 − (∆x1 ) mn = β2 ,
220 Function Approximation

where

β1 = 2{(∆x2 )2 f [x1 , x2 ] − (∆x1 )2 f [x2 , x3 ]}


β2 = 2{(∆xn−1 )2 f [xn−2 , xn−1 ] − (∆xn−2 )2 f [xn−1 , xn ]}.

The first equation adjoins to the top of the system n − 2 equations in n unknowns given by (5.9.15)
and (5.9.16) and the second to the bottom. The system is no more tridiagonal, but it can be turn
into a tridiagonal one, by combining equations 1 and 2, and n − 1 and n, respectively. After this
transformations, the first and the last equations become

∆x2 m1 + (∆x2 + ∆x1 )m2 = γ1 (5.9.19)


(∆xn−1 + ∆xn−2 )mn−1 + ∆xn−2 mn = γ2 , (5.9.20)

where
1 
γ1 = f [x1 , x2 ]∆x2 [∆x1 + 2(∆x1 + ∆x2 )] + (∆x1 )2 f [x2 , x3 ]
∆x2 + ∆x1
1  2
γ2 = ∆xn−1 f [xn−2 , xn−1 ] +
∆xn−1 + ∆xn−2

[2(∆xn−1 + ∆xn−2 ) + ∆xn−1 ]∆xn−2 f [xn−1 , xn ] .

The function CubicSpline computes the coefficients of all four spline types presented. It builds
the tridiagonal system with sparse matrices and solve it using the \operator. The differences between
types occur in first and last equations, implemented via a switch instruction.

function [a,b,c,d]=CubicSpline(x,f,type,der)
%CUBICSPLINE - find coefficients of a cubic spline
%call [a,b,c,d]=Splinecubic(x,f,type,der)
%x - abscissas
%f - ordinates
%type - 0 complete (clamped)
% 1 match second derivatives
% 2 natural
% 3 not a knot (deBoor)
%der - values of derivatives
% [f’(a),f’(b)] for type 0
% [f’’(a), f’’(b)] for type 1

if (nargin<4) | (type==2), der=[0,0]; end

n=length(x);
%sort nodes
if any(diff(x)<0), [x,ind]=sort(x); else, ind=1:n; end
y=f(ind); x=x(:); y=y(:);
%find equations 2 ... n-1
dx=diff(x); ddiv=diff(y)./dx;
ds=dx(1:end-1); dd=dx(2:end);
dp=2*(ds+dd);
md=3*(dd.*ddiv(1:end-1)+ds.*ddiv(2:end));
5.9. Spline Interpolation 221

%treat separately over type - equations 1,n


switch type
case 0 %complete
dp1=1; dpn=1; vd1=0; vdn=0;
md1=der(1); mdn=der(2);
case 1,2 %d2 si natural
dp1=2; dpn=2; vd1=1; vdn=1;
md1=3*ddiv(1)-0.5*dx(1)*der(1);
mdn=3*ddiv(end)+0.5*dx(end)*der(2);
case 3 %deBoor
x31=x(3)-x(1);xn=x(n)-x(n-2);
dp1=dx(2); dpn=dx(end-1);
vd1=x31; vdn=xn;
md1=((dx(1)+2*x31)*dx(2)*ddiv(1)+dx(1)ˆ2*ddiv(2))/x31;
mdn=(dx(end)ˆ2*ddiv(end-1)+(2*xn+dx(end))*dx(end-1)*...
ddiv(end))/xn;
end
%build sparse system
dp=[dp1;dp;dpn]; dp1=[0;vd1;dd];
dm1=[ds;vdn;0]; md=[md1;md;mdn];
A=spdiags([dm1,dp,dp1],-1:1,n,n);
m=A \ md;
d=y(1:end-1);
c=m(1:end-1);
a=[(m(2:end)+m(1:end-1)-2*ddiv)./(dx.ˆ2)];
b=[(ddiv-m(1:end-1))./dx-dx.*a];

The values of spline functions are computed using a single function for all types:

function z=evalspline(x,a,b,c,d,t)
%EVALSPLINE - compute cubic spline value
%call z=evalspline(x,a,b,c,d,t)
%z - values
%t - evaluation points
%x - nodes (knots)
%a,b,c,d - coefficients
n=length(x);
x=x(:); t=t(:);
k = ones(size(t));
for j = 2:n-1
k(x(j) <= t) = j;
end
% interpolant evaluation
s = t - x(k);
z = d(k) + s.*(c(k) + s.*(b(k) + s.*a(k)));
222 Function Approximation

5.9.2 Minimality properties of cubic spline interpolants


The complete and natural splines have interesting optimality properties. To formulate them, it is conve-
nient to consider not only the subdivision ∆ in (5.9.1), but also the subdivision
∆′ : a = x0 = x1 < x2 < x3 < · · · < xn−1 < xn = xn+1 = b, (5.9.21)
in which the endpoints are double knots. This means that whenever we interpolate on ∆′ , we interpolate
to function values at all interior points, but to the functions as well as first derivative values at the
endpoints. The first of the two theorems relates to the complete cubic spline interpolant, scompl (f ; ·).
Theorem 5.9.1. For any function g ∈ C 2 [a, b] that interpolates f on ∆′ , there holds
Z b Z b
[g ′′ (x)]2 dx ≥ [s′′compl (f ; x)]2 dx, (5.9.22)
a a

with equality if and only if g(·) = scompl (f ; ·).

Remark 5.9.2. scompl (f ; ·) in Theorem 5.9.1 also interpolates f on ∆′ and among all such interpolants
its second derivative has the smallest L2 norm. ♦

Proof. We write (for short) scompl = s. The theorem follows once we have shown that
Z b Z b Z b
[g ′′ (x)]2 dx = [g ′′ (x) − s′′ (x)]2 dx + [s′′ (x)]2 dx. (5.9.23)
a a a

Indeed, this immediately implies (5.9.22), and equality in (5.9.22) holds if and only if g ′′ (x) − s′′ (x) ≡
0, which, integrating twice from a to x and using the interpolation properties of s and g at x = a gives
g(x) ≡ s(x).
To complete the proof, note that the relation (5.9.23) is equivalent to
Z b
s′′ (x)[g ′′ (x) − s′′ (x)]dx = 0. (5.9.24)
a

Integrating by parts and taking into account that s′ (b) = g ′ (b) = f ′ (b) and s′ (a) = g ′ (a) = f ′ (a), we
get
Z b
s′′ (x)[g ′′ (x) − s′′ (x)]dx =
a
b Z b

= s (x)[g (x) − s′ (x)] −
′′ ′
s′′′ (x)[g ′ (x) − s′ (x)]dx (5.9.25)
a a
Z b
=− s′′′ (x)[g ′ (x) − s′ (x)]dx.
a

But s′′′ is piecewise constant, so


Z b n−1
X Z xν+1
s′′′ (x)[g ′ (x) − s′ (x)]dx = s′′′ (xν + 0) [g ′ (x) − s′ (x)]dx =
a ν−1 xν

n−1
X
= s′′′ (xν+0 )[g(xν+1 ) − s(xν+1 ) − (g(xν ) − s(xν ))] = 0
ν=1

since both s and g interpolate f on ∆. This proves (5.9.24) and hence the theorem. 
5.10. Interpolation in MATLAB 223

For interpolation on ∆, the distinction of being optimal goes to the natural cubic spline interpolant
snat (f ; ·).
Theorem 5.9.3. For any function g ∈ C 2 [a, b] that interpolates f on ∆ (not ∆′ ), there holds
Z b Z b
[g ′′ (x)]2 dx ≥ [s′′nat (f ; x)]2 dx (5.9.26)
a a

with equality if and only if g(·) = snat (f ; ·).

The proof of Theorem 5.9.3 is virtually the same as that of Theorem 5.9.1, since (5.9.23) holds
again, this time because s′′ (b) = s′′ (a) = 0.
Putting g(·) = scompl (f ; ·) in Theorem 5.9.3 immediately gives
Z b Z b
[s′′compl (f ; x)]2 dx ≥ [s′′nat (f ; x)]2 dx. (5.9.27)
a a

Therefore, in a sense, the natural cubic spline is the “smoothest” interpolant.


The property expressed in Theorem 5.9.3 is the origin of the name “spline”. A spline is a flexible
strip of wood used in drawing curves. If its shape is given by the equation y = g(x), x ∈ [a, b] and if
the spline is constrained to pass through the points (xi , gi ), then it assumes a form that minimizes the
bending energy
Z b
[g ′′ (x)]2 dx
,
a (1 + [g ′ (x)]2 )3
over all functions g similarly constrained. For slowly varying g (kg ′ k∞ ≪ 1) this is nearly the same
as the minimum property of Theorem 5.9.3.

5.10 Interpolation in MATLAB


MATLAB has functions for interpolation in one, two and more dimensions. The polyfit function
returns the coefficients of Lagrange interpolation polynomial if the degree n equates the number of
observations minus 1.
Function interp1 accepts x(i), y(i) data pairs and a vector xi as input parameters. It fits an
interpolant to the data and then returns the values of the interpolant at the points in xi:
yi = interp1(x,y,xi,method)
The elements of x must be in increasing order. Four types of interpolant are supported, as specified
by a fourth input parameter, method which is one of
• ’nearest’ – nearest neighbor interpolation
• ’linear’ – linear interpolation
• ’spline’ – piecewise cubic spline interpolation (SPLINE)
• ’pchip’ – shape-preserving piecewise cubic interpolation
• ’cubic’ – same as ’pchip’
• ’v5cubic’ – the cubic interpolation from MATLAB 5, which does not extrapolate and uses
’spline’ if X is not equally spaced.
The example bellow illustrates interp1 (file exinterp1.m). It chooses six points on graph of
f (x) = x + sin πx2 and computes the interpolant using nearest, linear and spline methods.
The cubic interpolant was omitted, since it it close to spline interpolant and it would complicate
the picture. The graph is given in Figure 5.16.
224 Function Approximation

x=[-1,-3/4, -1/3, 0, 1/2, 1]; y=x+sin(pi*x.ˆ2);


xi=linspace(-1,1,60); yi=xi+sin(pi*xi.ˆ2);
yn=interp1(x,y,xi,’nearest’);
yl=interp1(x,y,xi,’linear’);
ys=interp1(x,y,xi,’spline’);
%yc=interp1(x,y,xi,’pchip’);
plot(xi,yi,’:’,x,y,’o’,’MarkerSize’,12); hold on
plot(xi,yl,’--’,xi,ys,’-’)
stairs(xi,yn,’-.’)
set(gca,’XTick’,x);
set(gca,’XTickLabel’,’-1|-3/4|-1/3|0|1/2|1’)
set(gca,’XGrid’,’on’)
axis([-1.1, 1.1, -1.1, 2.1])
legend(’f’,’data’,’linear’, ’spline’, ’nearest’,4)
hold off

1.5

0.5

−0.5
f
data
linear
spline
−1 nearest

−1 −3/4 −1/3 0 1/2 1

Figure 5.16: Interpolating a curve using interp1

The smoothest interpolation is spline, but piecewise Hermite interpolation preserves the shape.
The next example illustrates the differences between spline and piecewise Hermite interpolation (file
expsli cub.m).

x =[-0.99, -0.76, -0.48, -0.18, 0.07, 0.2, ...


0.46, 0.7, 0.84, 1.09, 1.45];
y = [0.39, 1.1, 0.61, -0.02, -0.33, 0.65, ...
1.13, 1.46, 1.07, 1.2, 0.3];
plot(x,y,’o’); hold on
xi=linspace(min(x),max(x),100);
ys=interp1(x,y,xi,’spline’);
5.10. Interpolation in MATLAB 225

yc=interp1(x,y,xi,’cubic’);
h=plot(xi,ys,’-’,xi,yc,’-.’);
legend(h,’spline’,’cubic’,4)
axis([-1.1,1.6,-0.8,1.6])
Figure 5.17 gives the corresponding graph.

1.5

0.5

−0.5

spline
cubic

−1 −0.5 0 0.5 1 1.5

Figure 5.17: Cubic spline and piecewise Hermite interpolation

We can perform cubic spline and piecewise Hermite interpolation directly, by calling spline and
pchip. function, respectively.
Given the vectors x and y, the command yy = spline(x,y,xx) returns in yy vector the
values of cubic spline interpolant at points in xx. If y is a matrix, we have vector values interpolation
and interpolation is done after column of y; the size of yy is length(xx) over size(y,2). The
spline function computes a deBoor-type spline interpolant. If y is a vector that contains two more
values than x has entries, the first and last value in y are used as the endslopes for the cubic spline. (For
terminology on spline type see Section 5.9.1).
The next example takes six points on the graph of y = sin(x), determines and plots the graph of
deBoor spline and complete spline (see Figure 5.18).
x = 0:2:10;
y = sin(x); yc=[cos(0),y,cos(10)];
xx = 0:.01:10;
yy = spline(x,y,xx);
yc = spline(x,yc,xx);
plot(x,y,’o’,xx,sin(xx),’-’,xx,yy,’--’,xx,yc,’-.’)
axis([-0.5,10.5,-1.3,1.3])
legend(’nodes’,’sine’,’deBoor’,’complete’,4)
There are situations which require the handling of spline coefficients (for example, if nodes are
preserved and xx changes). The command pp=spline(x,y) stores the coefficients in a pp (piece-
226 Function Approximation

0.5

−0.5

nodes
sine
−1 deBoor
complete

0 2 4 6 8 10

Figure 5.18: deBoor and complete spline

wise polynomial) structure, that contains the shape, coefficient matrix (a four column matrix for cubic
splines), number of subintervals, order (degree plus one) and the size. The ppval function evalu-
ates the spline using such a structure. Other low-level operations are possible via mkpp (make a pp-
structure) and unmkpp (details about the components of a pp-structure). For example, the command
yy = spline(x,y,xx) can be replaced by

pp = spline(x,y);
yy = ppval(pp,xx);

We shall give now an example of vector spline interpolation and use of ppval. We wish to in-
troduce interactively in graphical mode a set of data points from the screen and to plot the parametric
spline that passes through these points, with two different resolutions (say 20 and 150 intermediary
points on the curve). Here is the source (M-file splinevect.m):

axis([0,1,0,1]);
hold on
[x,y]=ginput;
data=[x’;y’];
t=linspace(0,1,length(x));
tt1=linspace(0,1,20);
tt2=linspace(0,1,150);
pp=spline(t,data);
yy1=ppval(pp,tt1);
yy2=ppval(pp,tt2);
plot(x,y,’o’,yy1(1,:),yy1(2,:),yy2(1,:),yy2(2,:));
hold off

Interactive input is performed by ginput. The spline coefficients are computed only once. For
the evaluation we use ppval. We recommend the user to try this example.
5.11. Applications 227

We have two forms of call syntax for pchip function:

yi = pchip(x,y,xx)
pp = pchip(x,y)

The first form returns the values of interpolant at xx in yi, while the second returns a pp-structure.
The meaning of parameters is the same as for spline function. The next example computes the
deBoor and piecewise Hermite cubic interpolant for the same data set (script expchip.m):

x = -3:3;
y = [-1 -1 -1 0 1 1 1];
t = -3:.01:3;
p = pchip(x,y,t);
s = spline(x,y,t);
plot(x,y,’o’,t,p,’-’,t,s,’-.’)
legend({’data’,’pchip’,’spline’},4)

The graph appears in Figure 5.19. The spline interpolant is smoother, but the piecewise Hermite inter-
polant preserves the shape.

1.5

0.5

−0.5

−1

data
pchip
spline
−1.5
−3 −2 −1 0 1 2 3

Figure 5.19: pchip and spline example

5.11 Applications
5.11.1 Spline and sewing machines
The basic idea of this application [83] is to use a parametric representation (x(s), y(s)) for the curve and
approximate independently the coordinate functions x(s) and y(s). The parameter s can be anything,
but a proper choice is the arc length. Having chosen the nodes si , i = 1, . . . , n, we can interpolate
228 Function Approximation

the data xi = x(si ) by a spline Sx (s) and likewise the data yi = y(si ) by a spline Sy (s). The curve
(x(s), y(s)) is approximated by (Sx (s), Sy (s)). This yields a curve in the plane that passes through all
the points (x(si ), y(si )) in order.
A problem is the computation of parameter s. We take s1 = 0 and
p
si+1 = si + (xi+1 − xi )2 + (yi+1 − yi )2 . (5.11.1)

An alternative is
si+1 = si + |xi+1 − xi | + |yi+1 − yi |. (5.11.2)
An interesting example of the technique is furnished by a need to approximate curves for automatic
control of sewing machines. Arc length is a natural parameter because a constant increment in arc
length corresponds to a constant stich length.
An example taken from [83] fits the data ( 2.5, -2.5), ( 3.5, -0.5), ( 5.0, 2.0), ( 7.5, 4.0), ( 9.5, 4.5),
(11.8, 3.5), (13.0, 0.5), (11.5, -2.0), ( 9.0, -3.0), ( 6.0, -3.3), ( 2.5, -2.5), ( 0.0, 0.0), (-1.5, 2.0), (-3.0,
5.0), (-3.5, 9.0), (-2.0, 11.0), ( 0.0, 11.5), ( 2.0, 11.0), ( 3.5, 9.0), ( 3.0, 5.0), ( 1.5, 2.0), ( 0.0, 0.0), (-2.5,
-2.5), (-6.0, -3.3), (-9.0, -3.0), (-11.5, -2.0), (-13.0, 0.5), (-11.8, 3.5), (-9.5, 4.5), (-7.5, 4.0), (-5.0, 2.0),
(-3.5, -0.5), (-2.5, -2.5). The resulting spline curve is shown in Figure 5.20

Figure 5.20: Curve fit for a sewing machine pattern

The MATLAB code to generate Figure 5.20 is

load sm
s=zeros(length(sm),1);
s(2:end)=sqrt(diff(sm(:,1)).ˆ2+diff(sm(:,2)).ˆ2);
s=cumsum(s);
t=linspace(s(1),s(end),300);
xg=spline(s,sm(:,1),t);
yg=spline(s,sm(:,2),t);
plot(sm(:,1),sm(:,2),’o’,xg,yg)
5.11. Applications 229

axis off

The file sm.mat contains the points to be fitted. If we want an arc increment given by (5.11.2), line 3
can be replaced by

s(2:end)=abs(diff(sm(:,1)))+abs(diff(sm(:,2)));

The result with this modification is very close to the result obtained using (5.11.1).

5.11.2 A membrane deflection problem


Least squares approximation has numerous applications when we are dealing with a large number of
equations involving a smaller number of parameters used to closely fit some constraints. Let us illustrate
how least squares approximation can be used to compute the transverse deflection of a membrane sub-
jected to uniform pressure [105]. The transverse deflection u for a membrane which has zero deflection
on a boundary L satisfies the differential equation

∂2u ∂2u
2
+ = −γ,
∂x ∂x2
where (x, y) is inside L and γ is a physical constant. The differential equation is satisfied by a series of
the form " #
−|z|2 X
n  
j−1
u=γ + cj real z ,
4 j=1

where z = x + iy and constants cj are chosen to minimize the boundary deflection in the least squares
sense. As example, we analyze a membrane consisting of a rectangular part on the left joined with a
semicircular part on the right. The surface plot in Figure 5.21(a) and the contour plot in Figure 5.21(b)
were produced by the function membran listed below. The function generates boundary data, solves
the series coefficients, and constructs plots depicting the deflection pattern. The results obtained using
twenty-term series satisfy the boundary condition quite well.

Membrane Deflection Membrane Surface Contour Lines


1
0.3

0.25 0.8

0.2 0.6

0.15
deflection

0.4
0.1
0.2
0.05
y axis

0
0
−0.2
−0.05
1 −0.4
0.5
−0.6
0
−0.8
−0.5
0.5 1 −1
y axis −1 −0.5 0 −1 −0.5 0 0.5 1
−1
x axis x axis

(a) Surface plot of membrane (b) Contour plot of membrane

Figure 5.21: Membrane plot

function [dfl,cof]=membran(h,np,ns,nx,ny)
%MEMBRAN Computes transverse deflection of a
% uniformly tensioned membrane
230 Function Approximation

% [DFL,COF]=membran(H,NP,NS,NX,NY)
% Example use: membran(0.75,100,50,40,40);
% h - width of the rectangular part
% np - number of least square points (about 3.5*np)
% ns - number of terms
% nx,ny - the number of x points and y points
% dfl - computed array of deflection values
% cof - coefficients in the series

if nargin==0
h=.75; np=100; ns=50; nx=40; ny=40;
end

% Generate boundary points for least square


% approximation
z=[exp(1i*linspace(0,pi/2,round(1.5*np))),...
linspace(1i,-h+1i,np),...
linspace(-h+1i,-h,round(np/2))];
z=z(:); xb=real(z); xb=[xb;xb(end:-1:1)];
yb=imag(z); yb=[yb;-yb(end:-1:1)];

% Form the least square equations and solve


% for series coefficients
a=ones(length(z),ns);
for j=2:ns, a(:,j)=a(:,j-1).*z; end
cof=real(a)\(z.*conj(z))/4;

% Generate a rectangular grid for evaluation


% of deflections
xv=linspace(-h,1,nx); yv=linspace(-1,1,ny);
[X,Y]=meshgrid(xv,yv); Z=X+1i*Y;

% Evaluate the deflection series on the grid


dfl=-Z.*conj(Z)/4+ ...
real(polyval(cof(ns:-1:1),Z));

% Set values outside the physical region of


% interest to zero
dfl=real(dfl).*(1-((abs(Z)>=1)&(real(Z)>=0)));

% Make surface and contour plots


hold off; close; surf(X,Y,dfl);
xlabel(’x axis’); ylabel(’y axis’);
zlabel(’deflection’); view(-10,30);
title(’Membrane Deflection’); colormap([1 1 1]);
shg
figure(2)
contour(X,Y,dfl,15,’k’); hold on
plot(xb,yb,’k-’); axis(’equal’), hold off
5.11. Applications 231

xlabel(’x axis’); ylabel(’y axis’);


title(’Membrane Surface Contour Lines’), shg

Problems
Problem 5.3. (a) Given a function f ∈ C[a, b], determine ŝ1 (f ; ·) ∈ S01 (∆) such that
Z b
[f (x) − ŝ1 (f ; x)]2 dx
a

be minimal.
(b) Write a MATLAB function that builds and solves the system of normal equation in (a).
(c) Test the above MATLAB function for a function and a division at your choice.
π

Problem 5.4. Compute discrete least squares approximations for the function f (t) = sin 2
t on
0 ≤ t ≤ 1 of the form
n
X
ϕn (t) = t + t(1 − t) cj tj−1 , n = 1(1)5,
j=1

using N abscissas tk = k/(N + 1), k = 1, N . Note that ϕn (0) = 0 and ϕn (1) = 1 are the exact
values of f at t = 0, and t = 1, respectively.
(Hint: Approximate f (t) − t by a linear combination of polynomials πj (t) = t(1 − t)tj−1 , j =
1, n.) The system of normal equations has the form Ac = b, A = [(πi , πj )], b = [(πi , f − t)], c = [cj ].
Output (for n = 1, 2, . . . , 5) :
• The condition number of the system.
• The values of the coefficients.
• Maximum and minimum of the error:

emin = min |ϕn (tk ) − f (tk )|, emax = max |ϕn (tk ) − f (tk )|.
1≤k≤N 1≤k≤N

Run twice for


(a) N = 5, 10, 20,
(b) N = 4.
Comment the results. Plot the function and the approximations on the same graph.

Problem 5.5. Plot on the same graph for [a, b] = [0, 1], n = 11, the function, Lagrange interpolation
polynomial, and Hermite interpolation polynomial with double nodes when:
(a) xi = i−1
n−1
i = 1, n, f (x) = e−x and f (x) = x5/2 ;
,
 2
i−1
(b) xi = n−1 , i = 1, n, f (x) = x5/2 .

Problem 5.6. The same problem as above, but for the four types of cubic interpolation splines.

Problem 5.7. Choosing several values of n, for each chosen n plot the Lebesgue function for n equally
spaced nodes and n Chebyshev nodes in interval [0, 1].
232 Function Approximation

Problem 5.8. Consider the points Pi ∈ R2 , i = 0, n. Write:


(a) A MATLAB function to find a n-th degree parametric polynomial that passes through the given
points.
(b) A MATLAB function to find a cubic parametric spline curve that passes through the given points,
using the function for natural cubic spline or that for deBoor spline, given in this chapter.
Test these two function, by introducing the points interactively with ginput and then plotting the
points and the curves determined in this way.

Problem 5.9. Find a parametric cubic curve, that passes through two given points and has given tangent
vectors at those given points.

Problem 5.10. Write a MATLAB function that computes the coefficients and the values of Hermite
cubic spline, that is cubic spline of class C 1 [a, b] which verifies

s3 (f, xi ) = f (xi ), s′3 (f, xi) = f ′ (xi ), i = 1, n.


2
Plot on the same graph the function f (x) = e−x and the interpolant of f for 5 equally-spaced nodes
and 5 Chebyshev nodes on [0,1].

Problem 5.11. Implement a MATLAB function for the inversion of Vandermode matrix, using the
results from pages 188–191.

Problem 5.12. [66] Write a MATLAB function for the computing of a periodic cubic spline of class
C 2 [a, b]. This means that our input data must check fn = f1 , and the interpolating spline must be
periodic, with period xn − x1 . The endpoint periodicity condition can be forced easier by considering
two additional points x0 = x1 − ∆xn−1 and xn+1 = xn + ∆x1 , where the function would have the
values f0 = fn−1 and fn+1 = f2 , respectively.

Problem 5.13. Consider the input data

x = -5:5;
y = [0,0,0,1,1,1,0,0,0,0,0];

Find the coefficients of 7th degree least squares polynomial approximation for this data and plot on the
same graph the corresponding least squares approximation and the Lagrange interpolation polynomial.

Problem 5.14. The density of sodium (in kg/m3 ) for three temperatures (in ◦ C) is given in the table

Temperature Ti 94 205 371


Density ρi 929 902 860

(a) Find the Lagrange interpolation polynomial for this data, using the Symbolic toolbox.
(b) Find the density for T = 251◦ using Lagrange interpolation.

Problem 5.15. Approximate


1+x
y=
1 + 2x + 3x2
for x ∈ [0, 5] using Lagrange, Hermite and spline interpolation. Chose five nodes and plot on the same
graph the function and the interpolants. Then plot the approximation errors.
5.11. Applications 233

T 605 645 685 725 765 795 825


C(T ) 0.622 0.639 0.655 0.668 0.679 0.694 0.730
T 845 855 865 875 885 895 905
C(T ) 0.812 0.907 1.044 1.336 1.181 2.169 2.075
T 915 925 935 955 975 1015 1065
C(T ) 1.598 1.211 0.916 0.672 0.615 0.603 0.601

Table 5.5: A property of titanium as a function of temperature

Problem 5.16. Table 5.5 gives the values for a property of titanium as a function of temperature, T .
Find and plot a cubic interpolating spline for this data using 15 nodes. How well does the spline the
data at the other 6 points?
Problem 5.17. Determine a discrete least squares approximation of the form
y = α exp(βx)
for data
x y
0.0129 9.5600
0.0247 8.1845
0.0530 5.2616
0.1550 2.7917
0.3010 2.2611
0.4710 1.7340
0.8020 1.2370
1.2700 1.0674
1.4300 1.1171
2.4600 0.7620
Plot the points and the approximation.
Hint: apply logarithm.
Problem 5.18. Implement a MATLAB function for discrete least squares approximation based on
Chebyshev #1 polynomials and the inner product (5.3.13).
Problem 5.19. Determine a discrete least squares approximation of the form
y = c1 + c2 x + c3 sin(πx) + c4 sin(2πx)
for data
i xi yi
1 0.1 0.0000
2 0.2 2.1220
3 0.3 3.0244
4 0.4 3.2568
5 0.5 3.1399
6 0.6 2.8579
7 0.7 2.5140
8 0.8 2.1639
9 0.9 1.8358
234 Function Approximation

Plot data and the approximation.


CHAPTER 6

Linear Functional Approximation

6.1 Introduction
Let X be a linear space, L1 , . . . , Lm real linear functional, that are linear independent, defined on X
and L : X → R be a real linear functional such that L, L1 , . . . , Lm are linear independent.
Definition 6.1.1. A formula for approximation of a linear functional L with respect to linear function-
als L1 , . . . , Lm is a formula having the form
m
X
L(f ) = Ai Li (f ) + R(f ), f ∈ X. (6.1.1)
i=1

Real parameters Ai are called coefficients (weights) of the formula, and R(f ) is the remainder term.

For a formula of form (6.1.1), given Li , we wish to determine the weights Ai and to study the
remainder term corresponding to these coefficients.
Remark 6.1.2. The form of Li depends on information on f available (they really express this infor-
mation), but also on the nature of the approximation problem, that is, the form of L. ♦

Example 6.1.3. If X = {f | f : [a, b] → R}, Li (f ) = f (xi ), i = 0, m, xi ∈ [a, b], and L(f ) =


f (α), α ∈ [a, b], the Lagrange interpolation formula
m
X
f (α) = li (α)f (xi ) + (Rf )α
i=0

provides us an example of type (6.1.1), having the coefficients Ai = li (α), and a possible representation
of the remainder is
u(α)
(Rf )(α) = f (m+1) (ξ), ξ ∈ [a, b],
(m + 1)!
if f (m+1) exists [a, b]. ♦

235
236 Linear Functional Approximation

Example 6.1.4. If X and Li are like in Example 6.1.3 and f (k) (α) exists, α ∈ [a, b], k ∈ N∗ , and
L(f ) = f (k) (α) one obtains a formula for the approximation of the kth derivative of f at α

m
X
f (k) (α) = Ai f (xi ) + R(f ),
i=0

called numerical differentiation formula . ♦

Example 6.1.5. If X is a space of functions which are defined on [a, b], integrable on [a, b] and there
exists f (j) (xk ), k = 0, m, j ∈ Ik , with xk ∈ [a, b] and Ik are given sets of indices

Lkj (f ) = f (j) (xk ), k = 0, m, j ∈ Ik ,

and
Z b
L(f ) = f (x)dx,
a

one obtains a formula


Z b m X
X
f (x)dx = Akj f (j) (xk ) + R(f ),
a k=0 j∈Ik

called numerical integration formula. ♦

Definition 6.1.6. If Pr ⊂ X, then the number r ∈ N such that Ker(R) = Pr is called degree of
exactness of the approximation formula (6.1.1).

Remark 6.1.7. Since R is a linear functional, the property Ker(R) = Pr is equivalent to R(ek ) = 0,
k = 0, r and R(er+1 ) 6= 0, where ek (x) = xk . ♦

We are now ready to formulate the general approximation problem: given a linear functional L
on X, m linear functional L1 , L2 , . . . , Lm on X and their values (the “data”) li = Li f , i = 1, m
applied to some function f and given a linear subspace Φ ⊂ X with dim Φ = m, we want to find an
approximation formula of the type
Xm
Lf ≈ ai Li f (6.1.2)
i=1

that is exact (i.e., holds with equality), whenever f ∈ Φ.


It is natural (since we want to interpolate) to make the following
Assumption: the “interpolation problem”
find ϕ ∈ Φ such that
Li ϕ = si , i = 1, m (6.1.3)

has a unique solution ϕ(·) = ϕ(s, ·), for arbitrary s = [s1 , . . . , sm ]T .


We can express our assumption more explicitly in terms of a given basis ϕ1 , ϕ2 , . . . , ϕm of Φ and
6.1. Introduction 237

1
the associated Gram matrix

L1 ϕ1 L1 ϕ2 ... L1 ϕm

L2 ϕ1 L2 ϕ2 ... L2 ϕm
G = [Li ϕj ] = ∈ Rm×m . (6.1.4)

... ... ... ...
Lm ϕ1 Lm ϕ2 ... Lm ϕm

What we require is that


det G 6= 0. (6.1.5)
It is easily seen that this condition is independent of the particular choice of basis.
To show that unique solvability of (6.1.3) and (6.1.5) are equivalent, we express ϕ in (6.1.3) as a
linear combination of the basis functions
nm
X
ϕ= cj ϕj (6.1.6)
j=1

and note that the interpolation conditions


m
!
X
Li cj ϕj = si , i = 1, m
j=1

by the linearity of the functionals Li , can be written in the form


m
X
cj Li ϕj = si , i = 1, m,
j=1

that is,
Gc = s, c = [c1 , c2 , . . . , cm ]T , s = [s1 , s2 , . . . , sm ]T . (6.1.7)
This has a unique solution for arbitrary s if and only if (6.1.5) holds.
We have two approaches for the solution of this problem.

6.1.1 Method of interpolation


We solve the general approximation problem by interpolation

Lf ≈ Lϕ(ℓ; ·), ℓ = [ℓ1 , ℓ2 , . . . , ℓm ]T , ℓi = Li f (6.1.8)

In other words, we apply L not to f , but to ϕ(l; ·) — the solution of the interpolation problem (6.1.3)
in which s = ℓ, the given “data”. Our assumption guarantees that ϕ(ℓ; ·) is uniquely determined.

Jórgen Pedersen Gram (1850-1916), Danish mathematician who stud-


ied at the University of Copenhagen. After graduation, he entered an
insurance company as computer assistant and, moving up the ranks,
1 eventually became its director. He was interested in series expan-
sions of special functions and also contributed to Chebyshev and least
squares approximation. The “Gram determinant” was introduced by
him in connection with his study of linear independence.
238 Linear Functional Approximation

In particular, if f ∈ Φ, then (6.1.8) holds with equality, since trivially ϕ(l; ·) = f (·). Thus, our
approximation (6.1.8) already satisfies the exactness condition required for (6.1.2). It remains only to
show that (6.1.8) produces indeed an approximation of the form (6.1.2).
To do so, observe that the interpolant in (6.1.8) is
m
X
ϕ(ℓ; ·) = cj ϕj (·)
j=1

where the vector c = [c1 , c2 , . . . , cm ]T satisfies (6.1.7) with s = ℓ

Gc = ℓ, ℓ = [L1 f, L2 f, . . . , Lm f ]T .

Writing
λj = Lϕj , j = 1, m, λ = [λ1 , λ2 , . . . , λm ]T , (6.1.9)
we have by the linearity of L
m
X
Lϕ(ℓ; ·) = cj Lϕj = λT c = λT G−1 ℓ = [(GT )−1 λ]T ℓ,
j=1

that is,
m
X
Lϕ(ℓ; ·) = ai Li f, a = [a1 , a2 , . . . , am ]T = (GT )−1 λ. (6.1.10)
i=1

6.1.2 Method of undetermined coefficients


Here we determined the coefficients ai in (6.1.3) such that the equality holds ∀ f ∈ Φ, which, by the
linearity of L and Li is equivalent to equality for f = ϕ1 , f = ϕ2 , . . . , f = ϕm ; that is

m
!
X
aj Lj ϕi = Lϕi , i = 1, m,
j=1

or by (6.1.8)
m
X
aj Lj ϕi = λi , i = 1, m.
j=1

Evidently, the matrix of this system is GT , so that

a = [a1 , a2 , . . . , am ]T = (GT )−1 λ,

in agreement with (6.1.10). Thus, the method of interpolation and the method of undetermined coeffi-
cients are mathematically equivalent — they produce exactly the same approximation.
It seems that, at least in the case of polynomial (i.e. Φ = Pd ), that the method of interpolation
is more powerful than the method of undetermined coefficients, because it also yields an expression
for the error term (if we carry along the remainder term of interpolation). But, the method of undeter-
mined coefficients is allowed, using the condition of exactness to find the remainder term by the Peano
Theorem.
6.2. Numerical Differentiation 239

6.2 Numerical Differentiation


For simplicity we consider only the first derivative; analogous techniques apply to higher order deriva-
tives. We solve the problem by means of interpolation: instead to differentiate f ∈ C m+1 [a, b], we
shall differentiate its interpolation polynomial:

f (x) = (Lm f )(x) + (Rm f )(x). (6.2.1)

We write the interpolation polynomial in Newton form

(Lm f )(x) = (Nm f )(x) = f0 + (x − x0 )f [x0 , x1 ] + · · · +

+ (x − x0 ) . . . (x − xm−1 )f [x0 , x1 , . . . , xm ] (6.2.2)


and the remainder in the form
f (m+1) (ξ(x))
(Rm f )(x) = (x − x0 ) . . . (x − xm ) . (6.2.3)
(m + 1)!
Differentiating (6.2.2) with respect to x and then putting x = x0 gives

(Lm f )(x0 ) = f [x0 , x1 ] + (x0 − x1 )f [x0 , x1 , x2 ] + · · · +

+ (x0 − x1 )(x0 − x2 ) . . . (x0 − xm−1 )f [x0 , x1 , . . . , xm ]. (6.2.4)


Assuming that f is has n + 2 continuous derivatives in an appropriate interval we get

f (m+1) (ξ(x0 ))
(Rm f )′ (x0 ) = (x0 − x1 ) . . . (x0 − xm ) . (6.2.5)
(m + 1)!
Therefore, differentiating (6.2.4), we find

f ′ (x0 ) = (Lm f )′ (x0 ) + (Rm f )′ (x0 ) . (6.2.6)


| {z }
em

If H = max |x0 − xi |, the error has the form em = O(H m ), when H → 0.


i
We can thus get approximation formulae of arbitrarily high order, but those with large order are of
limited practical use.

Remark 6.2.1. Numerical differentiation is a critical operation; for this reason it must be avoided as
much as possible, since even good approximation lead to poor approximation of the derivative (see
Figure 6.1). This also follows from Example 6.2.2. ♦

Example 6.2.2. Let the function


1
f (x) = g(x) + sin n2 (x − a), g ∈ C 1 [a, b].
n
We see that d(f, g) → 0 (n → ∞), but d(f ′ , g ′ ) = n 9 0. ♦

The most important uses of differentiation formulae are made in the discretization of differential
equations — ordinary or partial. In these applications, the spacing of the points is usually uniform,
but unequally distribution points arise when partial differential operators are to be discretized near the
boundary of the domain of interest.
We can also use another interpolation procedures such as: Taylor, Hermite, spline, least squares.
240 Linear Functional Approximation

a
n

a′
n

f′

Figure 6.1: The drawbacks of numerical differentiation

6.3 Numerical Integration


The basic problem is to calculate the definite integral of a given function f over a finite interval [a, b].
If f is well behaved, this is a routine problem, for which the simplest integration rules, such as the
composite trapezoidal or Simpson’s rule will be quite adequate, the former having an edge over the
later if f is periodic with period b − a.
Complications arise if f has an integrable singularity, or the interval of integration extends to infin-
ity (which is just other manifestation of the singular behavior). By breaking up the integral, if necessary,
into several pieces, it can be assumed that the singularity, if its location is known, is at one (or both)
ends of the interval [a, b]. Such improper integrals can usually be treated by weighted quadrature; that is
one incorporates the singularity into a weight function, which then becomes one factor of the integrand,
leaving the other factor well behaved. The most important example of this is Gaussian quadrature
relative to such a weight function. Finally, it is possible to accelerate the convergence of quadrature
schemes by suitable recombinations. The best-known example of this is Romberg integration.
Let f : [a, b] → R be a function integrable on [a, b], Fk (f ), k = 0, m information on f (usually
linear functionals) and w : [a, b] → R+ is a weight function, integrable over [a, b].

Definition 6.3.1. A formula of the form


Z b
w(x)f (x)dx = Q(f ) + R(f ), (6.3.1)
a

where
m
X
Q(f ) = Aj Fj (f ),
j=0

is called a numerical integration formula for the function f or a quadrature formula. Parameters Aj ,
j = 0, m are called weights or coefficients of the formula, and R(f ) is its remainder term. Q is called
quadrature functional.

Definition 6.3.2. The natural number d = d(Q) having the property ∀ f ∈ Pd, R(f ) = 0 and
∃ g ∈ Pd + 1 such that R(g) 6= 0 is called degree of exactness of the quadrature formula..
6.3. Numerical Integration 241

Since R is linear, a quadrature formula has the degree of exactness d if and only if R(ej ) = 0,
j = 0, d and R(ed+1 ) 6= 0.
If the degree of exactness of a quadrature formula is known, the remainder could be determined
using Peano theorem.

6.3.1 The composite trapezoidal and Simpson’s rule


These formulae are called by Gautschi in [33] “the workhorses of numerical integration”. They will do
the job when the interval is finite and the integrand is unproblematic. The trapezoidal rule is sometimes
surprisingly effective on infinite intervals.
Both rules are obtained by applying the simplest kind of interpolation on subintervals of the de-
composition
b−a
a = x0 < x1 < x2 < · · · < xn−1 < xn = b, xk = a + kh, h= . (6.3.2)
n
of the interval [a, b]. In the trapezoidal rule, one interpolates linearly on each subinterval [xk , xk+1 ],
and obtains
Z xk+1 Z xk+1 Z xk+1
f (x)dx = (L1 f )(x)dx + (R1 f )(x)dx, f ∈ C 1 [a, b], (6.3.3)
xk xk xk

where
(L1 f )(x) = fk + (x − xk )f [xk , xk+1 ].
Integrating, we have
Z xk+1
h
f (x)dx = (fk + fk+1 ) + R1 (f ),
xk 2
where (using Peano Theorem)
Z xk+1
R1 (f ) = K1 (t)f ′′ (t)dt,
xk

and
(xk+1 − t)2 h
K1 (t) = − [(xk − t)+ + (xk+1 − t)+ ]
2 2
(xk − t)2 h(xk+1 − t)
= −
2 2
1
= (xk+1 − t)(xk − t) ≤ 0.
2
So
h3 ′′
R1 (f ) = − f (ξk ), ξk ∈ (xk , xk+1 )
12
and Z xk+1
h 1 3 ′′
f (x)dx = (fk + fk+1 ) − h f (ξk ). (6.3.4)
xk 2 12
This is the elementary trapezoidal rule. Summing over all subinterval gives the trapezes rule or the
composite trapezoidal rule.
Z b   n−1
1 1 1 3 X ′′
f (x)dx = h f0 + f1 + · · · + fn−1 + fn − h f (ξk ).
a 2 2 12 k=0
242 Linear Functional Approximation

Since f ′′ is continuous on [a, b], the remainder term could be written as

(b − a)h2 ′′ (b − a)3 ′′
R1,n (f ) = − f (ξ) = − f (ξ). (6.3.5)
12 12n2

Since f ′′ is bounded in absolute value on [a, b] we have

R1,n (f ) = O(h2 ),

when h → 0 and so the composite trapezoidal rule converges when h → 0 (or equivalently, n → ∞),
provided that f ∈ C 2 [a, b].
MATLAB Source 6.1 gives an implementation of trapezes rule.

MATLAB Source 6.1 Composite trapezoidal rule


function I=trapezes(f,a,b,n);
%TRAPEZES trapezes formula
%call I=trapezes(f,a,b,n);

h=(b-a)/n;
I=(f(a)+f(b)+2*sum(f([1:n-1]*h+a)))*h/2;

If instead of linear interpolation one uses quadratic interpolation over two consecutive intervals,
one gives rise to the composite Simpson’s formula. Its “elementary” version, called Simpson’s rule or
Simpson formula is
Z xk+1
h 1 5 (h)
f (x)dx = (fk + 4fk+1 + fk+2 ) − h f (ξk ), xk ≤ ξk ≤ xk+1 , (6.3.6)
xk 3 90

where it has been assumed that f ∈ C 4 [a, b].


Let us prove the formula for the remainder of Simpson rule. Since de degree of exactness is 3,
Peano theorem yields to Z xk+2
R2 (f ) = K2 (t)f (4) (t) dt.
xk

where
 
1 (xk+1 − t)4 h 
K2 (t) = − (xk − t)3+ + 4 (xk+1 − t)3+ + (xk+2 − t)3+ ,
3! 4 3

that is,
( (xk+2 −t)4  
1 4
− h
3
4 (xk+1 − t)3 + (xk+2 − t)3 , t ∈ [xk , xk+1 ] ,
K2 (t) = (xk+2 −t)4
6 − h
(xk+2 − t)3 , t ∈ [xk+1 , xk+2 ] .
4 3

One easily checks that for t ∈ [a, b], K2 (t) ≤ 0, so we can apply Peano’s Theorem corollary.

1 (4)
R2 (f ) = f (ξk )R2 (e4 ),
4!
6.3. Numerical Integration 243

x5k+2 − x5k h 4 
R2 (e4 ) = − x + 4x4k+1 + x4k+1
5 3 k
 4
x + x3k+2 xk + x2k+2 x2k + xk+2 x3k + x4k
= h 2 k+2
5

5x4k + 4x3k xk+2 + 6x2k x2k+2 + 4xk x3k+2 + 5x4k+2

12
h 
= −xk + 4xk xk+2 + 6x2k x2k+2 + 4xk x3k+2 − x4k+2
4 3
60
h h5
= − (xk+2 − xk )4 = −4 .
60 15
Thus,
h5 (4)
R2 (f ) = − f (ξk ).
90
For the composite Simpson 2 rule we get
Z b
h
f (x)dx = (f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · · + 4fn−1 + fn ) + R2,n (f ) (6.3.7)
a 3

with
1 (b − a)5 (4)
R2,n (f ) = − (b − a)h4 f (4) (ξ) = − f (ξ), ξ ∈ (a, b). (6.3.8)
180 2880n4
One notes that R2,n (f ) = O(h4 ), which assures convergence when n → ∞. We have also a gain
with 1 in the order of accuracy. This is the reason why Simpson’s rule has long been, and continues to
be, one of the most popular general-purpose integration methods. For an implementation, see MATLAB
Source 6.2.

MATLAB Source 6.2 Composite Simpson formula


function I=Simpson(f,a,b,n);
%SIMPSON composite Simpson formula
%call I=Simpson(f,a,b,n);

h=(b-a)/n;
x2=[1:n-1]*h+a;
x4=[0:n-1]*h+a+h/2;
I=h/6*(f(a)+f(b)+2*sum(f(x2))+4*sum(f(x4)));

Thomas Simpson (1710-1761) was an English mathematician, self-


2 educated,and author of many textbooks popular at the time. Simpson
published his formula in 1743, but it was already known to Cavalieri
(1639), Gregory (1668), and Cotes (1722), among others.
244 Linear Functional Approximation

The composite trapezoidal rule works well for trigonometric polynomials. Suppose, without loss
of generality that [a, b] = [0, 2π] and let

Tm [0, 2π] = {t(x) : t(x) = a0 + a1 cos x + a2 cos 2x + · · · + am cos mx


+ b1 sin x + b2 sin 2x + · · · + bm sin mx}.

Then
Rn,1 (f ) = 0, ∀ f ∈ Tn−1 [0, 2π]. (6.3.9)
iνx
We can easily check this by taking f (x) = eν (x) := e = cos νx + i sin νx, ν = 0, 1, 2, · · · :
" #
Z 2π
2π 1 X  2kπ  1
n−1
Rn,1 (eν ) = eν (x)dx − eν (0) + eν + eν (2π)
0 n 2 k=1
n 2
Z 2π n−1
2π X 2πikν
= eiνx dx − e n .
0 n k=0

R 2π 2π

When ν = 0, this is zero, and otherwise, since 0
eiνx dx = (iν)−1 eiνx = 0,
0

 −2π if ν = 0 (mod n), ν > 0
Rn,1 (eν ) = 2π 1 − eiνn·2π/n (6.3.10)
 − =0 if ν 6= 0 (mod n)
n 1 − eiν·2π/n
In particular, Rn,1 (eν ) = 0 for ν = 0, 1, . . . , n−1, which proves (6.3.9). Taking the real and imaginary
part in (6.3.10) one obtains

−2π, ν = 0 (modn), ν 6= 0
Rn,1 (cos ν·) = , Rn,1 (sin ν·) = 0.
0, otherwise

Therefore, if f is 2π-periodic and has a uniform convergent Fourier expansion



X
f (x) = [aν (f ) cos νx + bν (f ) sin νx], (6.3.11)
ν=0

where aν (f ), bν (f ) are the Fourier coefficients of f , then



X
Rn,1 (f ) = [aν (f )Rn,1 (cos ν·) + bν (f )Rn,1 (sin ν·)]
ν=0

(6.3.12)
X
= −2π aln (f )
l=1

From the theory of Fourier series it is known that Fourier coefficients of f go to zero faster the smoother
f is. More precisely, if f ∈ C r [R], then aν (f ) = O(ν −r ) as ν → ∞ (and similarly for bν (f )). Since
by (6.3.12)
Rn,1 (f ) ≃ −2πan (f ),
it follows that
Rn,1 (f ) = O(n−r ) as n → ∞ f ∈ C r [R], 2π-periodic, (6.3.13)
−2
which for r > 2 is better than Rn,1 (f ) = O(n ), valid for nonperiodic functions f . In particular, if
r = ∞, the trapezes rule converges faster than any power of n−1 . It should be noted, however, that f
6.3. Numerical Integration 245

must be smooth on the whole real line R. Starting with a function f ∈ C r [0, 2π] and extending it by
periodicity to R, will not in general produce a function f ∈ C r [R].
Another instance in which the trapezes rule work very well is for functions f defined on R which
have the following property: for some r ≥ 1
Z
(2r+1)
f ∈ C 2r+1 [R], f (x) dx < ∞,
R

lim f (2ρ−1) (x) = lim f (2ρ−1) (x) = 0, ρ = 1, 2, . . . , r. (6.3.14)


x→−∞ x→+∞

In this case, one can show that


Z ∞
X
f (x)dx = h f (kh) + R(f ; h) (6.3.15)
R k=−∞

has an error R(f, h) satisfying R(f ; h) = O(h2r+1 ), h → 0. Hence, if (6.3.14) holds for each r ∈ N,
then, the error tends to zero faster than any power of h.

6.3.2 Weighted Newton-Cotes and Gauss formulae


A weighted quadrature formula is a formula of the type
Z b n
X
f (t)w(t)dt = wk f (tk ) + Rn (f ) (6.3.16)
a k=1

where the weight function w is nonnegative and integrable on (a, b). The interval (a, b) may be finite
or infinite. If it is infinite, we must make sure that the integral in (6.3.16) is well defined, at least when
f is a polynomial. We achieve this by requiring that all moments of the weight function,
Z b
µs = ts w(t)dt, s = 0, 1, 2, . . . (6.3.17)
a

exist and are finite.


We call (6.3.16) interpolatory, if it has the degree of exactness d = n − 1. Interpolatory formulae
are precisely those “obtained by interpolation”, that is, for which
n
X Z b
wk f (tk ) = Ln−1 (f ; t1 , . . . , tn , t)w(t)dt (6.3.18)
k=1 a

or equivalently,
Z b
wk = lk (t)w(t)dt, k = 1, 2, . . . , n, (6.3.19)
a

where
Yn
t − tl
lk (t) = (6.3.20)
l=1
t k − tl
l6=k

are the elementary Lagrange interpolation polynomials associated with the nodes t1 , t2 , . . . , tn . The
fact that (6.3.16) with wk given by (6.3.19) has the degree of exactness d = n − 1 is evident, since
246 Linear Functional Approximation

for any f ∈ Pn−1 Ln−1 (f ; ·) ≡ f (·) ı̂n (6.3.18). Conversely, if (6.3.16) has the degree of exactness
d = n − 1, then putting f (t) = lr (t) in (6.3.17) gives
Z b Xn
lr (t)w(t)dt = wk lr (tk ) = wr , r = 1, 2, . . . , n,
a k=1

that is, (6.3.19).


We see that given any n distinct nodes t1 , . . . , tn it is always possible to construct a formula of
type (6.3.16) which is exact for all polynomials of degree ≤ n − 1. In the case w(t) ≡ 1 on [−1, 1] and
tk equally spaced on [−1, 1], the feasibility of such a construction was already alluded to by Newton
in 1687 and implemented in detail by Cotes 3 around 1712. By extension, we call the formula (6.3.16),
with the tk prescribed and the wk given by (6.3.19) a Newton-Cotes formula.
The question naturally arises whether we can do better, that is, whether we can achieve the degree
of exactness d > n − 1 by a judicious choice of the nodes tk (the weights wk being necessarily given
by (6.3.19)). The answer is surprisingly simple and direct. To formulate it we introduce the node
polynomial
Y n
un (t) = (t − tk ). (6.3.21)
k=1

Theorem 6.3.3. Given an integer k, with 0 ≤ k ≤ n, the quadrature formula (6.3.16) has the degree
of exactness d = n − 1 + k if and only if both of the following conditions are satisfied.
(a) The formula (6.3.16) is interpolatory;
(b) The node polynomial un in (6.3.21) satisfies
Z b
un (t)p(t)w(t) dt = 0, ∀ p ∈ Pk−1 .
a

The condition in (b) imposes k conditions on the nodes t1 , t2 , . . . , tn of (6.3.16). (If k = 0, there is
no restriction since, as we know, we can always get d = n − 1). In effect, un must be orthogonal to
Pk−1 relative to the weight function w. Since w(t) ≥ 0, we have necessarily k ≤ n; otherwise, un
would have to be orthogonal to Pn , in particular, orthogonal to itself, which is impossible. Thus k = n
is optimal, giving rise to a quadrature rule of maximum degree of exactness dmax = 2n − 1. Condition
(b) then amounts to orthogonality of un to all polynomials of lower degree; that is un (·) = πn (·, w)
is precisely the nth-degree orthogonal polynomial with respect to the weight function w. This optimal
formula is called the Gaussian quadrature formula associated with the weight function w. Its nodes,
therefore, are the roots of πn (·, w), and the weights (coefficients) wk are given as in (6.3.19); thus

πn (tk ; w) = 0
Z b
πn (t, w) k = 1, 2, . . . , n. (6.3.22)
wk = ′
w(t)dt,
a (t − tk )πn (tk , w)

Roger Cotes (1682-1716), precocious son of an English country pas-


tor, was entrusted with the preparation of the second edition of New-
ton’s Principia. He worked out in detail Newton’s idea of numeri-
3 cal integration and published the coefficients — now known as Cotes

numbers — of the n-point formula for all n < 11. Upon his death at
the early age of 33, Newton said of him: “If he had lived, we might
have known something.”
6.3. Numerical Integration 247

The formula was developed in 1814 by Gauss for the special case w(t) ≡ 1 on [−1, 1], and extended
to more general weight functions by Christoffel 4 in 1877. It is, therefore, also referred to as Gauss-
Christoffel quadrature formula.

Proof of theorem 6.3.3. Necessity. Since the degree of exactness is d = n − 1 + k ≥ n − 1, condition


(a) is trivial. Condition (b) also follows immediately, since for any p ∈ Pk−1 , un p ∈ Pn−1+k . Hence,
Z b n
X
un (t)p(t)w(t) = wk uk (tk )p(tk ),
a k=1

which vanishes, since un (tk ) = 0 for k = 1, 2, . . . , n.


Sufficiency. We must show that for any p ∈ Pn−1+k we have Rn (p) = 0 in (6.3.16). Given any
such p, divide it by un , such that

p = qun + r, q ∈ Pk−1 , r ∈ Pn−1 ,

where q is the quotient and r the remainder. There follows


Z b Z b Z b
p(t)w(t)dt = q(t)un (t)w(t)dt + r(t)w(t)dt.
a a a

The first integral on the right, by (b), is zero, since q ∈ Pk−1 , whereas the second, by (a), since
r ∈ Pn−1 equals
n
X n
X n
X
wk r(tk ) = wk [p(tk ) − q(tk )un (tk )] = wk p(tk )
k=1 k=1 k=1

the last equality following again from un (tk ) = 0, k = 1, 2, . . . , n, which completes the proof. 

The case k = n is discussed further in §6.3.3. Here we still mention two special cases with
k < n, which are of some practical interest. The first is the Gauss-Radau quadrature formula in which
one endpoint, say a, is finite and serves as a quadrature node, say t1 = a. The maximum degree of
exactness attainable then is d = 2n − 2 and corresponds to k = n − 1 in Theorem (6.3.3). Part (b)
of that theorem tells us that the remaining nodes t2 , . . . , tn must be the zeroes of πn−1 (·, wa ), where
wa (t) = (t − a)w(t).
Similarly, in the Gauss-Lobatto formula, both endpoints are finite and serves as nodes, say, t1 = a,
tn = b, and the remaining nodes t2 , . . . , tn−1 are taken to be the zeros of πn−2 (·; wa,b ), wa,b (t) =
(t − a)(b − t)w(t), thus achieving maximum degree of exactness d = 2n − 3.

Elvin Bruno Christoffel (1829-1900) was active for a short period of


4 time in Berlin and and Zurich and, for the rest of his life, in Stras-
bourg. He is best known for his work in geometry, in particular, tensor
analysis, which became important in Einstein’s theory of relativity.
248 Linear Functional Approximation

6.3.3 Properties of Gaussian quadrature rules


The Gaussian quadrature rule (6.3.16) and (6.3.22), in addition to being optimal (i.e. has maximum
degree of exactness), has some interesting and useful properties.
(i) All nodes tk are real, distinct, and contained in the open interval (a, b). This is a well-known
property satisfied by the zeros of orthogonal polynomials.
(ii) All the weights (coefficients) wk are positive. An ingenious observation of Stieltjes proves it
almost immediately. Indeed,
Z b Xn
0< lj2 (t)w(t)dt = wk lj2 (tk ) = wj , j = 1, 2, . . . , n,
a k=1

lj2
the first equality following since ∈ P2n−2 and the degree of exactness is d = 2n − 1.
(iii) If [a, b] is a finite interval, then the Gauss formula converges for any continuous function; that
is, Rn (f ) → 0, when n → ∞, for any f ∈ C[a, b]. This is basically a consequence of the Weierstrass
Approximation Theorem, which implies that, if pb2n−1 (f ; ·) denotes the best polynomial approximation
of degree 2n − 1 of f on [a, b] in the uniform norm, then
lim kf (·) − pb2k−1 (f ; ·)k∞ = 0.
n→∞

p2n−1 ) = 0 (since d = 2n − 1), it follows that


Since Rn (b
|Rn (f )| = |Rn (f − pb2n−1 )|
Z
b Xn

= [f (t) − pb2n−1 (f ; t)]w(t)dt − wk [f (tk ) − pb2n−1 (f ; tk )]
a
k=1
Z b Xn
≤ |f (t) − pb2n−1 (f ; t)|w(t)dt + wk |f (tk ) − pb2n−1 (f ; tk )|
a k=1
"Z n
#
b X
≤ kf (·) − pb2n−1 (f ; ·)k∞ w(t)dt + wk .
a k=1

Here the positivity of weights wk has been used crucially. Noting that
Xn Z b
wk = w(t)dt = µ0 ,
k=1 a

we conclude
|Rn (f )| ≤ 2µ0 kf − pb2n−1 k∞ → 0, când n → ∞.
The next property is the background for an efficient algorithm for computing Gaussian quadrature
formulae.
(iv) Let αk = αk (w) and βk = βk (w) be the recursion coefficients for the orthogonal polynomials
πk (·) = πk (·; w), that is
πk+1 (t) = (t − αk )πk (t) − βk πk−1 (t), k = 0, 1, 2, . . .
(6.3.23)
π0 (t) = 1, π−1 (t) = 0,
where
(tπk , πk )
αk =
(πk , πk )
(6.3.24)
(πk , πk )
βk = ,
(πk−1 , πk−1 )
6.3. Numerical Integration 249

with β0 defined (as is customary) by


Z b
β0 = w(t)dt (= µ0 ).
a

The nth order Jacobi matrix for the weight function w is a tridiagonal symmetric matrix defined by
 √ 
√α0 β1 √ 0
 β1 α1 β2 
 
 √ .. 
Jn (w) =   β2 . .

 .. .. √ 
 . . β 
n−1

0 βn−1 αn−1
Theorem 6.3.4. The nodes tk of a Gauss-type quadrature formula are the eigenvalues of Jn

Jn vk = tk vk , vkT vk = 1, k = 1, 2, . . . , n, (6.3.25)
and the weights wk are expressible in terms of the first component vk of the corresponding (normalized)
eigenvectors by
2
wk = β0 vk,1 , k = 1, 2, . . . , n (6.3.26)

Thus, to compute the Gauss formula, we must solve an eigenvalue/eigenvector problem for a sym-
metric tridiagonal matrix. This is a routine problem in linear algebra, and very efficient methods (e.g.
the QR algorithm) are known for solving it. Thus, the eigenvalue-based approach is more efficient
than the classical one. Moreover, the classical approach is based on two ill-conditioned problems: the
solution of a polynomial equation and the solution of a Vandermonde system of linear equations.

ptheorem 6.3.4. Let π̃k (·) = π̃k (·, w) denote the normalized
Proof of p orthogonal polynomials, so that
πk = (πk , πk ) dλ π̃k . Inserting this into (6.3.23), dividing by (πk , πk ) dλ , and using (6.3.24), we
obtain
π̃k π̃k−1
π̃k+1 (t) = (t − αk ) p − βk p ,
βk+1 βk+1 βk
p
or, multiplying through by βk+1 and rearranging
p p
tπ̃k (t) = αk π̃k (t) + βk π̃k−1 (t) + βk+1 π̃k+1 (t), k = 0, 1, . . . n − 1. (6.3.27)

In terms of the Jacobi matrix Jn we can write these relations in vector form as
p
tπ̃(t) = Jn π̃(t) + βn π̃n (t)en , (6.3.28)
T T n
where π̃(t) = [π̃0 (t), π̃1 (t), . . . , π̃n−1 (t)] and en (t) = [0, 0, . . . , 0, 1] are vectors in R . Since tk
is a zero of π̃n , it follows from (6.3.28) that

tk π̃(tk ) = Jn π̃(tk ), k = 1, 2, . . . , n. (6.3.29)


This proves the first relation in Theorem 6.3.4, since π̃ is a nonzero vector, its first component being
−1/2
π̃0 = β0 . (6.3.30)

To prove the second relation, note from (6.3.29) that the normalized eigenvector vk is
n
!−1/2
1 X 2
vk = π̃(tk ) = π̃µ−1 (tk ) π̃(tk ).
[π̃(tk )T π̃(tk )] µ=1
250 Linear Functional Approximation

Comparing the first component on far left and right, and squaring, gives, by virtue of (6.3.30)
1 2
n = β0 vk,1 , k = 1, 2, . . . , n. (6.3.31)
X
˜ 2µ−1 (tk )
pi
µ=1

On the other hand, letting f (t) = π̃µ−1 (t) in (6.3.16), one gets, by orthogonality, using (6.3.30) again
that
Xn
1/2
β0 δµ−1,0 = wk π̃µ−1 (tk )
k=0

or in matrix form
1/2
P w = β0 e1 , (6.3.32)
where δµ−1,0 is Kronecker’s delta, P ∈ Rn×n is the matrix of eigenvectors, w ∈ Rn is the vector of
Gaussian weights, and e1 = [1, 0, . . . , 0]T ∈ Rn . Since the columns of P are orthogonal, we have
n
X
P T P = D, D = diag(d1 , d2 , . . . , dn ), dk = 2
π̃µ−1 (tk ).
µ=1

Now multiply (6.3.32) from the left by P T to obtain


1/2 1/2 −1/2
Dw = β0 P T e1 = β0 ∗ β0 e = e, e = [1, 1, . . . , 1]T .

Therefore, w = D−1 e, that is,


1
wk = n , k = 1, 2, . . . , n.
X 2
π̃µ−1 (tk )
µ=1

Comparing this with (6.3.31) establishes the desired result. 

For details on algorithmic aspects concerning orthogonal polynomials and Gaussian quadratures see
[34].
MATLAB Source 6.3 computes the nodes and the coefficients of a Gaussian quadrature formula by
mean of eigenvalues and eigenvectors of Jacobi matrix. The input parameters are the coefficients α and
β in the recurrence relation. It uses the MATLAB function eig.

MATLAB Source 6.3 Compute nodes and coefficients of a Gaussian quadrature rule
function [g_nodes,g_coeff]=Gaussquad(alpha,beta)
%GAUSSQUAD - generate Gaussian quadrature formula
%computes nodes and coefficients for
%Gauss rules given alpha and beta
%method - Jacobi matrix
n=length(alpha); rb=sqrt(beta(2:n));
J=diag(alpha)+diag(rb,-1)+diag(rb,1);
[v,d]=eig(J);
g_nodes=diag(d);
g_coeff=beta(1)*v(1,:).ˆ2;
6.3. Numerical Integration 251

MATLAB Source 6.4 Approximation of an integral using a Gaussian formula


function I=vquad(g_nodes,g_coeff,f)
I=g_coeff*f(g_nodes);

For the computation of the approximative value of an integral via a Gaussian rule, with nodes and
coefficients computed by Gaussquad we need a single line of code – see MATLAB Source 6.4. We
give in the sequel MATLAB functions which compute the coefficients and the nodes of various type
Gaussian rules. They call the function Gaussquad. MATLAB Source 6.5 computes Gauss-Legendre
nodes and coefficients.

MATLAB Source 6.5 Generate a Gauss-Legendre formula


function [g_nodes,g_coeff]=Gauss_Legendre(n)
%GAUSS-LEGENDRE - Gauss-Legendre nodes and coefficients

beta=[2,(4-([1:n-1]).ˆ(-2)).ˆ(-1)];
alpha=zeros(n,1);
[g_nodes,g_coeff]=Gaussquad(alpha,beta);

Since a first kind Gauss-Chebyshev rule has equal coefficients, and the nodes are the roots of first
kind Chebyshev polynomial, MATLAB Source 6.6 does not use the Jacobi matrix. We also give sources
for second kind Gauss-Chebyshev rule, (MATLAB Source 6.7), Gauss-Hermite rule (MATLAB Source
6.8), Gauss-Laguerre rule (MATLAB Source 6.9) and Gauss-Jacobi rule (MATLAB Source 6.10).

MATLAB Source 6.6 Generate a first kind Gauss-Chebyshev formula


function [g_nodes,g_coeff]=Gauss_Cheb1(n)
%GAUSS_CHEB1 - Gauss-Cebisev #1 nodes and coefficients

g_coeff=pi/n*ones(1,n);
g_nodes=cos(pi*([1:n]’-0.5)/n);

MATLAB Source 6.7 Generate a second kind Gauss-Chebyshev formula


function [g_nodes,g_coeff]=Gauss_Cheb2(n)
%GAUSS_CHEB2 - Gauss-Chebyshev #2 nodes and coefficients

beta=[pi/2,1/4*ones(1,n-1)]; alpha=zeros(n,1);
[g_nodes,g_coeff]=Gaussquad(alpha,beta);
252 Linear Functional Approximation

MATLAB Source 6.8 Generate a Gauss-Hermite formula


function [g_nodes,g_coeff]=Gauss_Hermite(n)
%GAUSS_HERMITE - Gauss-Hermite nodes and coefficients

beta=[sqrt(pi),[1:n-1]/2]; alpha=zeros(n,1);
[g_nodes,g_coeff]=Gaussquad(alpha,beta);

MATLAB Source 6.9 Generate a Gauss-Laguerre formula


function [g_nodes,g_coeff]=Gauss_Laguerre(n,a)
%GAUSS_HERMITE - Gauss-Laguerre nodes and coefficients

k=1:n-1;
alpha=[a+1, 2*k+a+1];
beta=[gamma(1+a),k.*(k+a)];
[g_nodes,g_coeff]=Gaussquad(alpha,beta);

MATLAB Source 6.10 Generate a Gauss-Jacobi formula


function [g_nodes,g_coeff]=Gauss_Jacobi(n,a,b)
%Gauss-Jacobi - Gauss-Jacobi nodes and coefficients

k=0:n-1;
k2=2:n-1;
%rec. relation coeffs
bet1=4*(1+a)*(1+b)/((2+a+b)ˆ2)/(3+a+b);
bet=[2ˆ(a+b+1)*beta(a+1,b+1), bet1, 4*k2.*(k2+a+b).*(k2+a).*...
(k2+b)./(2*k2+a+b-1)./(2*k2+a+b).ˆ2./(2*k2+a+b+1)];
if a==b
alpha=zeros(1,n);
else
alpha=(bˆ2-aˆ2)./(2*k+a+b)./(2*k+a+b+2);
end
[g_nodes,g_coeff]=Gaussquad(alpha,bet);
6.4. Adaptive Quadratures 253

(v) Markov 5 observed in 1885 that the Gauss quadrature formula can also be obtained by Hermite
interpolation on the nodes tk , double.

f (x) = (H2n−1 f )(x) + u2n (x)f [x, x1 , x1 , . . . , xn , xn ],

Z b Z b
w(x)f (x)dx = w(x)(H2n−1 f )(x)dx+
a a
Z b
+ w(x)u2n (x)f [x, x1 , x1 , . . . , xn , xn ]dx.
a

But the degree of exactness 2n − 1 implies


Z b n
X n
X
w(x)(H2n−1 f )(x)dx = wi (H2n−1 f )(xi ) = wi f (xi ),
a i=1 i=1

Z b n
X Z b
w(x)f (x)dx = wi f (xi ) + w(x)u2 (x)f [x, x1 , x1 , . . . , xn , xn ]dx,
a i=1 a

so Z b
Rn (f ) = w(x)u2n (x)f [x, x1 , x1 , . . . , xn , xn ]dx.
a

Since w(x)u2 (x) ≥ 0, applying the second Mean Value Theorem for integrals and the Mean Value
Theorem for divided differences, we get
Z b
Rn (f ) = f [η, x1 , x1 , . . . , xn , xn ] w(x)u2 (x)dx
a
Z b
f (2n) (ξ)
= w(x)[πn (x, w)]2 dx, ξ ∈ [a, b].
(2n)! a

For orthogonal polynomials and their recursion coefficients αk , βk see Table 5.2, page 175.

6.4 Adaptive Quadratures


In a numerical integration method errors depend not only on the size of the interval, but also on values
of certain higher order derivatives of the function to be integrated. This implies that the methods do
not work well for functions having large values of higher order derivatives — especially for functions

Andrei Andrejevich Markov (1856-1922) was a Russian mathemati-


5 cianactive in St. Petersburg who made important contributions to
probability theory, number theory, and constructive approximation
theory.
254 Linear Functional Approximation

having large oscillations on the whole interval or on some subintervals. It is reasonable to use small
subintervals where the derivatives have large values and large subintervals where the derivatives have
small values. A method which does this systematically is called adaptive quadrature.
The general approach in an adaptive quadrature is to use two different methods for each interval,
to compare the results, and to divide the interval when the differences are large. There are situations
when we have two bad methods, the results are bad, but their differences are small. We can avoid this
situation taking a method which overestimates the result and another which underestimates it. We shall
give an example of general structure for a recursive adaptive quadrature (MATLAB Source 6.11). The

MATLAB Source 6.11 Adaptive quadrature


function I=adaptquad(f,a,b,eps,g)
%ADAPTQUAD adaptive quadrature
%call I=ADAPTQUAD(F,A,B,EPS,G)
%F - integrand
%A,B - endpoints
%EPS -tolerance
%G - composed rule used on subintervals

m=4;
I1=g(f,a,b,m);
I2=g(f,a,b,2*m);
if abs(I1-I2) < eps %success
I=I2;
return
else %recursive subdivision
I=adaptquad(f,a,(a+b)/2,eps,g)+adaptquad(f,(a+b)/2,b,eps,g);
end

parameter g is a function that implements a composite quadrature rule, such as trapezes or composite
Simpson rule. Algorithm structure: DIVIDE AND CONQUER.
In contrast to other methods, which decide what amount of work is needed to achieve the desired
accuracy, an adaptive quadrature compute only as much as it is necessary. This means that the absolute
error ε must be chosen so that to avoid an infinite loop when one tries to achieve a precision which
could not be achieved. The number of steps depends on the behavior of the function to be integrated.
Possible improvement: the accuracy can be scaled by the ratio of current interval length and the
whole interval length.

Example 6.4.1. Approximate the length of a sinusoid on an interval of length equal to a period.
We have to approximate
Z 2π p
I= 1 + cos2 (x) dx.
0

The integrand in MATLAB is

function y=lsin(x)
y=sqrt(1+cos(x).ˆ2);

Here are two examples of using adaptquad


6.5. Iterated Quadratures. Romberg Method 255

>> format long


>> I=adaptquad(@lsin,0,2*pi,1e-8,@Simpson)
I =
7.64039557805542
>> I=adaptquad(@lsin,0,2*pi,1e-8,@trapez)
I =
7.64039557011458

We recommend the reader to compare the running times. ♦

For supplementary details on design and implementation of adaptive quadrature formulas see [31].

6.5 Iterated Quadratures. Romberg Method


A drawback of previous adaptive quadrature is that it computes repeatedly the function values at nodes;
also a time penalty appears at running time due to recursion or to stack management in an iterative
implementation. Iterative quadratures remove these drawbacks. They apply at the first step a composite
quadrature rule and then they divide the interval into equal parts using at each step the previously
computed approximations. We shall illustrate this technique using a method that starts from composite
trapezoidal rule and then improves the convergence using Richardson extrapolation.
The first step involves applying the composite trapezoidal rule for n1 = 1, n2 = 2, . . . , np = 2p−1 ,
where p ∈ N∗ . The step size hk corresponding to nk would be

b−a b−a
hk = = k−1 .
nk 2
Using these notations the trapezes rule becomes
 
Z b 2n−1
X−1
hk  b − a 2 ′′
f (x)dx = f (a) + f (b) + 2 f (a + ihk ) − h f (µk ) (6.5.1)
a 2 i=1
12 k

µk ∈ (a, b).
Let Rk,1 denote the result of approximation in accordance to (6.5.1).

h1 b−a
R1,1 = [f (a) + f (b)] = [f (a) + f (b)] (6.5.2)
2 2

h2
R2,1 = [f (a) + f (b) + 2f (a + h2 )] =
2   
b−a b−a
= f (a) + f (b) + 2f a +
4 2
  
1 1
= R1,1 + h1 f a + h1 .
2 2

and generally
 
k−2
2X    
1 1
Rk,1 = Rk−1,1 + hk−1 f a+ i− hk−1  , k = 2, n (6.5.3)
2 i=1
2
256 Linear Functional Approximation

6 7
Now, it follows the improvement by Richardson extrapolation (in fact it is due to Archimedes
).
Z b
(b − a) 2 ′′
I= f (x)dx = Rk−1,1 − hk f (a) + O(h4k ).
a 12
We shall eliminate the h2k term by combining two equations

(b − a) 2 ′′
I =Rk−1,1 − hk f (a) + O(h4k ),
12
b − a 2 ′′
I =Rk,1 − hk f (a) + O(h4k ).
48

We get
4Rk,1 − Rk−1,1
I= + O(h4 ).
3
We define
4Rk,1 − Rk−1,1
Rk,2 = . (6.5.4)
3
One applies the Richardson extrapolation to these values. If f ∈ C 2n+2 [a, b], then for k = 1, n we
may write
 
Z b 2k−1
X−1
hk 
f (x)dx = f (a) + f (b) + 2 f (a + ihk )
a 2 i=1
(6.5.5)
Xk
2i 2k+2
+ Ki hk + O(hk ),
i=1

where Ki does not depend on hk .

Lewis Fry Richardson (1881-1953), born, educated, and active


in England, did pioneering work in numerical weather prediction,
proposing to solve the hydrodynamical and thermodynamical equa-
tions of meteorology by finite difference methods. He also did a
6 penetrating study of atmospheric turbulence, where a nondimensional

quantity introduced by him is now called “Richardson’s number”. At


the age of 50 he earned a degree in psychology and began to develop
a scientific theory of international relations. He was elected fellow of
the Royal Society in 1926.

Archimedes (287 B.C. - 212 B.C.) Greek mathematician from Syra-


cuse. The achievements of Archimedes are quite outstanding. He
is considered by most historians of mathematics as one of the great-
est mathematicians of all time. He perfected a method of integra-
tion which allowed him to find areas, volumes and surface areas of
many bodies. Archimedes was able to apply the method of exhaustion,
7 which is the early form of integration. He also gave an accurate ap-

proximation to π and showed that he could approximate square roots


accurately. He invented a system for expressing large numbers. In me-
chanics Archimedes discovered fundamental theorems concerning the
centre of gravity of plane figures and solids. His most famous theorem
gives the weight of a body immersed in a liquid, called Archimedes’
principle. He defended his town during the Romans’ siege.
6.5. Iterated Quadratures. Romberg Method 257

Rb
The formula (6.5.5) can be argued as follows. Let a0 = a
f (x)dx and
" n−1
#
h X b−a
A(h) = f (a) + 2 f (a + kh) + f (b) , h= .
2 k
k=1

(the approximation obtained using trapezes rule).


If f ∈ C 2k+1 [a, b], k ∈ N∗ , then the following formula holds

A(h) = a0 + a1 h2 + a2 h4 + · · · + ak h2k + O(h2k+1 ), h→0 (6.5.6)

where
B2k (2k−1)
ak = [f (b) − f (2k−1) (a)], k = 1, 2, . . . , K.
(2k)!
The quantities Bk are the coefficients in the expansion

X Bk k∞
z
z
= z , |z| < 2π;
e −1 k=0
k!

they are called Bernoulli 8 numbers.


By eliminating successively the powers of h in (6.5.5) one obtains

4j−1 Rk,j−1 − Rk−1,j−1


Rk,j = , k = 2, n, j = 2, i.
4j−1 − 1

The computation could be performed in a tabular fashion:

R1,1
R2,1 R2,2
R3,1 R3,2 R3,3
.. .. .. ..
. . . .
Rn,1 Rn,2 Rn,3 ... Rn,n

Since (Rn,1 ) is convergent, (Rn,n ) is also convergent, but faster than (Rn,1 ). One may choose as
stopping criterion |Rn−1,n−1 − Rn,n | ≤ ε.
We give an implementation of Romberg’s method (see MATLAB Source 6.12, the M-file Romberg.m).

Example 6.5.1. One can solve the problem in Example 6.4.1 as follows:

Jacob Bernoulli (1654-1705), the elder brother of Johann Bernoulli,


was active in Basel. He was one of the first to appreciate Leibniz’s and
8 Newton’s differential and integral calculus and enriched it by many
original contributions of his own, often in (not always amicable) com-
petition with his younger brother. He is also known in probability
theory for his “law of large numbers”.
258 Linear Functional Approximation

MATLAB Source 6.12 Romberg method


function I=Romberg(f,a,b,epsi,nmax)
%ROMBERG - approximate an integral using Romberg method
%call I=romberg(f,a,b,epsi,nmax)
%f -integrands
%a,b - integration limits
%epsi - tolerance
%nmax - maximum number of iterations
if nargin < 5
nmax=10;
end
if nargin < 4
epsi=1e-3;
end
R=zeros(nmax,nmax);
h=b-a;
% first iteration
R(1,1)=h/2*(sum(f([a,b])));
for k=2:nmax
%trapezes formula
x=a+([1:2ˆ(k-2)]-0.5)*h;
R(k,1)=0.5*(R(k-1,1)+h*sum(f(x)));
%extrapolation
plj=4;
for j=2:k
R(k,j)=(plj*R(k,j-1)-R(k-1,j-1))/(plj-1);
plj=plj*4;
end
if (abs(R(k,k)-R(k-1,k-1))<epsi)&(k>3)
I=R(k,k);
return
end
%halving step
h=h/2;
end
error(’iteration number exceeded’)
6.6. Adaptive Quadratures II 259

>> I=Romberg(@lsin,0,2*pi,1e-8)

I =
7.64039557805609

9 10
Formula (6.5.6) is called Euler -MacLaurin formula.

6.6 Adaptive Quadratures II


The second column of tabular Romberg’s method corresponds to a Simpson’s rule approximation. We
introduce the notation
Sk,1 = Rk,2 .
The third column is a combination of two Simpson approximation:

Sk,1 − Sk−1,1 Rk,2 − Rk−1,2


Sk,2 = Sk,1 + = Rk,2 + .
15 15
We shall use the relation
Sk,1 − Sk−1,1
Sk,2 = Sk,1 + , (6.6.1)
15
to devise and adaptive quadrature.
Let c = (a + b)/2. The elementary Simpson formula is

h
S= (f (a) + 4f (c) + f (b)) .
6
Leonhard Euler (1707-1783) was the son of a minister interested in
mathematics who followed lectures of Jakob Bernoulli at the Univer-
sity of Basel. Euler himself was allowed to see Johann Bernoulli on
Saturday afternoons for private tutoring. At the age of 20, after he
was unsuccessful to obtain a professorship in physics at the Univer-
sity of Basel, because of a lottery system then is use (Euler lost), he
emigrated to St. Petersburg; later, he moved on to Berlin, and then
9 back to St. Petersburg. Euler unquestionable was the most prolific
mathematician of the 18th century, working in virtually all branches
of the differential and integral calculus and, in particular, being one
of the founders of the calculus of variations. He also did pioneering
work in the applied sciences, notably hydrodynamics, mechanics of
deformable materials and rigid bodies, optics, astronomy and the the-
ory of the spinning top. Not even his blindness at the age of 59 man-
aged to break his phenomenal productivity. Euler’s collected works
are still being edited, 71 volumes having already been published.

Colin Maclaurin (1698-1768) was a Scottish mathematicians who ap-


10 pliedthe new infinitesimal calculus to the various problems in ge-
ometry. He is best known for his power series expansion, but also
contributed to the theory of equations.
260 Linear Functional Approximation

For two subintervals one obtains


h
S2 = (f (a) + 4f (d) + 2f (c) + 4f (e) + f (b)) ,
12
where d = (a + c)/2 and e = (c + b)/2. Applying (6.6.1) to S1 and S2 yields

Q = S2 + (S2 − S)/15.

Now, we are able to give a recursive algorithm for the approximation of our integral. The func-
tion adquad evaluates the integral by applying Simpson. It calls quadstep recursively and apply
extrapolation. The implementation appears in MATLAB Source 6.13.

Example 6.6.1. Problem in Example 6.4.1 can be solved using adquad:

>> I=adquad(@lsin,0,2*pi,1e-8)

I =
7.64039557801944

6.7 Numerical Integration in MATLAB


MATLAB has two main functions for numerical integration, quad and quadl. Both require a and b
to be finite and the integrand to have no singularities on [a, b]. For infinite integrals and integrals with
singularities a variety of approaches can be used in order to produce an integral that can be handled by
quad and quadl; these include change of variable, integration by parts, and analytic treatment of the
integral over part of the range. See numerical analysis textbooks for details, for example [88, 33, 101,
22, 83].
The most frequent usage is q = quad(fun,a,b,tol) (and similarly for quadl), where fun
specifies the function to be integrated. It can be given as a string, an inline object or a function handle.
The argument tol is an absolute error tolerance, which defaults to 1e-6. A recommended value is a
small multiple of eps times an estimate of the integral.
The form q = quad(fun,a,b,tol,trace) with a nonzero trace shows the values of
[fcount a b-a Q] calculated during the recursion.
[q,fcount] = quad(...) returns R π the number of function evaluations.
Suppose we want to approximate 0 x sin x dx. We can store the integrand in an M-file, say
xsin.m:

function y=xsin(x)
y=x.*sin(x);

The approximate value is computed by:

>> quad(@xsin,0,pi)
ans =
3.1416

quad function is an implementation of a Simpson-type quadrature, as described in Section 6.6 or in


[66]. quadl is more precise and is based on a 4 points Gauss-Lobatto formula (with degree of exactness
5) and a 7 points Kronrod extension (with degree of exactness 9), described in [31]. Both routines
use adaptive quadrature. They break the range of integration into subintervals and apply the basic
6.7. Numerical Integration in MATLAB 261

MATLAB Source 6.13 Adaptive quadrature, variant


function [Q,fcount] = adquad(F,a,b,tol,varargin)
%ADQUAD adaptive quadrature
%call [Q,fcount] = adquad(F,a,b,tol,varargin)
% F - integrand
% a,b - interval endpoints
% tol - tolerance, default 1.e-6.
% other arguments are passed to the integrand, F(x,p1,p2,..).

% make F callable by feval.


if ischar(F) & exist(F)˜=2
F = inline(F);
elseif isa(F,’sym’)
F = inline(char(F));
end
if nargin < 4 | isempty(tol), tol = 1.e-6; end

% Initialization
c = (a + b)/2;
fa = feval(F,a,varargin{:}); fc = feval(F,c,varargin{:});
fb = feval(F,b,varargin{:});

% Recursive call
[Q,k] = quadstep(F, a, b, tol, fa, fc, fb, varargin{:});
fcount = k + 3;

% ---------------------------------------------------------

function [Q,fcount] = quadstep(F,a,b,tol,fa,fc,fb,varargin)


% Recursive subfunction called by adquad

h = b - a;
c = (a + b)/2;
fd = feval(F,(a+c)/2,varargin{:});
fe = feval(F,(c+b)/2,varargin{:});
Q1 = h/6 * (fa + 4*fc + fb);
Q2 = h/12 * (fa + 4*fd + 2*fc + 4*fe + fb);
if abs(Q2 - Q1) <= tol
Q = Q2 + (Q2 - Q1)/15;
fcount = 2;
else
[Qa,ka] = quadstep(F, a, c, tol, fa, fd, fc, varargin{:});
[Qb,kb] = quadstep(F, c, b, tol, fc, fe, fb, varargin{:});
Q = Qa + Qb;
fcount = ka + kb + 2;
end
262 Linear Functional Approximation

integration rule over each subinterval. They choose the subintervals according to the local behavior
of the integrand, placing the smallest ones where the integrand is changing most rapidly. Warning
messages are produced if the subintervals become very small or if an excessive number of function
evaluations is used, either of which could indicate that the integrand has a singularity.
To illustrate how quad and quadl we shall approximate
Z 1 
1 1
+ − 6 dx.
0 (x − 0.3)2 + 0.01 (x − 0.09)2 + 0.04
The integrand is the MATLAB function humps, used to test numerical integration functions or in MAT-
LAB demos. We shall apply quad to this function with tol=1e-4. Figure 6.2 plots the integrand and
shows with tick marks on the x-axis where the integrand was evaluated; circles mark the corresponding
values of the integrand. The figure shows that the subintervals are smallest where the integrand is most
rapidly varying. We obtained it by modifying quad MATLAB function.

Value of integral = 29.8583


100

90

80

70

60

50

40

30

20

10

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 6.2: Numerical integration of humps function by quad

The next example approximates the Fresnel integrals


Z t Z t
x(t) = cos(u2 ) du, y(t) = sin(u2 ) du.
0 0

This are the parametric equations of a curve, called the Fresnel spiral. Figure 6.3 shows its graphical
representation, produced by sampling 1000 equally spaced points on the interval [−4π, 4π]. For effi-
ciency reasons, we exploit the symmetry and avoid repeatedly integrating on [0, t] by integrating over
each subinterval and then evaluating the cumulative sums using cumsum. Here is the code:

n = 1000; x = zeros(1,n); y = x;
i1 = inline(’cos(x.ˆ2)’); i2 = inline(’sin(x.ˆ2)’);
t=linspace(0,4*pi,n);
6.7. Numerical Integration in MATLAB 263

for i=1:n-1
x(i) = quadl(i1,t(i),t(i+1),1e-3);
y(i) = quadl(i2,t(i),t(i+1),1e-3);
end
x = cumsum(x); y = cumsum(y);
plot([-x(end:-1:1),0,x], [-y(end:-1:1),0,y])
axis equal

0.8

0.6

0.4

0.2

−0.2

−0.4

−0.6

−0.8

−1 −0.5 0 0.5 1

Figure 6.3: Fresnel spiral

To integrate functions given by values, not by their analytical expression, one uses trapz function.
It implements the composite trapezoidal rule. The nodes need not to be equally spaced. As we have
already seen in Section 6.3.1, it works well when integrates periodical function on intervals whose
lengths is an integer multiple of the period. Example:
>> x=linspace(0,2*pi,10);
>> y=1./(2+sin(x));
>> trapz(x,y)
ans =
3.62759872810065
>> 2*pi*sqrt(3)/3-ans
ans =
3.677835813675756e-010

The exact value of the integral is 23 3π, so the error is less than 10−9 .
The function quadv accepts a vector argument and returns a vector.
A newer function is quadgk. It implements adaptive quadrature based on a Gauss-Kronrod pair
(15th and 7th order formulas). Besides the approximate value of the integral, it can return an error
bound and it accepts several options which control the integration process (for example we can specify
the singularities). If the integrand is complex-valued or the limits are complex, it integrates on a straight
line within the complex plane. We give two examples. The first computes
Z 1
sin x
dx.
−1 x
264 Linear Functional Approximation

ρ(x)

−1/β 1/β

Figure 6.4: Section of the ellipsoid

>> format long


>> ff=@(x) sin(x)./x;
>> quadgk(ff,-1,1,’RelTol’,1e-8,’AbsTol’,1e-12)
ans =
1.892166140734366

Notice that quad and quadl fail; they return NaN. Nevertheless, they succeed if we compute the
integral as: Z 1 Z 0 Z 1
sin x sin x sin x
dx = dx + dx.
−1 x −1 x 0 x
The second example uses Waypoints to integrate around a pole using a piecewise linear contour:

>> Q = quadgk(@(z)1./(2*z - 1),-1-i,-1-i,’Waypoints’,[1-i,1+i,-1+i])


Q =
0.0000 + 3.1416i

See doc quadgk and [82] for further information.

6.8 Applications
We present here two applications adapted after [72].

6.8.1 Computation of an ellipsoid surface


Consider an ellipsoid obtained by rotating the ellipse in Figure 6.4 around the x axis.
The radius ρ is described as a function of axial coordinate by the equation
1 1
ρ2 (x) = α2 (1 − β 2 x2 ), − ≤x≤ ,
β β
6.8. Applications 265

where α and β are such that α2 β 2 < 1. √


For test we chose the following values for the parameters: α = ( 2 − 1)/10, β = 10. The surface
is given by
Z 1/β p
I(f ) = 4πα 1 − K 2 x2 dx,
0
p
where K 2 = β 2 1 − α2 β 2 . Since f ′ (1/β) = −100, an adaptive quadrature seems to be appropriate.
We can compute the exact value and its floating point approximation using Symbolic Math Toolbox:
clear
syms alpha beta K2 s2 x f vI
s2=sqrt(sym(2));
alpha=(s2-1)/10;
beta=sym(10);
K2=betaˆ2*sqrt(1-alphaˆ2*betaˆ2);
f=sqrt(1-K2*xˆ2);
vI=4*sym(pi)*alpha*int(f,0,1/beta)
vpa(vI,16)
The results are:
vI =
1/2 1/2 1/2 1/2 1/2
1/100 pi (-2 (-(-2 + 2 2 ) + 1) + 2 (-(-2 + 2 2 )

1/2 1/2 1/2 3/4 1/2 1/4


+ 1) 2 + (-2 + 2 2 ) asin((-2 + 2 2 ) ))
ans =
0.04234752094082434
The next script approximates the surface with a tolerance of 1e-8 using the following functions:
Romberg, adquad, and MATLAB quad and quadl.
err=1e-8;
beta=10;
alpha=(sqrt(2)-1)/10;
alpha2=alphaˆ2;
beta2=betaˆ2;
K2=beta2*sqrt(1-alpha2*beta2);
f=@(x) sqrt(1-K2*x.ˆ2);
fpa=4*pi*alpha;
[vi(1),nfe(1)]=Romberg(f,0,1/beta,err,100);
[vi(2),nfe(2)]=adquad(f,0,1/beta,err);
[vi(3),nfe(3)]=quad(f,0,1/beta,err);
[vi(4),nfe(4)]=quadl(f,0,1/beta,err);
vi=fpa*vi;
meth={’Romberg’,’adquad’,’quad’,’quadl’};
for i=1:4
fprintf(’%8s %18.16f %3d\n’,meth{i},vi(i),nfe(i))
end
Here is the output:
266 Linear Functional Approximation

Romberg 0.0423475209214685 129


adquad 0.0423475209189811 65
quad 0.0423475203088494 37
quadl 0.0423475209279265 48

Romberg method is inferior to adaptive quadratures. Surprisingly, quad beats quadl.

6.8.2 Computation of the wind action on a sailboat mast


The sailboat schematically drawn in Figure 6.5(a) is subject to the action of wind force.
The straight line AB represent the mast, of length L, and BO is one of the two shrouds (strings for
the side stiffening of the mast).
Any infinitesimal element of the sail transmits to the corresponding element of length dx of the
mast a force of magnitude equal to f (x)dx. The change of f along with the height x, measured from
the point A (basis of the mast), is expressed by the following law
αx −γx
f (x) = e ,
x+β
where α, β and γ are given constants.
The resultant R of the force f is defined as

ZL
R = I(f ) ≡ f (x)dx
0

and is applied at distance equal to b (to be determined ) from the basis of the mast. The formula for b is
b = I(xf )/I(f ).

B B

wind mast
direction
shroud L

fdx R
dx
T
T

A M b
O
A
O
H
V

Figure 6.5: Schematic representation of a sailboat (left); forces acting on the mast (right)

Computing R and b is crucial for the structural design of the mast and shroud section. Once the
values of R and b are known, it is possible to analyze the hyperstatic structure mast-shroud. This
analysis has as results the reactions V , H, and M at the basis of the mast and the traction T that is
transmitted by the shroud (see Figure 6.5(b)). In a next step , the internal actions in the structure can
be found, as well as the maximum stresses arising in the mast AB and in the shroud BO, from which,
assuming that the safety verifications are satisfied, one can finally design the geometric parameters of
the sections AB and BO.
6.8. Applications 267

We approximate R and b by adquad and MATLAB quad and quadl functions. The function
sailboat in MATLAB Source 6.14 computes the approximations and plot number of function eval-
uations versus minus decimal logarithm of error. We test the function for α = 50, β = 5/3, γ = 1/4,

MATLAB Source 6.14 Computation of the wind action on a sailboat mast


function sailboat(alpha, beta, gamma, L)
%SAILBOAT - computation of wind action on
% a sailboat mast
% Alfio Quarteroni, Riccardo Sacco, Fausto Saleri
% Numerical Mathematics
% Springer 2000

f = @(x) alpha*x./(x+beta).*exp(-gamma*x);
xf = @(x) x.*f(x);
x=1:9;
err=10.ˆ(-x);
for k=1:9
[R1,ne1(k)]= adquad(f,0,L,err(k));
[b1,neb1(k)]= adquad(xf,0,L,err(k)); b1=b1/R1;
[R2,ne2(k)]= quad(f,0,L,err(k));
[b2,neb2(k)]= quad(xf,0,L,err(k)); b2=b2/R2;
[R3,ne3(k)]=quadl(f,0,L,err(k));
[b3,neb3(k)]=quadl(xf,0,L,err(k)); b3=b3/R3;
end
subplot(1,2,1)
plot(x,ne1,’b-x’,x,ne2,’r-+’,x,ne3,’g--d’)
xlabel(’-log_{10}(err)’,’FontSize’,14); ylabel(’n’,’FontSize’,14)
legend(’adquad’,’quad’,’quadl’,0)
title(’Computation of \it{R}’,’FontSize’,14)
subplot(1,2,2)
plot(x,neb1,’b-x’,x,neb2,’r-+’,x,neb3,’g--d’)
xlabel(’-log_{10}(err)’,’FontSize’,14); ylabel(’n’,’FontSize’,14)
legend(’adquad’,’quad’,’quadl’,0)
title(’Computation of \it{b}’,’FontSize’,14)
R1,R2,R3
b1,b2,b3

and L = 10. Required tolerances are in the range 10−1 to 10−9 . The calling command and the results
are given below:

>> sailboat(50, 5/3, 1/4, 10)


R1 =
100.061368317961
R2 =
100.061368317941
R3 =
100.061368317962
b1 =
268 Linear Functional Approximation

4.03145652950332
b2 =
4.03145652950425
b3 =
4.03145652950326

See Figure 6.6 for the number of functions evaluation required to attain the desired tolerances.

Computation of R Computation of b
700 800

adquad adquad
700
600 quad quad
quadl quadl

600
500

500
400
n

n
400

300
300

200
200

100
100

0 0
0 2 4 6 8 10 0 2 4 6 8 10
−log (err) −log (err)
10 10

Figure 6.6: Number of function evaluations vs. tolerance for R (left) and b

Problems
Problem 6.1. Approximate Z 1
sin x
dx,
x 0
using an adaptive quadrature and Romberg method. What kind of problems could occur? Compare
your result to that provided by quad or quadl.

Problem 6.2. Starting from a convenient integral, approximate π with 8 exact decimal digits, using
Romberg method and an adaptive quadrature.

Problem 6.3. Approximate Z 1


2
dx
−1 1 + x2
6.8. Applications 269

trapezes formula and composed Simpson rule, for various values of n. How does the accuracy vary
with n? Give a graphical representation.

Problem 6.4. The error function, erf, is defined by


Z x
2 2
erf(x) = √ e−t dt.
π 0

Tabulate its values for x = 0.1, 0.2, . . . , 1, using adquad function. Compare the results with those
provided by MATLAB functions quad and erf.

Problem 6.5. (a) Use adquad and MATLAB function quad to approximate
Z 2
1
p dt.
−1 sin |t|

(b) Why problems, like “divide by zero” do not occur at t = 0?

Problem 6.6. For a number p ∈ N, consider the integral


Z 1
Ip = (1 − t)p f (t) dt.
0

Compare the composite trapezoidal rule for n subintervals to Gauss-Jacobi formula with n nodes and
the parameters α = p and β = 0. Take, for example, f (t) = tgt, p = 5(5)20 and n = 10(10)50 for
the composite trapezoidal rule and n = 1(1)5 Gauss-Jacobi formula.

Problem 6.7. Let


f (x) = ln(1 + x) ln(1 − x).
(a) Use ezplot to plot f (x) for x ∈ [−1, 1].
(b) Use Maple or Symbolic Math toolbox to obtain the exact value of the integral
Z 1
f (x) dx.
−1

(c) Find the numerical value of the expression in (b).


(d) What problem occurs if we try to approximate the integral by

adquad(′ log(1 + x). ∗ log(1 − x)′ , −1, 1)?

(e) How can you avoid the difficulty? Argue your solution.
(f) Use adquad with various accuracies (tolerances). Plot the error and the number of function
evaluations as functions of tolerance.

Problem 6.8. A sphere of radius R floats half submerged in a liquid. If it is pushed down until the
diametral plane is a distance p (0 < p ≤ R) below the surface and is the released, the period of the
resulting vibration is
s Z2π
R dt
T = 8R p ,
g (6R2 − p2 ) 1 − k2 sin2 t
0

where k2 = p2 / 6R2 − p2 and g = 10m/s2 . For R = 1 and p = 0.5, 0.75, 1.0 find T .
270 Linear Functional Approximation
CHAPTER 7

Numerical Solution of Nonlinear Equations

7.1 Nonlinear Equations


The problems discussed in this chapter may be written generically in the form

f (x) = 0, (7.1.1)

but allow different interpretations depending on the meaning of x and f . The simplest case is a single
equation in a single unknown, in which case f is a given function of a real or complex variable, and
we are trying to find values of this variables for which f vanishes. Such values are called roots of the
equation (7.1.1) or zeros of the function f . If x in (7.1.1) is a vector, say x = [x1 , x2 , . . . , xd ]T ∈ Rd ,
and f is also a vector, each component of which is a function of d variables x1 , x2 , . . . , xd , then (7.1.1)
represents a system of equations. The system is nonlinear if at least one component of f depends
nonlinearly of at least one of the variables x1 , x2 , . . . , xd . If all components of f are linear functions
of x1 , . . . , xd we call (7.1.1) a system of linear algebraic equations. Still more generally, (7.1.1) could
represent a functional equation, if x is an element of some function space and f a (linear or nonlinear)
operator acting on this space. In each of these interpretations, the zero on the right of (7.1.1) has a
different meaning: the number zero in the first case, the zero vector in the second, and the function
identically equal to zero in the last case.
Much of this chapter is devoted to single nonlinear equations. Such equations are often encountered
in the analysis of vibrating systems, where the roots correspond to critical frequencies (resonance).
The special case of algebraic equations, where f in (7.1.1) is a polynomial, is also of considerable
importance and deserves a special treatment.

7.2 Iterations, Convergence, and Efficiency


Even the simplest of nonlinear equations — for example, algebraic equations — are known to not admit
solutions that are expressible rationally in terms of the data. It is therefore impossible, in general, to

271
272 Numerical Solution of Nonlinear Equations

compute roots of nonlinear equations in a finite numbers of arithmetic operations. What is required is
an iterative method, that is, a procedure that generates an infinite sequence of approximations {xn }n∈N ,
such that
lim xn = α, (7.2.1)
n→∞
for some root α of the equation. In case of a system of equations, both xk and α are vectors of
appropriate dimension, and convergence is to be understood is sense of a componentwise convergence.
Although convergence of an iterative process is certainly desirable, it takes more than just conver-
gence to make it practical. What one wants is fast convergence. A basic concept to measure the speed
of convergence is the order of convergence.

Definition 7.2.1. One says that xn converge to α (at least) linearly if

|xn − α| ≤ en (7.2.2)

where {en } is a positive sequence satisfying


en+1
lim = c, 0 < c < 1. (7.2.3)
n→∞ en
If (7.2.2) and (7.2.3) hold with equality in (7.2.2) then c is called asymptotic error constant.

The phrase “at least” in this definition relates to the fact that we have only inequality in (7.2.2),
which in practice is all we can usually ascertain. So, strictly speaking, it is the bounds en that converge
linearly, meaning that (e.g. for n large enough) each of these error bounds is approximately a constant
fraction of the preceding one.

Definition 7.2.2. One says that xn converges to α with (at least) order p ≥ 1 if (7.2.2) holds with
en+1
lim = c, c>0 (7.2.4)
n→∞ epn
Thus, convergence of order 1 is the same as linear convergence, whereas convergence of order
p > 1 is faster. Note that in this latter case there is no restriction on the constant c: once en is small
enough, it will be the exponent p that takes care of the convergence. If we have equality in (7.2.2), c is
again referred to as the asymptotic error constant.
The same definitions apply also to vector-valued sequences; one only needs to replace absolute
values in (7.2.2) by (any) vector norm.
The classification of convergence with respect to order is still rather crude, as there are types of
convergence that “fall between the cracks”. Thus, a sequence {en } may converge to zero slower than
linearly, for example such that c = 1 in (7.2.3). We call this type of convergence sublinear. Likewise,
c = 0 in (7.2.3) gives rise to superlinear convergence, if (7.2.4) does not hold for any p > 1.
It is instructive to examine the behavior of en , if instead of the limit relations (7.2.3) and (7.2.4) we
had strict equality from some n, say,
en+1
= c, n = n0 , n0 + 1, n0 + 2, . . . (7.2.5)
epn
For n0 large enough, this is almost true. A simple induction argument then shows that
pk −1 k
en0 +k = c p−1 epn0 , (7.2.6)

which certainly holds for p > 1, but also for p = 1 in the limits as p ↓ 1:

en0 +k = ck en0 , k = 0, 1, 2, . . . , (p = 1) (7.2.7)


7.3. Sturm Sequences Method 273

Assuming then en0 sufficiently small so that the approximation xn0 has several correct decimal
digits, we write en0 +k = 10−δk en0 . Then δk , according to (7.2.2), approximately represents the
number of additional correct digits in the approximation xn0 +k (as opposed to xn0 ). Taking logarithms
in (7.2.6) and (7.2.7) gives
( 1
k log
h c, i if p=1
δk = k 1−p−k 1 −k 1
p p−1
log c
+ (1 − p ) log en0
, if p>1

hence as k → ∞
δk ∼ c1 k (p = 1), δk ∼ cp pk (p > 1), (7.2.8)
1
where c1 = log c
> 0, if p = 1 and

1 1 1
cp = log + log .
p−1 c en0

(We assume here that n0 is large enough, and hence en0 small enough, to have cp > 0). This shows
that the number of correct decimal digits increases linearly with k when p = 1, but exponentially when
p > 1. In the latter case, δk+1 /δk ∼ p meaning that ultimately (for large k) the number of correct
decimal digits increases, per iteration step, by a factor of p.
If each iteration requires m units of work (a “unit of works” typically is the work involved in
computing a function value or a value of one of its derivatives), then the efficiency index of the iteration
may be defined by
lim [δk+1 /δk ]1/m = p1/m .
k→∞

It provides a common basis on which to compare different iterative methods with one another. Methods
that converge linearly have efficiency index 1.
Practical computation requires the employment of a stopping rule that terminates the iteration once
the desired accuracy is (or is believed to be) obtained. Ideally, one stops as soon as kxn − αk < tol,
where tol is a prescribed accuracy. Since α is not known, one commonly replaces xn −α by xn −xn−1
and requires
kxn − xn−1 k ≤ tol (7.2.9)
where
tol = kxn kεr + εa (7.2.10)
with εr , εa prescribed tolerances. As a safety measure, one might require (7.2.9) not just for one, but a
few consecutive values of n. Choosing εr = 0 or εa = 0 will make (7.2.10) a relative (resp., absolute)
error tolerance. It is prudent, however, to use a “mixed error tolerance”, say εe = εa = ε. Then, if kxn k
is small or moderately large, one effectively controls the absolute error, whereas for kxn k very large, it
is in effect the relative error that is controlled. One can combine the above tests with ||f (x)|| ≤ ε.

7.3 Sturm Sequences Method


There are situations in which it is desirable to be able to select one particular root among many others
and have the iterative scheme converge to it. This is the case, for example, in orthogonal polynomials,
where we know that all zeros are real and distinct. It may well be that we are interested in the second-
largest or third-largest zero, and should be able to compute it without computing any of the others. This
274 Numerical Solution of Nonlinear Equations

PSfrag x ξd ξr ξℓ
∫∫ ∫∫
σ(x) d d−1 r r−1 ℓ ℓ−1

Figure 7.1: Sturm’s theorem

1
is indeed possible if we combine the bisection method with the theorem of Sturm .
Thus, consider
f (x) := πd (x) = 0, (7.3.1)
where πd is a polynomial of degree d, orthogonal with respect to some positive measure. We know that
πd is the characteristic polynomial of a symmetric tridiagonal matrix and can be computed recursively
by a three term recurrence relation

π0 (x) = 1, π1 (x) = x − α0
(7.3.2)
πk+1 (x) = (x − αk )πk (x) − βk πk−1 (x), k = 1, 2, . . . , d − 1

with all βk positive. The recursion (7.3.2) is not only useful to compute πd (x), but has also the property
due to Sturm.

Theorem 7.3.1 (Sturm). Let σ(x) be the number of sign changes (zeros do not count) in the sequence
of numbers
πd (x), πd−1 (x), . . . , π1 (x), π0 (x). (7.3.3)
Then, for any two numbers a, b with a < b, the number of real zeros of πd in the interval a < x ≤ b is
equal to σ(a) − σ(b).

Since πk (x) = xk + . . . , it is clear that σ(−∞) = d, σ(+∞) = 0, so that indeed the number of
real zeros of πd is σ(−∞) − σ(∞) = d. Moreover, if ξ1 > ξ2 > · · · > ξd denote the zeros of πd in
decreasing order, we have the behavior of σ(x) as shown in Figure 7.1.
It is now easy to see that
σ(x) ≤ r − 1 ⇐⇒ x ≥ ξr . (7.3.4)
Indeed, suppose that x ≥ ξr . Then {#zeros ≤ x} ≥ d + 1 − r; hence, by Sturm’s theorem ,
σ(−∞) − σ(x) = d − σ(x) = {# zeros ≤ x} ≤ d1 − r, that is, σ(x) ≤ r − 1. Conversely, if
σ(x) ≤ r − 1, then, again by Sturm’s theorem, {# zeros ≤ x} = d − σ(x) ≥ d + 1 − r, which
implies x ≥ ξr (see Figure 7.1).
The basic idea is to control the bisection process, not as in the bisection procedure, by checking the
sign of πd (x), but rather, by checking the inequality (7.3.4) to see whether we are on the right or left

Jaques Charles François Sturm (1803-1855), a Swiss analyst and the-


oretical physicist, is best known for his theorem on Sturm sequences,
1 discovered in 1829, and his theory on Sturm-Liouville differential

equation. He also contributed significantly to differential and pro-


jective geometry.
7.4. Method of False Position 275

side of the zero ξr . In order to initialize the procedure, we need two values a1 = a, b1 = b such that
a < ξd and b > ξ1 . These are trivially obtained as endpoints of the interval of orthogonality for πd , if
it is finite. More generally, one can apply Gershgorin’s theorem to the Jacobi matrix Jd associated to
the polynomial (7.3.2)
 √ 
√α0 β1 √ 0
 β1 α1 β2 
 
 √ .. 
Jn =  β2 α 2 . 

 .. .. √ 
 . . β 
n−1

0 βn−1 αn−1

and taking into account that the zeros of πd are precisely the eigenvalues of Jd .
Gershgorin’s theorem states that the eigenvalue of a matrix A = [aij ] of order d are located in the
union of the disks  
 X 
z ∈ C : |z − aii | ≤ ri , ri = |aij | , i = 1, d.
 
j6=i

√ a can be chosen
In√this way, p largest of the d numbers α0 + β1 ,
pto be thepsmallest and b the
α1 + β1 + β2 , . . . , αd−2 + βd−2 + βd−1 , αd−1 + βd−1 . The method of Sturm sequences
then proceeds as follows, for any given r with 1 ≤ r ≤ d:
for n := 1, 2, 3, . . . do
xn := 21 (an + bn );
if σ(xn ) > r − 1 then
an+1 := xn ; bn+1 := bn ;
else
an+1 := an ; bn+1 = xn ;
end if
end for
Since initially σ(a) = d > r − 1, σ(b) = 0 ≤ r − 1, it follows by construction that

σ(an ) > r − 1, σ(bn ) ≤ r − 1, n = 1, 2, 3, . . .

meaning that ξr ∈ [an , bn ], for all n = 1, 2, 3, . . . . Moreover, as in the bisection method, bn − an =


2−(n−1) (b − a), so that |xn − ξr | ≤ εn with εn = 2−n (b − a). The method converges (at least)
linearly to the root ξr . A computer implementation can be obtained by modifying the if-else statement
appropriately.

7.4 Method of False Position


As in the method of bisection, we assume two numbers a < b such that

f ∈ C[a, b], f (a)f (b) < 0 (7.4.1)

and generate a sequence of nested intervals [an , bn ], n = 1, 2, 3, . . . with a1 = a, b1 = b such that


f (an )f (bn ) < 0. Unlike the bisection method, however, we are not taking the midpoint of [an , bn ] to
determine the next interval, but rather the solution x = xn of the linear equation

(L1 f )(x; an , bn ) = 0.
276 Numerical Solution of Nonlinear Equations

This would appear to be more flexible than bisection, as xn will come to lie closer to the endpoint at
which |f | is smaller.
More explicitly, the method proceeds as follows: define a1 = a, b1 = b. Then
for n := 1, 2, . . . do
a n − bn
xn := an − f (an );
f (an ) − f (bn )
if f (an )f (xn ) > 0 then
an+1 := xn ; bn+1 := bn ;
else
an+1 := an ; bn+1 := xn ;
end if
end for
One may terminate the iteration as soon as min(xn − an , bn − xn ) ≤ tol, where tol is a prescribed
error tolerance, although this is not entirely fool-proof.
For an implementation see MATLAB Source 7.1.

MATLAB Source 7.1 False position method for nonlinear equation in R


function [x,nit]=falseposition(f,a,b,er,nmax)
%FALSEPOSITION - false position method
%call [X,NIT]=FALSEPOSITION(F,A,B,ER,NMAX)
%F - function
%A,B - endpoints
%ER - tolerance
%NMAX - maximum number of iterations

if nargin < 5, nmax=100; end


if nargin < 4, er=1e-3; end
nit=0; fa=f(a); fb=f(b);
for k=1:nmax
x=a-(a-b)*fa/(fa-fb);
if (x-a<er*(b-a))||(b-x<er*(b-a))
nit=k; return
else
fx=f(x);
if sign(fx)==sign(fa)
a=x; fa=fx;
else
b=x; fb=fx;
end %if
end %if
end %for
error(’iteration number exceeded’)

The convergence behavior is most easily analyzed if we assume that f is convex or concave on
[a, b]. To fix ideas, suppose f is convex, say

f ′′ (x) > 0, x ∈ [a, b], f (a) < 0, f (b) > 0. (7.4.2)


7.4. Method of False Position 277

Then f has exactly one zero, α, in [a, b]. Moreover, the secant connecting f (a) and f (b) lies entirely
above the graph of y = f (x), and hence intersects the real line to the left of α. This will be the case of
all subsequent secants, which means that the point x = b remains fixed while the other endpoint a gets
continuously updated, producing a monotonically increasing sequence of approximation. The sequence
defined by
xn − b
xn+1 = xn − f (xn ), n ∈ N∗ , x1 = a (7.4.3)
f (xn ) − f (b)
is monotonically increasing sequence bounded from above by α, therefore convergent to a limit x, and
f (x) = 0 (See Figure 7.2).

xn−1 x
n

a α b

Figure 7.2: Method of false position

To determine the speed of convergence, we subtract α from both side of (7.4.3) and use the fact that
f (α) = 0:
xn − b
xn+1 − α = xn − α − [f (xn ) − f (α)].
f (xn ) − f (b)
Now divide by xn − α to get
xn+1 − α xn − b f (xn ) − f (α)
=1− .
xn − α f (xn ) − f (b) xn − α
Letting here n → ∞ and using the fact that xn → α, we obtain
xn+1 − α f ′ (α)
lim = 1 − (b − α) . (7.4.4)
n→∞ xn − α f (b)
Thus, we have linear convergence with asymptotic error constant equal to
f ′ (α)
c = 1 − (b − a) .
f (b)
Due to the assumption of convexity, c ∈ (0, 1). The proof when f is concave is analogous. If f is
neither convex nor concave on [a, b], but f ∈ C 2 [a, b] and f ′′ (α) 6= 0, f ′′ has a constant sign in a
278 Numerical Solution of Nonlinear Equations

neighborhood of α and for n large enough xn will eventually come to lie in this neighborhood, and we
can proceed as above.
As an example, we consider the equation cos x cosh x = 1 on interval [ 3π
2
, 2π]. Here is the code

ff=@(x) cos(x).*cosh(x)-1;
a=3/2*pi; b=2*pi;
[x,n]=falseposition(ff,a,b,eps)

The output is

x =
4.730040744862703
n =
75

Drawbacks. (i) Slow convergence; (ii) The fact that one of the endpoints remain fixed. If f is very
flat near α, the point a is nearby and b further away, the convergence is exceptionally slow.

7.5 Secant Method


The secant method is a simple variant of the method of false position in which it is no longer required
that the function f have opposite signs at the endpoints of each interval generated, not even the initial
interval. One starts with two arbitrary initial approximations x0 , x1 and continues with
xn − xn−1
xn+1 = xn − f (xn ), n ∈ N∗ (7.5.1)
f (xn ) − f (xn−1 )

This precludes the formation of a fixed false position, as in the method of false position, and hence
suggest potentially faster convergence. Unfortunately, the “global convergence” no longer holds; the
method converges only “locally”, that is only if the initial approximations x0 and x1 are sufficiently
close to a root.
We need a relation between three consecutive errors
 
f (xn ) f (xn ) − f (α)
xn+1 − α = xn − α − = (xn − α) 1 −
f [xn−1 , xn ] (xn − α)f [xn−1 , xn ]
 
f [xn , α] f [xn−1 , xn ] − f [xn , α]
= (xn − α) 1 − = (xn − α)
f [xn−1 , xn ] f [xn−1 , xn ]
f [xn , xn−1 , α]
= (xn − α)(xn−1 − α) .
f [xn−1 , xn ]

Hence,
f [xn , xn−1 , α]
(xn+1 − α) = (xn − α)(xn−1 − α) , n ∈ N∗ (7.5.2)
f [xn−1 , xn ]
From (7.5.2) it follows that if α is a simple root (f (α) = 0, f ′ (α) 6= 0) and if xn → α, then
convergence is faster than linear, at least if f ∈ C 2 near α. How fast is convergence?
We replace the ratio of divided difference in (7.5.2) by a constant, which is almost true when n is
large. Letting then ek = |xk − α|, we have

en+1 = en en−1 C, C>0


7.5. Secant Method 279

Multiplying both sides by C and defining En = Cen gives

En+1 = En En−1 , En → 0.
1
Taking logarithms on both sides, and defining yn = log En
we obtain

yn+1 = yn + yn−1 , (7.5.3)

the well-known difference equation for the Fibonacci sequence.


The solution is
yn = c1 tn n
1 + c2 t2 ,

where c1 , c2 are constants and


1 √ 1 √
t1 = (1 + 5), t2 = (1 − 5).
2 2
Since yn → ∞, we have c1 6= 0 and yn ∼ c1 tn
1 , as n → ∞, since |t2 | < 1. Putting them back,
n n
1
En
∼ ec1 t1 , e1n ∼ Cec1 t1 , so
n
en+1 C t 1 e c1 t 1 t 1 t1 −1
t1 ∼ n+1 = C , n → ∞.
en Cec1 t1

1+ 5
The order of convergence, therefore, is t1 = ≈ 1.61803 . . . (the golden ration).
2
Theorem 7.5.1. Let α be a simple zero of f . Let Iε = {x ∈ R : |x − α| < ε} and assume f ∈ C 2 [Iε ].
Define, for sufficiently small ε ′′
f (s)
M (ε) = max ′ .
(7.5.4)
s∈Iε 2f (t)
t∈I ε

Assume ε so small that


εM (ε) < 1. (7.5.5)
Then the secant method converges to the unique root α ∈ Iε for any starting values x0 6= x1 with
x0 ∈ Iε , x1 ∈ Iε .
′′
f (α)
Remark 7.5.2. Note that lim M (ε) = ′ < ∞, so that (7.5.5)) can certainly be satisfied for ε
ε→0 2f (α)
small enough. The local nature of convergence is thus quantified by the requirement x0 , x1 ∈ Iε . ♦

Proof. First of all, observe that α is the only zero of f in Iε . This follows from Taylor’s formula applied
at x = α:
(x − α)2 ′′
f (x) = f (α) + (x − α)f ′ (α) + f (ξ)
2
where f (α) = 0 and ξ ∈ (x, α) (or (α, x)). Thus, if x ∈ Iε , then also ξ ∈ Iε , and we have
 
x − α f ′′ (ξ)
f (x) = (x − α)f ′ (α) 1 +
2 f ′ (α)

Here, if x 6= α, all three factors are different from zero, the last one since by assumption

x − α f ′′ (ξ)

2 f ′ (α) ≤ εM (ε) < 1.
280 Numerical Solution of Nonlinear Equations

Thus, f on Iε can only vanish at x = α.


Next we show that for all n, xn ∈ Iε and two consecutive iterates are distinct, unless f (xn ) = 0
for some n, in which case xn = α and the method converge in a finite number of steps. We prove this
by induction: assume that xn−1 , xn ∈ Iε and xn 6= xn−1 . (By assumption this is true for n = 1.)
Then, from known properties of divided differences, and by our assumption that f ∈ C 2 [Iε ], we have
1 ′′
f [xn−1 , xn ] = f ′ (ξ1 ), f [xn−1 , xn , α] = f (ξ2 ), ξi ∈ Iε , i = 1, 2.
2
Therefore, by (7.5.2), ′′
f (ξn )
|xn+1 − α| ≤ ε2 ′ ≤ εεM (ε) < ε,
2f (ξ1 )
that is, xn+1 ∈ Iε . Furthermore, by the relation between three consecutive errors (7.5.2), xn+1 6= xn
unless f (xn ) = 0 , hence xn = α.
Finally, using again (7.5.2) we have
|xn+1 − α| ≤ |xn − α|εM (ε)
which, applied repeatedly, yields
|xn+1 − α| ≤ |xn − α|εM (ε) ≤ · · · ≤ [εM (ε)]n−1 |x1 − α|.
Since εM (ε) < 1, it follows that the method converges and xn → α as n → ∞. 

Since only√one evaluation of f is required in each iteration step, the secant method has the efficiency
index p = 1+2 5 ≈ 1.61803 . . . . An implementation of this method is given in MATLAB Source 7.2.

7.6 Newton’s method


Newton’s method can be thought of as a limit case of the secant method, when xn−1 → xn . The result
is
f (xn )
xn+1 = xn − ′ (7.6.1)
f (xn )
where x0 is some appropriate initial approximation. Another, more fruitful interpretation is that of
linearization of the equation f (x) = 0 at x = xn :
f (x) ≈ f (xn ) + (x − xn )f ′ (xn ) = 0.
Viewed in this manner, Newton’s method can be vastly generalized to nonlinear equations of all kinds
(nonlinear equations, functional equations, in which case the derivative f ′ is to be understood as a
Fréchet derivative, and the iteration is
xn+1 = xn − [f ′ (xn )]−1 f (xn ) (7.6.2)
The study of error in Newton’s method is virtually the same as the one for the secant method.
f (xn )
xn+1 − α = xn − α − ′
f (xn )
 
f (xn ) − f (α)
= (xn − α) 1 −
(xn − α)f ′ (xn )
  (7.6.3)
f [xn , α]
= (xn − α) 1 −
f [xn , xn ]
f [xn , xn , α]
= (xn − α)2 .
f [xn , xn ]
7.6. Newton’s method 281

MATLAB Source 7.2 Secant method for nonlinear equations in R


function [z,ni]=secant(f,x0,x1,ea,er,Nmax)
%SECANT - secant method in R
%input
%f - function
%x0,x1 - starting values
%ea,er - absolute and relative error, respectively
%Nmax - maximim number of iterations
%output
%z - approximate root
%ni - actual no. of iterations

if nargin<6, Nmax=50; end


if nargin<5, er=0; end
if nargin<4, ea=1e-3; end
xv=x0; fv=f(xv); xc=x1; fc=f(xc);
for k=1:Nmax
xn=xc-fc*(xc-xv)/(fc-fv);
if abs(xn-xc)<ea+er*xn %success
z=xn;
ni=k;
return
end
%prepare next iteration
xv=xc; fv=fc; xc=xn; fc=feval(f,xn);
end
%failure
error(’maximum iteration number exceeded’)

Therefore, if xn → α, then
xn+1 − α f ′′ (α)
lim 2
=
n→∞ (xn − α) 2f ′ (α)
that is, Newton’s method
√ has the order of convergence p = 2 if f ′′ (α) 6= 0. The efficiency index of
Newton method is 2 = 1.41421 . . . , because it requires at each step two function evaluations (the
function and its derivative).
For the convergence of Newton’s method we have the following result.

Theorem 7.6.1. Let α be a simple root of the equation f (x) = 0 and Iε = {x ∈ R : |x − α| ≤ ε}.
Assume that f ∈ C 2 [Iε ]. Define
′′
f (s)
M (ε) = max ′
(7.6.4)
s∈Iε 2f (t)
t∈I ε

If ε is so small that
2εM (ε) < 1, (7.6.5)
then for every x0 ∈ Iε , Newton’s method is well defined and converges quadratically to the only root
α ∈ Iε .
282 Numerical Solution of Nonlinear Equations

The extra factor 2 in (7.6.5) comes from the requirement that f ′ (x) 6= 0 for x ∈ Iε .
The stopping criterion for Newton’s method

|xn − xn−1 | < ε

is based on the following result.

Proposition 7.6.2. Let (xn ) be the sequence of approximations generated by Newton’s method. If α is
a simple root in [a, b], f ∈ C 2 [a, b] and the method is convergent, then there exists an n0 ∈ N such that

|xn − α| ≤ |xn − xn−1 |, n > n0 .

Proof. We shall first show that


1
|xn − α| ≤ |f (xn )|, m1 ≤ inf |f ′ (x)|. (7.6.6)
m1 x∈[a,b]

Using Lagrange Theorem, f (α) − f (xn ) = f ′ (ξ)(α − xn ), where ξ ∈ (α, xn ) (or (xn , α)). Relations
f (α) = 0 and |f ′ (x)| ≥ m1 , for x ∈ (a, b), imply that |f (xn )| ≥ m1 |α − xn |, that is (7.6.6).
Based on Taylor formula, we have
1
f (xn ) = f (xn−1 ) + (xn − xn−1 )f ′ (xn−1 ) + (xn − xn−1 )2 f ′′ (µ), (7.6.7)
2
where µ ∈ (xn−1 , xn ) or µ ∈ (xn , xn−1 ). Due to the way which we obtain an approximation in
Newton’s method, we have f (xn−1 ) + (xn − xn−1 )f ′ (xn−1 ) = 0 and from (7.6.7) we obtain
1 1
|f (xn )| = (xn − xn−1 )2 |f ′′ (µ)| ≤ (xn − xn−1 )2 kf ′′ k∞ ,
2 2
and based on (7.6.6) it follows that
kf ′′ k∞
|α − xn | ≤ (xn − xn−1 )2 .
2m1
Since we assumed the convergence of the method, there exists a n0 ∈ N such that
kf ′′ k∞
(xn − xn−1 ) < 1, n > n0 ,
2m1
and hence
|xn − α| ≤ |xn − xn−1 |, n > n0 .


The geometric interpretation of Newton’s method is given in Figure 7.3, and an implementation is
given in MATLAB Source 7.3.
The choice of starting value is, in general, a difficult task. In practice, one chooses a value, and if
after a fixed maximum number of iterations the desired accuracy, tested by an usual stopping criterion,
is not attained, another starting value is chosen. For example, if the root is isolated in a certain interval
[a, b], and f ′′ (x) 6= 0, x ∈ (a, b), a choice criterion is f (x0 )f ′′ (x0 ) > 0. Another criterion is: Let
f ∈ C 2 [a, b] be such that
- f is convex (or concave) on [a, b];
- f (a)f (b) < 0;
- the tangents at the endpoints of [a, b] intersect the real line within [a, b].
7.6. Newton’s method 283

xn x
n−1

Figure 7.3: Newton’s method

Then, Newton’s method converges globally, i.e. for any x0 ∈ [a, b].

Example 7.6.3. We wish to compute α = a, a > 0. Starting from the equation

f (x) = x2 − a = 0,

(7.6.1) becomes  
1 a
xn+1 = xn + , n = 0, 1, 2, . . . (7.6.8)
2 xn
This method was used by Babylonians long before Newton. Because the convexity of f , it is clear
that the iteration (7.6.8) converges to the positive square root for each x0 > 0 and is monotonically
decreasing (except for the first step in the case 0 < x0 < α). This is an elementary example of global
convergence. ♦

π
Example 7.6.4 (Cycle). Let f (x) = sin(x), |x| < 2
. There is exactly one root in this interval, α = 0.
Newton’s method becomes

xn+1 = xn − tan xn , x = 0, 1, 2, . . . . (7.6.9)

It exhibits a strange behavior (see Figure 7.4). If x0 = x∗ , where is the smallest positive root of

tan x∗ = 2x∗ , (7.6.10)


284 Numerical Solution of Nonlinear Equations

MATLAB Source 7.3 Newton method for nonlinear equations in R


function [z,ni]=Newtons(f,fd,x0,ea,er,Nmax)
%NEWTONS - Newton method in R
%Input
%f - function
%fd - derivative
%x0 - starting value
%ea,er - absolute and relative error, respectively
%Nmax - maximum number of iterations
%Output
%z - approximate solution
%ni - actual no. of iterations

if nargin<6, Nmax=50; end


if nargin<5, er=0; end
if nargin<4, ea=1e-3; end
xv=x0;
for k=1:Nmax
xc=xv-f(xv)/fd(xv);
if abs(xc-xv)<ea+er*abs(xc) %success
z=xc;
ni=k;
return
end
xv=xc; %prepare next iteration
end
%failure
error(’maximum iteration number exceeded’)

then x1 = −x∗ , x2 = x∗ ; that is Newton’s method cycles forever. This is called a cycle. For this
starting value, Newton’s method does not converge, let alone to α = 0. It does converge, however, for
any starting value x0 with |x0 | < x∗ , generating a sequence of alternatively increasing and decreasing
approximations xn converging necessarily to α = 0. The value of the critical number x∗ can be
computed by Newton’s method applied to (7.6.10). The result is x∗ = 1.165561185207211
  . . . . Here
is an example of local convergence, since convergence does not hold for all x0 ∈ − π2 , π2 . (If x0 = π2 ,
the first derivative vanishes.) What is interesting, in double precision floating-point representation the
cycle does not appear. The example (M-file cycletest.m):

%cycle test
f1=@(x) tan(x)-2*x;
f1d=@(x) cos(x).ˆ(-2)-2;
[x0,ni]=Newtons(f1,f1d,3*pi/7,0,eps,200)
[x,n2]=Newtons(@sin,@cos,x0,0,eps,200)
[x1,n3]=Newtons(@sin,@cos,x0-eps,eps,eps,200)

provides the following results:

x0 =
7.6. Newton’s method 285

1.165561185207211
ni =
7
x =
21.991148575128552
n2 =
27
x1 =
0
n3 =
26
Can you explain why? (Hint: compute f1(x0) and f1(x0-eps).) ♦

*
x

Figure 7.4: A cycle in Newton’s method

For another example on cycles see Problem 7.2.


Example 7.6.5. f (x) = x20 − 1, x > 0. There is exactly one positive simple root, α = 1. Newton’s
method yields the iteration
19 1
xn+1 = xn + , n = 0, 1, 2, . . . (7.6.11)
20 20x19
n

which provides a good example to illustrate that unless one starts sufficiently close to the desired root,
19
it may take a long time to approach it. If we take x0 = 21 ; then x1 ≈ 220 = 2.62144 × 104 , a very
large number. It takes a long time to arrive in a small vicinity of α = 1, since for xn large, one has
19
xn+1 ≈ xn , xn ≫ 1.
20
19
At each step the approximation is reduced only by a fraction 20 = 0.95. It takes about 200 steps to get
back to near the desired root. But once we come close to α = 1, the iteration speeds up dramatically
and converges to the root quadratically. Since f is convex, we actually have global convergence on
[0, ∞], but as we have seen, this is not very useful. ♦
286 Numerical Solution of Nonlinear Equations

7.7 Fixed Point Iteration


Often, in applications, a nonlinear equation presents itself in the form of a fixed point problem: find x
such that
x = ϕ(x). (7.7.1)
A number α satisfying this equation is called a a fixed point of ϕ. Any equation f (x) = 0 can, in fact,
(in many different ways) be written equivalently in the form (7.7.1). For example, if f ′ (x) 6= 0 in the
interval of interest, we can take
f (x)
ϕ(x) = x − ′ . (7.7.2)
f (x)
If x0 is an initial approximation of a fixed point α of (7.7.1), the fixed point iteration generates a
sequence of approximates by
xn+1 = ϕ(xn ). (7.7.3)
If it converges, it clearly converges to a fixed point of ϕ if ϕ is continuous. Note that (7.7.3) is precisely
Newton’s method for solving f (x) = 0 if ϕ is defined by (7.7.2). So Newton’s method can be viewed
as a fixed point iteration, but not the secant method.
For any iteration of the form (7.7.3), assuming that xn → α when n → ∞, it is straightforward to
determine the order of convergence. Suppose indeed that at the fixed point α we have

ϕ′ (α) = ϕ′′ (α) = · · · = ϕ(p−1) (α) = 0, ϕp (α) 6= 0 (7.7.4)

We assume that ϕ ∈ C p on a neighborhood of α. This defines the integer p ≥ 1. We then have by


Taylor’s theorem

(xn − α)p−1 (p−1)


ϕ(xn ) = ϕ(α) + (xn − α)ϕ′ (α) + · · · + ϕ (α)
(p − 1)!
p p
(xn − α) (p) (xn − α) (p)
+ ϕ (ξn ) = ϕ(α) + ϕ (ξn ),
p! p!
where ξn is between α and xn . Since ϕ(xn ) = xn+1 and ϕ(α) = α we get
xn+1 − α 1
= ϕ(p) (ξn ).
(xn − α)p p!

As xn → α, since ξn is trapped between xn and α, we conclude by the continuity of ϕ(p) at α, that


xn+1 − α 1
lim = ϕ(p) (α) 6= 0. (7.7.5)
n→∞ (xn − α)p p!
This shows that convergence is exactly of order p, and the asymptotic error constant is
1 (p)
c= ϕ (α). (7.7.6)
p!
Combining this with the usual local convergence argument, we obtain the following result.

Theorem 7.7.1. Let α be a fixed point of ϕ and Iε = {x ∈ R : |x − α| ≤ ε}. Assume ϕ ∈ C p [Iε ]


satisfies (7.7.4). If
M (ε) := max |ϕ′ (t)| < 1 (7.7.7)
t∈Iε

then the fixed point iteration converges to α, for any x0 ∈ Iε . The order of convergence is p, and the
asymptotic error constant is given by (7.7.6).
7.8. Newton’s Method for Multiple zeros 287

7.8 Newton’s Method for Multiple zeros


If α is a zero of multiplicity m, then the convergence order of Newton method is only one. Indeed, let
f (x)
ϕ(x) = x − .
f ′ (x)
Since
f (x)f ′′ (x)
ϕ′ (x) =
[f ′ (x)]2
the process should be convergent if ϕ′ (α) = 1 − 1/m < 1.
One way to avoid multiple zeros is to solve the modified equation
f (x)
u(x) := =0
f ′ (x)
which has the same roots as f , but simple. Newton’s method for the modified problem has the form
u(xk ) f (xk )f ′ (xk )
xk+1 = xk − = ′ . (7.8.1)
u′ (xk ) [f (xk )]2 − f (xk )f ′′ (xk )
Since α is a simple zero of u, the convergence of (7.8.1) is always quadratic. The only theoretical
disadvantage of (7.8.1) is the additionally required second derivative of f and the slightly higher cost
of the determination of xk+1 from xk . In practice, this is a weakness, since the denominator of (7.8.1)
could be very small on a neighborhood of α when xk → α.
Quadratic convergence for zeros of higher multiplicities can be achieved not only by modifying the
problem, but also by modifying the method. In a neighborhood of a zero α with multiplicity m, the
relation
f (x) = (x − α)m ϕ(x) ≈ (x − α)m · c, (7.8.2)
holds. This leads to
f (x) x−α f (x)
≈ ⇒ α ≈ x−m ′ .
f ′ (x) m f (x)
The accordingly modified sequence
f (xk )
xk+1 := xk − m , k = 0, 1, 2, . . . (7.8.3)
f ′ (xk )
converges quadratically even at multiple zeros, provided the correct value of the multiplicity m is used
in (7.8.3).
The efficiency of the Newton variant (7.8.3) depends critically on the correctness of the value m
used. If this value cannot be determined analytically, then a good estimate should at least be used.
Provided that
|xk − α| < |xk−1 − α| ∧ |xk − α| < |xk−2 − α|
xk can be substituted for α in (7.8.2):
f (xk−1 ) ≈ (xk−1 − xk )m · c
f (xk−2 ) ≈ (xk−2 − xk )m · c.
Then this system is solved with respect to m:
log [f (xk−1 )/f (xk−2 )]
m≈ .
log [(xk−1 − xk )/(xk−2 − xk )]
This estimate of the multiplicity can be used, for example, in (7.8.3).
288 Numerical Solution of Nonlinear Equations

7.9 Algebraic Equations


There are many iterative methods specifically designed to solve algebraic equations. Here we only
describe how Newton’s method applies to this context, essentially confining ourselves to a discussion
of an efficient way to evaluate simultaneously the value of a polynomial and its first derivative. In the
special case where all zeros of the polynomial are known to be real and simple, we describe an improved
variant of Newton’s method.
Newton’s method applied to algebraic equations. We consider an algebraic equation of degree
d,
f (x) = 0, f (x) = xd + ad−1 xd−1 + · · · + a0 , (7.9.1)
where the leading coefficient is assumed (without restricting generality) to be 1 and where we may also
assume a0 6= 0 without loss of generality. For simplicity we assume all coefficients to be real.
To apply Newton’s method to (7.9.1), one needs good methods for evaluating a polynomial and its
derivative.
Horner’s scheme is good for this purpose:
bd := 1; cd := 1;
for k = d − 1 downto 1 do
bk := tbk+1 + ak ;
ck := tck+1 + bk ;
end for
b0 := tb1 + a0 ;
Then f (t) = b0 , f ′ (t) = c1 .
We proceed as follows:
One applies Newton’s method, computing simultaneous f (xn ) and f ′ (xn )

f (xn )
xn+1 = xn − .
f ′ (xn )

f (x)
Then we apply Newton’s method to the polynomial . For complex roots, one begins with
x−α
x0 complex and all computations are done in complex arithmetic. It is possible to divide by quadratic
factors and to compute entirely in real arithmetic – Bairstow’s method. This method for decreasing the
degree could lead to large errors. A way of improvement is to use the approximated roots as starting
values for Newton’s method to the original polynomial.

7.10 Newton’s method for systems of nonlinear equations


Newton’s method can be easily adapted to deal with systems of nonlinear equations

F (x) = 0, (7.10.1)

where F : Ω ⊂ Rn → Rn , and x, F (x) ∈ Rn . The system (7.10.1) can be written explicitly



 F1 (x1 , . . . , xn ) = 0

..
 .

Fn (x1 , . . . , xn ) = 0
7.10. Newton’s method for systems of nonlinear equations 289

Let F ′ (x(k) ) be the Jacobian matrix of F in x(k) :


 ∂F ∂F1

∂x
1
(x(k) ) ... ∂xn
(x(k) )
′ (k)
 1 . .. .. 
J := F (x ) =   .. . .
.
 (7.10.2)
∂Fn
∂x1
(x(k) ) ... ∂Fn
∂xn
(x (k)
)

The quantity 1/f ′ (x) is replaced by the inverse of the Jacobian in x(k) :

x(k+1) = x(k) − [F ′ (x(k) )]−1 F (x(k) ). (7.10.3)

We write iteration under the form


x(k+1) = x(k) + w(k) . (7.10.4)
Note that wk is the solution of the system having n equations and n unknowns

F ′ (x(k) )w(k) = −F (x(k) ). (7.10.5)

It is more efficient and convenient that, instead of computing the inverse Jacobian to solve the system
(7.10.5) and of using the form (7.10.4) of iteration.

Theorem 7.10.1. Let α be a solution of equation F (x) = 0 and suppose that in closed ball B(δ) ≡
{x : kx − αk ≤ δ}, there exists the Jacobi matrix of F : Rn → Rn , it is nonsingular and satisfies a
Lipschitz condition

kF ′ (x) − F ′ (y)k∞ ≤ ckx − yk∞ , ∀ x, y ∈ B(δ), c > 0.



We set γ = c max k[F ′ (x)]−1 k∞ : kα − xk∞ ≤ δ and 0 < ε < min{δ, γ −1 }. Then for any initial
approximation x(0) ∈ B(ε) := {x : kx − αk∞ ≤ ε} Newton method is convergent, and the vectors
e(k) := α − x(k) satisfy the following inequalities:
(a) ke(k+1) k∞ ≤ γke(k) k2∞
k
(b) ke(k) k∞ ≤ γ −1 (γke(0) k∞ )2 .

Proof. If F ′ is continuous on the segment joining the points x, y ∈ Rn , Lagrange’s Theorem implies

F (x) − F (y) = Jk (x − y),

where  
∂F1 ∂F1
∂x1
(ξ1 ) ... ∂xn
(ξ1 )
 .. .. .. 
Jk = 
 . . .
⇒

∂Fn ∂Fn
∂x1
(ξn ) ... ∂xn
(ξ n )

e(k+1) = e(k) − [F ′ (x(k) )]−1 (F (α) − F (x(k) )) = e(k) − [F ′ (x(k) )]−1 Jk e(k)
= [F ′ (x(k) )]−1 (F ′ (x(k) ) − Jk )e(k)

and (a) follows. From Lipschitz condition one gets

kF ′ (x(k) ) − Jk k∞ ≤ c max kx(k) − ξ (j) k ≤ ckx(k) − αk


j=1,n

Thus, if kα − x(k) k∞ ≤ ε, then kα − x(k+1) k∞ ≤ (γε)ε ≤ ε. Since (a) holds for any k, (b) follows
immediately. 
290 Numerical Solution of Nonlinear Equations

MATLAB Source 7.4 Newton method in R and Rn


function [z,ni]=Newton(f,fd,x0,ea,er,nmax)
%NEWTON - Newton method for nonlinear equations in R and Rˆn
%call [z,ni]=Newton(f,fd,x0,ea,er,nmax)
%Input
%f - function
%fd - derivative
%x0 - starting value
%ea,er - absolute and relative error, respectively
%nmax - maximum number of iterations
%Output
%z - approximate solution
%ni - actual no. of iterations

if nargin < 6, nmax=50; end


if nargin < 5, er=0; end
if nargin < 4, ea=1e-3; end
xp=x0(:); %previous x
for k=1:nmax
xc=xp-fd(xp)\f(xp);
if norm(xc-xp,inf)<ea+er*norm(xc,inf)
z=xc; %success
ni=k;
return
end
xp=xc;
end
error(’maximum iteration number exceeded’)

MATLAB Source 7.4 gives an implementation of Newton’s method that functions for scalar equa-
tions and also for systems.
Example 7.10.2. Consider the nonlinear system:
1
3x1 − cos(x1 x2 ) − = 0,
2
x21 − 81(x2 + 0.1)2 + sin x3 + 1.06 = 0,
10π − 3
e−x1 x2 + 20x3 + =0
3
Here are the function and the jacobian in MATLAB:
function y=fs3(x)
y=[3*x(1)-cos(x(2)*x(3))-1/2;...
x(1)ˆ2-81*(x(2)+0.1)ˆ2+sin(x(3))+1.06;...
exp(-x(1)*x(2))+20*x(3)+(10*pi-3)/3];

function y=fs3d(x)
7.11. Quasi-Newton Methods 291

y=[3,x(3)*sin(x(2)*x(3)), x(2)*sin(x(2)*x(3));...
2*x(1), -162*(x(2)+0.1), cos(x(3));...
-x(2)*exp(-x(1)*x(2)), -x(1)*exp(-x(1)*x(2)), 20];

Finally, we give the call example:


>> [z,ni]=Newton(@fs3,@fs3d,x0,1e-9);
Applying Newton’s method with starting value x(0) = [0.1, 0.1, −0.1]T one obtains the value given in
Table 7.1. The desired accuracy, 10−9 , is attained after 5 iterations. ♦

(k) (k) (k)


k x1 x2 x3 kx(k) − x(k−1) k∞
0 0.1000000000 0.1000000000 -0.1000000000 —
1 0.4998696729 0.0194668485 -0.5215204719 0.42152
2 0.5000142402 0.0015885914 -0.5235569643 0.0178783
3 0.5000001135 0.0000124448 -0.5235984501 0.00157615
4 0.5000000000 0.0000000008 -0.5235987756 1.2444e-005
5 0.5000000000 0.0000000000 -0.5235987756 7.75786e-010
6 0.5000000000 -0.0000000000 -0.5235987756 1.11022e-016

Table 7.1: Results for Example 7.10.2

7.11 Quasi-Newton Methods


An important weakness of Newton’s method for solution of systems of nonlinear equation is the neces-
sity to compute the Jacobian matrix and to solve a linear n × n system having this matrix. To illustrate
the size of such a weakness, let us evaluate the amount of computation associated to an iteration of New-
ton method. The Jacobian matrix associated to a system of n nonlinear equation F (x) = 0 requires the
evaluation of the n2 partial derivatives of the n component of F . In most situations, evaluation of partial
derivatives is not convenient and often impossible. The computational effort required by an iteration of
Newton’s method is at least n2 +n scalar function evaluations (n2 for Jacobian and n for F ) and O(n3 )
flops for the solution of nonlinear system. This amount of computation is prohibitive, excepting small
values of n, and scalar functions which can be evaluated easily. So, it is natural to focus our attention
to reduce the number of evaluation and to avoid the solution of a linear system at each step.
With the scalar secant method, the next iteration, x(k+1) , is obtained as the solution of the linear
equation
f (x(k) + hk ) − f (x(k) )
l̄k = f (x(k) ) + (x − x(k) ) = 0.
hk
Here the linear function l̄k can be interpreted in two ways:
1. l̄k is an approximation of the tangent equation

lk (x) = f (x(k) ) + (x − x(k) )f ′ (x(k) );

2. l̄k is the linear interpolation of f between the points x(k) and x(k+1) .
By extending the scalar secant method to n dimensions, different generalization of secant method are
obtained depending on the interpretation of ¯
lk . The first interpretation leads to the discrete Newton
method, and the second one to interpolation methods.
292 Numerical Solution of Nonlinear Equations

The discrete Newton method is obtained replacing F ′ (x) in Newton’s method (7.10.3) by a discrete
approximation A(x, h). The partial derivatives in the Jacobian matrix (7.10.2) are replaced by divided
differences
A(x, h)ei := [F (x + hi ei ) − F (x)]/hi , i = 1, n, (7.11.1)
where ei ∈ Rn is the i-th unit vector and hi = hi (x) is the step length of the discretization. A possible
choice of step length is

ε|xi |, if xi 6= 0;
hi :=
ε, otherwise,

with ε := eps, where eps is the machine epsilon.

7.11.1 Linear Interpolation


In linear interpolation, each of the tangent planes is replaced by a (hyper-)plane which interpolates the
component function Fi of F at n + 1 given points xk,j , j = 0, n, located in a neighborhood of x(k) ,
that is one determines vectors a(i) and scalars αi , in such a way that for

Li (x) = αi + a(i)T x, i = 1, n (7.11.2)

the following relations hold

Li (xk,j ) = Fi (xk,j ), i = 1, n, j = 0, n.

The next iterate x(k+1) is obtained by intersecting the n hyperplanes (7.11.2) in Rn+1 and the hyper-
plane y = 0. x(k+1) is the solution of the linear system of equations

Li (x) = 0, i = 1, n. (7.11.3)

Depending on the selection of interpolation points xk,j , numerous different methods are derived, and
among them the best known are Brown’s method and Brent’s method. Brown’s method combines
the processes of approximating F ′ and that of solving the system of linear equations (7.11.3) using
Gaussian elimination. Brent’s method uses a QR factorization for solving (7.11.3). Both methods
are members of a class of methods quadratically convergent (like Newton’s method) but require only
(n2 + 3n)/2 function evaluations per iteration.
In a comparative study Moré and Cosnard [67] found that Brent’s method is often better than
Brown’s method, and that the discrete Newton method is usually the most efficient if the F-evaluations
do not require too much computational effort.

7.11.2 Modification Method


Iterative methods of exceptionally efficiency can be constructed by using an approximation Ak of
F ′ (x(k) ), which is derived from Ak−1 by a rank 1 modification, i.e., by adding a matrix of rank 1:
h iT
Ak+1 := Ak + u(k) v (k) , u(k) , v (k) ∈ Rn , k = 0, 1, 2, . . .

According to the Sherman-Morrison formula (see [24])


 −1 1
A + uv T = A−1 − A−1 uv T A−1
1 + v T A−1 u
7.11. Quasi-Newton Methods 293

for Bk+1 := A−1


k+1 the recursion
h iT
Bk u(k) v (k) Bk
Bk+1 = Bk − T
, k = 0, 1, 2, . . . ,
1 + [v (k) ] Bk u(k)
h iT
holds, provided that 1 + v (k) Bk u(k) 6= 0. Thus, the necessity of solving a system of linear
equations in every iteration is avoided; a matrix vector operation suffices. Accordingly, a reduction
of the computational effort from O(n3 ) to O(n2 ) occurs. There is, however, a major drawback: the
convergence is no longer quadratic (as it is in Newton’s, Brown or Brent method); it is only superlinear:

kx(k+1) − αk
lim = 0. (7.11.4)
k→∞ kx(k) − αk

Broyden’s chooses the vectors u(k) and v (k) using the principle of secant approximation. For the scalar
case, the approximation ak ≈ f ′ (x(k) ) is uniquely defined by

ak+1 (x(k+1) − x(k) ) = f (x(k+1) − f (x(k) ).

However, for n > 1, the approximation

Ak+1 (x(k+1) − x(k) ) = F (x(k+1) ) − F (x(k) ) (7.11.5)

(called quasi-Newton equation) is no longer uniquely defined; any other matrix of the form

Āk+1 := Ak+1 + pq T ,

where p, q ∈ Rn and q T (x(k+1) − x(k) verifies the equations (7.11.5). On the other hand,

yk := F (x(k) ) − F (x(k−1) ) and sk := x(k) − x(k−1)

only contain information about the partial derivative of F in the direction of sk , not about the partial
derivative in directions orthogonal to sk . On this direction, the effect of Ak+1 and Ak should be the
same
Ak+1 q = Ak q, ∀q ∈ {v : v 6= 0, v T sk = 0}. (7.11.6)
Starting from an initial approximation A0 ≈ F ′ (x(0) ) (the differential quotient given by (7.11.1) could
be such an example), one generates the sequence A1 , A2 , . . . , uniquely determined by formulas (7.11.5)
and (7.11.6) (Broyden [9], Dennis and Moré [24]).
For the corresponding sequence B0 = A−1 0 ≈ [F (x(0) )]−1 , B1 , B2 , . . . , the Sherman-Morrison
formula can be used to obtain the recursion
(sk+1 − Bk yk+1 )sTk+1 Bk
Bk+1 := Bk + , k = 0, 1, 2, . . .
sTk+1 Bk yk+1

which requires only matrix-vector multiplication operations and thus only O(n2 ) computational work.
With the matrices Bk one can define the Broyden’s method by the iteration

x(k+1) := x(k) − Bk F (x(k) ), k = 0, 1, 2, . . .

This method converges superlinearly in terms of (7.11.4), if the update vectors sk converge (as k → ∞)
to the update vectors of the Newton’s. This is a good illustration of the importance of local linearization
principle for the solution of nonlinear equation.
MATLAB Source 7.5 gives an implementation of Broyden method.
294 Numerical Solution of Nonlinear Equations

MATLAB Source 7.5 Broyden method for nonlinear systems


function [z,ni]=Broyden1(f,fd,x0,ea,er,nmax)
%Broyden1 - Broyden method for nonlinear sytems
%call [z,ni]=Broyden1(f,x0,ea,er,nmax)
%Input
%f - function
%fd - derivative
%x0 - starting approximation
%ea - absolute error
%er - relative error
%nmax - maximum number of iterations
%Output
%z - approximate solution
%ni - actual no. of iterations

if nargin < 6, nmax=50; end


if nargin < 5, er=0; end
if nargin < 4, ea=1e-3; end
x=zeros(length(x0),nmax+1);
F=x;
x(:,1)=x0(:);
F(:,1)=f(x(:,1));
B=inv(fd(x));
x(:,2)=x(:,1)+B*F(:,1);
for k=2:nmax
F(:,k)=f(x(:,k));
y=F(:,k)-F(:,k-1); s=x(:,k)-x(:,k-1);
B=B+((s-B*y)*s’*B)/(s’*B*y);
x(:,k+1)=x(:,k)-B*F(:,k);
if norm(x(:,k+1)-x(:,k),inf)<ea+er*norm(x(:,k+1),inf)
z=x(:,k+1); %success
ni=k;
return
end
end
error(’maximum iteration number exceeded’)
7.12. Nonlinear Equations in MATLAB 295

Example 7.11.1. The nonlinear system:

1
3x1 − cos(x1 x2 ) − = 0,
2
2 2
x1 − 81(x2 + 0.1) + sin x3 + 1.06 = 0,
10π − 3
e−x1 x2 + 20x3 + =0
3
was solved in Example 7.10.2 by using Newton method. By applying Broyden’s method with the same
function and jacobian as in Example 7.10.2 and the starting values x(0) = [0.1, 0.1, −0.1]T one obtains
the results in Table 7.2. The solution in MATLAB is
>> [z,ni]=Broyden1(@fs3,@fs3d,x0,1e-9);
The desired accuracy, 10−9 , is achieved after 8 iterations. ♦

(k) (k) (k)


k x1 x2 x3 kx(k) − x(k−1) k∞
0 0.1000000000 0.1000000000 -0.1000000000 —
1 -0.2998696729 0.1805331515 0.3215204719 0.42152
2 0.5005221618 0.0308676159 -0.5197695490 0.84129
3 0.4999648231 0.0130913099 -0.5229213211 0.0177763
4 0.5000140840 0.0018065260 -0.5235454326 0.0112848
5 0.5000009766 0.0001164840 -0.5235956998 0.00169004
6 0.5000000094 0.0000011019 -0.5235987460 0.000115382
7 0.5000000000 0.0000000006 -0.5235987756 1.10133e-006
8 0.5000000000 0.0000000000 -0.5235987756 5.69232e-010
9 0.5000000000 -0.0000000000 -0.5235987756 4.29834e-013

Table 7.2: Results in Example 7.11.1

7.12 Nonlinear Equations in MATLAB


MATLAB has few functions to find the zeros of a univariate function. We have already seen that if
the function is polynomial, then roots(p) returns all the roots of p, where p is the coefficient vector
decreasingly ordered on the power of variable.
fzero function finds a root of a univariate function. The algorithm used by fzero is a com-
bination of several methods: bisection, secant and inverse quadratic interpolation [66]. The simplest
invocation of fzero is x = fzero(fun,x0), with x0 a scalar, which attempts to find a zero of
fun near x0. For example,

>> fzero(@(x) cos(x)-x,0)


ans =
0.7391

The convergence tolerance and the display of output in fzero are controlled by a third input argument,
the structure options, which is best defined using the optimset function. Four of the fields of the
options structure are used:
296 Numerical Solution of Nonlinear Equations

• Display specifies the level of reporting, with values off for no output, iter for output at
each iteration, final for just the final output, and notify to display the result only if the
method does not converge (default);
• TolX is a convergence tolerance;
• FunValCheck determines wether function values are checked for complex or NaN values;
• OutputFcn specify a user defined function that is called at each iteration.
For the previous example, using Display with value final, we obtain:

>> fzero(@(x) cos(x)-x,0,optimset(’Display’,’final’))


Zero found in the interval [-0.905097, 0.905097]
ans =
0.7391

The input argument x0 can be a 2-vector; at its components the function must have values of opposites
sign. Such an argument is useful when the function has singularities.
We shall set Display to final using
os=optimset(’Display’,’final’);
Consider the example (message is omitted):

>> [x,fval]=fzero(’tan(x)-x’,1,os)
...
x =
1.5708
fval =
-1.2093e+015

The second output argument is the function value at x, the purported singularity. Since f (x) = tan x −
x has a singularity at π/2 (see Figure 7.5), we can give pass to fzero a starting interval that encloses
a zero but not a singularity:

>> [x,fval]=fzero(’tan(x)-x’,[-1,1],os)
Zero found in the interval: [-1, 1].
x =
0
fval =
0

We can pass additional parameters p1,p2,... to f function using calls such


x = fzero(f,x0,options,p1,p2,...)
but it is more convenient to use anonymous functions.
MATLAB has no function for the solution a nonlinear system. However, an attempt at solving such
a system could be made by minimizing the sum of squares of the residual. The Optimization Toolbox
contains a nonlinear equation solver.
Function fminsearch searches for a local minimum of a real function of n real variables. A
possible call is x=fminsearch(f,x0,options). The fields in options structure are those
supported by fzero plus MaxFunEvals (the maximum number of function evaluations allowed),
MaxIter (the maximum number of iterations allowed), TolFun (a termination tolerance on the func-
tion value). Both TolX and TolFun default to 1e-4.
7.12. Nonlinear Equations in MATLAB 297

tan(x)−x

10

−2

−4

−6

−8

−10

−3 −2 −1 0 1 2 3
x

Figure 7.5: Singularity of f (x) = tan x − x, obtained with


ezplot(’tan(x)-x’,[-pi,pi]),grid

Example 7.12.1. The nonlinear system in Example 7.10.2 and 7.11.1, that is
1
f1 (x1 , x2 , x3 ) := 3x1 − cos(x1 x2 ) − = 0,
2
f2 (x1 , x2 , x3 ) := x21 − 81(x2 + 0.1)2 + sin x3 + 1.06 = 0,
10π − 3
f3 (x1 , x2 , x3 ) := e−x1 x2 + 20x3 + =0
3
could be solve by trying minimization of the sum of squares of left-hand sides:

F (x1 , x2 , x3 ) = [f1 (x1 , x2 , x3 )]2 + [f1 (x1 , x2 , x3 )]2 + [f1 (x1 , x2 , x3 )]2 .

The function to be minimized is given in M-file fminob.m:

function y = fminob(x)
y=(3*x(1)-cos(x(2)*x(3))-1/2)ˆ2+(x(1)ˆ2-81*(x(2)+0.1)ˆ2+...
sin(x(3))+1.06)ˆ2+(exp(-x(1)*x(2))+20*x(3)+(10*pi-3)/3)ˆ2;

We chose the starting vector x(0) = [0.5, 0.5, 0.5]T and the tolerance 10−9 for x and for the
function. We give below MATLAB commands and their result

>> os=optimset(’Display’,’final’,’TolX’,1e-9,’TolFun’,1e-9);
>> [xm,fval]=fminsearch(@fminob,[0.5,0.5,0.5]’,os)
Optimization terminated:
the current x satisfies the termination criteria using
OPTIONS.TolX of 1.000000e-009 and F(X) satisfies the
convergence criteria using OPTIONS.TolFun of 1.000000e-009
xm =
0.49999999959246
298 Numerical Solution of Nonlinear Equations

0.00000000001815
-0.52359877559440
fval =
1.987081116616629e-018

Because of complicated objective function, the running time is larger than that for Newton’s or Broy-
den’s method and the precision is worse. The computed approximation can be used as starting vector
for Newton’s method. ♦

Function fminsearch is based on the Nelder-Mead simplex algorithm, a direct search method
that uses function values but not derivatives. The method can be very slow to converge, or may fail to
converge to a local minimum. However, it has the advantage of being insensitive to discontinuities in
the function. More sophisticated minimization functions can be found in the Optimization Toolbox.
For the minimization of a univariate function MATLAB provides the fminbnd function. Example

>> [x,fval] = fminbnd(@(x) sin(x)-cos(x),-pi,pi)


x =
-0.7854
fval =
-1.4142

7.13 Applications
7.13.1 Analysis of the state equation of a real gas
The mathematical model and implementation hint for this application is from [72]. For a mole of a
perfect gas, the state equation P v = RT establishes a relation between the pressure P of the gas (in
Pascals [Pa]), the specific volume v (in cubic meters per kilogram [m3 Kg −1 ] and its temperature T (in
Kelvin [K]), R being the universal gas constant, expressed in [JKg −1 K −1 ] (joules per kilogram per
Kelvin).
For a real gas, the deviation from the state equation of perfect gases is due to van der Waals and
takes into account the intermolecular interaction and the space occupied by molecules of finite size.
Denoting by α and β the gas constants according to the van der Waals model, in order to determine
the specific volume v of the gas, once P and T are known, me must solve the nonlinear equation

f (v) = (P + α/v 2 )(v − β) − RT = 0. (7.13.1)

For this purpose we shall apply Newton’s method in the case of carbon dioxide (CO2 ), at the
pressure of P = 10[atm] (that is 1013250[P a]) and at the temperature of T = 300[K].
The constants are α = 188.33[P am6 Kg −2 ] and β = 9.77 · 10−4 [m3 Kg −1 ]. Assuming the gas
is perfect, the computed solution is v ≈ 0.056[m3 Kg −1 ].

To approximate the solution we shall use Newton’s method and MATLAB function fzero. Here
is a MATLAB script that plots f and solve approximately the equation.

%set parameters
alpha=188.33;
beta_p=9.77e-4;
P=1013250;
R=8.3144721515/0.044;
7.13. Applications 299

%R=11.0163766063776
T=300;
%plot f
f = @(v) (P+alpha./v.ˆ2).*(v-beta_p)-R*T;
fd= @(v) P+alpha./v.ˆ2-2*(v-beta_p)*alpha./v.ˆ3;
v=linspace(1e-3,0.1,5000);
plot([0,0.1],[0,0],’k-’,v,f(v),’k-’)
%Newton method
for v0=[1e-4,1e-3,1e-2,1e-1]
[vv,nit]=Newton(f,fd,v0,0,eps,200);
v0,vv,nit
end
%MATLAB fzero
opt=optimset(’TolX’,eps,’Display’,’iter’);
[xv,fv]=fzero(f,[1e-4,0.1],opt)

4
x 10
6

2
f(v)

−2

−4

−6
0 0.02 0.04 0.06 0.08 0.1
v

Figure 7.6: Graph of the function f in (7.13.1)

Newton’s method for a relative error equal to machine epsilon and starting values v0 given in the
table requires a number of iteration denoted by N it
v0 N it v0 N it v0 N it v0 N it
10−4 30 10−3 36 10−2 7 10−1 5
The computed approximation of v is v ∗ = 0.053515529144585. To analyze the causes of the strong
dependence of N it on v0 let examine the derivative f ′ (v) = P − αv −2 + 2αβv −3 . For v > 0,
f ′ (v) = 0 at vM ≈ 1.99 · 10−3 [m3 Kg −1 ] (a relative maximum) and at vm ≈ 1.25 · 10−2 [m3 Kg −1 ]
300 Numerical Solution of Nonlinear Equations

(a relative minimum) (see the graph of f in Figure 7.6). The choice of v0 in the interval (0, vm ) leads
to a slow convergence.
The results for each iteration of fzero are

Func-count x f(x) Procedure


2 0.1 45510.4 initial
3 0.0997264 45238.3 interpolation
4 0.0543725 814.9 interpolation
5 0.0543725 814.9 bisection
6 0.0534599 -52.8251 interpolation
7 0.0535155 -0.0545071 interpolation
8 0.0535155 6.49088e-008 interpolation
9 0.0535155 0 interpolation

Zero found in the interval [0.0001, 0.1]


xv =
0.053515529144585
fv =
0

7.13.2 Nonlinear heat transfer in a wire


Consider a thin wire of length L and radius r. Let the ends of the wire have a fixed temperature of
900 and let the surrounding region be usur = 300. Suppose the surface of the wire is being cooled via
radiation.
The continuous model for the heat transfer is [102]

−(Kux )x = c(u4sur − u4 )
u(0) = u(L) = 900,

where c = 2εσ/r, ε is the emissivity of the surface and σ is the Stefan-Boltzmann constant. For
simplicity assume K is a constant and will be incorporated into c. Consider the nonlinear differential
equation −uxx = f (u). Applying the finite difference discretization, as in Section 4.9.1, for h =
L/(n + 1) and ui ≈ u(ih) with u0 = un+1 = 900 we obtain

−ui−1 + 2ui − ui+1 = h2 f (ui ), i = 1, ..., n.

Thus, this discrete model has n unknowns, ui , and n equations

Fi (u) ≡ h2 f (ui ) + ui−1 − 2ui + ui+1 = 0.

The Jacobian matrix is easily computed and must be tridiagonal because each Fi (u) only depends on
ui−1 , ui and ui+1
 
h2 f (u1 ) − 2 1
 .. 
 1 h2 f (u2 ) − 2 . 
Fi′ (u) = 

.

 .. .. 
. . 1
1 h2 f (un ) − 2
7.13. Applications 301

For the Stefan cooling model where the absolute temperature is positive f ′ (u) < 0. Thus, the Jacobian
matrix is strictly diagonally dominant and must be nonsingular so that the solve step can be done in
Newton’s method. The MATLAB function HeatTransfer contains the code for the solution based
on Newton’s method.

function [x,y]=HeatTransfer(L,c,n)

h = L/(n+1);
x =(0:n+1)*h;
f = @(u) c*(300ˆ4 - u.ˆ4);
df = @(u) -4*c*u.ˆ3;
u0 = 900*ones(n,1);
[z,ni] = Newton(@F,@Jac,u0,1e-4,0,150);
y=[900;z;900];

function y=F(u)
ux=[900;u;900];
y=hˆ2*f(u)+ux(1:end-2)-2*ux(2:end-1)+ux(3:end);
end
function y=Jac(u)
y=diag(hˆ2*df(u)-2)+diag(ones(n-1,1),1)+diag(ones(n-1,1),-1);
end
end

We have experimented with c = 10−8 , 10−7 and 10−6 . The curves in Figure 7.7 indicate the larger
the c the more the cooling, that is, the lower the temperature. We obtained the figure with the code

%THT - test heat transfer


c = [1e-8, 1e-7, 1e-6];
n=30;
x=zeros(n+2,3);
y=x;
L=1;

for k=1:length(c)
[x(:,k),y(:,k)]=HeatTransfer(L,c(k),n);
end
plot(x(:,1),y(:,1),’k-’,x(:,2),y(:,2),’k--’,...
x(:,3),y(:,3),’k:’);
legend(’1e-8’,’1e-7’,’1e-6’,0)

Problems
Problem 7.1. Find the first 10 positive values, x, for which x = tgx.

Problem 7.2. Investigate the behavior of Newton and secant method for the function
p
f (x) = sign(x − a) |x − a|.
302 Numerical Solution of Nonlinear Equations

900
1e−8
1e−7
800 1e−6

700

600

500

400

300
0 0.2 0.4 0.6 0.8 1

Figure 7.7: Temperatures for Variable c

Problem 7.3 (Adapted after[66]). Consider the polynomial

x3 − 2x − 5.

Wallis uses this example to present Newton method in front of French Academy. It has a real root within
the interval (2, 3) and a pair of complex conjugated roots.
(a) Use Maple or Symbolic Math toolbox to compute the roots. The results are ugly. Convert them
to numerical values.
(b) Determine all the roots using MATLAB function roots.
(c) Find the real root with MATLAB function fzero.
(d) Find all the roots using Newton method (for complex roots use complex starting values).
(e) Can you use bisection or false position method to find a complex root? Why or why not?

Problem 7.4. Solve the systems numerically

f1 (x, y) = 1 − 4x + 2x2 − 2y 3 = 0
f2 (x, y) = −4 + x4 + 4y + 4y 4 = 0,

f1 (x1 , x2 ) = x21 − x2 + 0.25 = 0


f2 (x1 , x2 ) = −x1 + x22 + 0.25 = 0,

f1 (x1 , x2 ) = 2x1 + x2 − x1 x2 /2 − 2 = 0
3
f2 (x1 , x2 ) = x1 + 2x22 − cos(x2 )/2 − = 0.
2
7.13. Applications 303

Problem 7.5. Solve the system numerically

9x2 + 36y 2 + 4z 2 − 36 = 0,
x2 − 2y 2 − 20z = 0,
x2 − y 2 + z 2 = 0

Hint. There are four solutions. Good starting values: [±1, ±1, 0]T .

Problem 7.6. Consider the system, inspired from chemical industry

fi := βa2i + ai − ai−1 = 0.

The system has n equations and n + 2 unknowns. We shall set a0 = 5, β = an = 0.5 mol/liter. Solve
the system for n = 10 and starting value x=[1:-0.1:0.1]’.
304 Numerical Solution of Nonlinear Equations
CHAPTER 8

Eigenvalues and Eigenvectors

In the following we study the problem of determining eigenvalues (and eigenvectors) of a square matrix
A ∈ Rn×n , that is, to find the numbers λ ∈ C and the vectors x ∈ Cn such that

Ax = λx. (8.0.1)

Definition 8.0.1. A number λ ∈ C is called an eigenvalue of the matrix A ∈ Rn×n , if there is a vector
x ∈ Cn \ {0} called an eigenvector (or a right eigenvector) such that Ax = λx.

Remark 8.0.2. 1. The requirement x 6= 0 is important, since the null vector is an eigenvector
corresponding to any eigenvalue.
2. Even if A is real, some of its eigenvalues may be complex. For real matrices, these occur always
in conjugated pairs. ♦

8.1 Eigenvalues and Polynomial Roots


Any eigenvalue problem could be reduced to a problem of finding zeros of a polynomial: the eigenvalues
of a matrix A ∈ Rn×n are the roots of a characteristic polynomial

pA λ = det(A − λI), λ∈C

since the above determinant is zero if and only if the system (A − λI)x = 0 has a nontrivial solution,
that is, λ is an eigenvalue.
A naive method for the solution of eigenproblems is the computation of characteristic polynomial
and then the computation of its roots. But the computation of a determinant is, is general, a complex and
unstable problem, so matrix transformation is more appropriate. Conversely, the problem of computing

305
306 Eigenvalues and Eigenvectors

polynomial roots can be formulated as an eigenvalue problem. Let p ∈ Pn a polynomial with real
coefficients, that can be written (by mean of its roots z1 , . . . , zn , which could be complex) as

p(x) = an xn + · · · + a0 = an (x − z1 ) . . . (x − zn ), an ∈ R, an 6= 0.

In the vector space Pn−1 “modulo p multiplication”

Pn−1  q 7→ r xq(x) = αp(x) + r(x), r ∈ Pn (8.1.1)

is a linear transform, and since


n−1
X aj j
1
xn = p(x) − x , x ∈ R,
an j=0
an

we shall represent p in basis 1, x, . . . , xn−1 by mean of so-called Frobenius’s companion matrix (of
size n × n)  
0 − aan0
1 0 − an1 
a
 
 . . .. 
M =  . . . .

. , (8.1.2)
 an−2 
 1 0 − an 
a
1 − n−1an

Let vj = (vjk : k = 1, n) ∈ Cn , j = 1, n chosen so that

Y n
X
p(x)
ℓj (x) = = an (x − zk ) = vjk xk−1 , j = 1, n,
x − zj
k6=j k=1

then
n
X
(M vj − zj vj )k xk−1 = xℓj (x) − zj ℓj (x) = (x − zj )ℓj (x) = p(x) ≈ 0,
k=1

and thus M vj = zj vj , j = 1, n.
Hence, eigenvalues of M are roots of p.
The Frobenius matrix given by (8.1.2) is only a way (there exists many other) to represent the
“multiplication” in (8.1.1); any other basis of Pn−1 provides a matrix M whose eigenvalues are roots
of p. The only device for polynomial handling is “remainder division”.

8.2 Basic Terminology and Schur Decomposition


As the following example shows
 
0 1
A= , pA (λ) = λ2 + 1 = (λ + i)(λ − i),
−1 0

a real matrix may have complex eigenvalues. Therefore (at least from theoretical point of view) it is
convenient to deal with complex matrices A ∈ Cn×n .

Definition 8.2.1. Two matrices, A, B ∈ Cn×n are called similar if there exists a nonsingular matrix
T ∈ Cn×n , such that
A = T BT −1 .
8.2. Basic Terminology and Schur Decomposition 307

Lemma 8.2.2. If A, B ∈ Cn×n are similar, their eigenvalues are the same.

Proof. Let λ ∈ C be an eigenvalue of A = T BT −1 and x ∈ Cn its eigenvector. Then

B(T −1 x) = T −1 AT T −1 x = T −1 Ax = λT −1 x

and hence, λ is also an eigenvalue of B. 

The following important result from linear algebra holds, which we state without proof. (For a
proof see [41].)

Theorem 8.2.3 (Jordan normal form). Any matrix A ∈ Cn×n is similar to a matrix
 
  λℓ 1
J1  .. ..  k
   . .  X
J = .. , Jℓ =   ∈ Cnℓ ×nℓ , nℓ = n,
.  .. 
 . 1 ℓ=1
Jk
λℓ

called Jordan normal form of A.

Definition 8.2.4. A matrix is called diagonalizable, if all its Jordan blocks Jℓ have their dimension
equal to one, that is, nℓ = 1, ℓ = 1, n. A matrix is called nonderogatory if each eigenvalue λℓ appears
in exactly one Jordan block, on diagonal.

Remark 8.2.5. If a matrix A ∈ Rn×n has n simple eigenvalues, then it is diagonalizable and also
nonderogatory, and suitable for numerical treatment. ♦

Theorem 8.2.6 (Schur decomposition). For every matrix A ∈ Cn×n there exists a unitary matrix
U ∈ Cn×n and an upper triangular matrix
 
λ1 ∗ ... ∗
 .. .. .. 
 . . . 
R=  ∈ Cn×n ,
 .. 
 . ∗
λn

such that A = U RU ∗ .

Remark 8.2.7. 1. The diagonal elements of R are eigenvalues of A. Since A and R are similar,
they have the same eigenvalues.
2. We have a stronger form of similarity between A and R: they are unitary-similar. ♦

Proof of theorem 8.2.6. By induction on n. The case n = 1 is trivial. Suppose the theorem is true for
n ∈ N and let A ∈ C(n+1)×(n+1) . Let λ ∈ C be an eigenvalue of A and x ∈ Cn+1 , kxk2 = 1, the
corresponding eigenvector. We take u1 = x and we choose u2 , . . . , un+1 such that u1 , . . . , un+1 to
be an orthonormal basis for Cn+1 , or equivalently, the matrix U = [u1 , . . . , un+1 ] to be unitary. Thus,

U ∗ AU e1 = U ∗ Au1 = U ∗ Ax = λU ∗ x = λe1 ,
308 Eigenvalues and Eigenvectors

that is  
λ ∗
U ∗ AU = , B ∈ Cn×n .
0 B
By induction hypothesis, there exists a unitary matrix V ∈ Cn×n , such that B = V SV ∗ , where
S ∈ Cn×n is an upper triangular matrix. Therefore,
     
λ1 ∗ ∗ 1 0 λ1 ∗ 1 0
A=U U = U U ∗,
0 V SV ∗ 0 V 0 S 0 V∗
| {z } | {z } | {z }
=:U =:R =U ∗

which completes the proof. 

Let us give now two direct consequences of Schur decomposition.


Corollary 8.2.8. To each Hermitian matrix A ∈ Cn×n there corresponds an orthogonal matrix U ∈
Cn×n such that  
λ1
 ..  ∗
A=U . U , λj ∈ R, j = 1, n.
λn
Proof. The matrix R in Theorem 8.2.6 verifies R = U ∗ AU . Since
R∗ = (U ∗ AU ) = U ∗ A∗ U = U ∗ AU = R,
R must be diagonal, and its diagonal elements are real (being Hermitian). 

In other words, Corollary 8.2.8 guarantees that any Hermitian matrix is unitary-diagonalizable and has
a basis which consists of orthonormal eigenvectors. Moreover, all eigenvalues of a Hermitian matrix
are real. It is interesting, not only from theoretical point of view, what kind of matrices are unitary-
diagonalizable.
Theorem 8.2.9. A matrix A ∈ Cn×n is unitary-diagonalizable, that is, there exists a unitary matrix
U ∈ Cn×n such that
 
λ1
 ..  ∗
A=U . U , λj ∈ R, j = 1, . . . , n. (8.2.1)
λn
if and only if A is normal, i.e.
AA∗ = A∗ A. (8.2.2)

Proof. We set Λ = diag(λ1 , . . . , λn ). By (8.2.1), A has the form A = U ΛU , so
AA∗ = U ΛU ∗ U Λ∗ U ∗ = U |Λ|2 U ∗ and A∗ A = U ∗ Λ∗ U ∗ U ΛU ∗ = U |Λ|2 U ∗ ,
that is, (8.2.2). For the converse, we use the Schur decomposition of A in form R = U ∗ AU . Then
n
X
|λ1 |2 = (R∗ R)11 = (RR∗ )11 = |λ1 |2 + |r1k |2 ,
k=2

which implies r12 = · · · = r1n = 0. By induction, for j = 2, n,


j−1 n
X X
(R∗ R)jj = |λj |2 + |rkj |2 = (RRjj

= |λj |2 + |rjk |2 ,
k=1 k=j+1

and for this reason R must be diagonal. 


8.3. Vector Iteration 309

A Schur decomposition for real matrices, that is, the so-called real Schur decomposition is a little bit
more complicated.
Theorem 8.2.10. For any matrix A ∈ Rn×n there exists an orthogonal matrix U ∈ Rn×n such that
 
R1 ∗ . . . ∗
 .. .. .. 
 . . . 
A=U   U ∗, (8.2.3)
. 
 .. ∗
Rk

where either Rj ∈ R1×1 , or Rj ∈ R2×2 , with two complex conjugated eigenvalues, j = 1, k.


A real Schur decomposition transforms A into an upper Hessenberg matrix
 
∗ ... ... ∗
 .. .. 
∗ . .
U T AU = 
.

 . .. . . . ... 
∗ ∗
Proof. If all eigenvalues of A are real then we proceeds the same way as for complex Schur decompo-
sition. Thus, let λ = α + iβ, β 6= 0, a complex eigenvalue of A and x + iy its eigenvector. Then
A(x + iy) = λ(x + iy) = (α + iβ)(x + iy) = (αx − βy) + i(βx + αy)
or in matrix form  
α β
A [x y] = [x y] .
|{z} −β α
∈Rn×2
| {z }
:=R

Since R = α2 + β 2 > 0, (β 6= 0), span{x, y} is an A-invariant bidimensional subspace of Rn . Then


we choose u1 , u2 such that they form a basis of this space, completed by u3 , . . . , un such that all these
vectors be an orthonormal basis of Rn and after a reasoning which is analogous to that of complex case
we get  
R ∗
U T AU = ,
0 B
and the induction proceeds as for complex Schur decomposition. 

8.3 Vector Iteration


Vector iteration (also called power method) is the simplest method when an eigenvalue and its eigen-
vector is needed.
Starting with an arbitrary y (0) ∈ Cn one constructs the sequence y (k) , k ∈ N based on the follow-
ing iteration
z (k) = Ay (k−1) ,
(k)    
z (k) |zj∗ | (k) 1 (8.3.1)
y (k) = (k) (k)
, j∗ = min 1 ≤ j ≤ n : zj ≥ 1 − kz (k) k∞ .
zj∗ kz k∞ k

Under certain condition, this sequence converges to the eigenvector corresponding to the dominant
eigenvalue.
310 Eigenvalues and Eigenvectors

Proposition 8.3.1. Let A ∈ Cn×n a diagonalizable matrix whose eigenvalues verify the condition

|λ1 | > |λ2 | ≥ . . . |λn |.

Then the sequence y (k) , k ∈ N, converges to a multiple of the normed eigenvector corresponding to the
eigenvalue λ1 , for almost every starting vector y (0) .

Proof. Let x1 , . . . , xn be the orthonormal eigenvectors of A corresponding to the eigenvalues λ1 , . . . ,


λn – they exist, A is diagonalizable. We express y (0) as
n
X
y (0) = αj xj , αj ∈ C, j = 1, n,
j=1

and we state that


n
X n
X n
X  k
λj
Ak y (0) = αj Ak xj = αj λk xj = λk1 αj xj .
j=1 j=1 j=1
λ1

Since |λ1 | > |λj |, j = 2, n this implies


n
X  k
λj
lim λ−k k (0)
1 A y = α1 x1 + lim αj xk = α1 x1 ,
k→∞ k→∞
j=2
λ1

and also  k
Xn
−k k (0) λj
lim |λ1 | A y = αj αj xj = |α1 | kx1 k∞ .
k→∞ ∞
j=1
λ1
(0)
If α1 = 0, and thus y is in hyperplane

x+ n ∗
1 = {x ∈ C , x x1 = 0},

then both limits are zero, and we cannot derive any conclusion on the convergence of y (k) , k ∈ N; this
hyperplane has the measure zero, so in the sequel we shall suppose α1 6= 0.
Equation (8.3.1) implies y (k) = γk Ak y (0) , γk ∈ C and moreover ky (k) k∞ = 1, so

1 1
lim |λ1 |k |γk | = lim = .
k→∞ k→∞ |λ1 |−1 kA(k) y (0) k |α1 |kx1 k∞

Thus  
γk λk1 α1 x1 |λ2 |k
y (k) = γk Ak y (0) = k
+O , k ∈ N, (8.3.2)
|γk λ1 | |α1 |kx1 k∞ |λ1 |k
| {z } | {z }
=:αx1
=:e−2πiθk

where θk ∈ [0, 1]. Now, it is the time to use the ”strange” relation (8.3.1): let j be the least subscript
such that |(αx1 )j | = kαx1 k∞ ; then, by (8.3.2), for a sufficiently large k, it holds in (8.3.1) j ∗ = j too.
Therefore it holds
(k)
(k) yj 1
lim yj = 1 ⇒ lim e2πiθk = lim = .
k→∞ k→∞ k→∞ (αx1 )j (αx1 )j

Substituting this in (8.3.2) we conclude the convergence of y (k) , k ∈ N. 


8.4. QR Method – the Theory 311

We could also apply vector iteration to compute all eigenvalues and eigenvectors, provided that the
eigenvalues of A have different modulus. For this purpose we find the largest modulus eigenvalue λ1
of A and the corresponding eigenvector x1 , and we proceed to

A(1) = A − λ1 x1 xT1 .
The matrix A(1) is diagonalizable and has the same orthonormal eigenvectors as A, excepting that x1 is
the eigenvector corresponding to the eigenvalue 0, and does not play any role for the iteration, provided
that the starting vector is not a multiple of x1 . By applying once more vector iteration to A(1) one
obtains the second largest modulus eigenvalue λ2 and the corresponding eigenvector; the iteration
A(j) = A(j−1) − λj xj xTj , j = 1, n, A(0) = A

computes successively all the eigenvalues and eigenvectors of A, if its eigenvalues are different in
modulus.
Remark 8.3.2 (Drawbacks of vector iteration). 1. The method works only if there exists a dom-
inant eigenvector, that is only when there exists a unique eigenvector corresponding to the dom-
inant eigenvalue. For example, the matrix
 
0 1
A= ,
0 1

transforms vector [x1 x2 ]T into the vector [x2 x1 ]T and the convergence holds only if the starting
vector is an eigenvector.
2. The method works well only for “suitable” starting vectors. It sounds gorgeous that all vectors
which are not in a certain hyperplane are good, but the things are more complicated. If the
dominant eigenvalue of a real matrix is complex and the starting values are real, then the iteration
run indefinitely, without finding an eigenvector.
3. We could perform all computation in complex, but this grows seriously the computational com-
plexity (with a factor of two for addition and six for multiplication, respectively).
4. The speed of convergence depends on ratio
|λ2 |
< 1,
|λ1 |
which may be indefinitely close to 1. If the dominance of dominant eigenvector is not sufficiently
emphasized, the convergence is very slow. ♦

Taking into account the above remarks, we conclude that vector iteration method is not enough
good.

8.4 QR Method – the Theory


The practical method for eigenproblems is the QR method, due to Francis [29] and Kublanovskaya [53],
a unitary extension of Rutishauser’s LR method [75]. We begin with the complex case.
The iterative method is very simple: one starts with A(0) = A and one computes iteratively, using
QR decomposition,
A(k) = Qk Rk , A(k+1) = Rk Qk , k ∈ N0 . (8.4.1)
With a bit of luck, or as mathematicians say, under certain hypothesis, this sequence will converge to a
matrix whose diagonal elements are the eigenvalues of A.
312 Eigenvalues and Eigenvectors

Lemma 8.4.1. The matrices A(k) , built by (8.4.1), k ∈ N, are orthogonal-similar to A (and, obviously
have the same eigenvalues as A).

Proof. The following statement holds

A(k+1) = Q∗k Qk Rk Qk = Q∗k A(k) Qk = · · · = Q∗k . . . Q∗0 A Q0 . . . Qk .


| {z } | {z }

=:Uk =:Uk

In order to prove the convergence, we shall interpret QR iteration as a generalization of vector


iteration (8.3.1) (without the strange norming process) to vector spaces. For this purpose, we shall write
the orthonormal base u1 , . . . , um ∈ Cn of a m-dimensional subspace U ⊂ Cn , m ≤ n, as column
vectors of a unitary matrix U ∈ Rn×m and we shall iterate the subspace (i.e. matrices) over the QR
decomposition
Uk+1 Rk = AUk , k ∈ N0 , U0 ∈ Cn . (8.4.2)
This implies immediately

Uk+1 (Rk . . .R0 ) = AUk (Rk−1 . . .R0 ) = A2 Uk−1 (Rk−2 . . .R0 ) = . . . = Ak+1 U0 . (8.4.3)

If we define, for m = n, A(k) = Uk∗ AUk , then by (8.4.2), the following relation holds

A(k) = Uk∗ AUk = Uk∗ Uk+1 Rk


(k+1) ∗ ∗
A = Uk+1 AUk+1 = Uk+1 AUk Uk∗ Uk+1

and setting Qk := Uk∗ Uk+1 , we obtain the iteration rule (8.4.1). We choose U0 = I as starting matrix.

Definition 8.4.2. A phase matrix Θ ∈ Cn×n is a diagonal matrix with form


 −iθ1 
e
 .. 
Θ= . , θj ∈ [0, 2π), j = 1, n.
−iθn
e

Proposition 8.4.3. Suppose A ∈ Cn×n has eigenvalues with distinct moduli, |λ1 | > |λ2 | > · · · >
|λn | > 0. If the matrix X −1 in normal Jordan form A = XΛX −1 of A has a LU decomposition
 
1  
∗  ∗ ... ∗
 1  
X −1 = ST, S = . .. ..  ,
.. .. , T = . .
 .. . . 

∗ ... ∗ 1

the there exists phase matrices Θk , k ∈ N0 , such that the matrix sequence (Θk Uk ), k ∈ N is conver-
gent.

Remark 8.4.4 (On proposition 8.4.3). 1. The convergence of the matrix sequence (Θk Uk ) means
that if the corresponding orthonormal bases converge to an orthonormal basis of Cn , we have
also the convergence of the corresponding vector spaces.
8.4. QR Method – the Theory 313

2. The existence of a LU decomposition for X −1 introduces no additional constraints: since X −1


is invertible, there exists a permutation P , such that

X −1 P T = (P X)−1 = LU
b = P T AP , which is the result of line and
and P X is invertible. This means that the matrix A
column permutation of A has the same eigenvalues of A, fulfill the hypothesis of Proposition
8.4.3.
3. The proof of Proposition 8.4.3 is a modification of proof in [90, pag. 54–56] for the convergence
of LR, whose origin can be found in Wilkinson’s 1 book [103]. What is the LR method? It is
analogous to QR method, but the QR decomposition of A is replaced by an LU decomposition,
A(k) = Lk Rk , and then one builds A(k+1) = Rk Lk . Under certain conditions, this method
converges to an upper triangular matrix. ♦

Before the proof of 8.4.3, let us see why the convergence of the sequence (Uk ) implies the conver-
gence of QR method. Namely, if we have kUk+1 − Uk k ≤ ε or equivalently

Uk+1 = Uk + E, kEk2 ≤ ε,

then

Qk = Uk+1 Uk = (Uk + E)∗ Uk = I + E ∗ Uk = I + F, kF k2 ≤ kEk2 kUk k2 ≤ ε,
| {z }
=1

and simultaneously

A(k+1) = Rk Qk = Rk (I + F ) = Rk + G, kGk2 ≤ εkRk k2 ,


(k)
hence A , k ∈ N, converges also to an upper triangular matrix, only if the norms of Rk , k ∈ N are
uniformly bounded. This is the case, since

kRk k2 = kQ∗k A(k) k2 = kA(k) k2 = kQ∗k−1 . . . Q∗0 AQ0 . . . Qk−1 k = kAk2 .

We need also an auxiliary result on the “uniqueness” of QR decomposition.


Lemma 8.4.5. Let U, V ∈ Cn×n be unitary matrices and R, S ∈ Cn×n be invertible upper triangular
matrices. Then U R = V S if and only if there exists a phase matrix
 −iθ1 
e
 .. 
Θ= . , θj ∈ [0, 2π), j = 1, n,
−iθn
e
such that U = V Θ∗ , R = ΘS.
James Hardy Wilkinson (1919-1986), English mathematician. Contri-
bution to numerical analysis, numerical linear algebra and computer
science. He received many awards for his outstanding work. He was
elected a Fellow of the Royal Society in 1969. He received the A.
M. Turing award from the Association of Computing Machinery and
1 the J. von Neumann award from the Society for Industrial and Ap-
plied Mathematics both in 1970. Beside the large numbers of papers
on his theoretical work on numerical analysis, Wilkinson developed
computer software, working on the production of libraries of numeri-
cal routines. The NAG (Numerical Algorithms Group) began work in
1970 and much of the linear algebra routines were due to Wilkinson.
314 Eigenvalues and Eigenvectors

Proof. Since U R = V Θ∗ ΘS = V S, the sufficiency is trivial. For the necessity, from U R = V S it


follows that V ∗ U = SR−1 must be an upper triangular matrix such that (V ∗ U )∗ = U ∗ V = RS −1 .
Hence Θ = V ∗ U is a unitary diagonal matrix and it holds U = V V ∗ U = V Θ. 

Proof of Proposition 8.4.3. Let A = XΛX −1 be the Jordan normal form of A, where Λ = diag(λ1 , . . . , λn ).
For U0 = I and k ∈ N0
 
0
Y k
Uk  Rj  = X −1 ΛX = XΛk X −1 = XΛk ST = X (Λk SΛ−k ) Λk T,
j=k−1
| {z }
=:Lk

where Lk is a lower triangular matrix with entries


 k
λj
(Lk )jm = , 1≤m≤j≤n (8.4.4)
λm

such that for k ∈ N


 
0
  k  .. 
λj 1 . 
|Lk − I| ≤ max |sjm | max  , (k ∈ N). (8.4.5)
1≤m<j≤n 1≤m<j≤n λm . .. .. 
 .. . . 
1 ... 1 0

bk R
Let U bk = XLk be the QR decomposition of XLk , that, due to (8.4.5) and Lemma 8.4.5 converges,
up to a phase matrix, to a QR decomposition X = U R of X. Now we apply Lemma 8.4.5 to the
identity
 
Y0
Uk  bk R
Rj  Q bk Λk T ;
j=k−1

there exist phase matrices Θk , such that


 
0
Y
Uk = b k Θ∗k
Q and  bk Λk T,
Rj  = Θk R
j=k−1

b k , such that Uk Θ
hence there exist phase matrices Θ b k → U , when k → ∞. 

Let us examine shortly the “error term” in (8.4.4), whose sub-diagonal entries verifies
 
|λj |
|Lk |jm ≤ |sjm , 1 ≤ m < j ≤ n.
|λm |

Therefore, it holds

The farther is the sub-diagonal element to the diagonal, the faster is the convergence of
that element to zero.
8.5. QR Method – the Practice 315

8.5 QR Method – the Practice


8.5.1 Classical QR method
We have seen that QR method generates a matrix sequence A(k) that under certain conditions converges
to an upper triangular matrix with eigenvalues of A on diagonal. We may apply this method to real
matrices.

Example 8.5.1. Let  


1 1 1
A = 1 2 3 .
1 2 1
Its eigenvalues are
λ1 ≈ 4.56155, λ2 = −1, λ3 ≈ 0.43845.
Using a rough MATLAB implementation of QR method we obtain the values in Table 8.1 for the
(k)
subdiagonal entries. Note that after k iterative steps, the entries amℓ , ℓ < m, approach to 0 like |λℓ /λk |

#iterations a21 a31 a32


10 6.64251e-007 -2.26011e-009 0.00339953
20 1.70342e-013 -1.52207e-019 8.9354e-007
30 4.36711e-020 -1.02443e-029 2.34578e-010
40 1.11961e-026 -6.89489e-040 6.15829e-014

Table 8.1: Results for Example 8.5.1

do. ♦

Example 8.5.2. The matrix  


1 5 7
3 0 6
4 3 1
has the eigenvalues

λ1 ≈ 9.7407, λ2 ≈ −3.8703 + 0.6480i, λ2 ≈ −3.8703 − 0.6480i.

In this case, QR method does not converge to an upper triangular matrix. After 100 iterations we obtain
the matrix  
9.7407 −4.3355 0.94726
A(100) ≈  8.552e − 039 −4.2645 0.7236  ,
3.3746e − 039 −0.79491 −3.4762
that correctly provides the real eigenvalue. Additionally, the lower right 2 × 2 matrix provide the
complex eigenvalues −3.8703 ± 0.6480i. ♦

The second example leads us to the following strategy: if the sub-diagonal entries do not disappear, it
is recommendable to examine the corresponding 2 × 2 matrix.

Definition 8.5.3. If A ∈ Rn×n has the QR decomposition A = QR, then a RQ transformation of A is


defined by A∗ = RQ.
316 Eigenvalues and Eigenvectors

What problems appear in a practical usage of QR method? Since the complexity of QR decompo-
sition is Θ(n3 ), it is not advisable to use a method based on such an iterative step. In order to avoid
the problem, we convert the initial matrices into a matrix whose QR decomposition could be computed
faster. Such examples are upper Hessenberg matrices, whose QR decomposition could be computed us-
ing n Givens rotations, a total of O(n2 ) flops: since only entries hj−1,j , j = 2, n must be eliminated,
we shall find the angles φ2 , . . . φn , such that

G(n − 1, n; φn ) . . . G(1, 2; φ2 )H = R

and it holds
H∗ = RGT (1, 2; φ2 ) . . . GT (n − 1, n; φn ). (8.5.1)
This is the idea implemented in MATLAB Source 8.1.

MATLAB Source 8.1 The RQ transform of a Hessenberg matrix


function HH=HessenRQ(H)
%HESSENRQ - computes RQ transform of a Hessenberg matrix
%using Givens rotations
%input H - Hesenberg matrix
%output HH - RQ transform of H

[m,n]=size(H);
Q=eye(m,n);

for k=2:n
a=H(k-1:k,k-1);
an=sqrt(a’*a); %Euclidean norm
c=sign(a(2))*abs(a(1))/an; %sine
s=sign(a(1))*abs(a(2))/an; %cosine
Jm=eye(n);
Jm(k-1,k-1)=c; Jm(k,k)=c;
Jm(k-1,k)=s; Jm(k,k-1)=-s;
H=Jm*H;
Q=Q*Jm’;
end
HH=H*Q;

Lemma 8.5.4. If H ∈ Rn×n is an upper Hessenberg matrix, the matrix H∗ is upper Hessenberg, too.

Proof. The conclusion is a direct consequence of (8.5.1) representation. Right multiplication by a


Givens matrix, GT (j, j + 1, φj+1 ), j = 1, n − 1 means a combination of jth and (j + 1)th columns
that creates nonzero values only in the first sub-diagonal – R is upper triangular. 

Let see how to convert the initial matrix to a Hessenberg form. For this purpose we shall use (for
variation) Householder transformations. Let us suppose we have already found a matrix Qk , such that
8.5. QR Method – the Practice 317

the first k columns of the transformed matrix are already in Hessenberg form, that is,
 
∗ ... ∗ ∗ ∗ ... ∗
 ∗ ... ∗ ∗ ∗ ... ∗ 
 
 .. .. .. .. .. .. 
 . . . . . . 
 
T  ∗ ∗ ∗ ... ∗ 
Qk AQk =  .
 (k) 
 a1 ∗ ... ∗ 
 
 .. .. .. .. 
 . . . . 
(k)
an−k−1 ∗ ... ∗

Then we determine yb ∈ Rn−k−1 and α ∈ R (which results automatically), such that


 (k)

a1  
  α
 ..   0   
 .    Ik+1
H(b
y) 
.. = ..  ⇒ Uk+1 :=
   .  H(b
y)
 . 
(k) 0
an−k−1

and we get  
∗ ... ∗ ∗ ∗ ... ∗
 ∗ ... ∗ ∗ ∗ ... ∗ 
 
 .. .. .. .. .. .. 
 . . . . . . 
 
 
 ∗ ∗ ∗ ... ∗ 
Uk+1 Qk A Qk Uk+1 =   Uk+1 ;
| {z } | {z }  α ∗ ... ∗ 
 
=:Qk+1 =QTk+1  0 ∗ ... ∗ 
 
 .. .. .. 
 . . . 
0 ∗ ... ∗
the upper left unit matrix Ik+1 in matrix Uk+1 takes care to have on the first k+1 columns a Hessenberg
structure. MATLAB Source 8.2 gives a method for conversion of a matrix into upper Hessenberg form.
To conclude, our QR method will be a two step method:
1. Convert A into Hessenberg form using an orthogonal transformation:

H (0) = QAQT , QT Q = QQT = I.

2. Do QR iterations
H (k+1) = H∗(k) , k ∈ N0 ,
hoping that all the sub-diagonal elements converge to zero.
Since sub-diagonal entries converge slowest, we can use the maximum of modulus as stopping
criterion. This leads us to the simple QR method, see MATLAB Source 8.3, M-file QRMethod1.m.
Of course, for complex eigenvalues this method iterates infinitely.

Example 8.5.5. We apply the new method to the matrix in Example 8.5.1. For various given tolerances
ε, we get the results given in Table 8.2. Note that one gains a new decimal digit for sub-diagonal entries
at each three iterations. ♦
318 Eigenvalues and Eigenvectors

MATLAB Source 8.2 Reduction to upper Hessenberg form


function [A,Q]=hessen_h(A)
%HESSEN_H - Householder reduction to Hessenberg form

[m,n]=size(A);
v=zeros(m,m);
Q=eye(m,m);
for k=1:m-2
x=A(k+1:m,k);
vk=mysign(x(1))*norm(x,2)*[1;zeros(length(x)-1,1)]+x;
vk=vk/norm(vk);
A(k+1:m,k:m)=A(k+1:m,k:m)-2*vk*(vk’*A(k+1:m,k:m));
A(1:m,k+1:m)=A(1:m,k+1:m)-2*(A(1:m,k+1:m)*vk)*vk’;
v(k+1:m,k)=vk;
end
if nargout==2
Q=eye(m,m);
for j=1:m
for k=m:-1:1
Q(k:m,j)=Q(k:m,j)-2*v(k:m,k)*(v(k:m,k)’*Q(k:m,j));
end
end
end

MATLAB Source 8.3 Pure (simple) QR Method


function [lambda,it]=QRMethod1(A,t)
%QRMETHOD1 - Computes eigenvalues of a real matrix
%naive implementation
%Input
% A - matrix
% t - tolerance
%Output
% lambda - eigenvalues - diagonal of R
% it - no. of iterations

H=hessen_h(A);
it=0;
while norm(diag(H,-1),inf) > t
H=HessenRQ(H);
it=it+1;
end
lambda=diag(H);
8.5. QR Method – the Practice 319

ε #iterations λ1 λ2 λ3
10−3 11 4.56155 -0.999834 0.438281
10−4 14 4.56155 -1.00001 0.438461
10−5 17 4.56155 -0.999999 0.438446
10−10 31 4.56155 -1 0.438447

Table 8.2: Results for Example 8.5.5

We can try a speedup of our method by decomposing a problem into subproblems. If we have a
Hessenberg matrix with form
 
∗ ... ... ∗
 .. .. 
 ∗ . . ∗ 
 
 . .. 
 .. ... . 
   
 
 ∗ ∗  H1 ∗
H= = ,
 ∗ ... ... ∗  H2
 
 .. .. 
 ∗ . . 
 
 .. .. .. 
 . . . 
∗ ∗
then the eigenvalue problem for H may be decomposed into an eigenvalue problem for H1 and one for
H2 .
According to [39], a sub-diagonal hj+1,j entry is considered to be “small enough” if
|hj+1,j | ≤ eps (|hjj | + |hj+1,j+1 |) . (8.5.2)
We shall do something simpler, namely we shall decompose a matrix if its least modulus sub-
diagonal entries is less than a given tolerance. The procedure is as follows: the function for computing
eigenvalues using QR iterations finds a decomposition into two matrices H1 and H2 that calls itself
recursively for each of these matrices.
If one of these matrices is 1 × 1, the eigenvalue is trivially computed, and if it is 2 × 2, then its
characteristic polynomial is
pA (x) = det(A − xI) = x2 − trace(A)x + det(A)
= x2 + (a11 + a22 ) x + (a11 a22 − a12 a21 ) .
| {z } | {z }
=:b =:c
2
If its discriminant b − 4c is positive, then A have two real and distinct eigenvalues
1 p  c
x1 = −b − sgn(b) b2 − 4c and x2 = ,
2 x1
otherwise its eigenvalues are complex, namely
1 p 
−b ± i 4c − b2 ;
2
thus we can deal with complex eigenvalues. The function Eigen2x2 returns the eigenvalues of a
2 × 2 matrix. The idea is implemented in M-file QRSplit1a.m, MATLAB source 8.4. The M-
file QRIter.m contains the QR iterations, and the file Eigen2x2.m deals with the complex case.
320 Eigenvalues and Eigenvectors

MATLAB Source 8.4 QRSplit1a – QR method with partition and treatment of 2 × 2


matrices
function [lambda,It]=QRSplit1a(A,t)
%QRSPLIT1 eigenvalues with partition and special treatment
% of 2x2 matrices
%Input
%A - matrix
%t - tolerance
%Output
%lambda - eigenvalues
%It - no. of iterations

[m,n]=size(A);
if n==1
It=0;
lambda=A;
return
elseif n==2
It=0;
lambda=Eigen2x2(A);
return
else
H=hessen_h(A); %convert to Hessenberg form
[H1,H2,It]=QRIter(H,t); %decomposition H->H1,H2
%recursive call
[l1,It1]=QRSplit1a(H1,t);
[l2,It2]=QRSplit1a(H2,t);
It=It+It1+It2;
lambda=[l1;l2];
end

Example 8.5.6. Let us consider again the matrix in Example 8.5.2; we apply the algorithm imple-
mented in MATLAB Source 8.4 to it. The results are given in Table 8.3. ♦

8.5.2 Spectral shift


Hessenberg matrices allow us to execute each iteration into a shorter time. We shall try to reduce the
number of iterations, that is to increase the speed of convergence, since
The convergence rate of sub-diagonal entries hj+1,j has order of growth
 k
λj+1
, j = 1, n − 1.
λj

The keyword is here spectral shift. One observes that for µ ∈ R the matrix A − µI has the eigenvalues
λ1 − µ, . . . , λn − µ. For an arbitrary invertible matrix B the matrix B(A − µI)B −1 + µI has the
8.5. QR Method – the Practice 321

MATLAB Source 8.5 Compute eigenvalues of a 2 × 2 matrix


function lambda=Eigen2x2(A)
%EIGEN2X2 - Compute eigenvalues of a 2x2 matrix
%A - 2x2 matrix
%lambda - eigenvalues

b=trace(A); c=det(A);
d=bˆ2/4-c;
if d > 0
if b == 0
lambda = [sqrt(c); -sqrt(c)];
else
x = (b/2+sign(b)*sqrt(d));
lambda=[x; c/x];
end
else
lambda=[b/2+i*sqrt(-d); b/2-i*sqrt(-d)];
end

MATLAB Source 8.6 QR iterations on a Hessenberg matrix


function [H1,H2,It]=QRIter(H,t)
%QRITER - perform QR iteration on Hessenberg matrix
%until the least subdiagonal element is < t
%Input
% H - Hessenberg matrix
% t - tolerance
%Output
%H1, H2 - descomposition over least element
%It - no. of iterations

It=0; [m,n]=size(H);
[m,j]=min(abs(diag(H,-1)));
while m>t
It=It+1;
H=HessenRQ(H);
[m,j]=min(abs(diag(H,-1)));
end
H1=H(1:j,1:j);
H2=H(j+1:n,j+1:n);
322 Eigenvalues and Eigenvectors

ε #iteraţii λ1 λ2 λ3
10−3 12 9.7406 −3.8703 + 0.6479i −3.8703 − 0.6479i
10−4 14 9.7407 −3.8703 + 0.6479i −3.8703 − 0.6479i
10−5 17 9.7407 −3.8703 + 0.6480i −3.8703 − 0.6480i
10−5 19 9.7407 −3.8703 + 0.6480i −3.8703 − 0.6480i
10−5 22 9.7407 −3.8703 + 0.6480i −3.8703 − 0.6480i

Table 8.3: Results for Example 8.5.6

eigenvalues λ1 , . . . , λn – one may shift the spectrum of matrices forward and backwards by means of
a similarity transformation. One sorts the eigenvalues µ1 , . . . , µn such that

|µ1 − µ| > |µ2 − µ| > · · · > |µn − µ|, {µ1 , . . . , µ1 } = {λ1 , . . . , λn },


(n)
and if µ is close to µn , then if QR method start with H 0 = A − µI, the subdiagonal entry hn−1,n
converge very fast to zero. It is better if spectral shift is performed at each step individually. In addition,
(k)
we may choose heuristically as approximation for µ the value hnn . One gets the following iterative
scheme  
H (k+1) = H (k) − µk I + µk I, µk := h(k)
nn , k ∈ N0 ,

with the starting matrix H 0 = QAQT . The M-file QRsplit2.m, MATLAB Source 8.7, gives a
variant of the method which treats complex eigenvalues. It calls QRIter2. (See MATLAB Source
8.8).
Remark 8.5.7. If the shift value µ is sufficiently close to an eigenvalue λ, then the matrix could be
decomposed in a single iterative step. ♦

8.5.3 Double shift QR method


It can be shown that spectral shift QR method converges quadratically, that is the error is, for ρ < 1,

O(ρ2k ) instead of O(ρk ).

This nice idea works only for real eigenvalues; for complex eigenvalue it is problematic. Nevertheless,
we can exploit the fact that eigenvalues appear in conjugated pairs. This leads us to “double shift
methods”:
(k)
Instead of shifting the spectrum with an eigenvalue, approximated heuristically by hn,n ,
we could rather perform two shifts in a step, namely with eigenvalues of
" #
(k) (k)
hn−1,n−1 hn−1,n
B= (k) (k) .
hn−1,n hn,n

There are two possibilities: either both eigenvalues µ and µ′ of B are real and we proceed as above, or
we have a pair of complex conjugated eigenvalues, µ and µ̄. As we shall see, the second case could be
also treated in real arithmetic. Let Qk , Q′k ∈ Cn×n and Rk , Rk′ ∈ Cn×n the matrices of complex QR
decomposition

Qk Rk = H (k) − µI,
Q′k Rk′ = Rk Qk + (µ − µ̄)I.
8.5. QR Method – the Practice 323

MATLAB Source 8.7 Spectral shift QR method, partition and treatment of complex eigen-
values
function [lambda,It]=QRSplit2(A,t)
%QRSPLIT2 eigenvalues with partition and special treatment
% of 2x2 matrix
%Input
%A - matrix
%t - tolerance
%Output
%lambda - eigenvalues
%It - no. of iterations

[m,n]=size(A);
if n==1
It=0;
lambda=A;
return
elseif n==2
It=0;
lambda=Eigen2x2(A);
return
else
H=hessen_h(A); %convert to Hessenberg
[H1,H2,It]=QRIter2(H,t); %decomposition H->H1,H2
%recursive call
[l1,It1]=QRSplit2(H1,t);
[l2,It2]=QRSplit2(H2,t);
It=It+It1+It2;
lambda=[l1;l2];
end

Then it holds
H (k+1) := Rk′ Q′k + µI = (Q′k )∗ (Rk Qk + (µ − µ̄)I)Q′k + µ̄I
= Q∗k Rk Qk Q′k + µI = (Q′k )∗ Q∗k (H (k) − µI)Qk Q′k + µI
= (Qk Q′k )∗ H (k) Qk Q′k .
| {z } | {z }
=U ∗ =U

Using the matrix S = Rk′ Rk we have


U S = Qk Q′k Rk′ Rk = Qk (Rk Qk + (µ − µ̄)I)Rk
= Qk Rk Qk Rk + (µ − µ̄)Qk Rk = (H (k) − µI)2 + (µ − µ̄)(H (k) − µI)
(8.5.3)
= (H (k) )2 − 2µH (k) + µ2 I + (µ − µ̄)H (k) − (µ2 − µµ̄)I
= (H (k) )2 − (µ + µ̄)H (k) + µµ̄I =: X

If µ = α + iβ, then µ + µ̄ = 2α and µµ̄ = |µ|2 = α2 + β 2 , hence the matrix X in the righthand side
of (8.5.3) is real, so it has a real QR decomposition X = QR and by Lemma 8.4.5 there exist a phase
324 Eigenvalues and Eigenvectors

MATLAB Source 8.8 QR iteration and partition


function [H1,H2,It]=QRIter2(H,t)
%QRITER - perform QR iteration on Hessenberg matrix
%until the least subdiagonal element is < t
%Input
% H - Hessenberg matrix
% t - tolerance
%Output
%H1, H2 - descomposition over least element
%It - no. of iterations

It=0; [m,n]=size(H);
II=eye(n);
[m,j]=min(abs(diag(H,-1)));
while m > t
It=It+1;
H=HessenRQ(H-H(n,n)*II)+H(n,n)*II;
[m,j]=min(abs(diag(H,-1)));
end
H1=H(1:j,1:j);
H2=H(j+1:n,j+1:n);

matrix Θ ∈ Cn×n such that U = ΘQ. If we perform real iteration further, we obtain double shift QR
method
(k)
Qk Rk = (H (k) )2 − (hn−1,n−1 + h(k)
n,n )H
(k)

 
(k) (k) (k)
+ (hn−1,n−1 h(k)n,n − hn−1,n Hn,n−1 I, (8.5.4)

H (k+1) = QTk H (k) Qk .

Remark 8.5.8 (Double shift QR method). 1. The matrix X in (8.5.3) is no more a Hessenberg
matrix, since it has an additional diagonal. Nevertheless, one could easily compute the QR
decomposition of X, using only 2n − 3 Jacobi rotations, instead of n − 1, the number required
by a Hessenberg matrix.
2. Because of its high complexity, the multiplication QTk H (k) Qk is no more an effective method
for our iteration; we can fix this drawback, see for example [29] or [81, pages 272–278].
3. Naturally, H (k+1) could be converted to Hessenberg form.
4. Double shift QR method is useful only when A has complex eigenvalues; for symmetric matrices
it is not advantageous. ♦

Double shift QR method with partitioning and treatment of 2 × 2 matrices is given in MATLAB Source
8.9, file QRSplit3. It calls QRDouble.

Example 8.5.9. We apply sources 8.7 and 8.9 to matrices in Examples 8.5.1 and 8.5.2. One gets the
results in Table 8.4. The good behavior of double shift QR method can be explained by the idea to
obtain two eigenvalues simultaneously. ♦
8.5. QR Method – the Practice 325

MATLAB Source 8.9 Double shift QR method with partition and treating 2 × 2 matrices
function [lambda,It]=QRSplit3(A,t)
%QRSPLIT3 compute eigenvalues with QR method, partition, shift
%and special treatment of 2x2 matrices
%Input
%A - matrix
%t - tolerance
%Output
%lambda - eigenvalues
%It - no. of iterations

[m,n]=size(A);
if n==1
It=0;
lambda=A;
return
elseif n==2
It=0;
lambda=Eigen2x2(A);
return
else
H=hessen_h(A); %convert to Hessenberg
[H1,H2,It]=QRDouble(H,t); %decomposition H->H1,H2
%recursiv call
[l1,It1]=QRSplit3(H1,t);
[l2,It2]=QRSplit3(H2,t);
It=It+It1+It2;
lambda=[l1;l2];
end

#iterations in R #iterations in C
ε alg. 8.7 alg. 8.9 alg. 8.7 alg. 8.9
1e-010 1 1 9 4
1e-020 9 2 17 5
1e-030 26 3 45 5

Table 8.4: Comparisons in Example 8.5.9


326 Eigenvalues and Eigenvectors

MATLAB Source 8.10 Double shift QR iterations and Hessenberg transformation


function [H1,H2,It]=QRDouble(H,t)
%QRDOUBLE - perform double step QR iteration and inverse
%transform on Hessenberg matrix until the least
%subdiagonal element is < t
%Input
% H - Hessenberg matrix
% t - tolerance
%Output
%H1, H2 - descomposition over least element
%It - no. of iterations

It=0; [m,n]=size(H);
II=eye(n);
[m,j]=min(abs(diag(H,-1)));
while m>t
It=It+1;
X = H*H ... % X matrix
- (H(n-1,n-1) + H(n,n)) * H ...
+ (H(n-1,n-1)*H(n,n) - H(n,n-1)*H(n-1,n))*II;
[Q,R]=qr(X);
H=hessen_h(Q’*H*Q);
[m,j]=min(abs(diag(H,-1)));
end
H1=H(1:j,1:j);
H2=H(j+1:n,j+1:n);

8.6 Eigenvalues and Eigenvectors in MATLAB


MATLAB uses LAPACK routines to compute eigenvalues and eigenvectors. The eigenvalues of A
are computed with the eig function: e = eig(A) assigns the eigenvalues to the vector e. More
generally, after the call [V,D]=eig(A), the diagonal n-by-n matrix D contains eigenvalues on the
diagonal and the columns of the n-by-n matrix V are eigenvectors. It holds A*V=V*D. Not every
matrix has n linearly independent eigenvectors, so the matrix V returned by eig may be singular (or,
because of roundoff, nonsingular but very ill conditioned). The matrix in the following example has a
double eigenvalue 1 and only one linear independent eigenvector:

>> [V,D]=eig([2, -1; 1,0])


V =
0.7071 0.7071
0.7071 0.7071
D =
1 0
0 1

MATLAB normalizes so that each column of V has unit 2-norm (this is possible, since if x is an eigen-
vector then so is any nonzero multiple of x).
8.6. Eigenvalues and Eigenvectors in MATLAB 327

For Hermitian matrices MATLAB returns eigenvalues sorted in increasing order and the matrix of
eigenvectors is unitary to working precision::

>> [V,D]=eig([2,-1;-1,1])
V =
-0.5257 -0.8507
-0.8507 0.5257
D =
0.3820 0
0 2.6180
>> norm(V’*V-eye(2))
ans =
2.2204e-016

The following example computes the eigenvalues of the (non-Hermitian) Frank matrix:

>> F = gallery(’frank’,5)
F =
5 4 3 2 1
4 4 3 2 1
0 3 3 2 1
0 0 2 2 1
0 0 0 1 1
>> e = eig(F)’
e =
10.0629 3.5566 1.0000 0.0994 0.2812

if λ is an eigenvalue of F, then so is 1/λ.

>> 1./e
ans =
0.0994 0.2812 1.0000 10.0629 3.5566

The reason is that the characteristic polynomial is anti-palindromic:

>> poly(F)
ans =
1.0000 -15.0000 55.0000 -55.0000 15.0000 -1.0000

Thus, det(F − λI) = −λ5 det(F − λ−1 I).


If λ is an eigenvalue of A, a nonzero vector y such that y ∗ A = λy ∗ is a left eigenvector. Use [W,D]
= eig(A.’); W = conj(W) to compute the left eigenvectors of A. If λ is a simple eigenvalue of
A with a right eigenvector x and a left eigenvector y such that kxk2 = kyk2 = 1, then the condition
number of the eigenvalue λ is κ(λ, A) = |y ∗1 z| . The condition number of the eigenvector matrix is an
upper bound for the individual eigenvalue condition numbers.
Function condeig computes condition numbers for the eigenvalues. The command c=condeig(A)
returns a vector of condition numbers for the eigenvalue of A. The call [V,D,s] = condeig(A)
is equivalent to: [V,D] = eig(A), s = condeig(A). A large condition number indicates an
eigenvalue that is sensitive to perturbations in the matrix. The following example displays eigenvalues
of the sixth order Frank matrix in the first row and their condition numbers in the second:
328 Eigenvalues and Eigenvectors

>> A = gallery(’frank’,6);
>> [V,D,s] = condeig(A);
>> [diag(D)’; s’]
ans =
12.9736 5.3832 1.8355 0.5448 0.0771 0.1858
1.3059 1.3561 2.0412 15.3255 43.5212 56.6954

Let us explain shortly how eig function actions. It works in several stages. First, when A is
nonsymmetric, it balances the matrix, that is, it carries out a similarity transform A ← Y −1 AY ,
where Y is a permutation of a diagonal matrix chosen to give A rows and columns of approximately
equal norm. The motivation for balancing is that it can lead to a more accurate computed eigensystem.
However, balancing can worsen rather than improve the accuracy (see doc eig for an example),
so it may be necessary to turn balancing off with eig(A,’nobalance’). Also note that if A is
symmetric, eig(A,’nobalance’) ignores the nobalance option since A is already balanced.
After balancing, eig reduces A to Hessenberg form, then uses the QR algorithm to reach Schur
form, after which eigenvectors are computed by substitution if required. The Hessenberg factorization
is computed by H = hess(A) or [Q,H]= hess(A). The last form returns also the transformation
matrix Q. The commands T = schur(A) or [Q,T]= schur(A), produce the real Schur decom-
position if A is real and the real Schur decomposition if A is complex. The complex Schur form of a
real matrix can be obtained for with schur(A,’complex’).
If A is real and symmetric (complex Hermitian), [V,D] = eig(A) reduces initially to symmetric
(Hermitian) tridiagonal form then iterates to produce a diagonal Schur form, resulting in an orthogonal
(unitary) V and a real, diagonal D.
MATLAB can solve generalized eigenvalue problems: given two square matrices of order n, A
and B, find the scalars λ and vectors x 6= 0 such that Ax = λBx. The generalized eigenvalues are
computed by e = eig(A,B), while [V,D] = eig(A,B) computes an n-by-n diagonal matrix
D and an n-by-n matrix V of eigenvectors such that A*V = B*V*D. The theory of the generalized
eigenproblem is more complicated than that of the standard eigenproblem, with the possibility of zero,
finitely many or infinitely many eigenvalues and of eigenvalues that are infinitely large. When B is
singular eig may return computed eigenvalues containing NaNs. We illustrate with
>> A = gallery(’triw’,3), B = magic(3)
A =
1 -1 -1
0 1 -1
0 0 1
B =
8 1 6
3 5 7
4 9 2
>> [V,D]=eig(A,B); V, eigvals = diag(D)’
V =
-1.0000 -1.0000 0.3526
0.4844 -0.4574 0.3867
0.2199 -0.2516 -1.0000
eigvals =
0.2751 0.0292 -0.3459
Function polyeig solves the polynomial eigenvalue problem (λp Ap + λp−1 Ap−1 + · · · + λA1 +
A0 )x = 0, where the Ai are given square coefficient matrices. The generalized eigenproblem is
8.6. Eigenvalues and Eigenvectors in MATLAB 329

obtained for p = 1; if we take further A0 = I we get the standard eigenproblem. The quadratic
eigenproblem (λ2 A + λB + C)x = 0 corresponds to p = 2. If Ap is n-by-n and nonsingular
then there are pn eigenvalues. MATLAB’s syntax is e = polyeig(A0,A1,..,Ap) or [X,e]
= polyeig(A0,A1,..,Ap), with e a pn-vector of eigenvalues and X an n-by-pn matrix whose
columns are the corresponding eigenvectors. Example:

>> A = eye(2); B = [20 -10; -10 20]; C = [15 -5; -5 15];


>> [X,e] = polyeig(C,B,A)
X =
0.7071 0.7071 0.7071 0.7071
-0.7071 0.7071 -0.7071 0.7071
e =
-29.3178
-8.8730
-0.6822
-1.1270

The singular value decomposition (SVD) of an m-by-n matrix A is the factorization

A = U ΣV ∗ , (8.6.1)

where U is m × m and unitary, V is n × n and unitary, and Σ is m × n and diagonal with positive real
entries σii , such that σ11 ≥ σ2,2 ≥ σmin(m,n) ≥ 0.
There exist two kind of SVD: the full or complete SVD, given by (8.6.1), and the reduced or
economical SVD, that accept a rectangular m × n matrix, and returns the m × n matrix U , the diagonal
n × n matrix Σ and the unitary n × n matrix V .
SVD is a useful instrument for the analysis of mappings defined on a space and having values into
a different space, possible with a different dimensions. The rank, null space, range space of a matrix
are computed via SVD. SVD has many applications in Statistics and Image Processing and helps to a
better understanding of linear algebra concepts. If A is symmetric and positive definite, SVD (8.6.1)
and eigenvalue decomposition agree. In contrast to eigenvalue decomposition, SVD always exists.
Consider the matrix

A =
9 4
6 8
2 7

Its full (complete) SVD is

>> [U,S,V]=svd(A)
U =
-0.6105 0.7174 0.3355
-0.6646 -0.2336 -0.7098
-0.4308 -0.6563 0.6194

S =
14.9359 0
0 5.1883
0 0
330 Eigenvalues and Eigenvectors

V =
-0.6925 0.7214
-0.7214 -0.6925

and the reduced SVD

>> [U,S,V]=svd(A,0)
U =
-0.6105 0.7174
-0.6646 -0.2336
-0.4308 -0.6563

S =
14.9359 0
0 5.1883

V =
-0.6925 0.7214
-0.7214 -0.6925

In both cases U*S*V’ is equal to A, modulo roundoff.


The generalized singular value decomposition of an m × p matrix A and an n × p matrix B can be
written
A = U CX ∗ , B = V SX ∗ , C ∗ C + S ∗ S = I,
where U and V are unitary and C and S are real diagonal matrices with nonnegative diagonal elements.
The numbers C(i, i)/S(i, i) are the generalized singular values. This decomposition is computed by
[U,V,X,C,S] = gsvd(A,B). See help gsvd for more details about the dimensions of the fac-
tors. For details on generalized SVD, see [39].

8.7 Applications
8.7.1 Solving mass-spring systems
Demmel presents in [23] an application of the concepts of eigenvalue and eigenvector to a problem of
mechanical vibrations. Consider the damped mass spring system in Figure 8.1. Newton’s law F = ma
leads us to the system

mi ẍi (t) = ki (xi−1 (t) − xi (t))


+ ki+1 (xi+1 (t) − xi (t))
− bi ẋi (t).

The first term is the force on mass i from spring i, the second is the force on mass i from spring
i + 1 and the last term is the force on mass i from damper i. In matrix form, our equation is

M ẍ(t) = −B ẋ(t) − Kx(t), (8.7.1)


8.7. Applications 331

x1 xi xn
k1 k2 ki ki+1 kn
m1 ··· mi ··· mn
] ] ]
b1 bi bn

Figure 8.1: Damped, vibrating mass-spring system. Here, xi is the position of the ith mass,
mi is the ith mass, ki is the spring constant of the ith spring and bi is the damping constant
of ith damper

where M = diag(m1 , . . . , mn ), B = diag(b1 , . . . , bn ), and


 
k1 + k2 −k2
 −k2 k2 + k3 −k3 
 
 .. .. .. 
K= . . . .
 
 −kn−1 kn−1 + kn −kn 
−kn kn

We assume that all the masses mi are positive. M is called the mass matrix, B is the damping matrix,
and K is the stiffness matrix. First, we convert this second-order differential equation to a first-order
differential equation. If we introduce  
ẋ(t)
y(t) = ,
x(t)
our equation becomes
   
ẍ(t) −M −1 B ẋ(t) − M −1 Kx(t)
ẏ(t) = =
ẋ(t) ẋ(t)
  
−M −1 B −1
M K ẋ(t)
=
I 0 x(t)
 
−M −1 B M −1 K
= y(t) ≡ Ay(t). (8.7.2)
I 0

To solve ẏ(t) = Ay(t), we assume that y(0) is given (i.e. the initial positions x(0) and velocities ẋ(t)
are given).
One way to express the solution of this differential equation is y(t) = eAt y(0), where eAt is the
matrix exponential. We shall give the solution in the special case where A is diagonalizable; this will
be true for almost all choices of mi , ki , and bi .
When A is diagonalizable, we can write A = SΛS −1 , where Λ = (λ1 , . . . , λn ). Then ẏ(t) =
Ay(t) is equivalent to ẏ(t) = SΛS −1 y(t) or S −1 ẏ(t) = ΛS −1 y(t) or ż(t) = Λz(t), where
z(t) = S −1 y(t). This diagonal system of differential equations żi (t) = λi zi (t) has solutions zi (t) =
eλi t zi (0), so y(t) = Sdiag(eλ1 t , . . . , eλn t )S −1 y(0) = SeΛt S −1 y(0).

function [T,Locations,Velocities]=massspring(n,m,b,k,x0,x,v,dt,tfinal)
% MASSSPRING Solve vibrating mass-spring system using eigenvalues
%
% Inputs
% N = number of bodies
332 Eigenvalues and Eigenvectors

% M = column vector of n masses


% B = column vector of n damping constants
% K = column vector of n spring constants
% X0= column vector of n rest positions of bodies
% X = column vector of n initial displacements from rest
% V = column vector of n initial velocities of bodies
% DT = time step
% TFINAL = final time to integrate system
%
% Outputs
% graph body positions, T, LOCATIONS, VELOCITIES
%

% Compute mass matrix


M = diag(m);
% Compute damping matrix
B = diag(b);
% Compute stiffness matrix
if n>1,
K = diag(k)+diag([k(2:n);k(n)])-diag(k(2:n),1)-diag(k(2:n),-1);
else
K = k;
end
% Compute matrix governing motion, find eigenvalues and eigenvectors
A = [[-inv(M)*B, -inv(M)*K];[eye(n),zeros(n)]];
[V,D]=eig(A);
dD = diag(D);
iV = inv(V);
i=1;
Y = [v;x];
T = 0;
steps = round(tfinal/dt);
% Compute positions and velocities
for i = 2:steps
t = (i-1)*dt;
T = [T,t];
Y(1:2*n,i) = V * ( diag(exp(t*dD)) * ( iV * [v;x] ) );
end
Y = real(Y);
hold off, clf
subplot(2,1,1)
Locations = Y(n+1:2*n,:)+x0*ones(1,i);
attr={’k-’,’r--’,’g-.’,’b:’};
for j=1:n,
color=attr{rem(j,4)+1};
plot(T,Locations(j,:),color),
hold on
axis([0,tfinal,min(min(Locations)),max(max(Locations))])
end
8.7. Applications 333

title(’Positions’)
xlabel(’Time’)
grid
subplot(2,1,2)
Velocities = Y(1:n,:);
for j=1:n,
color=attr{rem(j,4)+1};
plot(T,Velocities(j,:),color),
hold on
axis([0,tfinal,min(min(Velocities)),max(max(Velocities))])
end
title(’Velocities’)
xlabel(’Time’)
grid

The MATLAB function massspring finds time moments, positions and the velocities of the
system components, and plot the graph. The call sequence is

n=4;
b=0.4*ones(4,1);
m=[2,1,1,2]’;
k=ones(4,1);
x0=(1:4)’;
x=[-0.25,0,0,0.25]’;
v=[-1,0,0,1]’;
dt=0.1;;
tfinal=20;
[T,P,V]=massspring(n,m,b,k,x0,x,v,dt,tfinal);

See Figure 8.2 for output.

8.7.2 Computing natural frequencies of a rectangular membrane


One of the most useful applications of eigenvalue problems occurs in natural frequency calculation of
linear systems. In [105] one considers the finite difference approximation for the natural frequencies of
a rectangular membrane and the approximate results are compared to exact values. Consider a tightly
stretched elastic membrane occupying a region R ⊂ R2 bounded by a curve L on which the traverse
deflection is zero. The PDE and boundary conditions governing the traverse motion U (x, y, t) are

T (Uxx + Uyy ) = ρUtt , (x, y) ∈ R


U (x, y, t) = 0, (x, y) ∈ L,

where T is the membrane tension and ρ is the density. The natural vibration modes are motion states
where all points of the system simultaneously move with the same frequency, which says U (x, y, t) =
u(x, y) sin Ωt. It follows that u(x, y) satisfies

uxx + uyy = −ω 2 u, (x, y) ∈ R


u(x, y) = 0, (x, y) ∈ L,
334 Eigenvalues and Eigenvectors

Positions

0 5 10 15 20
Time
Velocities
1

0.5

−0.5

−1
0 5 10 15 20
Time

Figure 8.2: Positions and velocities of a mass-spring system

p
where ω = ρt Ω. In the simple case of a rectangular membrane lying in the region such that 0 ≤ x ≤ a
and 0 ≤ y ≤ b, the natural frequencies and mode shapes are given by
r
nπ 2  mπ 2 nπx mπy
ωmn = + , unm = sin sin
a b a b
where n, m ∈ N∗ . How closely these values can be reproduced when the partial differential equation is
replaced by a second order finite difference approximation defined on a rectangular grid? We introduce
the grid points expressed as
x(i) = (i − 1)∆x , i = 1, . . . , N
y(j) = (j − 1)∆y , j = 1, . . . , M,
where ∆x = a/(N − 1), ∆y = b/(M − 1), and let u(i, j) be the value of u at x(i), y(j). The
Helmholz equation is replaced by an algebraic eigenvalue problem of the form
∆2y [u(i − 1, j) − 2u(i, j) + u(i + 1, j)] +
∆2x [u(i, j − 1) − 2u(i, j) + u(i, j + 1)] = λu(i, j)
where
λ = (∆x ∆y ω)2
and associated homogeneous boundary conditions
u(1, j) = u(N, j) = u(i, 1) = u(i, M ) = 0.
This combination of equations can be written in matrix form as
Au = λu, Bu = 0.
8.7. Applications 335

We used the MATLAB function null to solve the boundary conditions equations. If Q is the null
space of B (with orthonormal columns), we write u = Qz and substitute into the eigenvalue equations.
Multiplying both sides by QT , we obtain a standard eigenvalue problem of the form Cz = λz, where
C = QT AQ. The eigenvector matrix of the original problem is obtained as u = QV , where V is the
eigenvector matrix of C, and the eigenvalues of the original matrix are the eigenvalues of C (C and A
are similar).
The function recmemnfr form and solve the algebraic equations just discussed. To avoid the
tedious indexing into n2 dimensional matrices the MATLAB functions ind2sub and sub2ind are
useful.

function [w,wex,modes,x,y,nx,ny,ax,by]=recmemnfr(...
ax,by,nx,ny)
%RECMEMNFR - natural frequencies of a rectangular membrane
% [w,wex,modes,x,y,nx,ny,ax,by]=recmemfr(a,b,nx,ny,noplt)
% ˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜˜
% % ax, by - membrane side lengths
% nx,ny - number of points
% w - vector of (nx-2)*(ny-2) computed frequencies
% wex - vector of exact frequencies
% modes - three dimensional array containing the mode
% shapes for various frequencies. The array
% size is [nx,ny,(nx-2)*(nx-2)] denoting
% the x direction, y direction, and the
% frequency numbers matching components of the
% w vector. The i’th mode shape is obtained
% as reshape(vecs(:,i),n,m)
% x,y - vectors defining the finite difference grid

if nargin==0; ax=2; nx=20; by=1; ny=10; end


dx=ax/(nx-1); dy=by/(ny-1);
na=(1:nx-1)’/ax; nb=(1:ny-1)/by;

% Compute exact frequencies for comparison


wex=pi*sqrt(repmat(na.ˆ2,1,ny-1)+repmat(nb.ˆ2,nx-1,1));
wex=sort(wex(:)’); x=linspace(0,ax,nx);
y=linspace(0,by,ny); neig=(nx-2)*(ny-2); nvar=nx*ny;
% Form equations to fix membrane edges
k=0; s=[nx,ny]; c=zeros(2*(nx+ny),nvar);
for j=1:nx
m=sub2ind(s,[j,j],[1,ny]); k=k+1;
c(k,m(1))=1; k=k+1; c(k,m(2))=1;
end
for j=1:ny
m=sub2ind(s,[1,nx],[j,j]); k=k+1;
c(k,m(1))=1; k=k+1; c(k,m(2))=1;
end

% Form frequency equations at interior points


k=0; a=zeros(neig,nvar); b=a;
336 Eigenvalues and Eigenvectors

phi=(dx/dy)ˆ2; psi=2*(1+phi);
for i=2:nx-1
for j=2:ny-1
m=sub2ind(s,[i-1,i,i+1,i,i],[j,j,j,j-1,j+1]);
k=k+1; a(k,m(1))=-1; a(k,m(2))=psi; a(k,m(3))=-1;
a(k,m(4))=-phi; a(k,m(5))=-phi; b(k,m(2))=1;
end
end

% Compute frequencies and mode shapes


q=null(c); A=a*q; B=b*q; [modes,lam]=eig(B\A);
[lam,k]=sort(diag(lam)); w=sqrt(lam)’/dx;
modes=q*modes(:,k); modes=reshape(modes(:),nx,ny,neig);
The calling sequence and the plotting code for the first 50 frequencies is given below
% Plot first fifty approximate and exact frequencies
[w,wex,modes,x,y,nx,ny,ax,by]=recmemnfr;

m=1:min([50,length(w),length(wex)]);
pcter=100*(wex(m)-w(m))./wex(m);

clf; plot(m,wex(m),’k-’,m,w(m),’k.’,m,pcter,’k--’)
xlabel(’frequency number’);
ylabel(’frequency and % error’)
legend(’exact frequency’,’approx. frequency’,...
’percent error’,2)
s=[’MEMBRANE FREQUENCIES FOR AX / BY = ’,...
num2str(ax/by,5),’ AND ’,num2str(nx*ny),...
’ GRID POINTS’];
title(s), grid on, shg
See Figure 8.3 for the graph of exact and approximate frequencies and the error.

Problems
Problem 8.1. Compute the eigenvalues of Hilbert matrix for n = 10, 11, . . . , 20 and the corresponding
condition numbers.

Problem 8.2. The matrices


P=gallery(’pascal’,12)
F=galery(’frank’,12)
have the property that, if λ is an eigenvalue, then, 1/λ is an eigenvalue too. How well do the computed
eigenvalues preserve this property? Use condeig two explain the different behavior of these two
matrices.

Problem 8.3. What is the largest eigenvalue of magic(n)? Why?

Problem 8.4. Try the following command sequence:


8.7. Applications 337

MEMBRANE FREQUENCIES FOR AX / BY = 2 AND 200 GRID POINTS


20
exact frequency
18 approx. frequency
percent error
16

14
frequency and % error

12

10

0
0 10 20 30 40 50
frequency number

Figure 8.3: Approximate and exact frequencies for a rectangular membrane

n=100;
d=ones(n,1);
A=diag(d,1)+diag(d,-1);
e=eig(A);
plot(-n/2:n/2,e,’.’)
Do you recognize the resulted curve? Could you find a formula for the eigenvalues of this matrix?
Problem 8.5. Let TN be the matrix obtained by finite-difference discretization of univariate Poisson
equation (problem 4.7). Their eigenvalues are
 
πj
λj = 2 1 − cos ,
N +1
and their eigenvectors zj have the components
r
2 jkπ
zj (k) = sin .
N +1 N +1
Give a graphical representation of eigenvalues and eigenvectors of T21 .
Problem 8.6. (a) Implement power method (vector iteration).
(b) Test the function in (a) for the matrix and the starting vector
   
6 5 −5 −1
A =  2 6 −2  , x =  1 .
2 5 −1 1
338 Eigenvalues and Eigenvectors

(c) Approximate the spectral radius ρ(A) of the matrix


 
2 0 −1
A =  −2 −10 0 ,
−1 −1 4

using power method and starting vector [1, 1, 1]T .

Problem 8.7. Find the eigenvalues of the matrix


 
190 66 −84 30
 66 303 42 −36 
 
 336 −168 147 −112 
30 −36 28 291

by using double shift QR method. Compare your result to that provided by eig.

Problem 8.8. Find the SVD for the following matrices:


 
4 0 0  
0 0 0   5
  2 1 .
0 0 7 4
0 0 0
CHAPTER 9

Numerical Solution of Ordinary Differential Equations

9.1 Differential Equations


Let us consider the initial value (Cauchy 1
) problem: determine a vector valued y ∈ C 1 [a, b], y :
[a, b] → Rd , such that
(
dy
= f (x, y), x ∈ [a, b],
(CP ) dx (9.1.1)
y(a) = y0 .
We shall emphasize two important classes of such problems:
(i) for d = 1 we have a single first-order differential equation
 ′
y = f (x, y),
y(a) = y0 .

(ii) for d > 1 we have a system of first order differential equation


(
dy i
dx
= f i (x, y 1 , y 2 , . . . , y d ), i = 1, d,
y (a) = y0i , i = 1, d.
i

Augustin Louis Cauchy (1789-1857), French mathematician, active in


Paris, is considered to be the father of modern analysis. He provided
a firm foundation for analysis by basing it on a rigorous concept of
limit. He is also the creator of complex analysis, where ”Cauchy’s
1 formula” play a central role. In addition, Cauchy’s name is attached
to pioneering contributions to the theory of ordinary and partial differ-
ential equations, mainly in existence and uniqueness problems. Like
other great mathematicians of 18th and 19th centuries, his work en-
compasses geometry, algebra, number theory, mechanics and theoret-
ical physics.

339
340 Numerical Solution of Ordinary Differential Equations

Remark 9.1.1. Let us consider a single d-th order differential equation,

u(n) = g(x, u, u′ , . . . , u(n−1) ),

with the initial condition u(i) (a) = ui0 , i = 0, d − 1. This problem is easily brought into the form
(9.1.1) by defining
y i = u(i−1) , i = 1, d.
Then
dy 1
= y2, y 1 (a) = u00 ,
dx
dy 2
= y3, y 2 (a) = u10 ,
dx
... (9.1.2)
d−1
dy
= yd, y d−1 (a) = ud−2
0 ,
dx
dy d
= g(x, y 1 , y 2 , . . . , y d ), y d (a) = ud−1
0 .
dx
which has the form (9.1.1) with very special (linear) functions f 1 , f 2 , . . . , f d−1 , and f d (x, y) =
g(x, y). ♦

We recall from the theory of differential equation the following basic existence and uniqueness
result.

Theorem 9.1.2. Assume that f (x, y) is continuous in the first variable for x ∈ [a, b] and with respect
to the second satisfies a uniform Lipschitz condition

kf (x, y) − f (x, y ∗ )k ≤ Lky − y ∗ k, y, y ∗ ∈ Rd , (9.1.3)

where k · k is some vector norm. Then the initial value problem (CP) has a unique solution y(x),
a ≤ x ≤ b, ∀ y0 ∈ Rd . Moreover, y(x) depends continuously on a and y0 .
i
∂f
The Lipschitz condition (9.1.3) certainly holds if all functions ∂y j (x, y), i, j = 1, d are continuous
d
in the y-variables and bounded on [a, b]×R . This is the case for linear systems of differential equations,
where
Xd
f i (x, y) = aij (x)y j + bi (x), i = 1, d
j=1

and aij (x), bi (x) are continuous functions on [a, b].


Often the Lipschitz condition (9.1.3) holds in some compact neighbor of a in which y(x) remains
in a compact D.

9.2 Numerical Methods


One can distinguish between analytic approximation methods and discrete variable methods. In the for-
mer one tries to find approximations ya (x) ≈ y(x) to the exact solutions, valid for all x ∈ [a, b]. This
usually take the form of a truncated series expansion, either in powers of x, in Chebyshev polynomials,
or in some other system of basis functions. In discrete-variable methods, one attempts to find approxi-
mations un ∈ Rd of y(xn ) only at discrete points xn ∈ [a, b]. The abscissas xn may be predetermined
9.3. Local Description of One-Step Methods 341

(e.g., equally spaced on [a, b]), or, more likely, are generated dynamically as a part of the integration
process.
If desired, one can then from these discrete approximations {un } obtain again an approximation
ya (x) defined for all x ∈ [a, b], either by interpolation or, by a continuation mechanism built into the
approximation method itself. We are concerned only with discrete one step methods, that is methods
in which un+1 is determined solely from a knowledge of xn , un and the step h to proceed from xn to
xn+1 = xn + h. In a k-step method (k > 1) knowledge of k − 1 additional points (xn−j , un−j ),
j = 1, 2, . . . , k − 1, is required to advance the solution.
When describing a single step of a one-step method, it suffices to show how one proceeds from a
generic point (x, y), x ∈ [a, b], y ∈ Rd to the “next” point (x + h, ynext ). We refer to this as the
local description of the one-step method. This also includes a discussion of the local accuracy, that is
how closely ynext agrees at x + h with the solution of the differential equation. A one-step method
solving the initial value problem (9.1.1) effectively generates a grid function {un }, un ∈ Rd , on a grid
a = x0 < x1 < x2 < · · · < xN−1 < xN = b covering the interval [a, b], whereby un is intended
to approximate the exact solution y(x) at x = xn . The point (xn+1 , un+1 ) is obtained from the point
(xn , un ) by applying a one-step method with an appropriate step hn = xn+1 − xn . This is referred to
as the global description of a one-step method. Questions of interest here are the behavior of the global
error un − y(xn ), in particular stability and convergence, and the choice of hn to proceed from one
grid point xn to the next, xn+1 = xn + hn .

9.3 Local Description of One-Step Methods


Given a generic point x ∈ [a, b], y ∈ Rd , we define a single step of a one step method by
ynext = y + hΦ(x, y; h), h > 0. (9.3.1)
The function Φ : [a, b] × Rd × R+ → Rd may be thought as the approximate increment per unit step,
or the approximate difference quotient, and it defines the method. Along with (9.3.1), we consider the
solution u(t) of the differential equation (9.1.1) passing through the point (x, y), that is, the local initial
value problem  du
dt
= f (t, u)
(9.3.2)
u(t) = y t ∈ [t, t + h]
We call u(t) the reference solution. The vector ynext in (9.3.1) is intended to approximate u(x + h).
How successfully is this done it is measured by the truncation error defined as follows.
Definition 9.3.1. The truncation error of the method Φ at the point (x, y) is defined by
1
T (x, y; h) = [ynext − u(x + h)]. (9.3.3)
h
Thus the truncation error is a vector valued function of d + 2 variables. Using (9.3.1) and (9.3.2),
we can write for it alternatively,
1
T (x, y; h) = Φ(x, y; h) − [u(x + h) − u(x)], (9.3.4)
h
showing that T is the difference between the approximate and exact increment per unit step.
Definition 9.3.2. The method Φ is called consistent if
T (x, y; h) → 0 as h → 0, (9.3.5)
uniformly for (x, y) ∈ [a, b] × Rd .
342 Numerical Solution of Ordinary Differential Equations

By (9.3.4) and (9.3.3)) we have consistency if and only if

Φ(x, y; 0) = f (x, y), x ∈ [a, b], y ∈ Rd . (9.3.6)

A finer description of local accuracy is provided by the next definition based on the notion of a local
truncation error.

Definition 9.3.3. The method Φ is said to have order p if for some vector norm k · k,

kT (x, y; h)k ≤ Chp , (9.3.7)

uniformly on [a, b] × Rd , with a constant C not depending on x, y and h.

We express briefly this property as

T (x, y; h) = O(hp ), h → 0. (9.3.8)

Note that p > 0 implies consistency. Usually, p ∈ N∗ . It is called the exact order, if (9.3.7) does
not hold for any larger p.

Definition 9.3.4. A function τ : [a, b] × Rd → Rd that satisfies τ (x, y) 6≡ 0 and

T (x, y; h) = τ (x, y)hp + O(hp+1 ), h→0 (9.3.9)

is called the principal error function .

The principal error function determines the leading term in the truncation error. The number p in
(9.3.9) is the exact order of the method since τ 6≡ 0.
All the preceding definitions are made with the idea in mind that h > 0 is a small number. Then
the larger is p, the more accurate is the method.

9.4 Examples of One-Step Methods


Some of the oldest methods are motivated by simple geometric considerations based on the slope field
defined by the right-hand side of the differential equation. This include the Euler and modified Euler
methods. More accurate and sophisticated methods are based on Taylor expansion.

9.4.1 Euler’s method


Euler proposed his method in 1768, in the early days of calculus. It consist of simply following the
slope at the generic point (x, y) over an interval of length h

ynext = y + hf (x, y). (9.4.1)

(See Figure 9.1).


Thus, Φ(x, y; h) = f (x, y) does not depend on h and by (9.3.6) the method is consistent. For the
truncation error we have by (9.3.3)

1
T (x, y; h) = f (x, y) − [u(x + h) − u(x)], (9.4.2)
h
9.4. Examples of One-Step Methods 343

x x x x x
0 1 2 3 4

Figure 9.1: Euler’s method – the exact solution (continuous line) and the approximate solu-
tion (dashed line)

where u(t) is the reference solution defined in (9.3.2). Since u′ = f (x, u(x)) = f (x, y), we can write,
using Taylor’s theorem,
1
T (x, y; h) = u′ (x) − [u(x + h) − u(x)]
h
1 1
= u′ (x) − [u(x) + hu′ (x) + h2 u′′ (ξ) − u(x)] (9.4.3)
h 2
1
= − hu′′ (ξ), ξ ∈ (x, x + h),
2
assuming u ∈ C r [x, x + h]. This is certainly true if f ∈ C 1 ([a, b] × Rd ). Now differentiating (9.3.2)
totally with respect to t and then setting t = ξ, yields
1
T (x, y; h) = − h[fx + fy f ](ξ, u(ξ)), (9.4.4)
2
where fx is the partial derivative of f with respect to x and fy the Jacobian of f with respect to the
y-variables. If, in the spirit of Theorem 9.1.2, we assume that f and all its first partial derivatives are
uniformly bounded in [a, b] × Rd , there exists a constant C independent of x, y and h such that

kT (x, y; h)k ≤ Ch. (9.4.5)

Thus, Euler’s method has the order p = 1. If we make the same assumption about all second-order
partial derivatives of f we have u′′ (ξ) = u′′ (x) + O(h) and, therefore from (9.4.3),
1
T (x, y; h) = − h[fx + fy f ](x, y) + O(h2 ), h → 0, (9.4.6)
2
344 Numerical Solution of Ordinary Differential Equations

showing that the principal error function is given by


1
τ (x, y) = − [fx + fy f ](x, y). (9.4.7)
2
Unless fx + fy f ≡ 0, the order of Euler’s method is exactly p = 1.

9.4.2 Method of Taylor expansion


We have seen that Euler’s method basically amounts to truncating the Taylor expansion of the reference
solution after its second term. It is a natural idea, already proposed by Euler, to use more terms of the
Taylor expansion. This requires the computation of successive “total derivatives” of f ,

f [0] (x, y) = f (x, y)


[k] [k] (9.4.8)
f [k+1] (x, y) = fx (x, y) + fy (x, y)f (x, y), k = 0, 1, 2, . . .

which determine the successive derivatives of the reference solution u(t) of (9.3.2) by virtue of

u(k+1) (t) = f [k] (t, u(t)), k = 0, 1, 2, . . . (9.4.9)

These, for t = x, become

u(k+1) (x) = f [k] (x, y), k = 0, 1, 2, . . . (9.4.10)

and are used to form the Taylor series approximation according to


 
1 1
ynext = y + h f [0] (x, y) + hf [1] (x, y) + · · · + hp−1 f [p−1] (x, y) , (9.4.11)
2 p!
that is,
1 [1] 1
Φ(x, y; h) = f [0] (x, y) + hf (x, y) + · · · + hp−1 f [p−1] (x, y). (9.4.12)
2 p!
For the truncation error, using (9.4.10) and (9.4.12) and assuming f ∈ C p ([a, b] × Rd ), we obtain
from Taylor’s theorem
1
T (x, y; h) = Φ(x, y; h) − [u(x + h) − u(x)] =
h
p−1
X (k+1) hk hp
= Φ(x, y; h) − u (x) − u(p+1) (ξ) =
k=0
(k + 1)! (p + 1)!
hp
= −u(p+1) (ξ) , ξ ∈ (x, x + h),
(p + 1)!
so that
Cp
kT (x, y; h)k ≤ hp ,
(p + 1)!
where Cp is a bound on the pth total derivative of f . Thus, the method has the exact order p (unless
f [p] (x, y) ≡ 0), and the principal error function is
1
τ (x, y) = − f [p] (x, y). (9.4.13)
(p + 1)!
The necessity of computing many partial derivatives in (9.4.8) was a discouraging factor in the past,
when this had to be done by hand. But nowadays, this task can be delegated to the computer, so that the
method has become again a viable option.
9.4. Examples of One-Step Methods 345

9.4.3 Improved Euler methods


There is too much inertia in Euler’s method: one should not follow the same initial slope over the whole
interval of length h, since along this line segment the slope defined by the slope field of the differential
equation changes. This suggests several alternatives. For example, we may wish to reevaluate the slope
halfway through the line segment — retake the pulse of the differential equation, as it were — and then
follow this revised slope over the whole interval (cf. Figure 9.2). In formula,

ynext

x x+h

Figure 9.2: Modified Euler method

 
1 1
ynext = y + hf x + h, y + hf (x, y) (9.4.14)
2 2
or  
1 1
Φ(x, y; h) = f x + h, y + hf (x, y) (9.4.15)
2 2
Note the characteristic “nesting” of f that is required here. For programming purpose it may be desir-
able to undo the nesting and write
K1 (x, y) = f (x, y) 
K2 (x, y; h) = f x + 21 h, u + 21 hK1 (9.4.16)
ynext = y + hK2
In other words, we are taking two trial slopes, K1 and K2 , one at the initial point and the other nearby,
and then taking the latter as the final slope. The method is called modified Euler method.
We could equally well take the second trial slope at (x + h, y + hf (x, y)), but then, having waiting
too long before reevaluating the slope, take now as the final slope the average of two slopes:
K1 (x, y) = f (x, y)
K2 (x, y; h) = f (x + h, y + hK1 ) (9.4.17)
1
ynext = y + h(K1 + K2 ).
2
346 Numerical Solution of Ordinary Differential Equations

This is sometimes referred to as Heun method . The effect of both modifications is to raise the order by
1, as shown in the sequel.

9.5 Runge-Kutta Methods


We look for Φ of the form:
r
X
Φ(x, y; h) = αs Ks
s=1
K1 (x, y) = f (x, y) ! (9.5.1)
s−1
X
Ks (x, y) = f x + µs h, y + h λsj Kj , s = 2, 3, . . . , r
j=1

It is natural in (9.5.1) to impose the conditions


s−1
X r
X
µs = λsj , s = 2, 3, . . . , r, αs = 1, (9.5.2)
j=1 s=1

where the first set is equivalent to

Ks (x, y; h) = u′ (x + µs h) + O(h2 ), s ≥ 2,

and the second is nothing but the consistency condition (cf. (9.3.6)) (i.e. Φ(x, y; h) = f (x, y)).
We call (9.5.1) an explicit r-stage Runge-Kutta method; it requires r evaluation of the right-hand
side f of the differential equation. Conditions (9.5.2) lead to a nonlinear system. Let p∗ (r) the max-
imum attainable order (for arbitrary sufficient smooth f ) of an explicit r-stage Runge-Kutta method.
Kutta 2 has shown in 1901 that
p∗ (r) = r, r = 1, 4.
We can consider implicit r-stage Runge-Kutta methods
r
X
Φ(x, y; h) = αs Ks (x, y; h),
s=1
r
! (9.5.3)
X
Ks = f x + µs h, y + λsj Kj , s = 1, r,
j=1

in which the last r equations form a system of (in general nonlinear) equations in the unknowns K1 ,
K2 , . . . , Kr . Since each of these is a vector in Rd , before we can form the approximate increment Φ

Wilhelm Martin Kutta (1867-1944) was a German applied mathemati-


cian. It is well-known for his work on the numerical solution of ODE.
2 He had important contributions on application of conformal mapping
to hydro- and aerodynamical problems (Kutta-Joukowski formula for
the lift exerted on airfoil).
9.5. Runge-Kutta Methods 347

we must solve a system of rd equations in rd unknowns. Semi-implicit Runge-Kutta methods, where


the summation in the formula for Ks extends from j = 1 to j = s, require less work. This yields r
systems of equations, each having only d unknowns, the components of Ks .
Already in the case of explicit Runge-Kutta methods, and even more so in implicit methods, we have
at our disposal a large number of parameters which we can choose to achieve the maximum possible
order for all sufficiently smooth f . The considerable computational expenses involved in implicit and
semi-implicit methods can only be justified in special circumstances, for example, stiff problems. The
reason is that implicit methods can be made not only to have higher order than explicit methods, but to
have also better stability properties.
Example 9.5.1. Let
Φ(x, y; h) = α1 K1 + α2 K2 , ♦

where

K1 (x, y) = f (x, y),


K2 (x, y; h) = f (x + µ2 h, y + λ21 hK1 ),
λ21 = µ2 .

We have now three parameters, α1 , α2 , and µ. A systematic way of determining the maximum
order p is to expand both Φ(x, y; h) and h−1 [u(x + h) − u(x)] in powers of h and to match as many
terms as we can, without imposing constraints on f .
To expand Φ, we need Taylor’s expansion for (vector-valued) functions of several variables
f (x + ∆x, y + ∆y) = f + fx ∆x + fy ∆y+
1 (9.5.4)
+ [fxx (∆x)2 + 2fxy ∆x∆y + (∆y)T fyy (∆y)] + · · · ,
2
i
where fy denotes the Jacobian of f , and fyy = [fyy ] is the vector of Hessian matrices of f . In
(9.5.4), all functions and partial derivatives are understood to be evaluated at (x, y). Letting ∆x = µh,
∆y = µhf then gives
K2 (x, y; h) = f + µh(fx + fy f )
1 (9.5.5)
+ µ2 h2 (fxx + 2fxy f + f T fyy f ) + O(h3 ),
2
1 1 1
[u(x + h) − u(x)] = u′ (x) + hu′′ (x) + u′′′ (x) + O(h3 ), (9.5.6)
h 2 6
where

u′ (x) = f
u′′ (x) = f [1] = fx + fy f
u′′′ (x) = f [2] = fx[1] + fy[1] f = fxx + fx fy f + fy fx + (fxy + (fy f )y )f =
= fxx + 2fxy f + f T fyy f + fy (fx + fy )f,

and where in the last equation we have used

(fy f )y f = f T fyy f + fy2 f

Now,
1
T (x, y; h) = α1 K1 + α2 K2 − [u(x + h) − u(x)]
h
348 Numerical Solution of Ordinary Differential Equations

wherein we substitute the expansions (9.5.5) and (9.5.6). We find


 
1
T (x, y; h) = (α1 + α2 − 1)f + α2 µ − h(fx + fy f )+
2
  
1 1 1
+ h2 α2 µ2 − (fxx + 2fxy f + f T fyy f ) − fy (fx + fy f ) + O(h3 ) (9.5.7)
2 3 3

We cannot enforce the condition that the h2 coefficient be zero without imposing severe restriction
on f . Thus, the maximum order is 2 and we obtain it for

α1 + α2 = 1
α2 µ = 21

The solution

α1 = 1 − α2
1
µ=
2α2
depends upon an arbitrary parameter, α2 6= 0.
For α2 = 1 we obtain modified Euler method, and for α2 = 21 the Heun’s method.
We shall mention the classical Runge-Kutta formula of order p = 4.

Φ(x, y; h) = 16 (K1 + 2K2 + 2K3 + K4 )


K1 (x, y; h) = f (x, y) 
K2 (x, y; h) = f x + 21 h, y + 21 hK1  (9.5.8)
1 1
K3 (x, y; h) = f x + 2 h, y + 2 hK2
K4 (x, y; h) = f (x + h, y + hK3 )

When f does not depend on y, then (9.5.8) becomes the Simpson’s formula. Runge’s 3 idea was to gen-
eralize Simpson’s quadrature formula to ordinary differential equations. He succeeded only partially;
his formula had r = 4 and p = 3. The method (9.5.8) was discovered by Kutta in 1901 through a
systematic search.
The classical 4th order Runge-Kutta method for a grid of N + 1 equally spaced points is given by
MATLAB Source 9.1.

Example 9.5.2. Using 4th order Runge-Kutta method for the initial value problem

y ′ = −y + t + 1, t ∈ [0, 1]
y(0) = 1,

Carle David Tolmé Runge (1856-1927) was active in the famous


Götingen school of mathematics and is one of the pioneer of numer-
3 ical mathematics. He is best known for the Runge-Kutta formula in
ordinary differential equation, for which he provided the basic idea.
He made also notable contributions to approximation theory in the
complex plane.
9.5. Runge-Kutta Methods 349

MATLAB Source 9.1 Classical 4th order Runge-Kutta method


function [t,w]=RK4(f,tspan,alpha,N)
%RK4 - classical Runge-Kutta method for equispaced nodes
%call [t,w]=RK4(f,tspan,alpha,N)
%f - right hand side function
%tspan - interval
%alpha - starting value(s)
%N - number of subintervals
%t - abscissas of solution
%w - ordinates of solution

tc=tspan(1); wc=alpha(:);
h=(tspan(end)-tspan(1))/N;
t=tc; w=wc’;
for k=1:N
K1=f(tc,wc);
K2=f(tc+1/2*h,wc+1/2*h*K1);
K3=f(tc+1/2*h,wc+1/2*h*K2);
K4=f(tc+h, wc+h*K3);
wc=wc+h/6*(K1+2*K2+2*K3+K4);
tc=tc+h;
t=[t;tc]; w=[w;wc’];
end

with h = 0.1, N = 10, and ti = 0.1i we obtain the results given in Table 9.1. The exact solution is
y(t) = e−t + t, and the calling sequence is
[t,w]=rk4(@edex1,[0,1],1,10);
The MATLAB function

function df=edex1(t,y)
df=-y+t+1;

defines the right-hand side. ♦

It is usual to associate to each r-stages Runge-Kutta method (9.5.3) the tableau

µ1 λ11 λ12 ... λ1r


µ2 λ21 λ22 ... λ2r !
.. .. .. .. µ Λ
. . . ... . in matrix form
αT
µr λr1 λr2 ... λrr
α1 α2 ... αr

called Butcher table. For an explicit method µ1 = 0 and Λ is upper triangular having R µ a null main
diagonal. We can associate to the first r-lines of a Butcher table a quadrature formula 0 s u(t) dt ≈
Pr R1 Pr
j=1 λsj u(µj ), s = 1, r and to the last line the quadrature formula 0 u(t) dt ≈ s=1 αs u(µj ).
If corresponding degrees of exactness are ds = qs − 1, 1 ≤ s ≤ r + 1 (ds = ∞ if µs = 0 and all
350 Numerical Solution of Ordinary Differential Equations

ti Approximations Exact values Error


0.0 1 1 0
0.1 1.00483750000 1.00483741804 8.19640e-008
0.2 1.01873090141 1.01873075308 1.48328e-007
0.3 1.04081842200 1.04081822068 2.01319e-007
0.4 1.07032028892 1.07032004604 2.42882e-007
0.5 1.10653093442 1.10653065971 2.74711e-007
0.6 1.14881193438 1.14881163609 2.98282e-007
0.7 1.19658561867 1.19658530379 3.14880e-007
0.8 1.24932928973 1.24932896412 3.25617e-007
0.9 1.30656999120 1.30656965974 3.31459e-007
1.0 1.36787977441 1.36787944117 3.33241e-007

Table 9.1: Numerical results for Example 9.5.2

λsj = 0), then Peano’s Theorem implies that in the representation of the remainder occurs the qs th
derivatives of u and setting u(t) = y ′ (x + th) one obtains
r
y(x + µs h) − y(x) X
− λsj y ′ (x + µj h) = O (hqs ) , s = 1, r
h j=1

and
r
y(x + h) − y(x) X 
− αs y ′ (x + µs h) = O hqr +1 .
h s=1

For classical 4th order Runge-Kutta method (9.5.8) the Butcher table is:

0 0
1 1
2 2
0
1 1
2
0 2
0
1 0 0 1 0
1 2 2 1
6 6 6 6

MATLAB Source 9.2 is an implementation example for a Runge-Kutta method with constant step
given the Butcher table. The last parameter of Runge Kutta function is a function handle, and the
corresponding function returns the entries µ, λ and α of the table and the number r of method stages.
For the initialization of classical forth-order Runge-Kutta method Butcher table see MATLAB Source
9.3.
The MATLAB code below solves the problem in Example 9.5.2.
>>[t2,w2]=Runge_Kutta(@edex1,[0,1],1,10,@RK4tab);
We emphasize that functions RK4 and Runge Kutta work for both scalar ordinary differential
equations and for systems.

9.6 Global Description of One-Step Methods


Global description of one-step methods is best done in terms of grid and grid functions.
9.6. Global Description of One-Step Methods 351

MATLAB Source 9.2 Implementation of a Runge-Kutta method with constant step given
Butcher table
function [x,y,nfev]=Runge_Kutta(f,tspan,y0,N,BT)
%RUNGE_KUTTA - Runge-Kutta method with constant step
%call [t,y,nfev]=Runge_Kutta(f,tspan,y0,N,BT)
%f -right-hand side function
%tspan - interval [a,b]
%y0 - starting value(s)
%N - number of steps
%BT - function that provides Butcher table, call syntax
% [lambda,alfa,mu,s] - s number of stages
%t -abscissas of solution
%y - ordinates of solution components
%nfev - number of function evaluation

[lambda,alfa,mu,r]=BT(); %initialize Butcher table


h=(tspan(end)-tspan(1))/N; %step length
xc=tspan(1); yc=y0(:);
x=xc; y=yc’;
K=zeros(length(y0),r);
for k=1:N %RK iteration
K(:,1)=f(xc,yc);
for i=2:r
K(:,i)=f(xc+mu(i)*h,yc+h*(K(:,1:i-1)*lambda(i,1:i-1)’));
end
yc=yc+h*(K*alfa);
xc=xc+h; %prepare next iteration
x=[x;xc]; y=[y;yc’];
end
if nargout==3
nfev=r*N;
end
end

MATLAB Source 9.3 Initialize Butcher table for RK4


function [a,b,c,s]=RK4tab
%RK4TAB - Butcher table for classical RK4
s=4;
a=zeros(s,s-1);
a(2:s,1:s-1)=[1/2,0,0; 0, 1/2,0; 0,0,1];
b=[1,2,2,1]’/6;
c=sum(a’);
352 Numerical Solution of Ordinary Differential Equations

A grid on interval [a, b] is a set of points {xn }N


n=0 such that

a = x0 < x1 < x2 < · · · < xN−1 < xN = b, (9.6.1)

with grid lengths hn defined by

hn = xn+1 − xn , n = 0, 1, . . . , N − 1. (9.6.2)

The fineness of grid is measured by


|h| = max hn . (9.6.3)
0≤n≤N−1

We shall use the letter h to denote the collection of lengths h = {hn }. If h1 = h2 = · · · = hN =


(b − a)/N , we call (9.6.1) a uniform grid, otherwise a nonuniform grid. Letter h is also use to designate
the common grid length h = (b − a)/N . A vector valued function v = {vn }, vn ∈ Rd , defined on the
grid (9.6.1) is called a grid function. Thus, vn is the value of v at the gridpoint xn . Every function v(x)
defined on [a, b] induces a grid function by restriction. We denote the set of grid functions on [a, b] by
Γh [a, b], and for each grid function v = {vn } define its norm by

kvk∞ = max kvn k, v ∈ Γh [a, b]. (9.6.4)


0≤n≤N

A one-step method – indeed, any discrete-variable method – is a method producing a grid function
u = {un } such that u ≈ y, where y = {yn } is the grid function induced by the exact solution y(x) of
the initial value problem (9.1.1)
Let the method
xn+1 = xn + hn
(9.6.5)
un+1 = un + hn Φ(xn , un ; hn ),
where x0 = a, u0 = y0 .
To bring up the analogy between (9.1.1) and (9.6.5),we introduce operators R and Rh acting on
C 1 [a, b] and Γh [a, b], respectively. These are the residual operators

(Rv)(x) := v ′ (x) − f (x, v(x)), v ∈ C 1 [a, b] (9.6.6)


1
(Rh v)n := (vn+1 − vn ) − Φ(xn , vn ; hn ), n = 0, 1, . . . , N − 1, (9.6.7)
hn

where v = {vn } ∈ Γh [a, b]. (The grid function {(Rh v)n } is not defined for n = N , but we may
arbitrarily set (Rh v)N = (Rh v)N−1 ). Then the initial value problem (9.1.1) and its discrete analogue
(9.6.5) can be written transparently as

Ry = 0 on [a, b], y(a) = y0 (9.6.8)


Rh u = 0 on [a, b], u0 = y0 (9.6.9)

Note that the discrete residual operator (9.6.7) is closely related to the truncation error (9.3.3) when
we apply the operator at a point (xn , y(xn )) on the exact solution trajectory. Then indeed the reference
solution u(t) coincides with the solution y(t) and

1
(Rh y)n = [y(xn+1 ) − y(xn )] − Φ(xn , y(xn ); hn ) =
hn
= −T (xn , y(xn ); hn ). (9.6.10)
9.6. Global Description of One-Step Methods 353

9.6.1 Stability
Stability is a property of the numerical scheme (9.6.5) alone and has nothing to do with its approxi-
mation power. It characterizes the robustness of the scheme with respect to small perturbations. Nev-
ertheless, stability combined with consistency yields convergence of the numerical solution to the true
solution.
We define stability in terms of the discrete residual operators Rh in (9.6.7). As usual we assume
Φ(x, y; h) to be defined on [a, b] × Rd × [0, h0 ], where h0 > 0 is some suitable positive number.

Definition 9.6.1. The method (9.6.5) is called stable on [a, b] if there exists a constant K > 0 not
depending on h such that for an arbitrary grid h on [a, b], and for two arbitrary grid functions v, w ∈
Γh [a, b], there holds

kv − wk∞ ≤ K (kv0 − w0 k∞ + kRh v − Rh wk∞ ) , v, w ∈ Γh [a, b] (9.6.11)

for all h with |h| sufficiently small. In (9.6.11) the norm is defined by (9.6.4).

We refer to (9.6.11) as the stability inequality. The motivation for it is as follows. Suppose we have
two grid functions u, w satisfying

Rh u = 0, u0 = y0 (9.6.12)
Rh w = ε, w0 = y0 + η0 , (9.6.13)

where ε = {εn } ∈ Γh [a, b] is a grid function with small kεn k, and kη0 k is also small. We may
interpret u ∈ Γh [a, b] as the result of applying the numerical scheme in (9.6.5) in infinite precision,
whereas w ∈ Γh [a, b] could be the solution of (9.6.5) in floating-point arithmetic. Then, if stability
holds, we have
ku − wk∞ ≤ K(kη0 k∞ + kεk∞ ), (9.6.14)
that is, the global change in u is of the same order of magnitude as the local residual errors {εn } and
initial error η0 . It should be appreciated, however that the first equations in (9.6.13) says

wn+1 − wn − hn Φ(xn , wn , hn ) = hn εn ,

meaning that rounding errors must go to zero as |h| → ∞.


Interestingly enough, a Lipschitz condition on Φ is all that is required for stability.

Theorem 9.6.2. If Φ(x, y; h) satisfies a Lipschitz condition with respect to the y-variables

kΦ(x, y; h) − Φ(x, y ∗ ; h)k ≤ M ky − y ∗ k on [a, b] × Rd × [0, h0 ], (9.6.15)

then the method (9.6.5) is stable.

For the proof we need the following lemma.

Lemma 9.6.3. Let {en } be a sequence of numbers en ∈ R, satisfying

en+1 ≤ an en + bn , n = 0, 1, . . . , N − 1 (9.6.16)

where an > 0 and bn ∈ R. Then


n−1
! n−1 n−1
!
Y X Y
en ≤ E n , En = ak e0 + al bk , n = 0, 1, . . . , N (9.6.17)
k=0 k=0 l=k+1
354 Numerical Solution of Ordinary Differential Equations

We adopt here the usual convention that an empty product has the value 1 and an empty sum has
the value 0.

Proof of lemma 9.6.3. It is readily verified that

En+1 = an En + bn , n = 0, 1, . . . , N − 1, E 0 = e0 .

Subtracting this from (9.6.16), we get

en+1 − En+1 ≤ an (en − En ), n = 0, 1, . . . , N − 1.

Now, e0 − E0 = 0, so that e1 − E1 ≤ 0, since a0 > 0. By induction, more generally, en − En ≤ 0,


since an−1 > 0. 

Proof of Theorem 9.6.2. Let h = {hn } be an arbitrary grid on [a, b] and v, w ∈ Γh [a, b] two arbitrary
(vector-valued) grid functions. By definitions of Rh , we can write

vn+1 = vn + hn Φ(xn , vn ; hn ) + hn (Rh v)n , n = 0, 1, . . . , N − 1

and similarly for wn+1 . Subtracting then gives

vn+1 − wn+1 = vn − wn + hn [Φ(xn , vn ; hn ) − Φ(xn , wn ; hn )]+


+ hn [(Rh v)n − (Rh w)n ], n = 0, 1, . . . , N − 1. (9.6.18)

Define now
en = kvn − wn k, dn = k(Rh v)n − (Rh w)n k, δ = kdn k∞ . (9.6.19)
Then, using the triangle inequality in (9.6.18) and the Lipschitz condition (9.6.19) for Φ, we obtain

en+1 ≤ (1 + hn M )en + hn δ, n = 0, 1, . . . , N − 1 (9.6.20)

This is inequality (9.6.16) with an = 1 + hn M , bn = hn δ. Since for k = 0, 1, . . . , n − 1, n ≤ N we


have
n−1
Y n−1
Y N−1
Y N−1
Y
aℓ ≤ aℓ = (1 + hℓ M ) ≤ ehℓ M
ℓ=k+1 ℓ=0 ℓ=0 ℓ=0

= e(h0 +h1 +···+hN −1 )M = e(b−a)M ,

where the classical result 1 + x ≤ ex has been used in the second inequality, we obtain from lemma
9.6.3 that
n−1
X
en ≤ e(b−a)M e0 + e(b−a)M hk δ ≤
k=0

≤ e(b−a)M (e0 + (b − a)δ), n = 0, 1, . . . , N − 1.

Therefore

kek∞ = kv − wk∞ ≤ e(b−a)M (kv0 − w0 k + (b − a)kRh v − Rh wk∞ ),

which is (9.6.11) with K = e(b−a)M max{1, b − a}. 


9.6. Global Description of One-Step Methods 355

We have actually proved stability for all |h| ≤ h0 , not only for h sufficiently small.
All one-step methods used in practice satisfy a Lipschitz condition if f does, and the constant M
for Φ can be expressed in terms of the Lipschitz constant L for f . This is obvious for Euler’s method,
and not difficult to prove for others. It is useful to note that Φ does not need not be continuous in x;
piecewise continuity suffices, as long as (9.6.15) holds for all x ∈ [a, b], taking one side limits at points
of discontinuity.
The following application of Lemma 9.6.3, relative to a grid function v ∈ Γh [a, b] satisfying

vn+1 = vn + hn (An vn + bn ), n = 0, 1, ..., N − 1, (9.6.21)

where An ∈ Rd×d , bn ∈ Rd , and hn is an arbitrary grid on [a, b] is also useful.

Lemma 9.6.4. Suppose in (9.6.21) that

kAn k ≤ M, kbn k ≤ δ, n = 0, 1, . . . , N − 1, (9.6.22)

where the constants M , δ do not depend on h. Then, there exists a constant K > 0 independent of h,
but depending on kv0 k, such that
kvk∞ ≤ K. (9.6.23)

Proof. The lemma follows observing that

kvn+1 kk ≤ (1 + hn M )kvn k + hn δ, n = 0, 1, . . . , N − 1,

which is precisely the inequality (9.6.19) in the proof of Theorem 9.6.2, hence

kvn k ≤ e(b−a)M {kv0 k + (b − a)δ}. (9.6.24)

9.6.2 Convergence
Stability is a powerful concept. It implies almost immediately convergence, and it is also instrumen-
tal in deriving asymptotic global error estimates. We begin by defining precisely what we mean by
convergence.

Definition 9.6.5. Let a = x0 < x1 < x2 < · · · < xN = b be a grid on [a, b] with grid length
|h| = max (xn − xn−1 ). Let u = {un } be the grid function defined by applying the method (9.6.5)
1≤n≤N
on [a, b] and y = {yn } the grid function induced by the exact solution of the initial value problem
(9.1.1). The method (9.6.5) is said to converge on [a, b] if there holds

ku − yk∞ → 0 as |h| → 0 (9.6.25)

Theorem 9.6.6. If the method (9.6.5) is consistent and stable on [a, b], then it converges. Moreover, if
Φ has order p, then
ku − yk∞ = O(|h|p ) as |h| → 0. (9.6.26)

Proof. By the stability inequality (9.6.11) applied to the grid functions v = h and w = y of Definition
9.6.5, we have for |h| sufficiently small

ku − yk∞ ≤ K(ku0 − y(x0 )k + kRh u − Rh yk∞ ) = KkRh yk (9.6.27)


356 Numerical Solution of Ordinary Differential Equations

since u0 = y(x0 ) and Rh u = 0 by (9.6.5). But, by (9.6.10),

kRh yk∞ = kT (·, y; h)k∞ (9.6.28)

where T is the truncation error of the method Φ. By definition of consistency

kT (·, y; h)k∞ → 0, as |h| → 0,

which proves the first part of the theorem. The second part follows immediately from (9.6.27) and
(9.6.28), since order p, means, by definition that

kT (·, y; h)k∞ = O(|h|p ), as |h| → 0. (9.6.29)

9.6.3 Asymptotics of global error


Since the principal error function describes the leading contribution of the local truncation error, it is
of interest to identify the leading term in the global error un − y(xn ). To simplify matters, we assume
a constant grid length h, although it is not difficult to deal with a variable grid length of the form
hn = ϑ(xn )h, where ϑ(x) is piecewise continuous and 0 < ϑ(x) < θ for a ≤ x ≤ b. Thus, we
consider our one-step method to have the form

xn+1 = xn + h
un+1 = un + hΦ(xn , un ; h); n = 0, 1, . . . , N − 1 (9.6.30)
x0 = a, u0 = y0 ,

defining a grid function u = {un } on a uniform grid on [a, b]. We are interested in the asymptotic
behavior of un − y(xn ) as h → 0, where y(x) is the exact solution of the initial value problem
(
dy
= f (x, y) x ∈ [a, b]
dx (9.6.31)
y(a) = y0
Theorem 9.6.7. Assume that

(1) Φ(x, y, h) ∈ C 2 [a, b] × Rd × [0, h0 ] ;
(2) Φ is a method of order p ≥ 1 admitting a principal error function τ (x, y) continuous on [a, b] ×
Rd ;
(3) e(x) is the solution of the linear initial value problem
 de
dx
= fy (x, y(x))e + τ (x, y(x)), a≤x≤b
(9.6.32)
e(a) = 0

Then, for n = 0, N ,
un − y(xn ) = e(xn )hp + O(hp+1 ), as h → 0. (9.6.33)

Before we prove the theorem, we make the following remarks:


1. The precise meaning of (9.6.33) is

ku − y − hp ek∞ = O(hp+1 ),

where u, y, e are the grid functions u = {un }, y = {y(xn )} and e = {e(xn )}.
9.6. Global Description of One-Step Methods 357

2. Since by consistency Φ(x, y; 0) = f (x, y), assumption (1) implies f is of class C 2 on ([a, b] ×
Rd ), which is more than enough to guarantee the existence and uniqueness of the solution e(x)
of (9.6.32) on the whole interval [a, b].
3. The fact that some, but not all, components of τ (x, y) may vanish identically does not imply
that the corresponding components of e(x) also vanish, since (9.6.32) is a coupled system of
differential equations.

Proof of Theorem 9.6.7. We begin with an auxiliary computation, an estimate for


Φ(xn , un ; h) − Φ(xn , y(xn ); h). (9.6.34)
By Taylor’s (for functions of several variables), applied to the ith component of (9.6.34), we have
d
X h i
Φi (xn , un ; h)−Φi (xn , y(xn ); h) = Φi y j (xn , y(xn ); h) ujn − y j (xn )
j=1
(9.6.35)
d
1 X i j k h ih i
+ Φ y y (xn , un ; h) ujn − y j (xn ) ukn − y k (xn ) ,
2
j,k=1

where ūn is on the line segment connecting un and y(xn ). Using Taylor’s theorem once more, in the
variable h, we can write

Φiy j (xn , y(xn ); h) = Φiy j (xn , y(xn ); 0) + hΦiy j h xn , y(xn ); h̄ ,

where 0 < h̄ < h. Since, by consistency, Φ(x, y; 0) ≡ f (x, y) on [a, b] × Rd , we have


Φiy j (x, y; 0) = fyi j (x, y), x ∈ [a, b], y ∈ Rd ,
and assumption (1) allows us to write
Φiy j (xn , y(xn ); h) = fyi j (xn , y(xn )) + O(h), h → 0. (9.6.36)
p
Now observing that un − y(xn ) = O(h ), by virtue of Theorem 9.6.6 and using (9.6.36) in (9.6.35),
we get, again by assumption (1),
d
X h i
Φi (xn , un ; h) − Φi (xn , y(xn ); h) = fyi j (xn , y(xn )) ujn − y j (xn ) +
j=1

O(hp+1 ) + O(h2p ).
But O(h2p ) is also of order O(hp+1 ), since p ≥ 1. Thus, in vector notation,
Φ(xn , un ; h) − Φ(xn , y(xn ); h) = fy (xn , y(xn )) [un − y(xn )] + O(hp+1 ). (9.6.37)
Now, to highlight the leading term in the global error, we define the grid function r = {rn } by
r = h−p (u − y). (9.6.38)
Then
1 1  −p 
(rn+1 − rn ) = h (un+1 − y(xn+1 )) − h−p (un − y(xn )) =
h h  
1 1
= hp (un+1 − un ) − (y(xn+1 ) − y(xn )) =
h h
= h−p {Φ(xn , un ; h) − [Φ(xn , y(xn ); h) − T (xn , y(xn ); h)]} ,
358 Numerical Solution of Ordinary Differential Equations

where we have used (9.6.30) and the relation (9.6.10) for the truncation error T . Therefore, expressing
T in terms of the principal error function τ , we get
1 
(rn+1 − rn ) = h−p Φ(xn , un ; h) − Φ(xn , y(xn ); h) + τ (xn , y(xn ))hp
h

+ O(hp+1 )

For the first two terms in brackets we use (9.6.37) and the definition of r in (9.6.38) to obtain
1
h
(rn+1 − rn ) = fy (xn , y(xn )) rn + τ (xn , y(xn )) + O(h), n = 0, N − 1
(9.6.39)
r0 = 0.

Now letting
g(x, y) := fy (x, y(x))y + τ (x, y(x)) (9.6.40)
we can interpret (9.6.39) by writing
 
RhEuler,g r = εn , n = 0, N − 1, εn = O(h),
n

where RhEuler,g is the discrete residual operator (9.6.7) that goes with Euler’s method applied to e′ =
g(x, e), e(a) = 0. Since Euler’s method is stable on [a, b] and g being linear in y satisfies a uniform
Lipschitz condition, we have by the stability inequality (9.6.11)

kr − ek∞ = O(h),

and hence, by (9.6.38)


ku − y − hp ek∞ = O(hp+1 ),
as was to be shown. 

9.7 Error Monitoring and Step Control


Most production codes currently available for solving ODEs monitor local truncation errors and control
the step length on the basis of estimates for these errors. Here we attempt to monitor global error, at
least asymptotically, by implementing the asymptotic result of Theorem 9.6.7. This necessitates the
evaluation of the Jacobian matrix fy (x, y) along or near the solution trajectory; but this is only natural,
since fy , in a first approximation, governs the effect of perturbations via the variational differential
equation (9.6.32). This equation is driven by the principal error function evaluated along the trajectory,
so that estimates of local truncation errors (more precisely, of the principal error function) are needed
also in this approach. For simplicity we again assume constant grid length.

9.7.1 Estimation of global error


The idea of our estimation is to integrate the “variational equation” (9.6.32) along with the main equa-
tion (9.6.31). Since we need e(x) in (9.6.31) only to within an accuracy of O(h) (any O(h) error term
in e(xn ), multiplied by hp , being absorbed by the O(hp−1 ) term), we can use Euler’s method for that
purpose, which will provided the desired approximation vn ≈ e(xn ).

Theorem 9.7.1. Assume that



(1) Φ(x, y; h) ∈ C 2 [a, b] × Rd × [0, h0 ] ;
9.7. Error Monitoring and Step Control 359


(2) Φ is a method of order p ≥ 1 admitting a principal error function τ (x, y) of class C 1 [a, b] × Rd ;
(3) an estimate r(x, y; h) is available for principal error function that satisfies

r(x, y; h) = τ (x, y) + O(h), h → 0, (9.7.1)

uniformly on [a, b] × Rd ;
(4) along with the grid function u = {un } we generate the grid function v = {vn } in the following
manner.
xn+1 = xn + h;
un+1 = un + hΦ(xn , un ; h)
(9.7.2)
vn+1 = vn + h [fy (xn , vn )vn + r(xn , un ; h)]
x0 = a, u0 = y0 , v0 = 0.

Then, for n = 0, N − 1,

un − y(xn ) = vn hp + O(hp+1 ), când h → 0. (9.7.3)

Proof. The proof begins by establishing the following estimates

fy (xn , un ) = +O(h), (9.7.4)


r(xn , un ; h) = τ (xn , y(xn )) + O(h). (9.7.5)

From assumption (1) we note by consistency, f (x, y) = Φ(x, y; 0) that f (x, y) is in C 2 [a, b] × Rd .
Taking into account the Theorem 9.6.6, we have un = y(xn ) + O(hp ), and therefore,

fy (xn , un ) = fy (xn , yn ) + O(hp ),



which implies (9.7.4), since p ≥ 1. Next, since τ (x, y) ∈ C 1 [a, b] × Rd , by assumption (2) we have

τ (xn , un ) = τ (xn , y(xn )) + τy (xn , ūn )(un − y(xn ))


= τ (xn , y(xn )) + O(hp )

so that by assumption (3),

r(xn , un ; h) = τ (xn , un ) + O(h) = τ (xn , y(xn )) + O(hp ) + O(h),

which implies (9.7.5) immediately.


Let (cf. (9.6.40))
g(x, y) = fy (x, y(x))y + τ (x, y(x)). (9.7.6)
The equation for vn+1 in (9.7.2) has the form

vn+1 = vn + h(An vn + bn ),

where An are bounded matrices and bn bounded vectors. By Lemma 9.6.4, 9.6.4, we have boundness
of vn ,
vn = O(1), h → 0. (9.7.7)
Substituting (9.7.4) and (9.7.5) into the equation for vn+1 and noting (9.7.7), we obtain

vn+1 = vn + h [fy (xn , y(xn ))vn + τ (xn , y(xn )) + O(h)]


= vn + hg(xn , vn ) + O(h2 ).
360 Numerical Solution of Ordinary Differential Equations

Thus, in the notation used in the proof of Theorem 9.6.7


 
RhEuler,g v = O(h), v0 = 0.
n

Since Euler’s method is stable, we conclude

vn − e(xn ) = O(h),

where e(x) is, as before, the solution of

e′ = g(x, e)
e(a) = 0.

Therefore, by (9.6.33)
un − y(xn ) = e(xn )hp + O(hp+1 ).


9.7.2 Truncation error estimates


In order to apply Theorem 9.7.1 we need estimates r(x, y; h) of the principal error function τ (x, y)
which are O(h) accurate. We shall describe two of them in increasing order of efficiency.

Local Richardson extrapolation to zero


This works for any one-step method Φ, but is usually considered too expensive. If Φ has the order p,
the procedure is as follows

yh = y + hΦ(x, y; h),
 
1 1
yh/2 = y + hΦ x, y; h ,
2 2
  (9.7.8)
∗ 1 1 1
yh = yh/2 + hΦ x + h, yh/2 ; h ,
2 2 2
1 1 ∗
r(x, y; h) = (yh − yh ) .
1 − 2−p hp+1
Note that yh∗ is the result of applying Φ over two consecutive steps of length h/2 each, whereas yh is
the result of one application over the whole step length h.
We now verify that r(x, y; h) in (9.7.8)
 is an acceptable error estimator. To do this, we need to
assume that τ (x, y) ∈ C 1 [a, b] × Rd . In terms of the reference solution u(t) through (x, y) we have
(cf. (9.3.4) and (9.3.8))
1
Φ(x, y; h) = [u(x + h) − u(x)] + τ (x, y)hp + O(hp+1 ). (9.7.9)
h
Furthermore,
 
1 1 1 1 1
(yh − yh∗ ) = (yh − yh/2 ) + Φ(x, y; h) − hΦ x + h, yh/2 ; h
h h 2 2 2
   
1 1 1 1 1
= Φ(x, y; h) − Φ x, y; h − hΦ x + h, yh/2 ; h .
2 2 2 2 2
9.7. Error Monitoring and Step Control 361

Applying (9.7.9) to each of the three terms on the right, we find

1 1
(yh − yh∗ ) = [u(x + h) − u(x)] + τ (x, y)hp + O(hp+1 )
h h      
1 1 1 1 1 p
− u x + h − u(x) − τ (x, y) h + O(hp+1 )
2 h/2 2 2 2
     
1 1 1 1 1 1 p
− u (x + h) − u x + h − τ x + h, y + O(h) h
2 h/2 2 2 2 2
+ O(hp+1 ) = τ (x, y)(1 − 2−p )hp + O(hp+1 ).

Consequently
1 1
(yh − yh∗ ) = τ (x, y)hp + O(hp+1 ), (9.7.10)
1 − 2−p h
as required.
Subtracting (9.7.10) from (9.7.9) shows, incidentally that
1 1
Φ∗ (x, y; h) := Φ(x, y; h) − (yh − yh∗ ) (9.7.11)
1 − 2−p h
defines a one-step method of order p + 1.
Procedure (9.7.8) is rather expensive. For a fourth-order Runge-Kutta process, it requires a total
of 11 evaluations of f per step, almost three times the effort for a single Runge-Kutta step. Therefore,
Richardson extrapolation is normally used only after two steps of Φ, that is one proceeds according to

yh = y + hΦ(x, y; h),

y2h = yh + hΦ(x + h, yh ; h) (9.7.12)
y2h = y + 2hΦ(x, y; 2h).

Then (9.7.10) gives


1 1 ∗
(y2h − y2h ) = τ (x, y) + O(h), (9.7.13)
2(2p − 1) hp+1
so that the expression on the left is an acceptable estimator r(x, y; h). If the two steps in (9.7.12) yield
acceptable accuracy (cf. §9.7.3), then again for a fourth-order Runge-Kutta process, the procedure

requires only three additional evaluations of f , since yh and y2h would have to be computed anyhow.
There are still more efficient schemes, as we shall seen.

Embedded methods
The basic idea of this approach is very simple: if the given method Φ has order p, take any other one
step method Φ∗ of order p∗ = p + 1 and define
1
r(x, y; h) = [Φ(x, y; h) − Φ∗ (x, y; h)] (9.7.14)
hp
This is indeed an acceptable estimator, as follows by subtracting the two relations
1
Φ(x, y; h) − [u(x + h) − u(x)] = τ (x, y)hp + O(hp+1 )
h
1
Φ∗ (x, y; h) − [u(x + h) − u(x)] = O(hp+1 )
h
and dividing the result by hp .
362 Numerical Solution of Ordinary Differential Equations

The tricky part is to make this procedure efficient. Following an idea of Fehlberg, one can try to do
this by embedding one Runge-Kutta process (of order p) into another (of order p + 1). Specifically, let
Φ be some explicit r-stage Runge-Kutta method.

K1 (x, y) = f (x, y)
s−1
!
X
Ks (x, y; h) = f x + µs h; y + h λsj Kj , s = 2, 3, . . . , r
j=1
r
X
Φ(x, y; h) = αs Ks
s=1

Then for Φ∗ choose a similar r ∗ -stage process, with r ∗ > r, in such a way that

µ∗s = µs , λ∗sj = λsj , for s = 2, 3, . . . , r.

The estimate (9.7.14) then costs only r ∗ − r extra evaluations of f . If r ∗ = r + 1 one might even
attempt to save the additional evaluation by selecting (if possible)

µ∗r = 1, λ∗rj = αj for j = 1, r ∗ − 1 (r ∗ = r + 1) (9.7.15)

Then indeed, Kr∗ will be identical with K1 for the next step.
Pairs of such embedded (p, p + 1) Runge-Kutta formulae have been developed in the late 1960’s
by E. Fehlberg. There is a considerable degree of freedom in choosing the parameters. Fehlberg’s
choices were guided by an attempt to reduce the magnitude of the coefficients of all the partial derivative
aggregates that enter into the principal error function τ (x, y) of Φ. He succeeded in obtaining pairs with
the following values of parameters p, r, r ∗ , given in Table 9.2.

p 3 4 5 6 7 8
r 4 5 6 8 11 15
r∗ 5 6 8 10 13 17

Table 9.2: Embedded Runge-Kutta formulae

For the third-order process (and only for that one) one can choose the parameters for (9.7.15) to
hold.

9.7.3 Step control


Any estimate r(x, y; h) of the principal error function τ (x, y) implies an estimate

hp r(x, y; h) = T (x, y; h) + O(hp+1 ) (9.7.16)

for the truncation error, which can be used to monitor the local truncation error during the integration
process. However, one has to keep in mind that the local truncation error is quite different from the
global error, that one really wants to control. To get more insight into the relationship between these
two errors, we recall the following theorem, which quantifies the continuity of solution of an initial
value problem with respect to initial values.
9.7. Error Monitoring and Step Control 363

Theorem 9.7.2. Let f (x, y) be continuous in x ∈ [a, b] and satisfy a Lipschitz condition uniformly on
[a, b] × R, with Lipschitz constant L, that is

kf (x, y) − f (x, y ∗ )k ≤ L ky − y ∗ k .

Then the initial value problem


dy
= f (x, y), x ∈ [a, b],
dx (9.7.17)
y(c) = yc
has a unique solution on [a, b] for any c ∈ [a, b] and for any yc ∈ Rd . Let y(x, s) and y(x; s∗ ) be the
solutions of (9.7.17) corresponding to yc = s and yc = s∗ , respectively. Then for any vector norm k.k,

ky(x; s) − y(x; s∗ )k ≤ eL|x−c| ks − s∗ k . (9.7.18)

“Solving the given initial value problem (9.6.31) numerically by a one-step method (not necessarily
with constant step) means in reality that one follows a sequence of “solution tracks”, whereby at each
grid point xn one jumps from one track to the next by an amount determined by the truncation error
at xn ” [33] (see Figure 9.3). This result from the definition of truncation error, the reference solution
being one of the solution tracks. Specifically, the nth track, n = 0, N , is given by the solution of the
initial value problem

dvn
= f (x, vn ), x ∈ [xn , b],
dx (9.7.19)
vn (xn ) = un ,
and
un+1 = v(xn+1 ) + hn T (xn , un ; hn ), n = 0, N − 1. (9.7.20)
Since by (9.7.19) we have un+1 = vn+1 (xn+1 ), we can apply Theorem 9.7.2 to the solution vn+1 and
vn , letting c = xn+1 , s = un+1 , s∗ = un+1 − hn T (xn , un ; hn ) (by (9.7.20)), and thus obtain

kvn+1 (x) − vn (x)k ≤ hn eL|x−xn | kT (xn , un ; hn )k , n = 0, N − 1. (9.7.21)

Now
N−1
X
[vn+1 (x) − vn (x)] = vN (x) − v0 (x) = vN (x) − y(x), (9.7.22)
n=0

and since vN (xN ) = uN , letting x = xN , we get from (9.7.21) and (9.7.22) that
N−1
X
kuN − y(xN )k ≤ kvn+1 (xN ) − vn (xN )k
n=0
N−1
X
≤ hn eL|xN −xn+1 | kT (xn , un ; hn )k .
n=0

Therefore, if we make sure that

kT (xn , un ; hn )k ≤ εT , n = 0, N − 1, (9.7.23)

then
N−1
X
kuN − y(xN )k ≤ εT (xn+1 − xn )eL|xN −xn+1 | .
n=0
364 Numerical Solution of Ordinary Differential Equations

∫∫

h2T2

h T
1 1

h T
0 0

∫∫
a=x0 x1 x2 x3 xN−1 xN=b

Figure 9.3: Error accumulation in a one-step method

Interpreting the sum on the right as a Riemann sum for a definite integral, we finally obtain, approxi-
mately,
Zb
εT  L(b−a) 
kuN − y(xN )k ≤ εT eL(b−x) dx = e −1 .
L
a

Thus, knowing an estimate for L would allow us to set an appropriate εT , namely

L
εT = ε, (9.7.24)
eL(b−a) − 1
to guarantee an error kuN − y(xN )k ≤ ε. What holds for the whole grid on [a, b] of course, holds
for any grid on a subinterval [a, x], a ≤ x ≤ b. So, in principle, given the desired accuracy ε for
the solution y(x), we can determine a “local tolerance level” εT (cf. (9.7.24)) and achieve the desired
accuracy by keeping the local truncation error bellow εT (cf. (9.7.23)). Note that as L → 0, we have
εT → ε/(b − a). This limit value of εT would be appropriate for a quadrature problem but definitely
not for a true differential equation problem, where εT , in general, has to be chosen considerably smaller
than the target error tolerance ε.
Considerations such as these motivate the following step control mechanism: each integration step
(from xn to xn+1 = xn + hn ) consists of these parts:

1. Estimate hn .
2. Compute un+1 = un + hn Φ(xn , un ; hn ) and r(xn , un ; hn ).
3. Test hpn kr(xn , un ; hn )k ≤ εT (cf. (9.7.16) and (9.7.23)). If the test passes, proceed with the
next step; if not, repeat the step with a smaller hn , say, half as large, until the test passes.
9.7. Error Monitoring and Step Control 365

To estimate hn , assume first that n ≥ 1, so that the estimator from the previous step, r(xn−1 , un−1 ; hn−1 )
(or at least its norm) is available. Then, neglecting terms of O(h),

kτ (xn−1 , un−1 k ≈ kr(xn−1 , un−1 ; hn−1 )k,

and since τ (xn , un ) ≈ τ (xn−1 , un−1 ), likewise

kτ (xn , un )k ≈ kr(xn−1 , un−1 ; hn−1 )k.

What we want is
kτ (xn , un )khpn ≈ θεT ,
where θ is “safety factor”, say, θ = 0.8. Eliminating τ (xn , un ), we find
 1/p
θεT
hn ≈ .
kr(xn−1 , un−1 , hn−1 )k

Note that from the previous step we have

hpn−1 kr(xn−1 , un−1 ; hn−1 )k ≤ εT ,

so that
hn ≥ θ1/p hn−1 ,
and the tendency is to increase the step.
(0) (0)
If n = 0, we proceed similarly, using some initial guess h0 of h0 and associated r(x0 , y0 ; h0 )
to obtain ( )1/p
(1) θεT
h0 = (0)
.
r(x0 , y0 ; h0 )
(0)
The process may be repeated once or twice to get the final estimate of both quantities h0 and r(x0 , y0 ; h0 ).
For a synthetic description of variable-step Runge-Kutta methods Butcher table is completed by an
supplementary line used for the computation of Φ∗ (and thus of r(x, y; h)):

µ1 λ11 λ12 ... λ1r


µ2 λ21 λ22 ... λ2r
.. .. .. ..
. . . ... .
µr λr1 λr2 ... λrr
α1 α2 ... αr
α∗1 α∗2 α∗r α∗r+1

As an example, Table 9.3 is the Butcher table for a 2-3 method. For the derivation of this table see
[91, pages 451–452].
Table 9.4 is the Butcher table for Bogacki-Shampine method [8]. It is the background for M ATLAB
ode23 solver.
Another important example is DOPRI5 or RK5(4)7FM, a pair of order 4-5 and 7 stages (Table 9.5).
This is a very efficient pair; it is the base for M ATLAB ode45 solver and for other important solvers.
We shall give, in the sequel an implementation example for a variable-step Runge-Kutta method.
Following the ideas of [25], we implemented a more general MATLAB function, oderk, that uses a
Butcher table. The error handling and general background is inspired from MATLAB functions ode23
and ode45, but also from ode23tx function in [66].
366 Numerical Solution of Ordinary Differential Equations

µj λij
0 0
1 1
4 4 0
27
40 − 189
800
729
800 0
214 1 650
1 891 33 891 0
214 650
αi 891 891 0
533 800 1
α∗i 2106 1053 − 78

Table 9.3: A 2-3 pair

µj λij
0 0
1 1
2 2 0
3 3
4 0 4 0
2 3 4
1 9 9 9 0
2 3 4
αi 9 9 9 0
7 1 1 1
α∗i 24 4 3 8

Table 9.4: The Butcher table for Bogacki-Shampine method

µj λij
0 0
1 1
5 5 0
3 3 9
10 40 40 0
4 44
5 45 − 56
15
32
9 0
8 19372
9 6561 − 25360
2187
64448
6561 − 212
729 0
9017
1 3168 − 355
33
46732
5247
49
176
5103
− 18656 0
35 500 125
1 384 0 1113 192 − 2187
6784
11
84 0
35 500 125
αi 384 0 1113 192 − 2187
6784
11
84 0
5179 7571 393 92097 187 1
α∗i 57600 0 16695 640 − 339200 2100 40

Table 9.5: RK5(4)7FM (DOPRI5) embedded pair


9.7. Error Monitoring and Step Control 367

The first argument of oderk specifies the right-hand side function, f (t, y). It can be a function
handle, a character string or an inline function. The second argument is a two-component vector,
tspan, containing the initial and the final value, t0 and tfinal, respectively. It gives the integration
interval. The third argument, y0 provides the starting values y0 = y(t0 ). The length of y0 provides the
number of differential equations in the system. The forth argument is a function handle, that indicates
a function for the initialization of Butcher table. If missing, the default method is Bogacki-Shampine
method (Table 9.4, function BS23). The fifth argument, opts, contains the options of the solver. We
can initialize it using the MATLAB function odeset. The oderk function takes into account only
the following options: RelTol (relative error, default 1e-3), AbsTol (absolute error, default 1e-6),
OutputFcn (the output function, default odeplot), and Stats (with value on or off specifying
if one desires statistics). The statement

opts=odeset(’RelTol’, 1e-5, ’AbsTol’, 1e-8, ’OutputFcn’,...


myodeplot)

sets the relative error to 10−5 , the absolute error to 10−8 and myodeplot as output function.
The oderk output can be either numeric or graphical. Without any output argument, oderk
produces a graph of all solution components. With two output arguments, the statements
[tout,yout] = oderk(F,tspan,y0)
yields a tables of abscissas and ordinates of the solution.
Let us examine the code of this function.

function [tout,yout] = oderk(F,tspan,y0,BT,opts,varargin)


function [tout,yout] = oderk(F,tspan,y0,BT,opts,varargin)
%ODERK nonstiff ODE solver
% ODERK uses two embedded methods given by Butcher table
%
% ODERK(F,TSPAN,Y0) with TSPAN = [T0 TFINAL] integrates
% the system of differential equations y’ = f(t,y)
% from t=T0 to t=TFINAL. Initial condition is y(T0)=Y0.
% F is an M-file name, an inline function or a
% character string defining f(t,y).
% This function must have two arguments, t and y and must
% return a column vector of derivatives, yprime.
%
% With two output arguments, [T,Y] = ODERK(...) return
% a column vector T and an array Y, where Y(:,k) is the
% solution at point T(k).
%
% Without output arguments, ODERK plots the solution.
%
% ODERK(F,TSPAN,Y0,RTOL) uses the relative error RTOL,
% instead of default 1.e-3.
%
% ODERK(F,TSPAN,Y0,BT) uses a Butcher table, BT. If BT is
% empty or missing, one uses BS23 (Bogacki-Shampine)
%
% ODERK(F,TSPAN,Y0,BT,OPTS) where OPTS=ODESET(’reltol’,...
% RTOL,’abstol’,ATOL,’outputfcn’,@PLOTFUN) uses the
% relative error RTOL instead the default 1.e-3, the
368 Numerical Solution of Ordinary Differential Equations

% absolute error ATOL instead of the default 1.e-6 and


% call PLOTFUN instead of ODEPLOT after each
% successful step
%
% If the call has more than 5 input arguments,
% ODERK(F,TSPAN,Y0,BT,RTOL,P1,P2,...), the additional
% arguments are passed to F, F(T,Y,P1,P2,...).
%
% Stats set to ’on’ provides statistics
%
% Example
% tspan = [0 2*pi];
% y0 = [1 0]’;
% F = ’[0 1; -1 0]*y’;
% oderk(F,tspan,y0);

We start with variable initialization and option processing.


% Init variables

rtol = 1.e-3;
atol = 1.e-6;
plotfun = @odeplot;
statflag = 0;
statflag=strcmp(opts.Stats,’on’);
if (nargin >= 4) & ˜isempty(BT) %Butcher table
[lambda,alfa,alfas,mu,s,oop,fsal]=BT();
else
[lambda,alfa,alfas,mu,s,oop,fsal]=BS23();
end
if statflag %statistics
stat=struct(’ns’,0,’nrej’,0,’nfunc’,0);
end

if nargin >= 5 & isnumeric(opts)


rtol = opts;
elseif nargin >= 5 & isstruct(opts)
statflag=strcmp(opts.Stats,’on’);
if ˜isempty(opts.RelTol), rtol = opts.RelTol; end
if ˜isempty(opts.AbsTol), atol = opts.AbsTol; end
if ˜isempty(opts.OutputFcn),
plotfun = opts.OutputFcn;
end
end
if statflag %statistics
stat=struct(’ns’,0,’nrej’,0,’nfunc’,0);
end

t0 = tspan(1);
tfinal = tspan(2);
9.7. Error Monitoring and Step Control 369

tdir = sign(tfinal - t0);


plotit = (nargout == 0);
threshold = atol / rtol;
hmax = abs(0.1*(tfinal-t0));
t = t0;
y = y0(:);

% Make F callable

if ischar(F) & exist(F)˜=2


F = inline(F,’t’,’y’);
elseif isa(F,’sym’)
F = inline(char(F),’t’,’y’);
end

% Init outputs

if plotit
plotfun(tspan,y,’init’);
else
tout = t;
yout = y.’;
end

The computation of step length is a delicate question, since it requires some knowledge about global
scale of the problem.

% Compute initial stepsize

K=zeros(length(y0),s);
K(:,1)=F(t,y,varargin:); %first evaluation
if statflag, stat.nfunc=stat.nfunc+1; end
r = norm(K(:,1)./max(abs(y),threshold),inf) + realmin;
h = tdir*0.8*rtolˆ(oop)/r;

It follows the main loop. The integration process starts at t = t0 and increments t until it reaches tf inal .
It is possible to go backward if tf inal < t0 .

% Main loop

while t ˜= tfinal

hmin = 16*eps*abs(t);
if abs(h) > hmax, h = tdir*hmax; end
if abs(h) < hmin, h = tdir*hmin; end

% correct final stepsize


370 Numerical Solution of Ordinary Differential Equations

if 1.1*abs(h) >= abs(tfinal - t)


h = tfinal - t;
end

Here is the actual computing. The first slope, K(:,1), is already computed. Now, s-1 slope evalua-
tions follow, where s is the number of stages.

% compute step attempt

for i=2:s
K(:,i)=F(t+mu(i)*h,y+h*K(:,1:i-1)*...
(lambda(i,1:i-1)’));
end
if statflag, stat.nfunc=stat.nfunc+s-1; end
tnew=t+h;
ynew=y+h*K*alfas;

Then, one estimates the error. The error vector norm is scaled by the ratio of absolute and relative error.
The use of the smallest floating-point number, realmin, avoids err to be zero.

% Estimate error

e = h*K*(alfa-alfas);
err = norm(e./max(max(abs(y),abs(ynew)),threshold),...
inf) + realmin;

One tests if the step is successful. If this is true, it displays the result or adds it to the output vector.
Otherwise, if statistics are required, the unsuccessful step is counted. If the method is of type FSAL
(First Same As Last), that is, the last stage of previous step is the same as the first stage of next step,
then the last function value is reused.

% Accept solution if estimated error < tolerance

if err <= rtol %accepted step


t = tnew;
y = ynew;
if plotit
if plotfun(t,y,’’);
break
end
else
tout(end+1,1) = t;
yout(end+1,:) = y.’;
if statflag
stat.ns=stat.ns+1;
9.7. Error Monitoring and Step Control 371

end
end
if fsal % Reuse final value if required
K(:,1)=K(:,s);
else
K(:,1)=F(t,y);
if statflag, stat.nfunc=stat.nfunc+1; end
end
else %rejected step
if statflag, stat.nrej=stat.nrej+1; end
end

We use the error estimation to compute a new step size. The ratio rtol/err is greater than one if the
current step is successful and less than one if the current step fails. The safe factors 0.8 and 5 avoid the
excessive step-length changing.

% Compute new step


h = h*min(5,0.8*(rtol/err)ˆ(oop));

We can detect here the occurrence of a singularity.

% Exit if stepsize too small

if abs(h) <= hmin


warning(sprintf(’step size %e too small at ...
t = %e.\n’,h,t));
t = tfinal;
end
end

The main loop finishes here. The plot function must end its work.

if plotit
plotfun([],[],’done’);
end
if statflag
fprintf(’%d succesfull steps\n’,stat.ns)
fprintf(’%d failed attempts\n’, stat.nrej)
fprintf(’%d function evaluations\n’, stat.nfunc)
end

For applications of numerical solution of differential equations and other numerical methods in
mechanics see [52].
372 Numerical Solution of Ordinary Differential Equations

Solver Problem type Type of algorithm


ode45 Nonstiff Explicit Runge-Kutta pair, order 4 and 5
ode23 Nonstiff Explicit Runge-Kutta pair, order 2 and 3
ode113 Nonstiff Explicit multistep method, variable order, orders
1 to 13
ode15s Stiff Implicit multistep method, variable order, orders
1 to 15
ode23s Stiff Modified Rosenbrock pair (one step), orders 2 and
3
ode23t Stiff Implicit trapezoidal rule, orders 2 and 3
ode23tb Stiff Implicit Runge-Kutta-type algorithm, order 2 and
3
ode15i Fully implicit BDF

Table 9.6: MATLAB ODE solvers

9.8 ODEs in MATLAB

9.8.1 Solvers

MATLAB has powerful facilities for solving initial value problems for ordinary differential equations:

d
y(t) = f (t, y(t)), y(t0 ) = y0 .
dt

The simplest way to solve such a problem is to code a function that evaluates f and then to call one
of the MATLAB’s ODE solvers. The minimal information to be provided to a solver is the function
name, the range of t values over which the solution is required and the initial value y0 . MATLAB’s
ODE solvers allow for extra (optional) input and output arguments that make it possible to specify
more about the mathematical problem and how it is to be solved. Each of MATLAB’s ODE solvers is
designed to be efficient in specific circumstances, but all are essentially interchangeable. All solvers
have the same syntax, and this fact allows us to try various methods when we do not know which is the
most adequate. The simplest syntax, common to all the solver functions is

[t,y]=solver(@fun,tspan,y0,options)

where solver is one of the ODE solver functions given in Table 9.6.

The basic input arguments are:


9.8. ODEs in MATLAB 373

fun Handle to a function that evaluates the system of ODEs. The function has the form

dydt = odefun(t, y),

where t is a scalar, and dydt and y are column vectors.


tspan Vector specifying the interval of integration. The solver imposes the initial con-
ditions at tspan(1), and integrates from tspan(1) to tspan(end). If it has
more than two elements the solver returns the solution at that points. These elements
(abscissas) must be in increasing or decreasing order. The solver does not choose
the steps taking the values in tspan, but rather computing continuous extensions
(dense outputs) having the same precision order as the solution at points generated
by the solver.
y0 Vector of initial conditions for the problem.
options Structure of optional parameters that change the default integration properties.

The output parameters are:

t Column vector of abscissas.


y Solution array. Each row in y corresponds to the solution at an abscissa returned in the
corresponding row of t.

9.8.2 Nonstiff examples


Consider the scalar ODE

y ′ (t) = −y(t) + 5e−t cos 5t, y(0) = 0,

for t ∈ [0, 3]. The right-hand side is given in the M file f1scal.m:

function yder=f1scal(t,y)
%F1SCAL Example of scalar ODE
yder = -y+5*exp(-t).*cos(5*t);

We shall use ode45 solver. The MATLAB command sequence

>> tspan = [0,3]; yzero=0;


>> [t,y]=ode45(@f1scal,tspan,yzero);
>> plot(t,y,’k--*’)
>> xlabel(’t’), ylabel(’y(t)’)

will produce the graph in Figure 9.4. The exact solution is y(t) = e−t sin 5t. We may check the
maximum error in the ode45 approximation:

>> norm(y-exp(-t).*sin(5*t),inf)
ans =
3.8416e-004

Consider the simple pendulum equation [25, section 1.4]:

d2 g
θ(t) = − sin θ(t),
dt2 L
374 Numerical Solution of Ordinary Differential Equations

0.8

0.6

0.4
y(t)

0.2

−0.2

−0.4
0 0.5 1 1.5 2 2.5 3
t

Figure 9.4: Scalar ODE example

where θ is the angular displacement of the pendulum, g is the acceleration due to gravity and L is the
length of the pendulum. Introducing the unknowns y1 (t) = θ(t) and y2 (t) = dθ(t)/dt, we may rewrite
this equation as the two first-order equations:
d
y1 (t) = y2 (t),
dt
d g
y2 (t) = − sin y1 (t).
dt L
These equations are coded in the file pend.m, given in the sequel:

function yp=pend(t,y,g,L)
%PEND - simple pendulum
%g - acceleration due to gravity, L - length
yp=[y(2); -g/L*sin(y(1))];

Here, g and L are additional parameters to be passed to pend by the solver. We shall compute the
solution for t ∈ [0, 10] and three different initial conditions.

g=10; L=10;
tspan = [0,10];
yazero = [1; 1]; ybzero = [-5; 2];
yczero = [5; -2];
[ta,ya] = ode45(@pend,tspan,yazero,[],g,L);
[tb,yb] = ode45(@pend,tspan,ybzero,[],g,L);
[tc,yc] = ode45(@pend,tspan,yczero,[],g,L);

In the calls of the form


[ta,ya] = ode45(@pend,tspan,yazero,[],g,L);
9.8. ODEs in MATLAB 375

[] is an empty vector of options. To produce phase plane plots, that is, plots of y1 (t) against y2 (t),
we simply plot the first column of the numerical solution against the second. In this context, it is often
informative to superimpose a vector field using quiver The arrows produced by quiver points to
the direction of [y2 , − sin y1 ] and have length proportional to the 2-norm of this vector. The resulting
picture is shown in Figure 9.5.
[y1,y2] = meshgrid(-5:0.5:5,-3:0.5:3);
Dy1Dt = y2; Dy2Dt = -sin(y1);
quiver(y1,y2,Dy1Dt,Dy2Dt)
hold on
plot(ya(:,1),ya(:,2),yb(:,1),yb(:,2),yc(:,1),yc(:,2))
axis equal, axis([-5,5,-3,3])
xlabel y_1(t), ylabel y_2(t), hold off
Any solution of the pendulum ODE preserves the energy: the quantity y2 (t)2 −cos y1 (t) is constant.
We can check that this is approximately true using
>> Ec = 0.5*yc(:,2).ˆ2-cos(yc(:,1));
>> max(abs(Ec(1)-Ec))
ans =
0.0263

1
y (t)

0
2

−1

−2

−3
−5 −4 −3 −2 −1 0 1 2 3 4 5
y1(t)

Figure 9.5: Pendulum phase plane solutions.

9.8.3 Options
The odeset function creates an options structure that you can pass as an argument to any of the ODE
solver. The odeset arguments are pairs property name/property value. The syntax is
options = odeset(’name1’, value1, ’name2’, value2, ...)
In the resulting structure, the named properties have the specified values. Any unspecified properties
contain default values. For all properties, it is sufficient to type only the leading characters that uniquely
identify the property name. With no input arguments, odeset displays all property names and their
possible values; the default value are enclosed between braces:
376 Numerical Solution of Ordinary Differential Equations

>> odeset
AbsTol: [ positive scalar or vector {1e-6} ]
RelTol: [ positive scalar {1e-3} ]
NormControl: [ on | {off} ]
NonNegative: [ vector of integers ]
OutputFcn: [ function_handle ]
OutputSel: [ vector of integers ]
Refine: [ positive integer ]
Stats: [ on | {off} ]
InitialStep: [ positive scalar ]
MaxStep: [ positive scalar ]
BDF: [ on | {off} ]
MaxOrder: [ 1 | 2 | 3 | 4 | {5} ]
Jacobian: [ matrix | function_handle ]
JPattern: [ sparse matrix ]
Vectorized: [ on | {off} ]
Mass: [ matrix | function_handle ]
MStateDependence: [ none | {weak} | strong ]
MvPattern: [ sparse matrix ]
MassSingular: [ yes | no | {maybe} ]
InitialSlope: [ vector ]
Events: [ function_handle ]

To modify an existing structure, oldopts, use


options=odeset(oldopts,’name1’, value1,...)
This sets options to the existing structure oldopts, overwrites any value in oldopts that are
respecified using name/values pairs, and adds any new pairs to the structure. The command
options=odeset(oldopts, newopts)
combines the structures oldopts and newopts. In the output argument, any new options not equal
to the empty matrix overwrite corresponding options in oldopts. An option structure created with
odeset can be queried with
o=odeget(options,’name’)
This function returns the value of the specified property, or an empty matrix [], if the property value is
unspecified in the options structure.
Table 9.7 gives property types and property names.
Our example below solves the Rössler system [44, Section 12.2],
d
y1 (t) = −y2 (t) − y3 (t),
dt
d
y2 (t) = y1 (t) + αy2 (t),
dt
d
y3 (t) = b + y3 (t)(y1 (t) − c),
dt
where a, b and c are real parameters. The function that defines the differential equation is:

function yd=Roessler(t,y,a,b,c)
%ROESSLER parametrized Roessler system

yd = [-y(2)-y(3); y(1)+a*y(2); b+y(3)*(y(1)-c)];


9.8. ODEs in MATLAB 377

Category Property name


Error Control RelTol, AbsTol, NormControl
Solver output OutputFcn, OutputSel, NonNegative
Refine, Stats
Jacobian Matrix Jacobian, JPattern, Vectorized
Step control InitialStep, MaxStep
Mass matrix Mass, MStateDependence, MvPattern,
and DAE MassSingular, InitialSlope
Events Events
ode15s and ode15i MaxOrder, BDF
specific

Table 9.7: ODE solver properties

We modify the absolute and relative error with


options = odeset(’AbsTol’,1e-7,’RelTol’,1e-4);
The script Script Roessler.m (MATLAB source 9.4) solves the Rössler’s system over the interval
[0, 100] with initial condition y(0) = [1, 1, 1]T and parameter sets (a, b, c) = (0.2, 0.2, 2.5) and
(a, b, c) = (0.2, 0.2, 5). Figure refroesslerfig shows the results. The 221 subplot gives the 3D phase

MATLAB Source 9.4 Rössler system


tspan = [0,100]; y0 = [1;1;1];
options = odeset(’AbsTol’,1e-7,’RelTol’,1e-4);
a=0.2; b=0.2; c1=2.5; c2=5;
[t,y] = ode45(@Roessler,tspan,y0,options,a,b,c1);
[t2,y2] = ode45(@Roessler,tspan,y0,options,a,b,c2);
subplot(2,2,1), plot3(y(:,1),y(:,2),y(:,3))
title(’c=2.5’), grid
xlabel(’y_1(t)’), ylabel(’y_2(t)’), zlabel(’y_3(t)’);
subplot(2,2,2), plot3(y2(:,1),y2(:,2),y2(:,3))
title(’c=5’), grid
xlabel(’y_1(t)’), ylabel(’y_2(t)’), zlabel(’y_3(t)’);
subplot(2,2,3); plot(y(:,1),y(:,2))
title(’c=2.5’)
xlabel(’y_1(t)’), ylabel(’y_2(t)’)
subplot(2,2,4); plot(y2(:,1),y2(:,2))
title(’c=5’)
xlabel(’y_1(t)’), ylabel(’y_2(t)’)

space solution for c = 2.5 and the 223 subplot gives the 2D projection onto the y1 − y2 plane. The
222 and 224 subplots give the corresponding pictures for c = 5. We shall discuss properties and give
examples in next sections. For details see help odeset or doc odeset.
378 Numerical Solution of Ordinary Differential Equations

c=2.5 c=5

4 20

y (t)

y (t)
2 10

3
0 0
5 10
5 20
0 0 10
0 0
y (t) −5 −5 y (t) y (t) −10 −10 y (t)
2 1 2 1

c=2.5 c=5
4 10

2
5

0
y (t)

y (t)
0
2

2
−2

−5
−4

−6 −10
−4 −2 0 2 4 6 −10 −5 0 5 10 15
y (t) y (t)
1 1

Figure 9.6: Rössler system phase space solutions

9.8.4 Stiff equations


Stiffness is a subtle, difficult, and important concept in the numerical solution of ordinary differential
equations. It depends on the differential equation, the initial conditions, and the numerical method.
Dictionary definitions of the word “stiff” involve terms like “not easily bent”, “rigid”, and “stubborn”.
We are concerned with a computational version of these properties. Moler [66] characterizes this term
computationally :
“A problem is stiff if the solution being sought varies slowly, but there are nearby solutions
that vary rapidly, so the numerical method must take small steps to obtain satisfactory
results.”
Stiffness is an efficiency issue. Nonstiff methods can solve stiff problems; they just take a long time to
do it.
The next example, due to Shampine, is from [66] and is a model of flame propagation. If you light a
match, the ball of flame grows rapidly until it reaches a critical size. Then it remains at that size because
the amount of oxygen being consumed by the combustion in the interior of the ball balances the amount
available through the surface. A simple model is given by the initial value problem:

y′ = y2 − y3,
(9.8.1)
y(0) = δ, 0 ≤ t ≤ 2/δ.

The real-valued function y(t) represents the radius of the ball. The y 2 and y 3 terms come from the
surface area and the volume. The critical parameter is the initial radius, δ, which is “small”.” We seek
the solution over a length of time that is inversely proportional to δ. We shall try to solve the problem
with ode45, for δ = 0.01 and the relative error 10−4 (this is not a very stiff problem).

delta=0.01;
F = @(t,y) yˆ2-yˆ3;
9.8. ODEs in MATLAB 379

opts = odeset(’RelTol’,1e-4);
ode45(F,[0,2/delta],delta,opts);

With no output arguments, ode45 automatically plots the solution as it is computed. You should
get a plot of a solution that starts at y = 0.01, grows at a modestly increasing rate until t approaches
100, which is 1/δ, then grows rapidly until it reaches a value close to 1, where it remains (it reaches a
steady state). The solver uses 185 points. If we decrease δ, say at 0.0001, the stiff character becomes

1.4

1.2

0.8

0.6

0.4

0.2

0
0 20 40 60 80 100 120 140 160 180 200

Figure 9.7: Flame propagation, δ = 0.1

stronger. The solver generates 12161 points. The graph is given in Figure 9.8, the upper part. It takes a
lot of times to complete the plot. The lower part is a zoom in the neighbor of a steady state. The figure
was generated with the script flamematch2.m, given below. The Stats option, set to on, display
solver’s statistics.

delta=1e-4; er=1e-4;
F = @(t,y) yˆ2-yˆ3;
opts = odeset(’RelTol’,er,’Stats’,’on’);
[t,y]=ode45(F,[0,2/delta],delta,opts);
subplot(2,1,1)
plot(t,y,’c-’); hold on
h=plot(t,y,’bo’);
set(h,’MarkerFaceColor’,’b’,’Markersize’,4);
hold off
title ode45
subplot(2,1,2)
plot(t,y,’c-’); hold on
h=plot(t,y,’bo’);
set(h,’MarkerFaceColor’,’b’,’Markersize’,4);
axis([0.99e4,1.12e4,0.9999,1.0001])
hold off

Notice that the solver is keeping the solution within the required accuracy, but it must to work hard
for this purpose. The situation becomes more dramatic for smaller relative error, such as 10−5 or 10−6 .
If you try for example
380 Numerical Solution of Ordinary Differential Equations

delta = 0.00001;
ode45(F,[0 2/delta],delta,opts);

and the plotting process is to slow you can click the stop button in the lower left corner of the window.

ode45
1.4

1.2

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
4
x 10

1.0001

0.9999
1 1.02 1.04 1.06 1.08 1.1 1.12
4
x 10

Figure 9.8: Flame propagation, δ = 0.0001, relative error 1e-4 (up) and a zoom on solution
(down)

The problem is not stiff initially. It only becomes stiff as the solution approaches steady state. This
is because the steady state solution is so “rigid”. Any solution near y(t) = 1 increases or decreases
rapidly toward that solution. (We should point out that “rapidly” here is with respect to an unusually
long time scale.)
We shall use a solver for stiff problems. These solvers are based on implicit methods. At each
step they use matrix operation to solve a system of simultaneous linear equations that helps predict the
evolution of the solution. For our flame example, the matrix is only 1 by 1 (this is a scalar problem), but
even here, stiff methods do more work per step than nonstiff methods. Let us solve our example with a
stiff solver, ode23s. We have only to modify the solver name: ode23s instead of ode45. Figure 9.9
shows the computed solution and the zoom detail. This time, the solver effort is much smaller, as it can
be seen by examining the statistics generated by solver. In this case, the statistics for ode23s is:

99 successful steps
7 failed attempts
412 function evaluations
99 partial derivatives
106 LU decompositions
318 solutions of linear systems

Compare with that generated by ode45:

3040 successful steps


9.8. ODEs in MATLAB 381

ode23s
1.4

1.2

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
4
x 10

1.0001

0.9999
1 1.02 1.04 1.06 1.08 1.1 1.12
4
x 10

Figure 9.9: Flame propagation, δ = 0.0001, relative error 1e-4 (up) and a zoom on solution
(down). (Solver: ode23s.

323 failed attempts


20179 function evaluations
0 partial derivatives
0 LU decompositions
0 solutions of linear systems

It is possible to compute the exact solution of problem (9.8.1). The differential equation is separa-
ble. Integrating once gives an implicit equation for y as a function of t:
   
1 1 1 1
+ ln − 1 = + ln − 1 − t.
y y δ δ

The exact analytical solution to the flame model is


1
y(t) =
W (aea−t ) + 1

where a = 1/δ − 1, and W is the Lambert’s function. This is the exact solution of the functional
equation.
W (z)eW (z) = z.
With Matlab and the Symbolic Math Toolbox, the statements

y = dsolve(’Dy = yˆ2 - yˆ3’,’y(0) = 1/100’);


y = simplify(y);
pretty(y)
ezplot(y,[0,200])
382 Numerical Solution of Ordinary Differential Equations

produces

1
----------------------------
lambertw(99 exp(99 - t)) + 1

and the plot of the exact solution shown in Figure 9.10. If the initial value 1/100 is decreased and
the time span 0 ≤ t ≤ 200 increased, the transition region becomes narrower.

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 50 100 150 200

Figure 9.10: Exact solution for the flame example.

Cleve Moler [66] gives a very suggestive comparison between explicit and implicit solver.
“Imagine you are returning from a hike in the mountains. You are in a narrow canyon
with steep slopes on either side. An explicit algorithm would sample the local gradient to
find the descent direction. But following the gradient on either side of the trail will send
you bouncing back and forth across the canyon, as with ode45. You will eventually get
home, but it will be long after dark before you arrive. An implicit algorithm would have
you keep your eyes on the trail and anticipate where each step is taking you. It is well
worth the extra concentration.”

Consider now the an example due to Robertson and discussed in detail in [44, 25]. The Robertson
ODE system
d
y1 (t) = −αy1 (t) + βy2 (t)y3 (t),
dt
d
y2 (t) = αy1 (t) − βy2 (t)y3 (t) − γy22 (t),
dt
d
y3 (t) = γy22 (t)
dt
models a reaction between three chemicals. We set the system up as the function chem:

function yprime=chem(t,y,alpha,beta,gamma)
%CHEM - Robertson chemical reaction model
9.8. ODEs in MATLAB 383

yprime = [-alpha*y(1)+beta*y(2)*y(3);
alpha*y(1)-beta*y(2)*y(3)-gamma*y(2)ˆ2;
gamma*y(2)ˆ2];

The script robertson.m solves the ODE for α = 0.04, β = 104 , γ = 3 × 107 , t ∈ [0, 3] and initial
condition y(0) = [1, 0, 0]T . It uses ode45 and the ode15s solvers, and generates statistics.

alpha = 0.04; beta = 1e4; gamma = 3e7;


tspan = [0,3]; y0 = [1;0;0];
opts=odeset(’Stats’,’on’);
[ta,ya] = ode45(@chem,tspan,y0,opts,alpha,beta,gamma);
subplot(1,2,1), plot(ta,ya(:,2),’-*’)
ax = axis; ax(1) = -0.2; axis(ax);
xlabel(’t’), ylabel(’y_2(t)’)
title(’ode45’,’FontSize’,14)
[tb,yb] = ode15s(@chem,tspan,y0,opts,alpha,beta,gamma);
subplot(1,2,2), plot(tb,yb(:,2),’-*’)
axis(ax)
xlabel(’t’), ylabel(’y_2(t)’)
title(’ode15s’,’FontSize’,14)

Due to scale reason, only the graph of y2 was plotted in Figure 9.11. Figure 9.12 contains a zoom on
solution computed by ode45. The solver statistics are

x 10
−5 ode45 x 10
−5 ode15s
4 4

3.5 3.5

3 3

2.5 2.5
y (t)

y (t)

2 2
2

1.5 1.5

1 1

0.5 0.5

0 0
0 1 2 3 0 1 2 3
t t

Figure 9.11: Robertson’s system y2 solution. Left: ode45. Right: ode15s.

2052 successful steps


384 Numerical Solution of Ordinary Differential Equations

−5
x 10
2.8

2.75

2.7

2.65

2.6

2.55
2 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.1

Figure 9.12: Zoom on y2 solution of Robertson’s system, computed by ode45

440 failed attempts


14953 function evaluations
0 partial derivatives
0 LU decompositions
0 solutions of linear systems

for ode45 and

33 successful steps
5 failed attempts
73 function evaluations
2 partial derivatives
13 LU decompositions
63 solutions of linear systems

for ode15s, respectively. Note that in the computation above, we have

disp([length(ta),length(tb)])
8209 34

showing that ode45 returned output at almost 250 times as many points as ode15s. However, the
statistics show that ode45 took 2051 steps, only about 62 times as many as ode15s. The explanation
is that by default ode45 uses interpolation to return four solution values at equally spaced points over
each “natural” step. The default interpolation level can be overridden via the Refine property with
odeset.
The stiff solvers use information about the Jacobian matrix, ∂fi /∂yj , at various points along the
solution. By default, they automatically generate approximate Jacobians using finite differences. How-
ever, the reliability and efficiency of the solvers is generally improved if a function that evaluates the
9.8. ODEs in MATLAB 385

Jacobian is supplied. Options are available for specifying a function that evaluate the Jacobian or a
constant matrix if the Jacobian is constant (Jacobian), if Jacobian is sparse and the sparsity pattern
(Jspattern), or if it is in vector form (Vectorized). To illustrate how Jacobian information can
be encoded, we consider system of ODEs
d
y(t) = Ay(t) + y(t). ∗ (1 − y(t)) + v, (9.8.2)
dt
where A is an N × N matrix and v is an N × 1 vector
   
0 1 −2 1
−1 0 1  1 −2 1 
   
 .. .. ..   .. .. .. 
A = r1 
 . . .  + r2 
  . . . ,

 .. ..   .. .. 
 . . 1  . . 1
−1 0 1 −2

v = [r1 −r2 , 0, . . . , 0, r1 +r2 ]T , r1 = −a/(2∆x) and r2 = b/∆x2 . Here, a, b and ∆x are parameters
with values a = 1, b = 5 × 10−2 and ∆x = 1/(N + 1). This ODE system arises when the method of
lines based on central differences is used to semi-discretize the partial differential equation (PDE)
∂ ∂ ∂2
u(x, t) + a u(x, t) = b 2 u(x, t) + u(x, t)(1 − u(x, t)), 0 ≤ x ≤ 1,
∂t ∂x ∂x
with Dirichlet boundary conditions u(0, t) = u(1, t) = 1. This PDE is of reaction-convection- diffu-
sion type (and could be solved directly with pdepe). The ODE solution component yj (t) approximates
u(j∆x, t). We suppose that the PDE comes with the initial data u(x, 0) = (1 + cos 2πx)/2, for which
it can be shown that u(x, t) tends to the steady state u(x, t) ≡ 1 when t → ∞. The corresponding
ODE initial condition is (y0 )j = (1 + cos(2πj/(N + 1)))/2. The Jacobian for this ODE has the form
A + I − 2diag(y(t)), where I is the identity matrix. MATLAB Source 9.5 contains a function that
implements and solves (9.8.2) using ode15s. The use of subfunctions and function handles allow the
whole code to be contained in a single file, rcd.m. We have set N = 40 and t ∈ [0, 2]. We specify
via the Jacobian property of odeset the subfunction jacobian that evaluates the Jacobian, and the
sparsity pattern of the Jacobian, encoded as a sparse matrix of 0s and 1s, is assigned to the Jpattern
property. The jth column of the output matrix y contains the approximation to yj (t), and we have
created U by appending an extra column ones(size(t)) at each end of y to account for the PDE
boundary conditions. The plot produced by rcd is shown in Figure 9.13.
The ODE solvers can be applied to problems of the form
d
M (t, y(t)) y(t) = f (t; y(t)), y(t0 ) = y0 ,
dt
where the mass matrix, M (t, y(t)), is square and nonsingular. (The ode23s solver applies only when
M is independent of t and y(t).) Mass matrices arise naturally when semi-discretization is performed
with a finite element method. A mass matrix can be specified in a similar manner to a Jacobian, via
odeset. The ode15s and ode23t functions can solve certain problems where M is singular but
does not depend on y(t)—more precisely, they can be used if the resulting differential-algebraic equa-
tion is of index 1 and y0 is close to being consistent.

9.8.5 Event handling


In many situations, the determination of the last value tf inal of tspan is an important aspect of the
problem. One example is a body falling under the force of gravity and encountering air resistance.
386 Numerical Solution of Ordinary Differential Equations

MATLAB Source 9.5 Stiff problem with information about Jacobian


function rcd
%RCD Stiff ODE for reaction-convection-diffusion problem
% obtained from method of lines

N = 40; a = 1; b = 5e-2;
tspan = [0;2]; space = [1:N]/(N+1);

y0 = 0.5*(1+cos(2*pi*space));
y0 = y0(:);
options = odeset(’Jacobian’,@jacobian,’Jpattern’,...
jpattern(N),’RelTol’,1e-3,’AbsTol’,1e-3);

[t,y] = ode15s(@f,tspan,y0,options,N,a,b);
e = ones(size(t)); U = [e y e];
waterfall([0:1/(N+1):1],t,U)
xlabel(’space’,’FontSize’,16,’Interpreter’,’LaTeX’)
ylabel(’time’,’FontSize’,16,’Interpreter’,’LaTeX’)

% ---------------------------------------------------------
% Subfunctions.
% ---------------------------------------------------------
function dydt = f(t,y,N,a,b)
%F Differential equation

r1 = -a*(N+1)/2;
r2 = b*(N+1)ˆ2;
up = [y(2:N);0]; down = [0;y(1:N-1)];
e1 = [1;zeros(N-1,1)]; eN = [zeros(N-1,1);1];

dydt = r1*(up-down) + r2*(-2*y+up+down) + (r2-r1)*e1 +...


(r2+r1)*eN + y.*(1-y);

% ---------------------------------------------------------
function dfdy = jacobian(t,y,N,a,b)
%JACOBIAN Jacobian matrix

r1 = -a*(N+1)/2;
r2 = b*(N+1)ˆ2;
u = (r2-r1)*ones(N,1);
v = (-2*r2+1)*ones(N,1) - 2*y;
w = (r2+r1)*ones(N,1);

dfdy = spdiags([u v w],[-1 0 1],N,N);

% ---------------------------------------------------------
function S = jpattern(N)
%JPATTERN Sparsity pattern of Jacobian matrix

e = ones(N,1);
S = spdiags([e e e],[-1 0 1],N,N);
9.8. ODEs in MATLAB 387

1.5

0.5

0
2
1.5 1
0.8
1 0.6
0.5 0.4
0.2
0
time 0 space

Figure 9.13: Stiff example, with information about Jacobian

When does it hit the ground? Another example is the two-body problem, the orbit of one body under
the gravitational attraction of a much heavier body. What is the period of the orbit? The events feature
of the MATLAB ordinary differential equation solvers provides answers to such questions.
Events detection in ordinary differential equations involves two functions, f (t, y) and g(t, y), and
an initial condition, (t0 , y0 ). The problem is to find a function y(t) and a final value t∗ so that

y ′ = f (t, y)
y(t0 ) = y0

and

g(t∗, y(t∗ )) = 0.

A simple model for the falling body is

y ′′ = −1 + y ′2 ,

with an initial conditions that provides values for y(0) and y ′ (0). The question is, for what values of t
we have y(t) = 0? The source code for f (t, y) is
function ydot=f(t,y)
ydot = [y(2); -1+y(2)ˆ2];
The equation was rewritten as a system of two first order equations, so g(t, y) = y1 . The code for
g(t, y) is
function [gstop,isterminal,direction] = g(t,y)
gstop = y(1);
isterminal = 1;
direction = 0;
388 Numerical Solution of Ordinary Differential Equations

The first output argument, gstop, is the value that we want to vanish. If the second output, isterminal,
is set to one, the solver should terminate when gstop is zero. If isterminal = 0, then the event
is recorded and the solution process proceeds. The direction parameter may have values 0, 1 or
-1, with the meaning all zeros are to be located (the default), only zeros where the event function is
increasing, only zeros where the event function is decreasing, respectively. With these two functions
available, the following statements compute and plot the trajectory.

function falling_body(y0)
opts = odeset(’events’,@g);
[t,y,tfinal] = ode45(@f,[0,Inf],y0,opts);
tfinal
plot(t,y(:,1),’-’,[0,tfinal],[1,0],’o’)
axis([-0.1, tfinal+0.1, -0.1, max(y(:,1)+0.1)]);
xlabel t
ylabel y
title(’Falling body’)
text(tfinal-0.8, 0, [’tfinal = ’ num2str(tfinal)])

For the initial condition y0=[1; 0], one obtains

>> falling_body([1;0])
tfinal =
1.65745691995813

and the graph in Figure 9.14.

Falling body

0.8

0.6
y

0.4

0.2

0 tfinal = 1.6575

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6


t

Figure 9.14: Event handling for falling object.

Events detection is particularly useful in problems involving periodic phenomena. The two body
problem is a good example. It describes the orbit of a body subject of the action of gravitational force of
a much heavier body. Using Cartesian coordinates, u(t) and v(t), with the origin at position of heavier
9.8. ODEs in MATLAB 389

MATLAB Source 9.6 Two body problem


function orbit(reltol)
y0 = [1; 0; 0; 0.3];
opts = odeset(’events’, @gstop,’RelTol’,reltol);
[t,y,te,ye] = ode45(@twobody,[0,2*pi], y0, opts, y0);
tfinal = te(end)
yfinal = ye(end,1:2)
plot(y(:,1),y(:,2),’-’,0,0,’ro’)
axis([-0.1 1.05 -0.35 0.35])

%----------
function ydot = twobody(t,y,y0)
r = sqrt(y(1)ˆ2 + y(2)ˆ2);
ydot = [y(3); y(4); -y(1)/rˆ3; -y(2)/rˆ3];

%--------
function [val,isterm,dir] = gstop(t,y,y0)
d = y(1:2)-y0(1:2);
v = y(3:4);
val = d’*v;
isterm = 1;
dir = 1;

body, our equations are


u(t)
u′′ (t) = −
r(t)3
v(t)
v ′′ (t) = − ,
r(t)3
p
where r(t) = u(t)2 + v(t)2 . The complete source of the solution is contained in a single function
M file, orbit.m (MATLAB Source 9.6). The input parameter, reltol, is the desired local relative
tolerance. The code for the problem, twobody, and the event handling function, gstop, are given
as subfunctions, but they can be kept in separate M files. The function ode45 is used to compute the
orbit. The input argument, y0 is a 4-vector that provides the initial position and velocity. The light
body starts at (1, 0), which is a point with a distance 1 from the heavy body, and has initial velocity
(0; 0:3), which is perpendicular to the initial position vector. The input argument opts is an options
structure created by odeset that overrides the default value for reltol and that specifies a function
gstop that defines the events we want to locate. The last input argument of ode45 is a copy of y0,
passed on to both twobody and gstop. The 2-vector d is the difference between the current position
and the starting point. The 2-vector v is the velocity at the current position. The quantity val is the
inner product between these two vectors. If you specify an events function and events are detected, the
solver returns three additional outputs:
• A column vector of times (abscissas) at which events occur.
• Solution values corresponding to these times (abscissas).
• Indices into the vector returned by the events function. The values indicate which event the
solver detected.
390 Numerical Solution of Ordinary Differential Equations

If you call the solver as


[T,Y,TE,YE,IE] = solver(odefun,tspan,y0,options)
the solver returns these outputs as TE, YE, and IE respectively.
The expression for stopping function is

g(t, y) = d′ (t)T d(t),

where
d = (y1 (t) − y1 (0), y2 (t) − y2 (0))T .
Points where g(t, y(t)) = 0 are the local extrema of d(t). By setting dir = 1, we indicate that the
zeros of g(t, y) must be approached from above, so they correspond to minima. By setting isterm =
1, we indicate that computation of the solution should be terminated at the first minimum. If the orbit
is truly periodic, then any minima of d occur when the body returns to its starting point.
Calling orbit with a very loose tolerance
orbit(2e-3)
produces
tfinal =
2.35087197761946
yfinal =
0.98107659901112 -0.00012519138558
and plots Figure 9.15(a). You can see from both the value of yfinal and the graph that the orbit does
not quite return to the starting point. We need to request more accuracy.
orbit(1e-6)
yields
tfinal =
2.38025846171798
yfinal =
0.99998593905520 0.00000000032239
Now the value of yfinal is close enough to y0, and the graph looks much better (Figure 9.15(b)).

0.3 0.3

0.2 0.2

0.1 0.1

0 0

−0.1 −0.1

−0.2 −0.2

−0.3 −0.3

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(a) 2e-3 (b) 1e-6

Figure 9.15: Orbit for the tolerances 2e-3 (left) and 1e-6
9.8. ODEs in MATLAB 391

−0.5

−1

−1.5

−2

−2.5

−3

−3.5

−4

−4.5

−5
0 0.5 1 1.5 2 2.5 3 3.5 4

Figure 9.16: Solutions computed with deval

9.8.6 deval and odextend


If you call the solver as

sol = solver(odefun,tspan,y0,options)

the solver returns a structure sol, that can be used to evaluate the solution with deval. Consider the
example: solve the ODE
y ′ = y 2 − t, t ∈ [0, 3], (9.8.3)
for y(0) = −5, −4, . . . , 0. We shall solve this problem using deval:

fd=@(t,y) yˆ2-t;
t=linspace(0,3,150);
y=zeros(150,6);
y0=-5:0;
for k=1:length(y0)
sol=ode45(fd,[0,8],y0(k));
y(:,k)=deval(sol,t);
end
plot(t,y)

Figure 9.16 gives the graphs of the solution. For details on deval see help deval or doc deval.

odextend function extends the solution of an initial value problem for ODE. The calling syntax
in its simplest form is

solext = odextend(sol, odefun, tfinal)

The next example extends the solution of (9.8.3) with initial condition y(0) = 0, from [0,3] to [0,8] and
plots the solution.
392 Numerical Solution of Ordinary Differential Equations

fd=@(t,y) yˆ2-t;
sol=ode45(fd,[0,3],0);
sol=odextend(sol,fd,8);
t=linspace(0,8,150);
y=deval(sol,t);
plot(t,y)
For additional information see help odextend or doc odextend.

9.8.7 Implicit equations


The solver ode15i solves implicit ODEs of the form
f (t, y, y ′ ) = 0
using a variable order BDF method. The minimal syntax is
[T,Y] = ode15i(odefun,tspan,y0,yp0)
but, in general it supports all the features of other solvers. The difference is the parameter yp0 that
contains the value y ′ (t0 ). The initial values must fulfill the consistency condition f (t0 , y(t0 ), y ′ (t0 )) =
0. You can use the function decic to compute consistent initial conditions close to guessed values. To
illustrate, let us solve the equation
y ′2 + y 2 − 1 = 0, t ∈ [π/6, π/4],
with initial condition y(π/6)
√ = 1/2. The exact solution is y(t) = sin(t), and the starting value for the
derivative is y ′ (π/6) = 3/2. Here is the MATLAB code for the solution
tspan = [pi/6,pi/4];
[T,Y] = ode15i(@implic,tspan,1/2,sqrt(3)/2);
and for the implicit ODE
function z=implic(t,y,yp)
z=ypˆ2+yˆ2-1;

9.9 Applications
9.9.1 The restricted three-body problem
Consider the motion of a space probe in the gravitational field of two bodies (such as the earth and the
moon). Both bodies impose a force on the spacecraft according to the gravitational law, but the mass
of the spacecraft is too small to significantly affect the motion of the bodies. We therefore neglect the
influence of the spacecraft on the two stellar bodies. In the field of celestial mechanics this problem
is known as the restricted three-body problem. It is of great advantage to describe the motion of the
spacecraft in a (rotating) coordinate system that has its origin in the center of gravity of the earth and
the moon. In this coordinate system the earth is located at (−M, 0) and the moon at (E, 0) and the
governing equations are then given as [25, 51]
E(x + M ) M (x − E)
ẍ = 2ẏ + x − −
r13 r23
Ey My
ÿ = −2ẋ + y − 3 − 3 ,
r1 r2
9.9. Applications 393

Pământ Luna

M E

Figure 9.17: Restricted 3 body problem coordinate system

q q
where r1 = (x + M )2 + y 2 , r2 = (x − E)2 + y 2 , E = 1 − M .
This problem has a periodic solution and the spacecraft must arrive at initial coordinates at the end
of given interval. Since the system is not stable, small errors would destroy the periodicity, and for the
numerical solution a larger accuracy is needed.
We convert the system of second order differential equations into a first-order system of four equa-
tions. If we introduce the unknowns

X1 (t) = x(t), X2 (t) = ẋ(t), Y1 (t) = y(t), Y2 (t) = ẏ(t),

our system becomes


 
  X2
X1  E(X1 + M ) M (X1 − E) 
d    2Y2 + X1 − − 
 X2  = 

r13 r23 
.
dt  Y1   Y2 
 EY1 M Y1 
Y2
−2X2 + Y1 − 3 − 3
r1 r2

The ODE function (M-file r3body.m) is

function yp = r3body(t,y)
% R3BODY Function defining the restricted three-body ODEs

M = 0.012277471;
E = 1 - M;
%
r1 = sqrt((y(1)+M)ˆ2 + y(3)ˆ2);
r2 = sqrt((y(1)-E)ˆ2 + y(3)ˆ2);
%
yp2 = 2*y(4) + y(1) - E*(y(1)+M)/r1ˆ3 - M*(y(1)-E)/r2ˆ3;
yp4 = -2*y(2) + y(3) - E*y(3)/r1ˆ3 - M*y(3)/r2ˆ3;
yp = [y(2); yp2; y(4); yp4];

We shall solve the problem for two different data set. First we consider the interval [0, 6.192169331319640],
394 Numerical Solution of Ordinary Differential Equations

M = 1/82.45, E = 1 − M , and the initial conditions

x(0) = 1.2, ẋ(0) = 0, y(0) = 0,


ẏ(0) = −1.049357509830320.

The code is listed below.

tspan = [0,6.192169331319640];
M = 1/82.45; E = 1-M;
%
x0 = 1.2; xdot0 = 0; y0 = 0;
ydot0 = -1.049357509830320;
%
vec0 = [x0 xdot0 y0 ydot0];
%
options = odeset(’RelTol’,1e-6,’AbsTol’,[1e-6 1e-6 1e-6 1e-6]);
%
[t,y] = ode45(@r3body,tspan,vec0,options);
plot(y(:,1),y(:,3))
axis([-1.5 1.5 -.8 .8]);grid on;
hold on
plot(-M,0,’o’)
plot(E,0,’o’);
hold off
xlabel(’x’);
ylabel(’y’)
text(0,0.1,’Earth’,’FontSize’,16)
text(0.9,-0.1,’Moon’,’FontSize’,16)

The results are shown in Figure 9.18.


For the second data set, we take t ∈ [0, 29.4602], M = 0.012277471, E = 1 − M , and the initial
conditions

x(0) = 1.15, ẋ(0) = 0, y(0) = 0,


ẏ(0) = 0.0086882909.

Here is the code:

clear; close all;


tspan = [0 29.4602]; %experiment
%
M = 0.012277471; E = 1-M;
%
x0 = 1.15;
xdot0 = 0;
y0 = 0;
ydot0 = 0.0086882909;
vec0 = [x0 xdot0 y0 ydot0];
scal=1e-1; %1e0, 1e1, 1e2, 1e3,....
options = odeset(’RelTol’,1e-6*scal,’AbsTol’,...
9.9. Applications 395

0.8

0.6

0.4

0.2
Earth
0
y

−0.2
Moon

−0.4

−0.6

−0.8
−1.5 −1 −0.5 0 0.5 1 1.5
x

Figure 9.18: Restricted 3 body problem solution

[1e-6 1e-6 1e-6 1e-6]*scal);


[t,y] = ode45(@r3body,tspan,vec0,options);
figure(1);plot(y(:,1),y(:,3))
axis([-.8 1.2 -.8 .8]);grid on;
hold on
plot(-M,0,’o’)
plot(E,0,’o’);
hold off
xlabel(’x’): ylabel(’y’)
text(0,0.1,’Earth’,’FontSize’,16)
text(0.9,-0.15,’Moon’,’FontSize’,16)

See Figure 9.19 for the output.


We can add to our main code the next MATLAB sequence to implement a simple animation based
on comet function

figure(2);
shg
axis([-.8 1.2 -.8 .8]);grid on;
hold on
plot(-M,0,’o’)
plot(E,0,’o’);

xlabel(’x’): ylabel(’y’)
text(0,0.1,’Earth’,’FontSize’,16)
text(0.9,-0.15,’Moon’,’FontSize’,16)
396 Numerical Solution of Ordinary Differential Equations

0.8

0.6

0.4

0.2
Earth
0
y

−0.2
Moon

−0.4

−0.6

−0.8
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
x

Figure 9.19: The Three-Body Problem in a rotating coordinate system

comet(y(:,1),y(:,3))
hold off

9.9.2 Motion of a projectile


The problem of aiming a projectile to strike a distant target involves integrating a system of differential
equations governing the motion and adjusting the initial inclination angle to achieve the desired hit
[105]. Assuming an atmospheric drag proportional to the square of velocity, the motion is governed by
the equations
v̇x = −cvvx , v̇y = −g − cvvy , ẋ = vx , ẏ = vy ,
where g is the gravity constant and c is a ballistic coefficient depending on such physical properties
as the projectile shape and air density. Because the target will be located at a distant point (xf , yf )
relative to the initial position (0, 0) where the projectile is launched, our independent variable will be
the horizontal position x, rather than the time. In order to reformulate the equations in terms of x, we
use the relationship
dt 1
dx = vx dt or = .
dx vx
Then
dy vy dvy dvy dvx dvx
= , = vx , = vx ,
dx vx dt dx dt dx
and the equations of motion become

dy vy dt 1 dvx dvy − (g + cvvy )


= , = , = −cv, = .
dx vx dx vx dx dx vx
9.9. Applications 397

Taking a vector z defined by z = [vx , vy , y, t]T leads to a first order matrix differential equation

dz [−cvvx , − (g + cvvy ) , vy , 1]T


= ,
dx vx
where q
v= vx2 + vy2 .
The problem becomes ill-posed if the initial velocity of the projectile is not large enough so that the
maximum desired value of x is reached before vx is reduced to zero from atmospheric drag. Con-
sequently, error checking is needed to handle such a circumstance. The function protraject uses
ode45 to compute the projectile trajectory. The equations of motion uses some global variables, so we
chose to implement them as an embedded function (projcteq function). Error check is implemented
via event handling facility. Graphical results for default data are given in Figure 9.20.

Projectile Trajectory for Velocity Squared Drag


600

500

400

300
y

200

100

t = 20.52→
−100 final

100 200 300 400 500 600 700 800 900 1000
x

Figure 9.20: Projectile trajectory for v 2 drag condition

function [y,x,t]=protraject(angle,vinit,grav,dragc,xfinl)
%PROTRAJECT - trajectory of a projectile
% angle - initial angle in degrees
% vinit - initial velocity of the projectile
% gravty - the gravitational constant
% cdrag - drag coefficient
% xfinl - largest x value for which the
% solution is computed. %
% y,x,t - the y, x and time vectors produced
% by integrating the equations of motion

% Default data case generated


if nargin < 5, xfinl=1000; end
398 Numerical Solution of Ordinary Differential Equations

if nargin < 4, dragc=0.002; end


if nargin < 3, grav=9.81; end
if nargin < 2, vinit=340; end
if nargin < 1, angle=45; end

% Evaluate initial velocity


ang=pi/180*angle;
vtol=vinit/1e6;
%initial conditions
z0=[vinit*cos(ang); vinit*sin(ang); 0; 0];

% Integrate the equations of motion defined


% in function projcteq
deoptn=odeset(’RelTol’,1e-6,’Events’,@eventsh);
[x,z,te,ze,ie]=ode45(@projcteq,[0,xfinl],z0,deoptn);

% Plot the trajectory curve


y=z(:,3); t=z(:,4);
plot(x,y,’-’,x(end),y(end),’o’);
ss=sprintf(’{t_{final} = %5.2f\\rightarrow }’,t(end));
text(x(end),y(end),ss,’HorizontalAlignment’,’Right’,...
’FontSize’,14);
xlabel(’x’,’FontSize’,14); ylabel(’y’,’FontSize’,14);
title([’Projectile Trajectory for ’, ...
’Velocity Squared Drag’],’FontSize’,14);
axis(’equal’); grid on;
%error
if ˜isempty(te)
error(’initial velocity too small’)
end

%-----------------------
function zp=projcteq(t,y)
%PROJCTEQ - DE of projectile
v=sqrt(y(1)ˆ2+y(2)ˆ2);
zp=[-dragc*v; -(grav+dragc*v*y(2))/y(1); ...
y(2)/y(1); 1/y(1)];
end
%------------------------
function [val,isterm,dir] = eventsh(t,y)
%EVENTSH - event handling function
val = abs(y(1)) - vtol;
dir = -1;
isterm =1;
end
end

Another interesting problem on projectile motion is Problem 9.14.


9.9. Applications 399

Problems
Problem 9.1. Solve the problem

y′ = 1 − y2, y(0) = 0.

using various methods whose Butcher tables were given in this chapter, and also ode23 and ode45.
Compute the global error, given that the exact solution is

e2x − 1
y(x) =
e2x + 1

and check it is O(hp ).

Problem 9.2. Solve the equations and compare to the exact solution:
(a)
1 1
y′ = y(1 − y), x ∈ [0, 20], y(0) = 1;
4 20
with exact solution
20
y(x) = ;
1 + 19e−x/4
(b)
y ′′ = 0.032 − 0.4(y ′ )2 , x ∈ [0, 20], y(0) = 30, y ′ (0) = 0;
with exact solution
  √ 
5 2 2x
y(x) = log cosh + 30,
2 25
√  √ 
2 2 2x
y ′ (x) = tanh .
5 25

Problem 9.3. The equation of Lorenz attractor

dx
= −ax + ay,
dt
dy
= bx − y − xz,
dt
dz
= −cz + xy
dt
has chaotic solutions which are very sensitive to change in initial conditions. Solve it numerically for
a = 5, b = 15, c = 1 with initial conditions

x(0) = 2, y(0) = 6, z(0) = 4, t ∈ [0, 20],

and the tolerance T = 10−4 . Repeat for


(a) T = 10−5 ;
(b) x(0) = 2.1.
Compare both results. Plot them in each case.
400 Numerical Solution of Ordinary Differential Equations

Problem 9.4. The progress of an epidemic of influenza in a population of N individuals is modeled by


the system of differential equation
dx
= −βxy + γ,
dt
dy
= βxy − αy,
dt
dz
= αy − γ,
dt
where x is the number of people susceptible to the disease, y is the number of infected, and z is the
number of immunes, which includes those recovered from the disease, at time t. The parameters α, β,
γ are the rates of recovery, transmission and replenishment (per day), respectively. It is assumed that
the population is fixed so that new births are balanced by deaths.
Use oderk and ode45 functions to solve the equations with initial conditions x(0) = 980, y(0) =
20, z(0) = 0, given the parameters α = 0.05, β = 0.0002, γ = 0. You should terminate the simulation
when y(t) > 0.9N . Determinate approximately the maximum number of people infected and when it
occurs.
Investigate the effect of (a) varying the initial number of infected individuals on the progress of
epidemic, and (b) introducing a nonzero replenishment factor.

Problem 9.5. [25] Captain Kirk4 and his crew, aboard the starship Enterprise are stranded without
power in an orbit around the earth-like planet Capella III, at an altitude of 127 km. Atmospheric drag is
causing the orbit to decay, and if the ship reaches denser layers of the atmosphere, excessive deceleration
and frictional heating will cause irreparable damage to the life-support system. The science officer, Mr.
Spock, estimates that the temporary repairs to the impulse engines will take 29 minutes provided that
they can be completed before the deceleration rises to 5g (1g = 9.81ms−1 ). Since Mr. Spock is
a mathematical genius, he decides to simulate the orbital decay by solving the equations of motion
numerically with the DORPRI5 embedded pair. The equation of motion of the starship, subject to
atmospheric drag, are given by
dv GM sin γ
= 2
− cρv 2
dt  r 
dγ GM cos γ
= −v
dt rv r
dz
= −v sin γ
dt
dθ v cos γ
= ,
dt r
where
v is the tangential velocity (m/s);
γ is the re-entry angle (between velocity vector an the horizontal);
z is the altitude (m);
M is the planetary mass (6 × 1024 kg);
G is the constant of gravitation (6, 67 × 10−11 SI);
c is a drag constant (c = 0.004);
4 Characters from Star Trek, 1st series
9.9. Applications 401

r is the distance to the center of the planet (z + 6.37 × 106 m);


ρ is the atmospheric density (1.3 exp(−z/7600));
θ is the longitude;
t is the time (s).
p
At time t = 0, the initial values are γ = 0, θ = 0 and v = GM/r. Mr. Spock solved the equations
numerically to find the deceleration history and the time and the place of the impact of the Enterprise
should its orbital decay not be prevented. Repeat his simulation using a variable step Runge-Kutta
method, and also estimate approximately the maximum deceleration experienced during descent and
the height at which this occurs. Should Captain Kirk give the order to abandon ship?
Problem 9.6. Halley’s comet last reached perihelion (closest to the ) on February, 9th, 1986. Its posi-
tion and velocity components at this time were
(x, y, z) = (0.325514, −0.459460, 0.166229)
 
dx dy dz
, , = (−9.096111, −6.916686, −1.305721).
dt dt dt
Position is measured in astronomical units (the Earth’s mean distance to the Sun), and the time in years.
The equations of motion are
d2 x µx
= − ,
dt2 r3
d2 y µy
= − 3,
dt2 r
d2 z µz
= − 3,
dt2 r
p
where r = x2 + y 2 + z 2 , µ = 4π 2 , and the planetary perturbations have been neglected. Solve
these equations numerically to determine approximately the time of the next perihelion.
Problem 9.7. Consider the gravitational two-body orbit problem, written here as a system of four first
order equations
y1′ = y3
y2′ = y4
y3′ = −y1 /r 3
y4′ = −y2 /r 3 ,
with initial conditions " #T
r
1+e
y(0) = 1 − e, 0, 0, ,
1−e
p
where r = y12 + y22 . The true solution represents the motion on an elliptic orbit with eccentricity
e ∈ (0, 1), and period 2π.
(a) Show that the solutions could be written as
y1 = cos E − e
p
y2 = 1 − e2 sin E
y3 = sin E/(e cos E − 1)
p
y4 = 1 − e2 cos E/(1 + e cos E),
402 Numerical Solution of Ordinary Differential Equations

where E is the solution of Kepler’s equation

E − e sin E = x.

(b) Solve the problem using oderk and ode45 and plot the solution, the variation of step length
for x ∈ [0, 20] and a precision of 10−5 .
(c) Find global errors, number of function evaluations and number of step rejection for tolerances of
10−4 , 10−5 , . . . , 10−11 .

Problem 9.8. [66] In the 1968 Olympic games in Mexico City, Bob Beamon established a world record
with a long jump of 8.90 m. This was 0.80m longer than the previous world record. Since 1968,
Beamon’s jump has been exceeded only once in competition, by Mike Powell’s jump of 8.95m in
Tokyo in 1991. After Beamon’s remarkable jump, some people suggested that the lower air resistance
at Mexico City’s 2250m altitude was a contributing factor. This problem examines that possibility. The
fixed Cartesian coordinate system has a horizontal x-axis, a vertical y-axis, and an origin at the takeoff
board. The jumper’s initial velocity has magnitude v0 and makes an angle with respect to the x-axis
of θ0 radians. The only forces acting after takeoff are gravity and the aerodynamic drag, D, which is
proportional to the square of the magnitude of the velocity. There is no wind. The equations describing
the jumper’s motion are

x′ = v cos θ, y ′ = v sin θ,
g D
θ′ = − cos θ, v ′ = − − g sin θ.
v m

The drag is
cρs ′2
D= (x + y ′2 ).
2
Constants for this exercise are the acceleration of gravity, g = 9.81 m/s2 , the mass, m = 80 kg,
the drag coefficient, c = 0.72, the jumper’s cross-sectional area, s = 0.50 m2 , and the takeoff angle,
θ0 = 22.5◦ = π/8 radians. Compute four different jumps, with different values for initial velocity,
v0 , and air density, ρ. The length of each jump is x(tf ), where the air time, tf , is determined by the
condition y(tf ) = 0.
(a) “Nominal” jump at high altitude. v0 = 10 m/s and ρ = 0.94 kg/m3 .
(b) “Nominal” jump at sea level. v0 = 10 m/s and ρ = 1.29 kg/m3 . item [(c)] Sprinter’s approach
at high altitude. ρ = 0.94kg/m3 . Determine v0 so that the length of the jump is Beamon’s
record, 8.90 m.
(d) Sprinter’s approach at sea level. ρ = 1.29kg/m3 and v0 is the value determined in (c).
Present your results by completing the following table.

v0 θ0 ρ distance
10 22.5 0.94 ???
10 22.5 1.29 ???
??? 22.5 0.94 8.90
??? 22.5 1.29 ???

Which is more important, the air density or the jumper’s initial velocity?
9.9. Applications 403

Problem 9.9. Solve the stiff problem


1 2
y1′ = − x2 − 3 ,
y1 x
y1 1 1
y2′ = − − 3/2 ,
y22 x 2x
x ∈ [1, 10], with initial conditions y1 (1)√= 1, y2 = 1, using a nonstiff solver and then a stiff solver.
The exact solutions y1 = 1/x2 , y2 = 1/ x.

Problem 9.10. Van der Pol equation has the form



y1′′ − µ 1 − y12 y1′ + y1 = 0, (9.9.1)
where µ > 0 is a scalar parameter.
1. Solve the equation for µ = 1 on [0,20] and initial conditions y(0) = 2 and y ′ (0) = 0 (nonstiff).
Plot y and y ′ .
2. Solve the equation for µ = 1000 (stiff), time interval [0,3000] the vector of initial values [2;0].
Plot y.

Problem 9.11. A cork of length L is on the point of being ejected from a bottle containing a fermenting
liquid. The equations of motion of the cork may be written
( h −γ RT i
qx
dv g(1 + q) 1 + xd + 100 − 1 + L(1+q) , x < L;
=
dt 0, x≥L
dx
= v,
dt
where
g is the acceleration due to gravity,
q is the friction-weight ratio of the cork,
x is the cork displacement in the neck of the bottle,
t is the time,
d is the length of the bottle neck,
R is the percentage rate at which the pressure is increasing,
γ is the adiabatic constant for the gas in the bottle (γ = 1.4).
The initial condition is x(0) = x′ (0) = 0. While x < L the cork is still in the bottle but it leaves the
bottle at x = L. Integrate the equations of motion with DOPRI5 (Table 9.5) and tolerance 0.000001 to
find the time at which the cork is ejected. Also find the velocity of ejection when
q = 20, L = 3.75cm, d = 5cm, R = 4.

Problem 9.12. A simple model of the human heartbeat gives


εx′ = −(x3 − Ax + c),

c = x,
where x(t) is the displacement from the equilibrium of the muscle fiber, c(t) is the concentration of a
chemical control, and ε and A are positive constants. Solutions are expected to be periodic. This can be
seen by plotting the solution in the phase plane (x on horizontal axis, c on the vertical), which should
produce a closed curve. Assume that ε = 1 and A = 3.
404 Numerical Solution of Ordinary Differential Equations

(a) Calculate x(t) and c(t), for 0 ≤ t ≤ 12 starting with x(0) = 0.1, c(0) = 0.1. Sketch the output
in the phase plane. What does the period appear to be?
(b) Repeat (a) with x(0) = 0.87, c(0) = 2.1.

Problem 9.13. Design and implement a step control strategy for Euler method with an error estimator
based on Heun method. Test on two problems in this chapter.

Problem 9.14. [66] Determine the trajectory of a spherical cannonball in a stationary Cartesian coor-
dinate system that has a horizontal x-axis, a vertical y-axis, and an origin at the launch point. The initial
velocity of the projectile in this coordinate system has magnitude v0 and makes an angle with respect to
the x-axis of µ0 radians. The only forces acting on the projectile are gravity and the aerodynamic drag,
D, which depends on the projectile’s speed relative to any wind that might be present. The equations
describing the motion of the projectile are

x′ = v cos θ, y ′ = v sin θ,
g D
θ′ = − cos θ, v ′ = − − g sin θ.
v m
Constants for this problem are the acceleration of gravity, g = 9.81m/s2 , the mass, m = 15kg, and the
initial speed, v0 = 50m/s. The wind is assumed to be horizontal and its speed is a specified function
of time, w(t). The aerodynamic drag is proportional to the square of the projectile’s velocity relative to
the wind: 
cρs
D(t) = (x′ − w(t))2 + y ′2 ,
2
where c = 0.2 is the drag coefficient, ρ = 1.29kg/m3 is the density of air, and s = 0.25m2 is the
projectile’s cross-sectional area.
Consider four different wind conditions.
• No wind. w(t) = 0 for each t.
• Steady headwind. w(t) = −10m/s for all t.
• Intermittent tailwind. w(t) = 10m/s if the integer part of t is even, zero otherwise.
• Gusty wind. w(t) is a Gaussian random variable with mean zero and standard deviation 10 m/s.
The integer part of a real number t is denoted by ⌈t⌉ and is computed in MATLAB by floor(t). A
Gaussian random variable with mean 0 and standard deviation σ is generated by sigma*randn.
For each of these four wind conditions, carry out the following computations. Find the 17 trajectories
whose initial angles are multiples of 5 degrees, that is, θ0 = kπ/36 radians, k = 1, 17. Plot all 17
trajectories on one figure. Determine which of these trajectories has the greatest downrange distance.
For that trajectory, report the initial angle in degrees, the flight time, the downrange distance, the impact
velocity, and the number of steps required by the ordinary differential equation solver.
Which of the four wind conditions requires the most computation? Why?

Problem 9.15. [25] A parachutist jumps from an aeroplane traveling with speed v m/s at an altitude of
y m. After a period of free-fall, the parachute is opened at height yp . The equations of motion of the
skydiver are

x′ = v cos θ,
y′ = v sin θ,
′ 1
v = −D/M − g sin θ, D= ρCD Av 2 ,
2
θ′ = −g cos θ/v,
9.9. Applications 405

where x is the horizontal coordinate, θ is the angle of descent, D is the drag force, and A is the reference
area for the drag force given by 
σ, if y ≥ yp ;
A=
S, if y < yp .
The constants are
g = 9.81 m/s, M = 80kg, ρ = 1.2kg/m3 ,
σ = 0.5m2 , S = 30m2 , CD = 1.

Use an appropriate solve to simulate the descent of the parachutist. Use the interpolation facilities to
determine the critical height yp and the time of impact. Also estimate the minimum velocity of the
descent prior to parachute deployment.

Problem 9.16. In [44], the authors solve the following funny pursuit problem. Suppose that a rabbit
follows a predefined path (r1(t), r2(t)) in the plane, and that a fox chases the rabbit in such a way that
(a) at each moment the tangent of the fox’s path points towards the rabbit and (b) the speed of the fox
is some constant k times the speed of the rabbit. Then the path (y1 (t), y2 (t)) of the fox is given by the
system of ODEs
d
y1 (t) = s(t)(r1 (t) − y1 (t)),
dt
d
y2 (t) = s(t)(r2 (t) − y2 (t)),
dt
where q
d
2 d
2
k r (t)
dt 1
+ r (t)
dt 2
s(t) = p .
(r1 (t) − y1 (t))2 + (r2 (t) − y2 (t))2
Note that this ODE system becomes ill-defined if the fox approaches the rabbit. We let the rabbit follow
an outward spiral,    
r1 (t) √ cos(t)
= 1+t ,
r2 (t) sin(t)
and start the fox at y1 (0) = 3, y2 (0) = 0.
(a) Integrate the ODE system for k < 1 (the rabbit faster than the fox).
(b) Integrate the ODE system for k > 1 (the fox faster than the rabbit), using event handling facili-
ties.
(c) In each case, plot the trajectories of the animals and for the case (b) display the moment of
capture.
406 Numerical Solution of Ordinary Differential Equations
CHAPTER 10

Multivariate Approximation

10.1 Interpolation in Higher Dimensions


The problem of interpolation in several dimensions is more difficult than the univariate analogous. It
exhibits unusual aspects which do not appear in the univariate case, and these aspects are apparent even
in the bivariate case. The problem of multivariate interpolation attracted the attention of the researchers,
both in the past and currently. Nevertheless, few classical books on numerical analysis treat it ([50, 11]).

10.1.1 Interpolation problem


Let us state the bivariate interpolation problem. Given a set of n distinct interpolation points (or nodes)

N = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} ⊂ D ⊆ R2 (10.1.1)

and n real numbers c1 , . . . , cn , find a smooth and easily computed function F such that

F (xi , yi ) = ci , i = 1, . . . , n.

The terms smooth and easily computed have only informal or intuitive meaning. The set D is some
large domain that includes all nodes in (10.1.1).

10.1.2 Cartesian product and grid


In the particular case when the set of nodes N is a Cartesian product or a Cartesian grid

N = {x1 , x2 , . . . , xp } × {y1 , y2 , . . . , yq },

or
N = {(xi , yj ) : i = 1, . . . , p, j = 1, . . . , q} , (10.1.2)

407
408 Multivariate Approximation

the interpolation problem just described can be solved by a tensor product of univariate interpolation
methods. An example of Cartesian grid, in which p = 4 and q = 3, is shown in Figure 10.1. For
convenience, we have numbered the x-points from left to right, and the y-points from bottom to top,
although this is not necessary.

y3

y2

y1

x1 x2 x3 x4

Figure 10.1: A Cartesian grid of nodes

Suppose that we posses a linear interpolation scheme for the nodes x1 , x2 , . . . , xp . This will be a
univariate process. We want to think of this as a linear operator P of the form
p
X
(P f )(x) = f (xi )ui (x) (10.1.3)
i=1

in which the functions ui have the cardinal property

ui (xj ) = δij , i, j = 1, . . . , p. (10.1.4)

In the case of ordinary univariate interpolation, these functions ui are given by basic Lagrange polyno-
mials
Yn
x − xj
ui (x) = ℓi (x) = , i = 1, . . . p. (10.1.5)
x i − xj
j=1
j6=i

Notice that the operator P can be extended in a trivial manner to operate on functions of two or more
variables. Thus, if f is a function of (x, y), we can write
p
 X
P f (x, y) = f (xi , y)ui (x). (10.1.6)
i=1

One can see immediately that P f is a function of two variables that interpolates f on vertical lines

Li = {(xi , y) : −∞ < y < ∞} , i = 1, . . . , p. (10.1.7)


10.1. Interpolation in Higher Dimensions 409

Suppose that another operator is available for interpolation at the nodes y1 , y2 , . . . , yq . We write
q
X
(Qf ) (y) = f (yi )vi (y), (10.1.8)
i=1

where the functions vi are any convenient ones having the cardinal property
vi (yj ) = δi,j , 1 ≤ i, j ≤ q. (10.1.9)
Again, Q can be extended to operate on bivariate functions by use of the equation
q
 X
Qf (x, y) = f (x, yi )vi (y). (10.1.10)
i=1

The function Qf interpolates f on all the horizontal lines


Li = {(x, yi ) : −∞ < x < ∞} , i = 1, . . . , q. (10.1.11)

10.1.3 Boolean sum and tensor product


There are two useful bivariate interpolation operators that can now be constructed from P and Q; they
are the product P Q, and the Boolean sum P ⊕ Q, defined by
P ⊕ Q = P + Q − P Q. (10.1.12)
More detailed formulae for these operators are easily derived from the definitions of P and Q. Thus
p
  X 
P Qf (x, y) = P Qf (x, y) = Qf (xi , x)ui (x)
i=1
p q
X X
= f (xi , yj )vj (y)ui (x). (10.1.13)
i=1 j=1

Since vj (yk )ui (xℓ ) = δjk δiℓ , we see without difficulty that P Q is a function that interpolates f at all
nodes (xi , yj ). The tensor product notation P ⊗ Q is also used for the operator P Q.
In the same way, a formula for P ⊕ Q is
     
P ⊕ Q f (x, y) = P f (x, y) + Qf (x, y) − P Qf (x, y)
p q p q
X X X X
= f (xi , y)ui (x) + f (x, yj )vj (y) − f (xi , yj )ui (x)vj (y).
i=1 j=1 i=1 j=1
(10.1.14)

It is left as a problem to prove that the function P ⊕ Q f interpolates f on all the vertical lines Li ,
i = 1, . . . , p and all horizontal lines Lj , j = 1, . . . , q.
If RP and RQ are the rests (errors) for the univariate interpolants, then the rests for tensor product
and Boolean sum are
RP̄ Q̄ = RP̄ ⊕ RQ̄ ,

and

RP̄ ⊕Q̄ = RP̄ RQ̄ ,


respectively (see [17, 89, 98]).
410 Multivariate Approximation

Example 10.1.1. Give a formula for a polynomial in two variables for the following data set

(x, y) (1,1) (2,1) (4,1) (5,1) (1,3) (2,3)


f (x, y) 1.7 -4.1 -3.2 4.9 6.1 -4.2
(x, y) (4,3) (5,3) (1,4) (2,4) (4,4) (5,4)
f (x, y) 2.3 7.5 -5.9 3.8 -1.7 2.5

Observe first that the nodes form a Cartesian grid, and the tensor product method is applicable. The
functions ui and vj are given by Equation (10.1.5). In this example they are as follows:

x−2 x−4 x−5 1


u1 (x) = · · = − (x − 2)(x − 4)(x − 5)
1−2 1−4 1−5 12
1
u2 (x) = (x − 1)(x − 4)(x − 5)
6
1
u3 (x) = − (x − 1)(x − 2)(x − 5)
6
1
u4 (x) = (x − 1)(x − 2)(x − 5)
12
y−3 y−4 1
v1 (y) = · = (y − 3)(y − 4)
1−3 1−4 6
1
v2 (y) = − (y − 1)(y − 4)
2
1
v3 (y) = (y − 1)(y − 3)
3
The polynomial interpolant is then

F (x, y) = u1 (x) [1.7v1 (y) + 6.1v2 (y) − 5.9v3 (y)]


+ u2 (x) [−4.1v1 (y) − 4.2v2 (y) + 3.8v3 (y)]
+ u3 (x) [−3.2v1 (y) + 2.3v2 (y) − 1.7v3 (y)]
+ u4 (x) [4.9v1 (y) + 7.5v2 (y) + 2.5v3 (y)] (10.1.15)

If the function F in the preceding example is written as a sum of terms xi y j , the following 12 terms
appear
1, x, x2 , x3 , y, xy, x2 y, x3 y, y 2 , xy 2 , x3 y 2 . (10.1.16)
Thus, we are interpolating by means of a 12-dimensional subspace of bivariate polynomials. The
proper notation for this subspace is Π3 ⊗ Π2 . This is the tensor product of two linear spaces, and
consists of all functions of the form
m
X
(x, y) 7−→ ai (x)bi (y)
i=1

in which ai ∈ Π3 and bi ∈ Π2 . (The sum can have any number of terms). It is not difficult to prove a
basis for this space consists of the functions in (10.1.16).
It is to be emphasized that the theory just outlined applies to general functions ui and vi , not just
to polynomials. All that is needed is the cardinality property. (In an abstract theory, one works directly
with the operators P and Q; their detailed structure does not enter the analysis.)
10.1. Interpolation in Higher Dimensions 411

In the tensor product method of polynomial interpolation, the general case will involve bivariate
polynomials from the space Πp−1 ⊗ Πq−1 , where p and q are the number of points that figure in
Equation (10.1.2). A basis for this space is given by the functions

(x, y) 7−→ xi y j , i = 0, . . . , p − 1, j = 0, . . . , q − 1. (10.1.17)

A generic element of the space is then of the form


p−1 q−1
X X
(x, y) 7−→ cij xi y j .
i=0 j=0

i j
The degree of a term x y is defined to be i+j. Thus the space Πp−1 ⊗Πq−1 will contain one basis
element of degree p + q − 2, namely xp−1 y q−1 , but it will not contain all terms of degree p + q − 2.
For example a term such xp y q−2 will not be present. The degree of a polynomial in (x, y) is defined to
be the largest degree of the terms present in the polynomial.
 The space of all bivariate
 polynomials of
degree at most k will be denoted here by Πk R2 . A typical element of Πk R2 is a function of the
form
p−1 q−1
XX X
(x, y) 7−→ cij xi y j = cij xi y j . (10.1.18)
i=0 j=0 0≤i+j≤k

2

Theorem 10.1.2. A basis for Πk R is the set of functions

(x, y) 7−→ xi y j , 0 ≤ i + j ≤ k.

Proof. It is clear that this set spans Πk R2 , and it is only necessary to prove its liner independence.
Suppose therefore that the function in equation (10.1.18) is 0. If y is assigned a fixed value, say y = y0 ,
then the equation !
X k k=i
X
cij y0j xi = 0
i=0 j=0

exhibits an apparent linear dependence among the functions x 7→ xi . Since this set of functions is
linearly independent, we conclude that
k−i
X
cij y0j = 0, i = 0, . . . , k.
j=0

In this equation y0 can be any point. By the linear independence of the set of functions

y 7−→ y j , j = 0, . . . , k

we conclude that cij = 0 for all i and j. 



Corollary 10.1.3. The dimension of Πk R2 is 12 (k + 1)(k + 2).

Proof. The basis elements of Πk R2 given in Theorem 10.1.2 can be arrayed as follows:

xk
xk−1 xk−1 y
xk−2 xk−2 y xk−2 y 2
.. .. .. ..
. . . .
x0 x0 y x0 y 2 ··· x0 y k
412 Multivariate Approximation

The number of basis elements is thus


1
1 + 2 + · · · + (k + 1) = (k + 1)(k + 2).
2


The MATLAB Sources 10.1 and 10.2 gives the implementation of the bivariate Lagrange tensor
product interpolant and of the bivariate Lagrange Boolean sum interpolant, respectively.

MATLAB Source 10.1 Bivariate tensor product Lagrange interpolant


function Z=tensprod(u,v,x,y,f)
%TENSPROD - tensor product Lagrange interpolant
%call [Z,X,Y]=TENSPROD(U,V,X,Y,F)
%U - evaluation abscissas
%V - evaluation ordinates
%X - node abscissas
%Y - node ordinates
%F - function
[X,Y]=meshgrid(x,y);
F=f(X,Y);;
lu=pfl2b(x,u)’;
lv=pfl2b(y,v);
Z=lu*F*lv;

MATLAB Source 10.2 Bivariate Boolean sum Lagrange interpolant


function Z=boolsum(u,v,x,y,f)
%BOOLSUM - Boolean sum Lagrange interpolant
%call [Z,X,Y]=BOOLSUM(U,V,X,Y,F)
%U - evaluation abscissas
%V - evaluation ordinates
%X - node abscissas
%Y - node ordinate
%F - function
[X,Y]=meshgrid(x,y);
F=f(X,Y);
[X1,V1]=meshgrid(x,v);
F1=f(X1,V1);
[U2,Y2]=meshgrid(u,y);
F2=f(U2,Y2);
lu=pfl2b(x,u);
lv=pfl2b(y,v);
Z=F1*lu+lv’*F2-lu’*F*lv;

2 2
Example 10.1.4. Consider the function f : [−2, 2] × [−2, 2] → R, f (x, y) = xe−x −y . Its graph,
and the graphs of tensor product, Boolean sums and their rests are given in Figure 10.2. We considered
five equally spaced nodes on each axis, xk , yk = −2 + k, k = 0, 4. ♦
10.1. Interpolation in Higher Dimensions 413

0.5

−0.5
2

1 2
1
0
0
−1
−1
−2 −2

(a) Function

0.4 0.5

0.3

0.2

0.1

0 0

−0.1

−0.2

−0.3

−0.4 −0.5
2 2

1 2 1 2
1 1
0 0
0 0
−1 −1
−1 −1
−2 −2 −2 −2

(b) Tensor product (c) Boolean sum

0.2 0.04

0.15 0.03

0.1 0.02

0.05 0.01

0 0

−0.05 −0.01

−0.1 −0.02

−0.15 −0.03

−0.2 −0.04
2 2

1 2 1 2
1 1
0 0
0 0
−1 −1
−1 −1
−2 −2 −2 −2

(d) Rest for tensor product (e) Rest for Boolean sum

2 2
Figure 10.2: Graph of f : [−2, 2] × [−2, 2] → R, f (x, y) = xe−x −y in Example 10.1.4
(Figure 10.2(a)), tensor product approximation (Figure 10.2(b)), Boolean sum approximation
(Figure 10.2(c)) and the corresponding rest terms(Figure 10.2(d) and 10.2(e), respectively)
414 Multivariate Approximation

10.1.4 Geometry
Recall that in the one-variable case, Πk can be used for interpolation
 at any set of k + 1 nodes in
R. It is natural to expect that for two variables, Πk R2 can be used to interpolate at any set of
n ≡ 21 (k + 1)(k + 2) nodes. This expectation is not fulfilled, however,and a simple example will show
this. Suppose that k = 1, so that n = 3. A generic element of Π1 R2 has the form

c0 + c1 x + c2 y.

If we attempt to solve an interpolation problem with three nodes (xi , yi ) we are led to a linear system
whose coefficient determinant is
1 x 1 y1

1 x 2 y2 .

1 x 3 y3
This determinant will be zero when the nodes are collinear. In that case the interpolation problem will
be (in general) insoluble.
The preceding considerations
 indicate that the geometry of the node set N will determine wether
interpolation by Πk R2 is possible on N . Of course, the number of nodes should be n = 12 (k +
1)(k + 2). Some theorems concerning this question will be given to illustrate what is known. The
first is due to Micchelli [64]. Its
 proof uses Bézout’s Theorem. The Theorem of Bézout states that if
p ∈ Πk (R2 ), if q ∈ Πm R2 , and if p2 + q 2 has more than km zeros, then p and q must have a
common nonconstant factor.

Theorem 10.1.5 (Micchelli, 1986). Interpolation of arbitrary data by the subspace Πk R2 is possi-
1
ble on a set of 2 (k + 1)(k + 2) nodes if the nodes lie on lines L0 , L1 , . . . , Lk in such a way that (for
each i) Li contains exactly i + 1 nodes.

Proof. Let N denotes the set of nodes. Assuming the hypothesis of the theorem, we have card(N
∩Li ) = i+1. The sets N ∩Li must be pairwise disjoint, for if they were not, the following contradiction
would arise:
X k Xk
1
card(N ) < card(N ∩ Li ) = (i + 1) = (k + 1)(k + 2).
i=1 i=1
2

Since the number of nodes is equal to the dimension of space Πk R2 , it suffices to show that the
homogeneous interpolation problem has only the 0 solution. Accordingly, let p ∈ Πk R2 , and suppose
that p(x, y) = 0 for each point (x, y) in N . For each i, let ℓi be a linear function describing Li :

Li = {(x, y) : ℓi (x, y) = 0} , 0 ≤ i ≤ k.

Notice that p2 + ℓ2k has at least k + 1 zeros, namely the points in N ∩ Lk . Bézout’s Theorem allows us
to conclude that ℓk is a divisor of p. This argument can be repeated, for (p/ℓk )2 + ℓ2k−1 has at least k
zeros, and ℓk−1 must be a divisor of p/ℓk . After k steps in this argument, the conclusion is drawn that
p is divisible by ℓ1 ℓ2 . . . ℓk . Thus p is a scalar multiple of ℓ1 ℓ2 . . . ℓk because p is of degree at most k.
Since p vanishes on N ∩ L0 while ℓ1 ℓ2 . . . ℓk does not, p must be 0. 

Because Bézout’s Theorem is limited to R2 , the same is true for Theorem 10.1.5. An algorithmic
proof of Theorem 10.1.5, not requiring Bézout’s Theorem, is given in Section 10.1.5.

Example 10.1.6. The sets of nodes in Figure 10.3 are suitable for interpolation by the space Π2 R2 .♦

A theorem closely related to Theorem 10.1.5 can be given in higher dimensional spaces, Rd . The
dimension of Πk Rd is d+k k
. The following result is from Chung and Yao [13].
10.1. Interpolation in Higher Dimensions 415

Figure 10.3: Node sets for interpolation by Π2 (R2 )


Theorem 10.1.7. Let k and d be given and set n = d+k k
. Let a set of n nodes z1 , z2 , . . . , zn be given
in Rd . If there exist hyperplanes Hij in Rd , with 1 ≤ i ≤ n and 1 ≤ j ≤ k, such that
k
[
zj ∈ Hiν ⇐⇒ j 6= i, (1 ≤ i, j ≤ n) (10.1.19)
ν=1

then arbitrary data on the node set can be interpolated by polynomials in Πk Rd .

Proof. Each hyperplane is the kernel of a nonzero linear function, and we write
n o
Hij = z ∈ Rd : ℓij (z) = 0 ,

where ℓij ∈ Π1 Rd . Define the function
k
Y
qi (z) = ℓij (z), i = 1, . . . , n.
j=1

Now zi does not belong to any of the hyperplanes Hi1 , Hi2 , . . . , Hik , by Condition (10.1.19), and
therefore ℓij (xi ) 6= 0, for j = 1, . . . , k. This proves that qi (zi ) 6= 0.
Again, by Condition (10.1.19), if j 6= i, then zj ∈ Hiν for some ν, and consequently ℓiν  (zj ) = 0
and qi (zj ) = 0. We define pi (z) = qi (z)/qi (z); it holds pi (zj ) = δij . Since pi ∈ Πk Rd , we have
a Lagrangian formula for interpolating a function f at the nodes by a polynomial of degree k:
n
X
P (z) = f (zi )pi (z).
i=1

A node configuration satisfying the hypotheses of Theorem 10.1.7 is shown in Figure 10.4. Here
the dimension is d = 2, the degree is k = 2, and the number of nodes is n = 6.
A very weak result concerning polynomial interpolation at arbitrary set of nodes is the following.

Theorem 10.1.8. The space Πk R2 is capable of interpolating arbitrary data on any set of k + 1
distinct nodes in R2 .

Proof. If the nodes are (xi , yi ), with i = 1, . . . , k, we select a linear function

ℓ(x, y) = ax + by + c
416 Multivariate Approximation

1
L62
L12 2
5

6
3
L22 = L52 = L11
4

L21 = L31 = L42


L32
L41 = L51 = L61

Figure 10.4: Illustrating Theorem 10.1.7

such that the k + 1 numbers ti = ℓ(xi , yi ) are all different (show that this is possible). If f is
 the
function to be interpolated, we find p ∈ Πk (R) such that p(ti ) = f (xi , yi ). Then p ◦ ℓ ∈ Πk R2 and

(p ◦ ℓ) (xi , yi ) = p(ℓ(xi , yi )) = p(ti ) = f (xi , yi ).



A function of the form f ◦ ℓ, where ℓ ∈ Π R2 is called a ridge function. Its graph is a ruled surface,
since f ◦ ℓ remains constant on each line ℓ(x, y) = λ. Theorem 10.1.8 can be easily extended to Rd .

10.1.5 A Newtonian scheme


For the practical implementation of any interpolation method, it is advantageous to have an algorithm
like Newton’s procedure in univariate polynomial interpolation. Recall that one feature of the Newton
scheme is that from a polynomial p interpolating f at nodes x1 , x2 , . . . , xn , we can easily obtain a
polynomial p∗ interpolating f at nodes x1 , x2 , . . . , xn , xn+1 by adding one term to p. Indeed, we put

q(x) = (x − x1 )(x − x2 ) · · · (x − xn )
p∗ (x) = p(x) + cq(x)
c = [f (xn+1 ) − p(xn+1 )]/q(xn+1 ).

The advantage of this algorithm is that an interpolating polynomial can be constructed step by step,
adding one new interpolation node and one term to p in each stage.
The abstract form of this procedure is as follows. Let X be a set and f a real-valued function
defined on X. Let N be a set of nodes. If p is any function that interpolates f on N , and if q is any
function that vanishes on N , then a function p∗ interpolating f on N ∪ {ξ} can be obtained in form
p∗ = p + cq, provided that q(ξ) 6= 0.
10.1. Interpolation in Higher Dimensions 417

A more general version of this strategy deals with set of nodes. Let q be a function from X to R
and let Z be its zero set. If p interpolates f on N ∩ Z and r interpolates (f − p)/q on N \Z, then
p + qr interpolates f on N .
The procedure just outlined can be used to give an algorithmic proof of Theorem 10.1.5. To begin,
select pk ∈ Πk (R2 ) that interpolates f on N ∩ Lk . (Use Theorem 10.1.8). Proceeding inductively
downward, suppose that pi has been found in Πk (R2 ) and interpolates f on all the nodes in Lk ∪
Lk−1 ∪ · · · ∪ Li . We shall attempt to construct pi−1 in Newton form

pi−1 = pi + rℓk ℓk−1 · · · ℓi .

It is clear that pi−1 will still interpolate f at the nodes in Lk ∪ Lk−1 ∪ · · · ∪ Li , since the term added
to pi vanishes on this set. In order to make pi−1 interpolate f on the nodes in Li−1 , we write

f (x) = pi (x) + r(x)(ℓk ℓk−1 · · · ℓi )(x), x ∈ N ∩ Li−1

from which we infer that r should interpolate (f − pi )/(ℓk ℓk−1 · · · ℓi ) on N ∩ Li−1 . By Theorem
10.1.8, there is an r ∈ Πi−1 (R2 ) that does so. Finally, observe that pi−1 ∈ Π(R2 ) because r is of
degree k − i + 1. This algorithm was given by Micchelli [64].
It is an interesting fact that no n-dimensional subspace in C(R2 ) can serve for interpolation at
arbitrary sets of n nodes (excepts in the trivial case of n = 1). This was probably first noticed by Haar
in 1918, and his argument goes like this. Suppose that n functions u1 , u2 , . . . , un are given in C(R2 ).
Let n nodes in R2 be given, say pi = (xi , yi ). If we wish to interpolate at these nodes using the base
functions ui , we shall have to solve a linear system whose determinant is

u1 (p1 ) u2 (p1 ) . . . un (p1 )

u1 (p2 ) u2 (p2 ) . . . un (p2 )

D= .. .. .. .. .
. . . .

u1 (pn ) u2 (pn ) . . . un (pn )

This determinant may be nonzero for the given set of nodes. However, let the first two of the nodes
undergo a continuous motion in R2 in such a way that during the motion these two points never coincide,
nor do they coincide with any of the other nodes, yet at the end of the motion they have exchanged their
original positions. By the rule of determinants, the determinant D will have changed sign (because
rows 1 and 2 are interchanged). By continuity, D assumed the value zero during the continuous motion
described. Hence, D will sometimes be zero, even for distinct nodes. The fact that two nodes can move
in R2 and exchange places without ever be coincident is characteristic of R2 , R3 , . . . but not of R. This
explains why interpolation in R2 , R3 , . . . must be approached somewhat differently from R1 . What
is usually done is to fix the nodes first and then to ask what subspaces of interpolating functions are
suitable.

10.1.6 Shepard interpolation


A very general method of this type (in which the subspace depend on the nodes) is known as Shepard
interpolation, after its originator, Shepard [85]. Let the (distinct) nodes be listed as

pi = (xi , yi ), i = 1, ...n. (10.1.20)

We shall use p and q to denote generic elements in R2 , as this will make extensions to R2 , R3 , . . .
conceptually transparent. Next, we select a real-valued function φ on R2 × R2 subject to the sole
condition that
φ(p, q) = 0 ⇐⇒ p = q. (10.1.21)
418 Multivariate Approximation

Examples that come to mind are φ(p, q) = kp − qk and φ(p, q) = kp − qk2 . Next, we set up some
cardinal functions in exact analogy with the Lagrange formulae in univariate approximation. This is
done as follows:
Yn
φ(p, pj )
ui (p) = , i = 1, . . . , n. (10.1.22)
j=1
φ(p i , pj )
j6=i

It is easy to see that these functions have the cardinality property

ui (pj ) = δij , i, j = 1, . . . , n.

This is a consequence of the hypothesis in (10.1.21). It follows that an interpolant to f at the given
nodes is provided by the function
Xn
F = f (pi )ui . (10.1.23)
i=1

Example 10.1.9. Find the formulae for Shepard interpolation when kp − qk2 is used for φ(p, q).
Let pi = (xi , yi ), p = (x, y) and

φ(p, pj ) = kp − pj k2 = (x − xj )2 + (y − yj )2 .

Then
n
X Yn
(x − xj )2 + (y − yj )2
F (x, y) = f (xi , yi ) .
i=1 j=1
(xi − xj )2 + (yi − yj )2
j6=i ♦

Another version of Shepard’s method starts with the additional assumption on φ that is a nonnega-
tive function. Next, let
n
Y n
X
vi (p) = φ(p, pj ) v(p) = vi (p) wi (p) = vi (p)/v(p). (10.1.24)
j=1 i=1
j6=i

By our assumptions on φ, we have vi (pj ) = 0 if i 6= j and vi (p) > 0 for all points except p1 , . . . , pi−1 ,
pi+1 , . . . , pn . It follows that v(p) > 0 P
and that wi is well defined. By the construction, wi (pj ) = δij
and 0 ≤ wi (pj ) ≤ 1. Furthermore, n i=1 wi (p) = 1. The interpolation process is given by the
equation
Xn X n
F = f (pi )wi = f (pi )vi /v(p). (10.1.25)
i=1 i=1
This process has two favorable properties not possessed by the previous version; namely if the data are
nonnegative, then the interpolant F will be a nonnegative function, and if f is a constant function, then
F = f (F reproduces the constants). This two properties give evidence that the interpolant F inherits
certain characteristics of the function being interpolated. On the other hand, if φ is differentiable, then
F will exhibit a flat spot at each node. This is because 0 ≤ wi ≤ 1 and wi (pj ) = δij , so that the
nodes are extrema (maximum points or minimum points) of each wi .Thus the partial derivatives of wi
are zero at each node, and consequently the same is true of F .
An important case of Shepard interpolation arises when the function φ is a power of the Euclidian
distance:
φ(x, y) = kx − ykµ (µ > 0).
Here x and y can be points in Rs . It suffices to examine the simpler function g(x) = kxkµ at the
questionable point x = 0. The directional derivative of g at zero is obtained by differentiating the
10.1. Interpolation in Higher Dimensions 419

function G(t) = ktukµ , where u is a unit vector defining the direction. Since G(t) = |t|µ , the
derivative at t = 0 does not exist when 0 < µ ≤ 1, but for µ > 1, G′ (0) = 0.
The formula for wi can be given in two ways:
n
Y .X
n Y
n
wi (x) = kx − xj kµ kx − xj kµ , (10.1.26)
j=1 k=1 j=1
j6=i j6=i
.X
n
wi (x) = kx − xi k−µ kx − xj k−µ . (10.1.27)
j=1

The second equation must be used with care since the right hand sides assumes the indeterminate form
∞/∞ at xi .
The MATLAB source 10.3 gives an implementation for Shepard interpolation, based on formula
(10.1.26).

MATLAB Source 10.3 Shepard interpolation


function z=Shepgrid(xp,yp,x,y,f,mu)
%SHEPGRID - computes Shepard interpolant values on a grid
% call Z=SHEPGRID(XP,YP,X,Y,F,MU)
% XP,YP - points
% X,Y - node coordinates
% F function value on nodes
% MU - exponent
if size(xp)˜=size(yp)
error(’xp and yp have not the same size’)
end
[m,n]=size(xp);
for i=1:m
for j=1:n
z(i,j)=Shep1pt(xp(i,j),yp(i,j),x,y,f,mu);
end
end
function z=Shep1pt(xp,yp,x,y,f,mu)
%SHEP1PT - computes Shepard interpolant value in one point
%call Z=SHEP1PT(XP,YP,X,Y,F,MU)
% XP,YP - the point
% X,Y - node coordinates
% F function value on nodes
% MU - exponent
d=(sqrt((xp-x).ˆ2+(yp-y).ˆ2)).ˆmu;
n=length(x);
A=zeros(size(f));
for i=1:n
A(i)=prod(d([1:i-1,i+1:n]));
end
z=sum(A.*f)/sum(A);
420 Multivariate Approximation

Example 10.1.10. Plot the Shepard interpolant for the function in Example 10.1.4, f : [−2, 2] ×
2 2
[−2, 2] → R, f (x, y) = xe−x −y . The graph of Shepard interpolant, for µ = 2 and µ = 3 and 100
random nodes, is given in figure 10.5. The MATLAB sequence for the left figure is

>> P=4*rand(2,100)-2;
>> f=P(1,:).*exp(-P(1,:).ˆ2-P(2,:).ˆ2);
>> [X,Y]=meshgrid(linspace(-2,2,50));
>> z2=Shepgrid(X,Y,P(1,:),P(2,:),f,2);
>> surf(X,Y,z2)

and analogously for the right figure. ♦

µ=2 µ=3

0.4 0.6

0.4
0.2

0.2
0
0

−0.2
−0.2

−0.4 −0.4
2 2
1 2 1 2
0 1 0 1
0 0
−1 −1 −1 −1
−2 −2 −2 −2
y x y x

(a) µ = 2 (b) µ = 3

Figure 10.5: Graph of Shepard interpolant for function in Example 10.5 for µ = 2 (left) and
µ=3

A local multivariate interpolation method of Franke and Little is designed so that the datum at one
node will have a very small influence on the interpolation function at points far from that node. Given
nodes (xi , yi ), i = 1, . . . , n, we introduce functions
 p µ
gi (x, y) = 1 − ri−1 (x − xi )2 + (y − yi )2
+

The subscript ‘+’ indicates that when the quantity inside the parentheses is negative, it is replaced by 0.
This occur if (x, y) is far from the node (xi , yi ).
If ri is chosen to be the distance from (xi , yi ) to the nearest neighboring node, then gi (xj , yj ) =
δij . In this case, we interpolate an arbitrary function f by means of the function
n
X
f (xi , yi )gi (x, y).
i=1

In the sequel we consider a slightly modified variant of Franke-Little weights. The basic functions
are given by
(R − kx − xk k)µ+
Rµ kx − xk kµ
w̄k (x) = n , (10.1.28)
X (R − kx − xi k)µ+

i=0
Rµ kx − xi kµ
10.1. Interpolation in Higher Dimensions 421

where R > 0 is a given constant, and the interpolant has the form
n
X
S(f, x) = w̄k (x)f (xk ). (10.1.29)
k=0

The function Shepgridbloc, given in MATLAB Source 10.4, computes the interpolant given by
(10.1.29) and (10.1.28).

MATLAB Source 10.4 Local Shepard interpolation


function z=Shepgridbloc(xp,yp,x,y,f,mu,R)
%computes local Shepard interpolant values on a grid
% using the barycentric form and Franke-Little weights
% call z=Shepgridbloc(xp,yp,x,y,f,mu,R)
% xp,yp - the points
% x,y - node coordinates
% f function value on nodes
% mu - exponent
% R - Radius
if size(xp)˜=size(yp)
error(’xp and yp have not the same size’)
end
[m,n]=size(xp);
for i=1:m
for j=1:n
z(i,j)=Shep1barloc(xp(i,j),yp(i,j),x,y,f,mu,R);
end
end
function z=Shep1barloc(xp,yp,x,y,f,mu,R)
%computes local Shepard interpolant value in one point
d=(xp-x).ˆ2+(yp-y).ˆ2;
n=length(x);
w=zeros(1,n);
ix=(d<R);
w(ix)=((R-d(ix))./(R*d(ix))).ˆmu;
z=sum(w*f)/sum(w);

Example 10.1.11. We test Shepgridbloc for the function in Example 10.1.4, f : [−2, 2]×[−2, 2] →
2 2
R, f (x, y) = xe−x −y , with 201 random generated nodes, µ = 2, 3 and R = 0.4:

ftest=@(x,y) x.*exp(-x.ˆ2-y.ˆ2);
nX=[4*rand(201,1)-2];
nY=[4*rand(201,1)-2];
nX=nX(:); nY=nY(:); f=ftest(nX,nY);
[X,Y]=meshgrid(linspace(-2,2,113));
Z2a=Shepgridbloc(X,Y,nX,nY,f,2,0.4);
Z3a=Shepgridbloc(X,Y,nX,nY,f,3,0.4);
surf(X,Y,Z2a)
422 Multivariate Approximation

(a) µ = 2, R = 0.4 (b) µ = 3, R = 0.4

Figure 10.6: Local Shepard interpolants in Example 10.1.11

title(’\mu=2, R=0.4’,’FontSize’,14)
shading interp; camlight headlight
figure(2)
surf(X,Y,Z3a)
title(’\mu=3, R=0.4’,’FontSize’,14)
shading interp; camlight headlight

See Figure 10.6 for output. ♦

For details on local Shepard interpolation, including MATLAB implementation, see [97].

10.1.7 Triangulation
Another general strategy for interpolating functions given on R2 begins by creating a triangulation.
Informally, this means that triangles are drawn by joining nodes. In the end, we shall have a family
of triangles, T1 , T2 , . . . , Tm . We consider this collection of triangles to be the triangulation. The
following rules must be satisfied:
1. Each interpolation node must be the vertex of some triangle Ti .
2. Each vertex of a triangle in the collection must be a node.
3. If a node belongs to a triangle, it must be a vertex of that triangle.
The effect of Rule 3 is to disallow the construction shown in Figure 10.7.
The simplest type of interpolation on a triangulation is the piecewise linear function that interpolates
a function f at all the vertices of all triangles. In any triangle, Ti , a linear function will be prescribed:

ℓi (x, y) = ai x + bi y + ci , (x, y) ∈ Ti .

The coefficients in ℓi are uniquely determined by the prescribed function values at the vertices of Ti .
This can be seen as an application of Theorem 10.1.5, for L1 in that theorem can be taken to be one
side of the triangle, and L0 can be a line parallel to L1 containing the vertex not on L1 . Let us consider
the situation shown in Figure 10.8.
The line segment joining (x2 , y2 ) to (x3 , y3 ) is common to both triangles. This line segment can
be represented as
{t(x2 , y2 ) + (1 − t)(x3 , y3 ) : 0 ≤ t ≤ 1} .
10.1. Interpolation in Higher Dimensions 423

Figure 10.7: Illegal triangulations

(x2 , y2 )

(x1 , y1 ) T1 T2 (x4 , y4 )

(x3 , y3 )

Figure 10.8: Continuity in a triangulation

The variable t can be considered to be the coordinate for the points on the line segment. The linear
function ℓ1 , when restricted to this line segment, will be a linear function of the single variable t,
namely
a1 (tx2 + (1 − t)x3 ) + b1 (ty2 + (1 − t)y3 ) + c1
or
(a1 x2 − a1 x3 + b1 y2 − b1 y3 ) t + (a1 x3 + b1 y3 + c1 ).
This linear function of t is completely determined by the interpolation conditions at (x2 , y2 ) and
(x3 , y3 ). The same remarks pertain to the linear function ℓ2 . Thus ℓ1 and ℓ2 agree on this line segment,
and the piecewise linear function defined on T1 ∪ T2 is continuous. This proves the following result.

Theorem 10.1.12. Let {T1 , T2 , . . . , Tm } be a triangulation in the plane. The piecewise linear function
taking prescribed values at all the vertices of all the triangles Ti is continuous.

Consider next the use of piecewise quadratic functions on triangulation. In each triangle, Ti , a
424 Multivariate Approximation

quadratic polynomial will be prescribed:


qi (x, y) = a1 x2 + a2 xy + a3 y 2 + a4 x + a5 y + a6 .
Six conditions will be needed to determine the six coefficients. One such set of conditions consists of
values at the vertices of the triangle and the midpoints of the sides. Again, an application of Theorem
10.1.5 shows that this interpolation is always uniquely possible. Indeed, in that theorem, L2 can be one
side of the triangle, L1 can be the line passing through two midpoints not on L2 , and L0 can be a line
containing the remaining vertex but no other node. (See Figure 10.9). Reasoning as before, we see that
the global piecewise quadratic function will be continuous because the three prescribed function values
on the side of a triangle determine the quadratic function of one variable on that side.

L0

L1

L2

Figure 10.9: Applying Theorem 2

10.1.8 Moving least squares


Another versatile method of smoothing and interpolating multivariate functions is called moving least
squares. First it is explained in a general setting, and then some specific examples will be given.
We start with a set X that is the domain of the functions involved. For example, X can be R, or R2 ,
or a subset of either. Next, a set of nodes {x1 , x2 , . . . , xn } is given. These are the points at which a
certain function f has been sampled. Thus the values f (xi ) are known, for i = 1, . . . , n. For purposes
of approximation, we select a set of functions u1 , u2 , . . . , um . These are real-valued functions defined
on X. The number m will usually be very small relative to n.
In the familiar least-squares method, a set of nonnegative weights wi ≥ 0 is given. We try to find
coefficients c1 , c2 , . . . , cm to minimize the expression
n
" m
#2
X X
f (xi ) − cj uj (xi ) wi .
i=1 j=1

This is the sum of squares of the residuals. If we choose the discrete scalar product
n
X
hf, gi = wi f (xi )g(xi ),
i=1

the solution is characterized by the orthogonality condition


m
X
f− cj uj ⊥ ui , i = 1, . . . , m.
j=1
10.1. Interpolation in Higher Dimensions 425

This leads to the normal equations


m
X
cj huj , ui i = hf, ui i , i = 1, . . . m.
j=1

How does the moving least squares method differ from the previous procedure? The weights wi
are now allowed to be functions of x. The formalism of the usual least squares method can be retained,
although the following notation may be better:
n
X
hf, gix = f (xi )g(xi )wi (x).
i=1

The normal equations now should be written in the form


n
X
cj (x) huj , ui ix = hf, ui ix
j=1

and the final approximation will be


m
X
g(x) = cj (x)uj (x).
j=1

The computation necessary to produce this function will be quite formidable if m is large, for the normal
equation change with x. For this reason, m is usually no greater than 10.
The weight function can be used to achieve several desirable effects. First, if wi (x) is “strong” at xi ,
the function g will nearly interpolate f at xi . In the limiting case, wi (xi ) = +∞, and g(xi ) = f (xi ).
If wi (x) decreases rapidly to zero when x moves away from xi , then the nodes far from xi will have
little effect on g(xi ).
A choice for wi that achieves these two objectives in a space Rd is

wi (x) = kx − xi k−2

where any norm can be employed, although the Euclidean norm is usual.
If the moving least squares procedure is used with a single function, u1 (x) ≡ 1, and with weight
functions like the one just mentioned, then Shepard’s method will result. To see that this is so, write
the normal equation for this case,with c1 (x) = c(x), u1 (x) = u(x) = 1 :

c(x) hu, uix = hf, uix .

The approximating function will be

g(x) = c(x)u(x) = c(x) = hf, uix hu, uix


Xn .X
n
= f (xi )wi (x) wj (x).
i=1 j=1

Pn
If wi (x) = kx − xi k−2 , then after removing the singularities wi / j=1 wj has the cardinal property:
it takes the value 1 at x and the value zero at all other nodes.
426 Multivariate Approximation

10.1.9 Interpolation by radial basis functions


Consider a given function f ∈ C(Ω), Ω ∈ Rn , and a very large data set that consists of two parts:
a finite set X = {x1 , . . . xM } of M scattered points (nodes) in Ω and real numbers {f1 , . . . , fM }
(approximate values of f at given points). We want to interpolate f . We choose the following basis:

φ : R+ → R Φ(x, y) = φ kx − yk2 .

This functions must be invariant to translation and rotation, and they are called radial basis functions.
Reconstruction by interpolation on X will require to solve the linear system

M
X
αj Φ(xk , xj ) = fk , k = 1, . . . , M, (10.1.30)
j=1

for α1 , . . . , αM , or in matrix form


Aα = f,
where A = (Φ(xk , xj ))1≤j,k≤M . To assure the uniqueness of (10.1.30), A must be nonsingular.

Definition 10.1.13. A function f is complete monotone on [0, ∞) if

(a) f ∈ C[0, ∞).


(b) f ∈ C ∞ (0, ∞).
(c) (−1)k f (k) (t) ≥ 0 for t > 0 and k = 0, 1, . . . .

Theorem 10.1.14 (Schoenberg[11]). If a function f is complete monotone  and not constant on [0, ∞),
then for any distinct points x1 , . . . , xn the matrix A = f kxi − xj k2 is positive definite (and there-
fore nonsingular).

Thus, Schoenberg’s theorem provides a large class of functions for which interpolation is possible
by expressions of the following form:
n
X
cj f (kx − xj k) .
j=1

Here are some examples

φ(r) = e−βr , β > 0 (Gaussian)


2 2 β/2
φ(r) = (c + r ) , β < 0 (inverse multiquadric)
φ(r) = (1 − r)4+ (1 + 4r) (Wendland)

Another related result is

Theorem 10.1.15 (Micchelli, [65]). Suppose F ′ is complete monotone but not constant on (0, ∞), F
is continuous on [0, ∞) and positive on (0, ∞). Then for any distinct points x1 , . . . , xn from Rn , it
holds

(−1)n−1 det F kxi − xj k2 > 0.
10.1. Interpolation in Higher Dimensions 427


Hence the matrix Aij = F kxi − xj k2 is nonsingular. An example of function that fulfills the
conditions of Micchelli’s Theorem is
F (t) = (1 + t)1/2 . (10.1.31)

We look for an interpolant


n
X
s(x) = αj φ(kx − xj k),
j=1

by solving the linear system


n
X
αj φ(kxi − xj k) = f (xi ), i = 1, . . . , n.
j=1

The function given by (10.1.31) leads us to a multivariate interpolation process called interpolation by
multiquadrics.
A variant of this process is the one proposed by R. Hardy [42], that uses as its basic functions
 1/2
zi (p) = kp − pi k2 + c2 , i = 1, . . . , n.
Here the norm is Euclidean, and c is a parameter that Hardy suggested to be set equal to 0.8 times the
average distance between nodes. The nonsingularity of coefficient matrix is due to Micchelli [65].
The MATLAB Source 10.5 computes the values of the interpolant at a given set of points for a
given set of nodes, a given set of function values at nodes and a given basic function φ. By default, φ is
the function given by (10.1.31).
Example 10.1.16. Plot the Gaussian and multiquadrics radial basis function interpolant for the function
2 2
in Example 10.1.4, f : [−2, 2] × [−2, 2] → R, f (x, y) = xe−x −y . The graphs are given in the left
column of Figure 10.10. The right column plot the errors. The MATLAB sequence for the figure is
ftest=@(x,y) x.*exp(-x.ˆ2-y.ˆ2);
phi=@(r) exp(-2*r);
nX=[4*rand(100,1)-2;-2;-2;2;2];
nY=[4*rand(100,1)-2;-2;2;-2;2];
[X,Y]=meshgrid(linspace(-2,2,40));
nX=nX(:); nY=nY(:); f=ftest(nX,nY);
Z1=RBF(X(:),Y(:),nX,nY,f,phi);
Z2=RBF(X(:),Y(:),nX,nY,f);
ZE=ftest(X,Y); T=delaunay(X(:),Y(:));
G1=del2(Z1); G2=del2(Z2);
figure(1); trisurf(T,X,Y,Z1,G1)
figure(2); trisurf(T,X,Y,abs(Z1-ZE(:)),G1)
figure(3); trisurf(T,X,Y,Z2,G2)
figure(4); trisurf(T,X,Y,abs(Z2-ZE(:)),G2)
We used delaunay function and trisurf to obtain a better representation. ♦

For other examples, see [20].


Further references on multivariate interpolation are Kincaid and Cheney [50], Chui [12], Hartley
[43], Micchelli [64], Franke [30], Cătinaş [20], and Lancaster and Salkauskas [54].
References on Shepard interpolation are Shepard [85], Gordon and Wixom [40], Newman and
Rivlin [69], Barnhill, Dube and Little [3], Farwig [28], and Coman and Trı̂mbiţaş [15, 16].
428 Multivariate Approximation

MATLAB Source 10.5 Interpolation with radial basis function


function Z=RBF(X,Y,nX,nY,f,phi)
%RBF - radial basis function interpolant
%call Z=RBF(X,Y,nX,nY,f,phi)
%X,Y - points for evaluation
%nX,nY - nodes
%f - function values at nodes
%phi - radial basis function

if nargin<6
phi=@(x) (x+1).ˆ(1/2);
end
D=sqdist(nX,nY); %compute square of distances
%find coefficients
A=phi(D);
a=A\f;
%compute interpolant values
n=length(nX);
Z=zeros(size(X));
for j=1:n
Z=Z+a(j)*phi((X-nX(j)).ˆ2+(Y-nY(j)).ˆ2);
end
function M=sqdist(X,Y)
%compute squares of distances
[rX1,rX2]=meshgrid(X);
[rY1,rY2]=meshgrid(Y);
M=(rX1-rX2).ˆ2+(rY1-rY2).ˆ2;

10.2 Multivariate Numerical Integration


Let D ⊆ Rn , f : D → R an integrable function, Pi i = 0, m points in D, and w a nonnegative weight
function, defined on D. A formula of the form
Z Z m
X
··· w(x1 , . . . , xn )f (x1 , . . . , xn ) dx1 . . . dxn = Ai f (Pi ) + Rm f (10.2.1)
D i=0

is called a numerical integration formula for f or a cubature formula. The parameters Ai are called
weights or coefficients of the formula, the points Pi se its nodes, and Rm the remainder term.
An efficient method for the construction of cubature formulas when D is a rectangular domain,
consists of expressing its coefficients and its nodes based on coefficients and nodes of a univariate
formula respectively. We take into account only the bivariate case.
Let D = [a, b] × [c, d] be a rectangle, ∆x the grid a = x0 < x1 < · · · < xm = b,and ∆y the grid
c = y0 < y1 < · · · < yn = d, w(x, y) = 1, ∀(x, y) ∈ D. The formula (10.2.1) becomes, in this case,
Z b Z d m X
X n
f (x, y) dx dy = Aij f (xi , yj ) + Rm,n (f ). (10.2.2)
a c i=0 j=0
10.2. Multivariate Numerical Integration 429

−3
x 10

0.5 1

0.8

0.6
0
0.4

0.2

−0.5 0
2 2
1 2 1 2
0 1 0 1
0 0
−1 −1
−1 −1
−2 −2 −2 −2

(a) Gaussian (b) Error

−3
x 10

0.5 5

3
0
2

−0.5 0
2 2
1 2 1 2
0 1 0 1
0 0
−1 −1 −1 −1
−2 −2 −2 −2

(c) Multiquadrics (d) Error

Figure 10.10: Graph of radial basis interpolants for the function in Example 10.1.16

We can generate such a formula starting from bivariate Lagrange interpolation formula

f = Lxm Lyn f + Rm
x
⊕ Rny f.

If f ∈ C m+1,n+1 (D), by integrating previous formula term by term, one obtains

Z b Z d m X
X n
f (x, y) dx dy = Ai Bj f (xi , yj ) + Rmn (f ) (10.2.3)
a c i=0 j=0

where
Z b Z d
Ai = ℓi (x) dx, Bj = ℓ̃j (y) dy,
a c
430 Multivariate Approximation

and
Z b Z d
x
Rm,n (f ) = (Rm ⊕ Rny )f (x, y) dx dy
a c
Z bZ d
1
= um (x)f (m+1,0) (ξx , y) dx dy
(m + 1)! a c
Z bZ d
1
+ un (y)f (0,n+1) (x, ηy ) dx dy
(n + 1)! a c
Z bZ d
1 1
− um (x)un (y)f (m+1,n+1) (ξ˜x , η̃y ) dx dy.
(m + 1)! (n + 1)! a c

If ∆x and ∆y are uniform grids of the intervals [a, b] and [c, d], respectively, then cubature formula
(10.2.3) is called a Newton-Cotes cubature formula.

Particular cases. For m = n = 1 one obtains the trapezoidal cubature formula.


Z b Z d
(b − a)(d − c)
f (x, y) dx dy = [f (a, c) + f (a, d) + f (b, c) + f (b, d)]
a c 4
+ R11 (f ),

where
(b − a)3 (d − c) (2,0) (b − a)(d − c)3 (0,2)
R11 (f ) = − f (ξ1 , η1 ) − f (ξ2 , η2 )
12 12
(b − a)3 (d − c)3 (2,2)
− f (ξ3 , η3 ).
144
For m = n = 2 one obtains Simpson cubature formula.
Z bZ d 
(b − a)(d − c)
f (x, y) dx dy = f (a, c) + f (a, d) + f (b, c) + f (b, d)
a c 36
        
a+b a+b c+d c+d
+4 f ,c + f , d + f a, + +f b,
2 2 2 2
 
a+b b+c
+ 16f , + R22 (f ),
2 2
where
(b − a)5 (d − c) (4,0) (b − a)(d − c)5 (0,4)
R22 (f ) = − f (ξ1 , η1 ) − f (ξ2 , η2 )
2880 2880
(b − a)5 (d − c)5 (4,4)
− f (ξ3 , η3 ).
28802
By partitioning the intervals [a, b] and [c, d] we can obtain iterated cubature formulae. We illustrate
for Simpson’s formula. Suppose [a, b] is divided into m equal length subintervals, and [c, d] into n
equal length subintervals (we have a grid with mn rectangles). We will divide each rectangle into four
smaller rectangles, as in Figure 10.11 (the vertices of rectangles are indicated by black circles, and
intermediate points by lighter circles).
Let
b−a d−c
h= , k= .
2m 2n
10.2. Multivariate Numerical Integration 431

y2n =d

y2n−1

y2n−2
..
.
y2

y1

c=y0

a=x0 x1 x2 x3 x4 ... x2m−2 x2m−1 x2m =b

Figure 10.11: Subdivision of iterated Simpson formula

The coordinates of the subdivision points will be

xi = x0 + ih, x0 = a, i = 0, 2m
yj = y0 + jk, y0 = b, j = 0, 2n.

We introduce the notation fij := f (xi , yj ). Applying elementary Simpson formula to each rectangle
of the grid, we have
Z Z m n
b d
hk X X 
f (x, y) dx dy = f2i,2j + f2i+2,j + f2i+2,2j+2
a c 9 i=0 j=0

+ f2i,2j+2 + 4(f2i+1,2j + f2i+2,2j+1 + f2i+1,2j+2 + f2i,2j+1 )



+ 16f2i+1,2j+1 + Rm,n (f ).

After simplification, one obtains


Z Z m n
b d
hk X X
f (x, y) dx dy = λij fij + Rm,n (f ),
a c 9 i=0 j=0

where λij are entries of the matrix


 
1 4 2 4 2 ... 4 2 4 1
4 16 8 16 8 ... 16 8 16 4
 
2 8 4 8 4 ... 8 4 8 2
 
 ..  .
Λ =  ... ..
.
..
.
..
.
..
.
..
.
..
.
..
.
..
. .
 
2 8 4 8 4 ... 8 4 8 2
 
4 16 8 16 8 ... 16 8 16 4
1 4 2 4 2 ... 4 2 4 1

[10, 88] give, when f ∈ C 4,4 (D), the following expression for the rest

(b − a)(d − c) h 4 (4,0) i
Rmn (f ) = − h f (ξ1 , η1 ) + k4 f (0,4) (ξ2 , η2 ) ,
180
432 Multivariate Approximation

for some ξ1 , η1 , ξ2 , η2 ∈ D.
It is possible to construct bivariate Gauss formulae. For example, if xi , i = 0, m and yj , j = 0, n
are the roots Legendre polynomial w.r.t. interval [a, b] and [c, d], respectively, one obtains the Gauss-
Legendre cubature formula, with coefficients
[(m + 1)!]4 (b − a)2m+3
Ai = , i = 0, m
[(2m + 2)!]2 (xi − a)(b − xi )[u′ (xi )]2
[(n + 1)!]4 (b − a)2n+3
Bj = , j = 0, n,
[(2n + 2)!]2 (yj − c)(d − yj )[u′ (yj )]2
and, if f ∈ C 2m+2,2n+2 (D)
en f (0,2n+2) (ξ2 , η2 )
Rmn (f ) = (d − c)λm f (2m+2,0) (ξ1 , η1 ) + (b − a)λ
en f (2m+2,2n+2) (ξ3 , η3 ),
− λm λ
where
[(m + 1)!]4 (b − a)2m+3 4
en = [(n + 1)!] (b − a)
2n+3
λm = , λ .
[(2m + 2)!]3 (2m + 3) [(2n + 2)!]3 (2n + 3)
For further details, see [17].
If m = n = 0, one obtains an one-node cubature formula, analogous to rectangle formula
Z bZ d  
a+b c+d
f (x, y) dx dy = (b − a)(d − c)f , + R00 (f ),
a c 2 2
where
(b − a)3 (d − c) (2,0) (b − a)(d − c)3 (0,2)
R00 (f ) = f (ξ1 , η1 ) + f (ξ2 , η2 )
24 24
(b − a)3 (d − c)3 (2,2)
− f (ξ3 , η3 ).
576
The utility of the previous methods is not limited to rectangular domains. For example, we can
modify Simpson’s cubature formula so that it can be applied to integrals of the form
Z b Z d(x) Z d Z b(y)
f (x, y) dx dy or f (x, y) dx dy,
a c(x) c a(y)

provided that, the domain is simple with respect to x or y, respectively.


For an integral of the first type the step for x will be h = b−a 2
, but the step for y will vary as a
function of x (see Figure 10.12):
d(x) + c(x)
k(x) = .
2
One obtains
Z bZ d(x) Z b
k(x)
f (x, y) dxdy ≈ [f (x, c(x))+ 4f (x, c(x)+ k(x))+ f (x, d(x))] dx
a c(x) a 3

h k(a)  
≈ f (a, c(a)) + 4f (a, c(a) + k(a)) + f (a, d(a))
3 3
k(a + h) 
+4 f (a + h, c(a + h)) + 4f (a + h, c(a + h) + k(a + h))
3 
+ f (a + h, d(a + h))

k(b)
+ [f (b, c(b)) + 4f (b, c(b) + k(b)) + f (b, d(b))] .
3
10.3. Multivariate Approximations in MATLAB 433

d(a)

d(b)

k(a)

c(b)+d(b)
c(a) 2

k(a+h) k(b)

c(b)

a a+h b

Figure 10.12: Simpson formula for a domain simple with respect to x

Implementation hints. Consider the domain


D = [a, b] × [c, d].

The integral to be approximated can be written as


Z bZ d Z b Z d  Z b
f (x, y) dx dy = f (x, y) dy dx = F (x) dx,
a c a c a

where Z d
F (x) = f (x, y) dy.
c
Suppose adquad is a univariate adaptive routine. Idea is to use this routine to compute values of
F defined above and then to reuse the routine to integrate definite F . An implementation example is
given in MATLAB source 10.6.

10.3 Multivariate Approximations in MATLAB


10.3.1 Multivariate interpolation in MATLAB
MATLAB has two functions for bivariate interpolation: interp2 and griddata. The most general
calling syntax of interp2 is
ZI = interp2(x,y,z,XI,YI,method)
Here x and y contain the coordinates of interpolation nodes, z contains the values at nodes, and XI and
YI are matrices containing the coordinates of points at which we wish to interpolate. ZI contains the
values of interpolant at XI, YI. method can be one of the following values:
• ’linear’ – Bilinear interpolation (default);
• ’spline’ – Cubic spline interpolation;
434 Multivariate Approximation

MATLAB Source 10.6 Double integral approximation on a rectangle


function Q = quaddbl(F,xmin,xmax,ymin,ymax,tol,...
quadm,varargin)
%QUADDBL - approximates a double integral on a rectangle
%Parameters
%F - Integrand
%XMIN, XMAX, YMIN, YMAX - rectangle limits
%TOL -tolerance, default 1e-6
%QUADM - integration method, default adquad
if nargin < 5, error(’Required at lest 5 arguments’); end
if nargin < 6 || isempty(tol), tol = 1.e-6; end
if nargin < 7 || isempty(quadm), quadm = @adquad; end
F = fcnchk(F);

Q = quadm(@innerint, ymin, ymax, tol, [], F, ...


xmin, xmax, tol, quadm, varargin{:});

%---------
function Q = innerint(y, F, xmin, xmax, tol, quadm, varargin)
%INNERINT - used by QUADDBL for inner integral.
%
% QUADM specifies quadrature to be used
% Evaluates inner integral for each value of outer variable

Q = zeros(size(y));
for i = 1:length(y)
Q(i) = quadm(F, xmin, xmax, tol, [], y(i), varargin{:});
end

• ’nearest’ – Nearest neighbor interpolation;


• ’cubic’ – Cubic interpolation, as long as data is uniformly-spaced. Otherwise, this method is
the same as ’spline’.
All interpolation methods require that X and Y be monotonic, and have the same format (”plaid”) as if
they were produced by meshgrid. If you provide two monotonic vectors, interp2 changes them to
a plaid internally. Variable spacing is handled by mapping the given values in X, Y, XI, and YI to an
equally spaced domain before interpolating. For faster interpolation when X and Y are equally spaced
and monotonic, use the methods ’*linear’, ’*cubic’, ’*spline’, or ’*nearest’.
Our example tries to interpolate MATLAB peaks function on a 7-by-7 grid. We generate the grid,
compute the function values and plot the function with MATLAB sequence

[X,Y]=meshgrid(-3:1:3);
Z=peaks(X,Y);
surf(X,Y,Z)

The graph is given in Figure 10.13. Then we compute the interpolants on a finer grid and plot them:

[XI,YI]=meshgrid(-3:0.25:3);
10.3. Multivariate Approximations in MATLAB 435

−2

−4

−6
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3

Figure 10.13: Graph of peaks on a coarse grid

ZI1=interp2(X,Y,Z,XI,YI,’nearest’);
ZI2=interp2(X,Y,Z,XI,YI,’linear’);
ZI3=interp2(X,Y,Z,XI,YI,’cubic’);
ZI4=interp2(X,Y,Z,XI,YI,’spline’);
subplot(2,2,1), surf(XI,YI,ZI1)
title(’nearest’)
subplot(2,2,2), surf(XI,YI,ZI2)
title(’linear’)
subplot(2,2,3), surf(XI,YI,ZI3)
title(’cubic’)
subplot(2,2,4), surf(XI,YI,ZI4)
title(’spline’)

See Figure 10.14 for their graphs. If we replace everywhere surf by contour we obtain the graphs
in Figure 10.15.
The griddata function has the same syntax as interp2. The input data are nodes coordinates
x and y, which need not to be monotone, and the values at nodes, z. The function computes the
values ZI of interpolant at nodes XI and YI. The nodes are generated via meshgrid. The method
parameter may be ’linear’, ’cubic’, nearest and ’v4’, the latter being a method peculiar to
MATLAB 4. All methods, excepting v4 are based on Delaunay triangulation (a triangulation of a set
that minimizes the maximum angle). The method is useful to interpolate values on a surface. The next
2
+y 2 )
example interpolates random points on the surface z = sin(x
x2 +y 2
(“Mexican hat”). To avoid problems
at origin we add eps to denominator.

x=rand(100,1)*16-8; y=rand(100,1)*16-8;
R=sqrt(x.ˆ2+y.ˆ2)+eps;
z=sin(R)./R;
xp=-8:0.5:8;
[XI,YI]=meshgrid(xp,xp);
ZI=griddata(x,y,z,XI,YI);
mesh(XI,YI,ZI); hold on
plot3(x,y,z,’ko’); hold off
436 Multivariate Approximation

nearest linear

10 10

5 5

0 0

−5 −5
5 5
5 5
0 0
0 0
−5 −5 −5 −5

cubic spline

10 10

5
0
0

−5 −10
5 5
5 5
0 0
0 0
−5 −5 −5 −5

Figure 10.14: interp2 example

See Figure 10.16 for the result. The random points are represented as circles, and the interpolant is
plotted with mesh.

10.3.2 Computing double integrals in MATLAB


We can approximate double integrals on rectangles using dblquad. To illustrate, we shall approximate
the integral
Z π Z 2π
(y sin x + x cos y) dx dy
0 π

Its exact value is −π 2 , as it can be checked using Symbolic Math toolbox:

>> syms x y
>> Pi=sym(pi);
>> z=y*sin(x)+x*cos(y);
>> int(int(z,x,Pi,2*Pi),y,0,Pi)
ans =
2
-pi
10.3. Multivariate Approximations in MATLAB 437

nearest linear
3 3

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

cubic spline
3 3

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Figure 10.15: Contours generated by interp2

0.8

0.6

0.4

0.2

−0.2

−0.4
10

5 10
5
0
0
−5
−5
−10 −10

Figure 10.16: griddata interpolation


438 Multivariate Approximation

We can define the integrand as an inline object, M-file, character string, anonymous function or
function handle. Suppose we gave it in the M-file integrand.m:

function z = integrand(x, y)
z = y*sin(x)+x*cos(y);

We shall use dblquad and check it:

>> Q = dblquad(@integrand, pi, 2*pi, 0, pi)


Q =
-9.8696
>> -piˆ2
ans =
-9.8696

Also, we compute the integral with quaddbl (MATLAB source 10.6):

>> Q2=quaddbl(@integrand,pi,2*pi,0,pi)
Q2 =
-9.8696

The integrand for dblquad and quaddbl must accept a vector x and a scalar y and return a vector of
values of the integrand. We can pass additional arguments to specify the accuracy (tolerance) and the
univariate integration method (default quad for dblquad). Suppose we want to compute
Z 6 Z 1
(y 2 ex + x cos y) dx dy
4 0

with a tolerance of 1e-8 and to use quadl instead of quad:

>> fi = @(x,y) y.ˆ2.*exp(x)+x.*cos(y);


>> dblquad(fi,0,1,4,6,1e-8,@quadl)
ans =
87.2983

The exact value, provided by Maple or Symbolic Math Toolbox is

152 1
(e − 1) + (sin 6 − sin 4).
3 2

Let us check this:

>> syms x y
>> z=yˆ2*exp(x)+x*cos(y);
>> int(int(z,x,0,1),y,4,6)
ans =
- 152/3 + 152/3 exp(1) - 1/2 sin(4) + 1/2 sin(6)
>> double(ans)
ans =
87.2983
10.3. Multivariate Approximations in MATLAB 439

Problems
Problem 10.1. Code a MATLAB function that plots a surface f (x, y), f : [0, 1] × [0, 1] → R which
fulfills the conditions

f (0, y) = g1 (y) f (1, y) = g2 (y)


f (x, 0) = g3 (x) f (x, 1) = g4 (y),

where gi , i = 1, 4 are functions defined on [0, 1].

Problem 10.2. Find the bivariate tensor product and boolean sum corresponding to a univariate Her-
mite interpolant with double nodes 0 and 1. Plot this interpolant for the function f (x, y) = x exp(x2 +
y 2 ).

Problem 10.3. Adapt quaddbl function to approximate double integrals of the form
Z b Z d(x)
f (x, y) dy dx
a c(x)

or
Z d Z b(y)
f (x, y) dx dy,
c a(y)

when the integration domain is simple with respect to x or y.

Problem 10.4. Consider the double integral of the function f (x, y) = x2 + y 2 on the elliptical domain
R given by −5 < x < 5, y 2 < 53 (25 − x2 ).
(a) Plot the function on R.
(b) Find the exact value of the integral using Maple or Symbolic Math Toolbox.
(c) Approximate the value of the integral by transforming the ellipse into a rectangle.
(d) Approximate the integral using functions in Problem 10.3.

Problem 10.5. Consider the function f (x, y) = y cos x2 and the triangular domain T = {x ≥ 0, y ≥
0, x + y <= 1} and Z Z
I= f (x, y) dx dy.
T
(a) Plot the function on T using trimesh on trisurf.
(b) Approximate the value of I by transforming the integral into an integral of a function defined on
the unit square, that is null outside of T .
(c) Approximate the integral using functions in Problem 10.3.
440 Multivariate Approximation
Bibliography 441

Bibliography
[1] Octavian Agratini, Ioana Chiorean, Gheorghe Coman, and Radu Trı̂mbiţaş, Numerical Analysis
and Approximation Theory, vol. III, Cluj University Press, 2002, D. D. Stancu, Gh. Coman
(coords), (in Romanian).
[2] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Green-
baum, S. Hammarling, A. McKenney, and Sorensen D., LAPACK Users’ Guide, third ed., SIAM,
Philadelphia, 1999, http://www.netlib.org/lapack.
[3] R. Barnhill, R. P. Dube, and F. F. Little, Properties of Shepard’s surfaces, Rocky Mtn. J. Math.
13 (1983), 365–382.
[4] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo,
C. Romine, and H. van der Vorst, Templates for the Solution of Linear Systems: Building
Blocks for Iterative Methods, 2nd ed., SIAM, Philadelphia, PA, 1994, available via www,
http://www.netlib.org/templates.
[5] Jean-Paul Berrut and Lloyd N. Trefethen, Barycentric Lagrange Interpolation, SIAM Review 46
(2004), no. 3, 501–517.
[6] Å. Björk, Numerical Methods for Least Squares Problem, SIAM, Philadelphia, 1996.
[7] E. Blum, Numerical Computing: Theory and Practice, Addison-Wesley, 1972.
[8] P. Bogacki and L. F. Shampine, A 3(2) pair of Runge-Kutta formulas, Appl. Math. Lett. 2 (1989),
no. 4, 321–325.
[9] C. G. Broyden, A Class of Methods for Solving Nonlinear Simultaneous Equations, Math. Comp.
19 (1965), 577–593.
[10] L. Burden and J. D. Faires, Numerical Analysis, PWS Kent, Boston, 1986.
[11] W. Cheney and W. Light, A Course in Approximation Theory, Brooks/Cole, Pacific Grove, 2000.
[12] C. K. Chui, Multivariate splines, SIAM Regional Conference Series in Mathematics, 1988.
[13] K. C. Chung and T. H. Yao, On lattices admitting unique Lagrange interpolation, SIAMNA 14
(1977), 735–743.
[14] P. G. Ciarlet, Introduction à l’analyse numérique matricielle et à l’optimisation, Masson, Paris,
Milan, Barcelone, Mexico, 1990.
[15] Gh. Coman and R. T. Trı̂mbiţaş, Bivariate shepard interpolation, Seminar on Numerical and
Statistical Calculus, preprint (1999), no. 1, 41–83.
[16] , Bivariate Shepard interpolation in MATLAB, Seminarul itinerant ”Tiberiu Popoviciu”
de Ecuaţii funcţionale, aproximare şi convexitate (Cluj-Napoca, Romania), 2000, pp. 41–56.
[17] Gheorghe Coman, Numerical Analysis, Libris, Cluj-Napoca, 1995, (in Romanian).
[18] C. Cormen, T. Leiserson, and R. Rivest, Introduction to Algorithms, MIT Press, Cambridge, MA,
1994.
[19] M. Crouzeix and A. L. Mignot, Analyse numerique des équations differentielles, Masson, Paris,
Milan, Barcelone, Mexico, 1989.
[20] Teodora Cătinaş, Interpolation of Scattered Data, Casa Cărţii de Ştiinţă, 2007.
[21] I. Cuculescu, Numerical Analysis, Editura Tehnică, Bucureşti, 1967, (in Romanian).
[22] P. J. Davis and P. Rabinowitz, Numerical Integration, Blaisdell, Waltham, Massachusetts, 1967.
[23] James Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997.
442 Bibliography

[24] J. E. Dennis and J. J. Moré, Quasi-Newton Metods, Motivation and Theory, SIAM Review 19
(1977), 46–89.
[25] J. Dormand, Numerical Methods for Differential Equations. A Computational Approach, CRC
Press, Boca Raton New York, 1996.
[26] T. A. Driscoll, N. Hale, and L. N. Trefethen, Chebfun Guide, Pafnuty Publications, Oxford, 2014.
[27] Tobin A. Driscoll, Crash course in MATLAB, www, 2006,
math.udel.edu/˜driscoll/MATLABCrash.pdf.
[28] R. Farwig, Rate of convergence of Shepard’s global interpolation formula, Math. of Comp. 46
(1986), 577–590.
[29] J. G. F. Francis, The QR transformation: A unitary analogue to the LR transformation, Computer
J. 4 (1961), 256–272, 332–345, parts I and II.
[30] R. Franke, Scattered data interpolation, Math. of Comp. 38 (1982), 181–200.
[31] W. Gander and W. Gautschi, Adaptive quadrature - revisited, BIT 40 (2000), 84–101.
[32] W. Gautschi, On the condition of algebraic equations, Numer. Math. 21 (1973), 405–424.
[33] , Numerical Analysis, an Introduction, Birkhäuser, Basel, 1997.
[34] Walther Gautschi, Orthogonal polynomials: applications and computation, Acta Numerica 5
(1996), 45–119.
[35] J. Gilbert, C. Moler, and R. Schreiber, Sparse matrices in M ATLAB: Design and implementation.,
SIAM J. Matrix Anal. Appl. 13 (1992), no. 1, 333–356, available in M ATLAB kit.
[36] G. Glaeser and H. Stachel, Open Geometry: OpenGLr + Advanced Geometry, Springer, 1999.
[37] D. Goldberg, What every computer scientist should know about floating-point arithmetic, Com-
puting Surveys 23 (1991), no. 1, 5–48.
[38] H. H. Goldstine and J. von Neumann, Numerical inverting of matrices of high order, Amer. Math.
Soc. Bull. 53 (1947), 1021–1099.
[39] Gene H. Golub and Charles van Loan, Matrix Computations, 3rd ed., John Hopkins University
Press, Baltimore and London, 1996.
[40] W. J. Gordon and J. A. Wixom, Shepard’s method of ‘metric interpolation’ to bivariate and
multivariate interpolation, Math. Comp. 32 (1978), 253–264.
[41] P. R. Halmos, Finite-Dimensional Vector Spaces, Springer Verlag, New York, 1958.
[42] R. L. Hardy, Multiquadric equations of topography and and other irregular surfaces, Journal
Geophysical Research 76 (1971), 1905–1915.
[43] P. H. Hartley, Tensor product approximations to data defines on rectangle meshes in n-spce,
Computer Jounal 19 (1976), 348–352.
[44] D. J. Higham and N. J. Higham, MATLAB Guide, second ed., SIAM, Philadelphia, 2005.
[45] N. J. Higham and F. Tisseur, A Block Algorithm for Matrix 1-Norm Estimation, with an Applica-
tion to 1-Norm Pseudospectra, SIAM Journal Matrix Anal. Appl. 21 (2000), no. 4, 1185–1201.
[46] Nicholas J. Higham, The Test Matrix Toolbox for MATLAB, Tech. report, Manch-
ester Centre for Computational Mathematics, 1995, available via WWW, address
http://www.ma.man.ac.uk/MCCM/MCCM.html.
[47] Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 1996.
[48] E. Isaacson and H. B. Keller, Analysis of Numerical Methods, John Wiley, New York, 1966.
Bibliography 443

[49] C. G. J. Jacobi, Über eine neue Auflösungsart der bei der Methode der kleinsten Quadrate vork-
ommenden linearen Gleichungen, Astronomische Nachrichten 22 (1845), 9–12, Issue no. 523.
[50] D. Kincaid and W. Cheney, Numerical Analysis: Mathematics of Scientific Computing,
Brooks/Cole Publishing Company, Belmont, CA, 1991.
[51] Mirela Kohr, Special Chapter of Mechanics, Cluj University Press, 2005, in Romanian.
[52] Mirela Kohr and Ioan Pop, Viscous Incompressible Flow for Low Reynolds Numbers, WIT Press,
Southampton(UK) - Boston, 2004.
[53] V. N. Kublanovskaya, On some algorithms for the solution of the complete eigenvalue problem,
USSR Comp. Math. Phys. 3 (1961), 637–657.
[54] P. Lancaster and K. Salkauskas, Curve and Surface Fitting, Academic Press, New York, 1986.
[55] P. Marchand and O. T. Holland, Graphics and GUIs with MATLAB, third ed., CHAPMAN &
HALL/CRC, Boca Raton, London, New York, Washington, D.C., 2003.
[56] The Mathworks Inc., Natick, Ma, Using MATLAB, 2002.
[57] The Mathworks Inc., Learning MATLAB 7, 2005, Version 7.
[58] The Mathworks Inc., MATLAB 7. Getting Started Guide, 2008, Minor revision for MATLAB 7.7
(Release 2008b).
[59] The Mathworks Inc., MATLAB 7 Graphics, 2008, Revised for MATLAB 7.7 (Release 2008b).
[60] The Mathworks Inc., MATLAB 7. Mathematics, 2008, Revised for MATLAB 7.7 (Release
2008b).
[61] The Mathworks Inc., MATLAB 7. Programming Fundamentals, 2008, Revised for Version 7.7
(Release 2008b).
[62] The Mathworks Inc., Natick, Ma, MATLAB. Symbolic Math Toolbox 5, 2008, Revised for Version
5.1 (Release 2008b).
[63] J. Meier, Parametrische Flächen, 2000, available via www, address
http://www.3d-meier.de/tut3/Seite0.html.
[64] C. A. Micchelli, Algebraic aspects of interpolation, Approximation Theory (Providence R.I.)
(C. deBoor, ed.), Proceedings of Symposia in Applied Mathematics, vol. 36, AMS, 1986, pp. 81–
102.
[65] , Interpolation of scattered data: Distance matrices and conditionally positive definite
functions, Constructive Approximation 2 (1986), 11–22.
[66] Cleve Moler, Numerical Computing in MATLAB3, SIAM, 2004, available via www at
http://www.mathworks.com/moler.
[67] J. J. Moré and M. Y. Cosnard, Numerical Solutions of Nonlinear Equations, ACM Trans. Math.
Softw. 5 (1979), 64–85.
[68] Shoichiro Nakamura, Numerical Computing and Graphic Vizualization in MATLAB, Prentice
Hall, Englewood Cliffs, NJ, 1996.
[69] D. J. Newman and T. J. Rivlin, Optimal universally stable interpolation, Analysis 3 (1983),
355–367.
[70] Dana Petcu, Computer Assisted Mathematics, Eubeea, Timişoara, 2000, (in Romanian).
[71] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C,
Cambridge University Press, Cambridge, New York, Port Chester, Melbourne, Sidney, 1996,
available via www, http://www.nr.com/.
444 Bibliography

[72] Alfio Quarteroni, Riccardo Sacco, and Fausto Saleri, Numerical Mathematics, Springer, New
York, Berlin, Heidelberg, 2000.
[73] I. A. Rus, Differential Equations, Integral Equations and Dynamical Systems, Transilvania Press,
Cluj-Napoca, 1996, (in Romanian).
[74] I. A. Rus and P. Pavel, Differential Equations, 2nd ed., Editura Didactică şi Pedagogică,
Bucureşti, 1982, (in Romanian).
[75] H. Rutishauser, Solution of the eigenvalue problems with the LR transformation, Nat. Bur. Stand.
App. Math. Ser. 49 (1958), 47–81.
[76] Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing, Boston, 1996, available
via www, http://www-users.cs.umn.edu/˜saad/books.html.
[77] H. E. Salzer, Lagrangian interpolation at the Chebyshev points xn,ν = cos(νπ/n), ν = 0(1)n;
some unnoted advantages, Computer Journal 15 (1974), 156–159.
[78] A. Sard, Linear Approximation, American Mathematical Society, Providence, RI, 1963.
[79] Thomas Sauer, Numerische Mathematik I, Universität Erlangen-Nurnberg, Erlangen, 2000, Vor-
lesungskript.
[80] , Numerische Mathematik II, Universität Erlangen-Nurnberg, Erlangen, 2000, Vor-
lesungskript.
[81] R. Schwarz, H., Numerische Mathematik, B. G. Teubner, Stuttgart, 1988.
[82] L. F. Shampine, Vectorized adaptive quadrature in MATLAB, Journal of Computational and Ap-
plied Mathematics 211 (2008), 131–140.
[83] L. F. Shampine, R. C. Allen, and S. Pruess, Fundamentals of Numerical Computing, John Wiley
& Sons, Inc, 1997.
[84] L. F. Shampine, I. Gladwell, and S Thompson, Solving ODEs with MATLAB, Cambridge Uni-
versity Press, Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo,
2003.
[85] D. Shepard, A two-dimensional interpolation function for irregularly spaced data, Proceedings
23rd National Conference ACM, 517–524.
[86] D. D. Stancu, On Hermite’s interpolation formula and some of its applications, Acad. R. P. Rom.
Studii şi Cercetări Matematice 8 (1957), 339–355, (in Romanian).
[87] D. D. Stancu, Numerical Analysis - Lecture Notes and Problem Book, Lito UBB, Cluj-Napoca,
1977, (in Romanian).
[88] D. D. Stancu, G. Coman, and P. Blaga, Numerical Analysis and Approximation Theory, vol. II,
Cluj University Press, Cluj-Napoca, 2002, D. D. Stancu, Gh. Coman, (coord.) (in Romanian).
[89] D. D. Stancu, Gh. Coman, O. Agratini, and R. Trı̂mbiţaş, Numerical Analysis and Approximation
Theory, vol. I, Cluj University Press, Cluj-Napoca, 2001, D. D. Stancu, Gh. Coman, (coord.) (in
Romanian).
[90] J. Stoer and R. Burlisch, Einfuhrung in die Numerische Mathematik, vol. II, Springer Verlag,
Berlin, Heidelberg, 1978.
[91] , Introduction to Numerical Analysis, 2nd ed., Springer Verlag, 1992.
[92] Volker Strassen, Gaussian elimination is not optimal, Numer. Math. 13 (1969), 354–356.
[93] A. H. Stroud, Approximate Calculation of Multiple Integrals, Prentice Hall Inc., Englewood
Cliffs, NJ, 1971.
Bibliography 445

[94] L. N. Trefethen, Maxims About Numerical Mathematics, Computers, Science and Life, SIAM
News 31 (1998), no. 1, 1.
[95] Lloyd N. Trefethen, The Definition of Numerical Analysis, SIAM News (1992), no. 3, 1–5.
[96] Lloyd N. Trefethen and David Bau III, Numerical Linear Algebra, SIAM, Philadelphia, 1996.
[97] R. T. Trı̂mbiţaş, Local bivariate Shepard interpolation, Rendiconti del Circolo matematico di
Palermo 68 (2002), 701–710, Serie II, Suppl.
[98] , Numerical Analysis. An Introduction Based on MATLAB, Cluj University Press, Cluj-
Napoca, 2005, (in Romanian).
[99] E. E. Tyrtyshnikov, A Brief Introduction to Numerical Analysis, Birkhäuser, Boston, Basel,
Berlin, 1997.
[100] C. Überhuber, Computer-Numerik, vol. 1, 2, Springer Verlag, Berlin, Heidelberg, New-York,
1995.
[101] C. Ueberhuber, Numerical Computation. Methods, Software and Analysis, vol. I, II, Springer
Verlag, Berlin, Heidelberg, New York, 1997.
[102] R. E. White, Computational Mathematics. Models, Methods, and Analysis with MATLAB and
MPI, Chapman & Hall/CRC, Boca Raton, London, New York, Washington, D.C., 2004.
[103] J. H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, 1965.
[104] J. H. Wilkinson, The perfidious polynomial, Studies in Numerical Analysis (Gene H. Golub, ed.),
MAA Stud. Math., vol. 24, Math. Assoc. America, Washington, DC, 1984, pp. 1–28.
[105] H. B. Wilson, L. H. Turcotte, and D. Halpern, Advanced Mathematics and Mechanics Appli-
cations Using MATLAB, third ed., Chapmann & Hall/CRC, Boca Raton, London, New York,
Washington, D.C., 2003.
Index

p-norm, 101 sublinear, 272


\ (operator), 13, 126 superlinear, 272
: (operator), 11 cubature formula, 428
% (comment), 26 Newton-Cotes ∼, 430
Simpson ∼, 430
adaptive quadratures, 253 trapezoidal ∼, 430
anonymous function, 30 cumprod, 16
asymptotic error, 272 cumsum, 16
axis, 57
dblquad, 436
blkdiag, 10 deal, 37
Boolean sum, 409 decic, 392
box, 54 deconv, 179
break, 24 degree of exactness, 236, 240
Butcher table, 349, 365 delaunay, 427
det, 15
camlight, 71 deval, 391
Cartesian grid, 407 diag, 10
cell, 35 diary, 6
cell array, 35 diff(Symbolic), 41
chol, 130 digits, 46
Cholesky factorization, 120 disp, 5
class, 32 divided difference, 202
composite Simpson formula, 242 double, 32
composite trapezoidal rule, 241 double, 47
cond, 109 dsolve, 46
condest, 109
condition number, 86 efficiency index, 273
continue, 25 eig, 326
contour, 65 eigenvalue
conv, 179 condition number, 327
convergence eps, see machine epsilon
linear, 272 eps, 4
order of ∼, 272 eps, 81, 84

446
Index 447

error, 40 lasterr, 40
eval, 34 lasterror, 40
eyes, 8 Lebesgue
constant, 196
fcnchk, 31 function, 196
fill, 56 left eigenvector, 327
find, 19 length, 7
fliplr, 10 light, 71
flipud, 10 limit, 43
fminbnd, 298 linsolve, 133
fminsearch, 296 linspace, 8
for, 23 load, 6
format, 4 log2, 83
formula logical, 20
Euler-MacLaurin ∼, 259 logspace, 8
Simpson ∼, 242 lsqnonneg, 128
fplot, 54 lu, 129
fsolve, 45
full, 21 M file, 25
fzero, 295 M-file, 25
function, 25, 27
gallery, 11 script, 25
Gauss-Christoffel quadrature formula, see Gaus- machine epsilon, 77
sian quadrature formula
maple, 46
Gaussian quadrature formula, 246
matrices
generalized eigenvalues, 328
similar, 306
global, 39
matrix
grid, 352
characteristic polynomial of a ∼, 305
grid, 57
companion, 306
grid function, 352
condition number of a ∼, 181
griddata, 433, 435
condition number of a ∼, 109
gsvd, 330
diagonalisable ∼, 307
handle graphics, 68 eigenvalue of a ∼, 305
hess, 328 eigenvector of a ∼, 305
hermitian, 102
if, 23 Jordan normal form of a ∼, 307
Inf, 4 nonderogatory ∼, 307
inline, 30 normal, 102
int, 32 orthogonal, 102
int, 41 real Schur decomposition of a ∼, 309
interp1, 223 RQ transformation of a ∼, 315
interp2, 433 Schur decomposition of a ∼, 307
interpolation singular value decomposition of a ∼, 329
cardinal property, 408 symmetric, 102
inv, 15 unitary, 102
upper Hessenberg, 102, 309
Lagrange interpolation matrix norm, 102
Aitken method, 201 subordinate, 103
Neville method, 200 max, 16
448 Index

maximal pivoting, see total pivoting numerical differentiation formula, 236


mean, 16 numerical integration formula, 240
median, 16 numerical quadrature formula, see numerical
mesh, 63 integration formula
meshgrid, 63 numerical solution of differential equations
method one-step methods, 341
Broyden’s ∼, 293
Euler ∼, 342 ode23tb, 372
false position ∼, 275 ode15i, 392
fixed point iteration ∼, 286 ode113, 372
Heun ∼, 346, 348 ode15s, 372
modified Euler ∼, 345, 348 ode23s, 372
Newton ∼, 280 ode23tb, 372
power ∼, see vector iteration ode23t, 372
QR ode23, 372
double shift, 324 ode45, 372
simple, 317 odeset, 385
spectral shift, 320 odextend, 391
QR ∼, 311 one step method
quasi-Newton∼, 291 stable ∼, 353
Romberg ∼, 255 one-step method
Runge-Kutta ∼, 346 consistent ∼, 341
secant ∼, 278 convergent ∼, 355
semi-implicit Runge-Kutta ∼, 347 exact order, 342
SOR, 139 order, 342
Sturm ∼, 273 principal error function, 342
Taylor expansion ∼, 344 ones, 8
min, 16 optimset, 295
moving least squares, 424
multiquadric, 427 pchip, 225, 227
pi, 5
NaN, 4 pinv, 127, 181
nargin, 28 plot, 51
nargout, 28 plot3, 61
ndims, 7 polar, 55
nested function, 28 poly, 178
Newton-Cotes formulae, 246 polyder, 178
nnz, 22 polyfit, 182, 223
norm polyval, 178
Chebyshev, 101 polyvalm, 178
Euclidian, 101 pow2, 83
Frobenius, 105 ppval, 226
Minkowski, 101 print, 69
norm, 101, 107 prod, 16
normest, 107
notation qr, 131
Ω, 90 quad, 260
null, 129 quadgk, 263
num2str, 34 quadl, 260
Index 449

quadv, 263 syms, 41

radial basis functions, 426 taylor, 43


rand, 8 tensor product, 408–410
rand, 8 text, 58
randn, 8 theorem
randn, 8 Peano, 186
rcond, 109 tic, 39
realmax, 81 title, 54
realmin, 81 toc, 39
repmat, 8, 36 total pivoting, 114
reshape, 10 transform
roots, 178 Householder, 122
rot90, 10 trapezes rule, see composite trapezoidal rule
Runge-Kutta method trapezoidal formula, see trapezoidal rule
implicit ∼, 346 trapezoidal rule, 241
trapz, 263
save, 6 tril, 10
schur, 328 trisurf, 427
shading, 66 triu, 10
Shepard interpolation, 417 truncation error, 341
simple, 42 try-catch, 40
simplify, 42
single, 32 uint*, 32
single, 83
size, 7 var, 16
solve, 44 varargin, 36
sort, 16 varargin, 37
sparse, 21 varargout, 36
varargout, 37
spdiags, 22
spline vector iteration, 309
vectorize, 31
complete, 219
view, 64
Not-a-knot, 219
vpa, 46
spline, 225
sprintf, 34
warning, 40
spy, 22
while, 23, 24
stability inequality, 353
whos, 5
std, 16
str2num, 34 xlabel, 54
struct, 36 xlim, 57
structure, 36
subfunction, 28 ylabel, 54
subplot, 60 ylim, 57
subs, 42
sum, 16 zeros, 8
surf, 64 zlim, 63
svd, 329
switch, 23
sym, 41
450 Index

You might also like