CSC 161  Grinnell College  Spring, 2009 
Imperative Problem Solving and Data Structures  
Since computers have finite storage, real numbers can be stored to only a finite number of digits of accuracy. The lab on floating point numbers provided details of this representation of real numbers. This reading on numerical errors explores some practical consequences of the storage of floatingpoint numbers.
Most of this reading is an edited version of Henry M. Walker, Computer science 2: Principles of Software Engineering, Data Types, and Algorithms, Little, Brown/Scott, Foresman, 1989, Section 12.3, with programming examples translated from Pascal to C. This material is used with permission from the copyright holder.
Computers may not store real numbers exactly because of various practical limitations on the storage of those numbers. In particular, a computer can store only a certain number of significant digits of a real number, and when real numbers require more than that degree of accuracy, roundoff error results. The amount of roundoff error may vary from one machine to another, but such an error is always potentially present. In contrast, integers are stored exactly, so such numbers need not involve such roundoff error.
Storage of real numbers is further compounded by the way computers are built. Electrical devices depend on closed and open circuits (current flowing or not), and this circuitry is then used to represent numbers. When writing numbers, for example, a computer might interpret 0 as current flowing or high voltage and 1 as no current flowing or low voltage. Numbers, therefore, are usually represented by two digits, 0 or 1, rather than in decimal form. The resulting numbers are called binary numbers. For this lab, the main concern is that not all decimals translate exactly into a few binary digits.
1/3 does not translate into an exact decimal with only a finite number of digits. If we store eight significant decimal digits, then 1/3 = 0.33333333. All the places beyond the eight places are lost.
The same situation arises when various decimal numbers are stored in binary form. For example, 0.1 cannot be stored exactly in a computer in binary form. This leads to the following basic principle that has direct impact on programming: Whenever we work with real numbers in a computer, we cannot assume that the numbers are exact; there is always the potential for numerical error.
This potential for numerical error has several practical consequences in programming. Here, we consider
A few illustrations dramatize~ these consequences particularly well.
Consider the following program:
/* Program longloop */ #include <stdio.h> #define inc 0.1 /* increment added each time through the loop */ int main () { float sum; /* the result of our additions */ float diff; /* difference between sum and 1.0 */ /* printing headers */ printf ("program successively adds 0.1, starting at 0.0 until it reaches 1.0\n\n"); printf (" sum difference from 1.0\n"); sum = 0.0; while (sum != 1.0) { diff = 1.0  sum; printf ("%10.8f %12.8f\n", sum, diff); sum += inc; } printf ("program done\n"); }
When this program ran on one particular machine, the output began as follows:
program successively adds 0.1, starting at 0.0 until it reaches 1.0 sum difference from 1.0 0.00000000 1.00000000 0.10000000 0.89999998 0.20000000 0.80000001 0.30000001 0.69999999 0.40000001 0.60000002 0.50000000 0.50000000 0.60000002 0.39999998 0.70000005 0.29999995 0.80000007 0.19999993 0.90000010 0.09999990 1.00000012 0.00000012 1.10000014 0.10000014 1.20000017 0.20000017 1.30000019 0.30000019 ...
The computer continues to produce output beyond the point we expect. The program starts at 0 and adds 0.1 until we get to 1.0, so we expect the program will stop after 11 iterations.
However, here the 0.1 is not stored exactly. When we add 0.1 several times, this inaccuracy grows and the sum never actually equals 1.0. The sum does equal 1.0 to 6 decimal places, but the result contains a small numerical error. Program Done is never printed.
This example shows that when we compare real numbers, we may want to allow for possible error. In this program, instead of continuing until sum == 1.0, we might substitute a test for proximity using the absolute value functions, fabsf and fabs in the math.h library, for float and double numbers, respectively:
while (fabsf(sum  1.0) > 0.001)
It is worthwhile to note that some highlevel computer languages provide such an operation to test whether two real numbers are close to each other.
A modification of the above program illustrates a related problem with Boolean expressions as exit conditions.
/* Program shortloop */ #include <stdio.h> #define inc 0.1 /* increment added each time through the loop */ int main () { float sum; /* the result of our additions */ /* printing headers */ printf ("program successively adds 0.1, starting at 0.0 until it reaches 1.0\n\n"); sum = 0.0; while (sum <= 1.0) { printf ("%6.1f", sum); sum += inc; } printf ("\n\nprogram done\n"); }
We want to continue this loop while we do not exceed 1.0. The output, however, appears to skip the final case in which sum == 1.0. The actual output is:
program successively adds 0.1, starting at 0.0 until it reaches 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 program done
When we look at the output from the preceding program, we see that the difficulty is that the numerical error gives us results that are slightly too large. In particular, the machine computes 1.00000012 instead of 1.0 during the eleventh time through the loop. Although we expected a Sum of 1.0 to be printed at the end of the loop, this case was skipped. The loop stopped one iteration before we expected. Here, numerical errors have shortened the loop by one iteration.
This example illustrates the following:
We cannot depend on real variables to count loop iterations; numerical errors may cause steps to be skipped when real variables are incremented and tested in exit conditions.
This potential for numerical error is precisely the reason why some languages do not allow real variables as control variables. C and Java do not have this restriction, so in these languages there is always the potential for unexpected results when a particular iteration is skipped.
One solution to some numerical errors in loop control variables involves using integer variables for counting and then converting from integer to real for processing. For example, if processing should go from 0 to 1 in increments of 0.1, the above examples illustrate that a loop with real variables is problematic:
double x = 0.0; while (x <= 1.0) { /* subject to numerical errors  careful */ ... x += 0.1; }
The previous examples in this reading illustrates that such loops may stop early or late from what is intended. Changing to an integer variable, however, resolves the problem:
int i = 0; double x; while (i <= 10) { /* integer mode is reliable */ x = i / 10.0; /* convert to real for processing */ ... i++; }
Here x goes through the desired series of values, but integer variable i goes through exact values 0, 1, ..., 10. Also, since x is recomputed from an exact number at each iteration, any inaccuracies found in one value of x are not compounded in the next iteration.
Another consequence of real number storage is that arithmetic no longer follows the familiar rules that we depend on in much of our traditional thinking about numbers. In particular, addition is not associative. In other words, we cannot assume that
(a + b) + c = a + (b + c)
for all real numbers, a, b, c. Instead, when computing a + b + c, it may matter if we perform a + b or b + c first.
As an example, suppose that a computer stores exactly eight digits of accuracy and that it rounds to those eight digits after each operation. Now, suppose we add
1.0000000 + 0.00000004 + 0.00000004
in two ways.
If we add the first two numbers, we get
1.0000000 + 0.00000004 = (1.00000004) = 1.0000000 rounding to eight significant digits (including the 1)
Thus
(1.0000000 + 0.00000004) + 0.00000004 = (1.0000000)+ 0.00000004 first addition with rounding = 1.0000000 second addition with rounding
If we add the second two numbers first, we get
0.00000004 + 0.00000004 = 0.00000008
and
1.0000000 + (0.00000004 + 0.00000004) = 1.0000000 + (0.00000008) first addition = (1.00000008) second addition before rounding = 1.00000001 second addition after rounding
These three numbers demonstrate that the order of addition matters. When we add small numbers to large numbers, the small numbers can be lost completely (as in 1 above). On the other hand, if we add small numbers first, then the small pieces can accumulate enough to affect the large number (as in 2 above).
When we are adding many such numbers, the cumulative effect of these errors can be quite noticeable. For example, consider the following problem.
It can be shown that
π^{2} = 6+ 6/2^{2} + 6/3^{2}+ 6/4^{2}+ ...
Use this series to approximate the value of π.
This formula indicates that we can approximate π^{2} by adding more and more terms of this series. In other words,
π^{2} =~ 6 + 6/(2*2) + 6/(3*3) + 6/(4*4) + . . . + 6/(n*n)
where n is a large integer. When we compute the righthand side of this equation, we get an approximate value of π^{2}. Then by taking the square root, we can approximate π.
When we look at this series carefully, we see that the terms get smaller continually as the denominators get bigger. We must be careful when we add up the terms. If we start with the first term 6, then we will have the large numbers first, and the later small terms MAY not affect these large results. If we start with the small terms, then the small values can accumulate. This difference is illustrated in the following program.
/* Program to approximate Pi via the power series sqr(pi) = 6 + 6/(2*2) + 6/(3*3) + ... */ #include <stdio.h> #include <math.h> int main () { int trial; int number_terms = 1000000; int index; double sum_up, sum_down, i_real; /* print headings */ printf (" Number of Approximations to Pi\n"); printf (" Terms Biggest First Smallest First\n"); for (trial = 1; trial <= 12; trial++) { /* compute terms in ascending order */ sum_up = 0.0; for (index = 1; index <= number_terms; index++) { i_real = index; /* convert i to real */ sum_up += 6 / (i_real * i_real); } /* compute terms in descending order */ sum_down = 0.0; for (index = number_terms; index >= 1; index) { i_real = index; /* convert i to real */ sum_down += 6 / (i_real * i_real); } /* print results */ printf ("%11d %17.10lf %17.10lf\n", number_terms, sqrt(sum_up), sqrt(sum_down)); /* double number of terms for next time */ number_terms *= 2; } }
When this program is run for various values of n, we get:
Number of Approximations to Pi Terms Biggest First Smallest First 1000000 3.1415916987 3.1415916987 2000000 3.1415921761 3.1415921761 4000000 3.1415924149 3.1415924149 8000000 3.1415925342 3.1415925342 16000000 3.1415925939 3.1415925939 32000000 3.1415926237 3.1415926237 64000000 3.1415926385 3.1415926387 128000000 3.1415926436 3.1415926461 256000000 3.1415926436 3.1415926499 512000000 3.1415926436 3.1415926517 1024000000 3.1415926436 3.1415926527 2048000000 3.1415926436 3.1415926531
The correct value of π is 3.1415926535... . This output illustrates several points.
When we add a series in one order, we can get different results from when we add in another order.
When we add terms in descending order, the large terms dominate and small terms can be lost. In this program, the approximation for π does not change after about 12,800,000 terms in the series, when we add the large terms first. After this point, the additional terms are too small to affect the already large sum.
When we add terms in ascending order, the small terms can contribute. In the program, the approximation for π continually improves as we add more terms when we add the small terms first.
The numerical errors resulting from arithmetic operations just described can be particularly significant when two approximately equal numbers are subtracted. To understand why subtraction is so vulnerable to numerical error, suppose that the computer stores exactly eight digits of accuracy, as we did earlier. Next, consider the subtraction
12345675  12345674 = 1
In this example, if each of the first two numbers is correct to eight digits of accuracy, the result still has only one digit of accuracy. The first seven accurate digits were subtracted and only the eighth digit remains.
However, if the initial digits were correct to only six or seven digits, then all accuracy is lost in the subtraction. For example, in the preceding subtraction, suppose each number was incorrect by two units in the eighth digit. In such a situation, the exact numbers might be 12345677 and 12345672, respectively, and the correct subtraction should yield
12345677  12345672 = 5
Although each of the original numbers was correct to seven digits, the number resulting from the subtraction is incorrect by 500%! Such a result may be meaningless if it is used in further processing.
The previous examples also illustrate that when we put real numbers together, the size of the numerical errors can increase. Each real number may be off by a small amount. When we combine these numbers, we may combine these errors; and when we subtract these numbers, the errors may be particularly significant.
This document is available on the World Wide Web as
http://www.walker.cs.grinnell.edu/courses/161.sp09/readings/readingnumerrors.shtml
created 4 May last revised 31 January 2009 

For more information, please contact Henry M. Walker at walker@cs.grinnell.edu. 