Floating point numbers are represented in terms of a base $\beta$, a precision $p$, and an exponent $e$. If $\beta=10$ and $p=2$, then the number $0.034$ can be represented as $3.4\times 10^{-2}$. With this representation, its exponent is $e=-2$. With this specification, floating point numbers are not unique, as this same number can be represented as $34\times 10^{-3}$ or $0.34\times 10^{-1}$. Thus, we typically take a floating point number to have the specific form $$ \pm(d_0+d_1\beta^{-1}+\cdots+d_{p-1}\beta^{-(p-1)})\beta^e, $$ where each $d_i$ is an integer in $[0,\beta)$ and $d_0\neq0$. Such a representation is said to be a normalized floating point number.
Computers work more naturally in binary (or base 2) and most are hardwired this way. Thus, in the equation above, we have $\beta=2$ and each digit $d_i$ is zero or one. Taking this into account, here are some easily representable numbers with $p=5$ and $-5\leq e \leq 5$:
It turns out that normalized floating point numbers are not distributed uniformly. Let's plot the binary floating point numbers with precision 4 and maximum exponent size 3.
# Define the parameters
digit_length = 4
exponent_size = 3
# Generate all binary tuples of length 4
from itertools import product
a = [range(2)]*digit_length
tuples = tuple(product(*a))
tuples
# All the numbers of the form 1.dddd
import numpy as np
bins = [2**(-n) for n in range(1,digit_length+1)]
lead_ones = [1+np.dot(tup, bins) for tup in tuples]
lead_ones
# Mulitply each of the above by 2**(-n) over -3<=n<=3
# Include the negatives as well.
bins = [2**(-n) for n in range(-exponent_size,exponent_size+1)]
shifty = product(lead_ones,bins)
def times(x,y): return x*y
positive_numbers = [times(*xy) for xy in shifty]
negative_numbers = [-x for x in positive_numbers]
numbers = positive_numbers + negative_numbers
len(numbers)
# Plot them!
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(numbers, np.zeros(len(numbers)),'.')
plt.plot(lead_ones, np.zeros(len(lead_ones)),'r.')
ax = plt.gca()
ax.set_yticks([])
ax.set_xlim(-9,9)
fig = plt.gcf()
fig.set_figwidth(18)
fig.set_figheight(2)
plt.savefig('floating_point_distribution.eps', bbox_inches='tight')
The numbers of the form $1.dddd$ (called lead_ones
in the code) are shown in red. The other postive numbers have the form $2^nx$ for some $x$ with a leading 1. That leads to the scaling effect that we see.
Note, also, the hole near zero. That's because the smallest, postive, representable number (in this system) is $2^{-3}$. The next smallest is $(1+1/16)\times2^{-3}$ and the difference between these is quite a bit smaller than the smallest postive.
smallest = 2**(-3)
next_smallest = (1+1/16)*2**(-3)
[smallest, next_smallest-smallest]
The IEEE standard for double precision floating point numbers specifies that 64 bits be used to represent the number. Specifically, it requires
Since the leading digit must be a $1$, we only need to store 52 digits. Given those choices for $s$, $e$, and $c$, our number is $$(-1)^s\times c \times \beta^e.$$ Since the exponent $e$ is expressed in binary with no sign, it might be more natural to think of it in the range $0\leq e < 2048$. In this case, we might write the number $$(-1)^s\times c \times \beta^{e-1023}.$$ Note that the standard leaves quite a bit freedom allowing differences and optimizations between implementations. Often, a few strings are reserved for special numbers or, more often, there are a few extra bits included.
As an example, suppose we'd like to compute the number with bit string $$1\,\,10100000000000000000000000000000000000000000000000101\,\,10100000101$$ We might do the following:
dd = np.zeros(52, dtype=int)
dd[0]=dd[-1]=dd[2]=dd[-3]=1
ee = np.zeros(11, dtype=int)
ee[0]=ee[2]=ee[-1]=ee[-3]=1
s=1
bins = [2**(-n) for n in range(1,53)]
c = 1+np.dot(dd,bins)
bins2 = [2**(n) for n in range(11)]
e = int(np.dot(ee,bins2)-1023)
number_from_bin = (-1)**s * c * 2**e
number_from_bin
e
dd = np.zeros(52, dtype=int)
dd[0]=dd[-1]=dd[2]=dd[-3]=1
bins = [2**(-n) for n in range(1,53)]
c = 1+np.dot(dd,bins)
c
Python provides some nice tools to explore numbers and their expansions.
sys.float_info
stores information on how floating point numbers are stored
on your machine.
import sys
sys.float_info
The full exponent range isn't quite used because some configurations represent special numbers, like 0 and NaN
. If someone wants to explain this to me, that'd be awesome.
2**(-1022)
It ougtta be in section 5.2.4.2.2 of the C standard somewhere.
To get the largest possible value, at least, we take each digit to be $d_k=1$ and the exponent to be $e=1024$.
s = sum([2**(-k) for k in range(1,54)])
for i in range(1024):
s = 2*s
s
Another useful tool is float.hex
, which displays the hexadecimal expansion
of a number.
float.hex(0.1)
This indicates that the internal representation of $0.1$ is $$2^{-4}\left(1+\sum_{k=1}^{12} \frac{9}{16^k} + \frac{10}{16^{13}}\right) \approx 0.10000000000000000555.$$
We can use this information to fully understand the discrepancy between
$(0.1+0.2)+0.3$ and $0.1+(0.2+0.3)$.
Not surprisingly, the hexadecimal representation of $0.2$ is
0x1.999999999999ap-3
.
The hexadecimal representation of $0.3$ is 0x1.3333333333333p-2
,
which to 17 digits is $0.29999999999999999$ or exactly $0.3$ when rounded. The sum of
these three numbers yields $0.6000000000000001$, which is slightly larger than
the desired value of $0.6$. By contrast, the computer's computation of
$0.2+0.3$ yields exactly $0.5$, which is not surprising since $0.5$
easily representable in binary. Then, $0.1+0.5$ yields exactly the computer
representation of $0.6$ after rounding.
We can also use this to double check our number_from_bin
.
point_1 = (1+sum([9/16**k for k in range(1,13)]) + 10/16**13)*2**(-4)
point_1
float.hex(number_from_bin)
Suppose we want to estimate the harmonic number $$\sum_{k=1}^{n} \frac{1}{k}$$ for $n=1000$. It turns out that the order matters.
terms = [1/k for k in range(1,1001)]
sum1 = sum(terms)
terms.reverse()
sum2 = sum(terms)
sum1-sum2
The estimates might appear to be close but their difference is 10 times larger than machine epsilon. Which is better?
When we add two machine numbers $a$ and $b$, the result $\text{fl}(a+b)$ generally looks like $$\text{fl}(a+b) = (a+b)(1+\varepsilon).$$ We might think of $\varepsilon$ as a psuedo-random number between zero and machine epsilon. The point is, due to the distribution of machine numbers, the potential size of the error is proportional to the actual value of the sum.
Now, if we add three numbers together, we get $$\text{fl}(\text{fl}(a+b)+c) = ((a+b)(1+\varepsilon)+c)(1+\varepsilon) = (a+b+c) + (2a+2b+c)\varepsilon +O(\varepsilon^2).$$ Note that the numbers that are added first contribute twice to the error. Thus, we have less error, if we add the smaller numbers first. The problem is compounded if we add four numbers: $$\text{fl}(\text{fl}(\text{fl}(a+b)+c)+d) = (a+b+c+d)+(3 a+3 b+2 c+d) \varepsilon + O(\varepsilon^2).$$