mht.technology

A blog about computer science, programming, and whatnot.
back

WTFs in Floating Point Math

We all know that floating point nubmers have a limited precision. With small numbers, say between zero and one, we can describe the fractional part of the number in great detail. However, if the number is large, there may not be room for any fractional part. So how large numbers do we need before a == a + 1?

We’ll find our answer by simply checking:

#include <stdio.h>

int simple() {
  float num = 1.0f;
  float inc = num + 1;

  while (num != inc) {
    num *= 10;
    inc = num + 1;
  }
  printf("%f\n", num);
}

which outputs 100000000.000000. If we multiply by 2 insead of 10, we may get a more intuitive number: 16777216 = 2^24. If we look at the IEEE 754 single-precision floating-point format, we see that the fraction part uses only 24 bits; our number 2^24 needs 25 bits, so we can safely add one, because the last bit gets truncated.

How high can we go if we are using double instead of float? As IEEE 754 double-presicion uses 53 bits, we are expecting to get to 2^53 = 9007199254740992, or 10000000000000000 if we multiply by 10. And that is what we get.

Adding more than one

How much can we add before getting a new number? One would think the answer was simply until the addition carries over to the 24 first bits. However, the answer is not that straight forward. Instead, IEEE has defined Rounding rules, for deciding what to do when the number we would like to represent doesn’t quite fit in 24 bits. The default rule is «Round to nearest, ties to even», meaning if we are in between two numbers, round to the even number.

Let’s try this out with the totaly random number 314159256:

314159265 = 10010101110011011000010100001
            |------- 24 bits ------||---|
// The last 5 bits will be truncated
// .. so the 16 numbers from
314159264 = 10010101110011011000010100000
// to
314159279 = 10010101110011011000010101111
// are all the same.

This is fairly easy to confirm:

void confirm_suspicions(void) {
  double first = 314159264;
  double number = first;
  while (((float) first) == ((float) number)) {
    number++;
  }
  printf("%f != %f\n", number, first);
  // 314159280 != 314159264
}

If we however were to set the fifth bit, the rounding rule would round all the way up to 314156296

314159280 = 10010101110011011000010110000
            |------- 24 bits ------||---|
// Round up, so the least significant bit of the 24 bits becomes 0
314159296 = 10010101110011011000011000000
// This will be the same number all the way up to
314159312 = 10010101110011011000011010000
// .. where the rounding rule allows us to include ..10000, as the lsb is 0.
// This makes 33 different numbers, all represented as the same floating point number.

Conclusion

What can we take home from this? First of all, floating point numbers doesn’t behave like real numbers, and its easy to forget this:

float a,b;
// ...
a = b;
a++;
a == b // true ???

or what about this?

float a;
// ...
for (; a < large_float; a++) {
  // ...
}

which is an infinite loop, if a == (a + 1). Lastly, consider this:

void oops(void) {
  float a = 314159264;
  float c = a + 10 + 10;
  printf("%d\n", c == 314159284.f); // 0
  float d = a + (10 + 10);
  printf("%d\n", d == 314159284.f); // 1
}

And you thought addition was associative? Not in the world of floating point numbers!

back