Nat! bio photo

Nat!

Senior Mull

Twitter Github Twitch

My C code works with -O3 but not with -O0

  • v1.2 - added comments from HN
  • v1.3 - added “What does fits mean ?”
  • v1.4 - added the conclusion

When something very basic goes wrong, I have this hierarchy of potential culprits:

  • the compiler
  • buggy hardware
  • OS vendor
  • last and least me, because I don’t make mistakes :)

So what doesn’t work ? I am doing a simple cast from a floating point value to an integer value in C. This happens in a routine, that checks, if a double is integral and if it would fit into a long.

What does “fits” mean ?

If the value is stored into a long instead of a double, it should have the same numerical value. I.e. it shouldn’t matter if the number is stored as a double or a long, even when casting back and forth.

int   fits_long( double value)
{
   long   l_val;
   int    fits;

   if( value < (double) LONG_MIN || value > (double) LONG_MAX)
      return( 0);

   l_val = (long) value;
   fits  = ((double) l_val == value);
   return( fits);
}

First it checks, if the double value is within the limits of long. If yes, then the the double value is converted to a long, and then checked if both values as double still match.

This works for LONG_MIN, INT_MIN, INT_MAX but it doesn’t work for LONG_MAX.

Blame the compiler

I wrote a few tests to see what’s going on. There is a repository up on github that has the source. Let’s look at one file long.c:


#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#include <assert.h>
#include <float.h>


int   fits_long( double value)
{
   long   l_val;
   int    fits;

   if( value < (double) LONG_MIN || value > (double) LONG_MAX)
      return( 0);

   l_val = (long) value;
   fits  = (double) l_val == value;
   if( ! fits)
      fprintf( stderr, "Mismatch %f with %ld\n", value, l_val);

   return( fits);
}


int   main( int argc, char *argv[])
{
   double  value;

   fprintf( stderr, "long (%d bytes): %ld - %ld\n", (int) sizeof( long), LONG_MIN, LONG_MAX);
   fprintf( stderr, "double (%d bytes): %g - %g\n", (int) sizeof( double), DBL_MIN, DBL_MAX);

   assert( (double) LONG_MAX <= DBL_MAX);

   // keep my code optimiza
   if( argc == 1)
      value = (double) LONG_MAX;
   else
      value = (double) (LONG_MAX - atoi( argv[ 1]));

   if( fits_long( value))
      printf( "FITS\n");
   else
      printf( "NOPE\n");
   return( 0);
}

Let’s build it with cc -o long-O0.exe long.c and run it. cc is gcc-9. This is the output on my Intel Xeon machine:

long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
Mismatch 9223372036854775808.000000 with -9223372036854775808
NOPE

But with optimized code cc -O3 -o long-O3.exe long.c I get:

long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
FITS

Suddenly it works! This has got to be a compiler problem. So lets try another compiler clang-9. But it gets the same result!

Or is it the CPU ?

Suspect the CPU

Intel CPUs have a history of bugs. Did I hit one of those ? Let’s fire up the Raspberry Pi and see if results differ on ARM. On ARM the long-long.c test exhibits the same behavior (long.c is fine, because long is only 4 bytes large on the PI 4.)

./long-long-O0.exe
long-long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
Mismatch 9223372036854775808.000000 with -9223372036854775808
NOPE

./long-long-O3.exe
long-long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
FITS

So it must be a compiler bug! Or ?

Do another experiment

You can give long.exe a number that will be subtracted from LONG_MAX. A manual binary search brought me to the conclusion,that the problem manifests only with numbers that are larger than LONG_MAX - 512.

And 512 are seven bits. Which is peculiar, but does it matter ?

Gather some more data

The problem happens when long or long-long match the size of double a floating point number. Both are 8 bytes large. How can a 64 bit integer fit into a 64 bit double ?

What does C say ?

If you look at the “New C Standard” we have:

When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero).

Fine long is not _Bool.

If the value being converted is outside the range of values that can be represented, the behavior is undefined.

Well we checked against LONG_MIN and LONG_MAX and both are well within DBL_MIN and DBL_MAX. So the value seems cool.

Wikipedia

A floating point in 64 bit C ought to be an IEEE 754 which looks like this:

IEEE 754 Double Floating Point Format

The exponent takes up 11 bit, and the fraction is 52 bit.

Wikipedia says:

The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2−53 ≈ 1.11 × 10−16). If a decimal string with at most 15 significant digits is converted to IEEE 754 double-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string.

15 is also what DBL_DIG gives us. But our LONG_MAX is 9,223,372,036,854,775,807 so it has 19 digits. If we subtract 512 it becomes 9,223,372,036,854,775,295 and it fits. That doesn’t match up.

Godbolt

Looking at the -O3 assembler output on GodBolt of a simplified version of fits_llong:

#include <limits.h>

int fits_llong( double value)
{
    long long  l_val;

    l_val = (long long) value;
    return( l_val == value);
}

we see this assembler output:

cvttsd2si   rax, xmm0
cvtsi2sd    xmm1, rax
cmpeqsd     xmm1, xmm0
movq        rax, xmm1
and         eax, 1
ret

The generated code uses floating point (FP) xmm registers which are 128 bits wide. The conversion and the comparison are made in xmm and the flag is then moved out of xmm into integer eax and returned.

When we look at the -O0 code

push        rbp
mov         rbp, rsp
movsd       qword ptr [rbp - 8], xmm0
cvttsd2si   rax, qword ptr [rbp - 8]
mov         qword ptr [rbp - 16], rax
cvtsi2sd    xmm0, qword ptr [rbp - 16]
ucomisd     xmm0, qword ptr [rbp - 8]
sete        cl
setnp       dl
and         cl, dl
and         cl, 1
movzx       eax, cl
pop         rbp
ret

we see that the comparison is done with ucomisd which is 64 bits wide.

So it’s clear, why -O0 and -O3 differ. In -O3 the computation is done in higher precision registers and the outcome is different, than if it were done in -O0. It’s a little surprising to me, that ARM shows the same effect. But I know very little about that CPU.

Making it fail predictably

If the double is not passed in a xmm register but rather via a true double the function fails in -O3 and -O0. To achieve sending a true double I pass it via a pointer, which is obviously less efficient:

int   fits_long( double *value)
{
   long   l_val;
   int    fits;

   if( *value < (double) LONG_MIN || *value > (double) LONG_MAX)
      return( 0);

   l_val = (long) *value;
   fits  = (double) l_val == *value;
   if( ! fits)
      fprintf( stderr, "Mismatch %g with %ld\n", *value, l_val);

   return( fits);
}

But will this always work ? Sometimes I feel like the compiler isn’t my friend anymore, but rather an enemy. How can I be sure, that fits_long won’t get inlined and consequently “improved” to use a double instead of a double * ?

Since it is not a static, the compiler should keep it as is, so the symbol is available for linking. But I suspect nothing prevents the compiler from inlining it anyway in a local calling function. I could move the function to a different .o file, so inlining can’t happen. But then I live in fear of link-time optimizers becoming more aggressive…

I will adorn it with __attribute__(( noinline)) and hope that all compilers will understand and respect that.

Form an opinion

If you pass a value as double into a function, it may be using higher precision registers. The following calculations can be done, and in -O3 in my case will be done, with those registers exclusively. The results of the casting operation differ in the 128 bit and 64 bit floating point (FP) domain.

Only a true double size variable (via a pointer) gives a reliable 64 bit FP domain result:

__attribute__(( noinline))
int   fits_long( double *value)
{
   long   l_val;
   int    fits;

   l_val = (long) *value;
   fits  = (double) l_val == *value;
   return( fits);
}

So who is to blame ?

In terms of the C-Standard, I am hitting the clause

If the value being converted is outside the range of values that can be represented, the behavior is undefined.

The value LONG_MAX if 64 bit long can not be represented in a double also 64 bit long. The mistake is basically on the side of the caller of the function casting a LONG_MAX to a double and passing that along.

There is no warning from the compiler though. So you have to know which numbers are kosher and which are not (which I don’t).

Comments from Hacker News

So I linked this article on HackerNews, were it gained some traction. Some of the comments on HackerNews were very illuminating:

Use -fsanitize=undefined

hannob pointed out that this could have been caught more easily, if one had used the -fsanitize=undefined option. I didn’t think of that, but I will remember it now.

Check FP registers with fetestexcept

dfranke mentioned use of the floating point status registers. I didn’t know these were accessible in C, since I don’t do FP computations a lot. Reading up on this topic a bit more, it looks like a cast from double to long, might raise an exception if the processor is configured that way. So this is something I’d like to avoid.

SSE registers are 128 bits wide, but precision is 64 bit

heftig notices, that what I thought were 128 bits operations are really 64 bit after all.

The best comment was deleted

I am sure I read this, but I can’t find it anymore. So I don’t know who wrote it. Why was it best ? Because it just showed how I am tripping over my own feet again and again :)

   // keep my code optimiza
   if( argc == 1)
      value = (double) LONG_MAX;
   else
      value = (double) (LONG_MAX - atoi( argv[ 1]));

The use of some command-line argument to keep the optimizer from optimizing code away is valid, because the input is unpredictable. Yet because I am lazy, I gave a default value, if there are no arguments…

The optimizer snatches up the opportunity to forego the fits_long call when argc == 1 (1 is a constant) and does all the calculation at compile time rather than execution time, with different results.

Fixing this to

   // keep my code optimiza
   value = (double) (LONG_MAX - atoi( argv[ 1]));

(and supplying 0 on the command line), produces NOPE output with -O0 and -O3.

Conclusion

Let’s build it back up from the ground.

What is problematic about this code ?

int   fits_long( double value)
{
   long     l_val;
   double   d_val;
   int      fits;

   l_val = (long) value;      // 1
   d_val = (double) l_val;    // 2
   fits  = d_val == value;    // 3
   return( fits);
}

The conversion in 1 (long) value may produce an undefined value for l_val, if the value is out of bounds for long. The same goes for the conversion in 2. Lets assume the undefined value returned from 1 is by chance LONG_MAX, which we know to not be a double. Again we would be running into a case were the value is undefined in 2. The outcome of the comparison in 3 of an undefined d_val with value is therefore unpredictable. This is no good.

It would be nice to check beforehand, if a value is undefined and then not attempt the comparison. It’s known, that 64 bit doubles can represent all integers in the range of -1*2^53+1 to 1*2^53+1. Checking for that we get code that doesn’t produce undefined values:

int   fits_long( double value)
{
   long     l_val;
   double   d_val;
   int      fits;

   if( value < (double) -((1L<<52)+1) && value > (double) ((1L<<52)+1))
      return( 0);

   l_val = (long) value;
   d_val = (double) l_val;
   fits  = d_val == value;
   return( fits);
}

But these are by far not all the integers that a double can “cleanly” represent. For instance you can store without loss the number 1000000000000000000 into a double and a long. 1*2^53 is only 4503599627370496. Three digits less.

To check for all values, one needs to use the math library function lrint in conjunction with fetestexcept of the fenv library. With the fenv library, you get access to the floating point status registers.

int   fits_long( double d)
{
   long     l_val;
   double   d_val;

// may be needed ?
// #pragma STDC FENV_ACCESS ON

   feclearexcept( FE_INVALID);      // 1
   l_val = lrint( d);               // 2
   d_val = (double) l_val;          // 3
   if( fetestexcept( FE_INVALID))   // 4
      return( 0);

   return( d_val == d);
}

What this does is in 1, clear the FP exception flag FE_INVALID with feclearexcept. Then it will do the conversion in 2. If the returned value is undefined, the FE_INVALID flag will be set. The returned value is now casted back into a double in 3. The devils advocate, tells me that this could go wrong too and, since the FP registers are sticky, there is no harm and possible gain in doing the cast before the check. Then it is tested in 4 with fetestexcept, if undefined values have been produced. If yes, we skip the comparison.

Now we know that d_val and d are valid. So we can compare without fear of comparing undefined values.

Caveat

Use of fenv seems to be fairly tricky though. Perusing clang and python discussions, one gets the impression that the optimizer could get in the way. There is a #pragma STDC FENV_ACCESS ON for this, but… it’s a pragma. Not everyone supports it.

Open end

I am still looking for an easier way without having to link -lm to check, if a double can be represented by an integer of the same size or not. I suspect there isn’t any.


8 Comments

A photo of CatoCatoCato

From: CatoCatoCato

When you work with floating point, you need to remember you work with a tolerance to some epsilon for comparisons because you are rounding to 1/n^2 precision and different floating point units perform the conversion in different ways.

You must abandon the idea of '==' for floats (except for 0.0).

This is why your code is unpredictable, because you cannot guarantee the conversion of any integer to and from float is the same number. Period. The LSBs of the mantissa can and do change, which is why we mask to a precision or use signal-to-noise comparisons when evaluation bit drift between FP computations.

You has the first part correct, < and > are your friend with FP. But to get past the '==' hurdle, you need to define tolerance, the code should be something like:

if (fabs(f1 - f2) > TOLERANCE) ... fits = true.

That's how you work with FP because different hardware will always report different LSBs.

A photo of Nat!

From: Nat!

While this holds some truth in a general sense, it's not really helpful in this case. What this is testing is, if a `double` could be stored in a `long` without loss of information. If you use an epsilon here, you're doing it wrong.

A photo of Macmade

From: Macmade

Still undefined behavior...

A photo of Steinar H. Gunderson

From: Steinar H. Gunderson

bool fits_long(double d)
{
feclearexcept(FE_INVALID);
lrint(d);
return !fetestexcept(FE_INVALID);
}

A photo of Nat!

From: Nat!

@Macmade Yes, the problem that `LONG_MAX` isn't `double` compatible remains.

A photo of Nat!

From: Nat!

@Steinar That looks like the best most general solution (I haven't tested it yet). Alas I'd like to do it, without having to link to `-lm`.

A photo of mnd

From: mnd

> What this is testing is, if a `double` could be stored in a `long` without loss of information.

Then your code still wrong in this case: You convert long to double assuming that long can be stored to double without loss of information, but it's wrong. You can easily check that ((double)(LONG_MAX-1) == (double)(LONG_MAX-10)). It's simple task to find minimal long value when ((double)value==(double)(value+1)) becomes true.

Knowing this what the purpose of the storing double (at least double bigger than ((1L<<52)-1)) in long? This numbers for sure can't be converted back to doubles without loss of information.

A photo of szc

From: szc

Post to hackerNew, but the thread is dropping.... so rephrasing and posting here.

I think the problem is poorly posed and algorithm chosen to implement it is "backwards". StehpenCannon also notes the following in a comment, but it is not quite as explicit as this...

The problem is poorly posed because the question is, which mantissas, exponents and sign bits fit into a long.

The algorithm is backwards because simple comparisons in integer space cannot compute this; but I think the algorithm should be,

int fits_long(double value) {
unsigned int trailingZeros = countOfTrailingZeros(mantissaOf(value));
bool fits = (bitsInAMantissa + 2 - trailingZeros + exponentOf(value) < bitsInALong);
return fits;
}

[Notes:
if double is always > 0 and going to an unsigned long, the 2 above would be 1. The 2 represents the sign bit in the double. Given that sizeof(double) == sizeof(long) all bits in the two representations are accounted for, so there is no information loss.

checking:

mantissa = 0 (with a hidden msb of 1) means, trailingZeros = bitsInAMantissa -> fits will be true when the exponent value can be 0..62. So this represents each of the +/- 2^exponent values

mantissa = 1 x bitsInAMantissa means, trailingZeros = 0 -> fits will be true when the exponent value can be 0..(bitsInALong-BitsInAMantissa-2). So, +/- (all ones) * 2^exponent value.
The representation for going from double to long is not "smooth".
]

countOfTrailingZeros() is a favorite of bit twiddlers. "Hackers Delight" or "bithacks" https://graphics.stanford.edu/~seander/bithacks.html

Post a comment

All comments are held for moderation; basic HTML formatting accepted.

Name:
E-mail: (not published)
Website: