My C code works with -O3 but not with -O0
- v1.2 - added comments from HN
- v1.3 - added “What does fits mean ?”
- v1.4 - added the conclusion
When something very basic goes wrong, I have this hierarchy of potential culprits:
- the compiler
- buggy hardware
- OS vendor
- last and least me, because I don’t make mistakes :)
So what doesn’t work ? I am doing a simple cast from a floating point value
to an integer value in C. This happens in a routine, that checks, if a double
is integral and if it would fit into a long.
What does “fits” mean ?
If the value is stored into a long instead of a double, it should have the same numerical value. I.e. it shouldn’t matter if the number is stored as a double or a long, even when casting back and forth.
int fits_long( double value)
{
long l_val;
int fits;
if( value < (double) LONG_MIN || value > (double) LONG_MAX)
return( 0);
l_val = (long) value;
fits = ((double) l_val == value);
return( fits);
}
First it checks, if the double value is within the limits of long.
If yes, then the the double value is converted to a long, and then checked
if both values as double still match.
This works for LONG_MIN, INT_MIN, INT_MAX but it doesn’t
work for LONG_MAX.
Blame the compiler
I wrote a few tests to see what’s going on. There is a repository up
on github that has the
source. Let’s look at one file long.c:
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#include <assert.h>
#include <float.h>
int fits_long( double value)
{
long l_val;
int fits;
if( value < (double) LONG_MIN || value > (double) LONG_MAX)
return( 0);
l_val = (long) value;
fits = (double) l_val == value;
if( ! fits)
fprintf( stderr, "Mismatch %f with %ld\n", value, l_val);
return( fits);
}
int main( int argc, char *argv[])
{
double value;
fprintf( stderr, "long (%d bytes): %ld - %ld\n", (int) sizeof( long), LONG_MIN, LONG_MAX);
fprintf( stderr, "double (%d bytes): %g - %g\n", (int) sizeof( double), DBL_MIN, DBL_MAX);
assert( (double) LONG_MAX <= DBL_MAX);
// keep my code optimiza
if( argc == 1)
value = (double) LONG_MAX;
else
value = (double) (LONG_MAX - atoi( argv[ 1]));
if( fits_long( value))
printf( "FITS\n");
else
printf( "NOPE\n");
return( 0);
}
Let’s build it with cc -o long-O0.exe long.c and run it. cc is gcc-9. This is
the output on my Intel Xeon machine:
long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
Mismatch 9223372036854775808.000000 with -9223372036854775808
NOPE
But with optimized code cc -O3 -o long-O3.exe long.c I get:
long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
FITS
Suddenly it works! This has got to be a compiler problem. So lets try another compiler clang-9. But it gets the same result!
Or is it the CPU ?
Suspect the CPU
Intel CPUs have a history of bugs. Did I hit one of those ? Let’s fire up
the Raspberry Pi and see if results differ on ARM.
On ARM the long-long.c test exhibits the same behavior (long.c is fine,
because long is only 4 bytes large on the PI 4.)
./long-long-O0.exe
long-long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
Mismatch 9223372036854775808.000000 with -9223372036854775808
NOPE
./long-long-O3.exe
long-long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
FITS
So it must be a compiler bug! Or ?
Do another experiment
You can give long.exe a number that will be
subtracted from LONG_MAX. A manual binary search brought me to the
conclusion,that the problem manifests only with numbers that are larger than
LONG_MAX - 512.
And 512 are seven bits. Which is peculiar, but does it matter ?
Gather some more data
The problem happens when long or long-long
match the size of double a floating point number. Both are 8 bytes large.
How can a 64 bit integer fit into a 64 bit double ?
What does C say ?
If you look at the “New C Standard” we have:
When a finite value of real floating type is converted to an integer type other than
_Bool, the fractional part is discarded (i.e., the value is truncated toward zero).
Fine long is not _Bool.
If the value being converted is outside the range of values that can be represented, the behavior is undefined.
Well we checked against LONG_MIN and LONG_MAX and both are well within
DBL_MIN and DBL_MAX. So the value seems cool.
Wikipedia
A floating point in 64 bit C ought to be an IEEE 754 which looks like this:
![]()
The exponent takes up 11 bit, and the fraction is 52 bit.
Wikipedia says:
The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2−53 ≈ 1.11 × 10−16). If a decimal string with at most 15 significant digits is converted to IEEE 754 double-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string.
15 is also what DBL_DIG gives
us. But our LONG_MAX is 9,223,372,036,854,775,807 so it has 19 digits. If we
subtract 512 it becomes 9,223,372,036,854,775,295 and it fits. That doesn’t
match up.
Godbolt
Looking at the -O3 assembler output on GodBolt of
a simplified version of fits_llong:
#include <limits.h>
int fits_llong( double value)
{
long long l_val;
l_val = (long long) value;
return( l_val == value);
}
we see this assembler output:
cvttsd2si rax, xmm0
cvtsi2sd xmm1, rax
cmpeqsd xmm1, xmm0
movq rax, xmm1
and eax, 1
ret
The generated code uses floating point (FP) xmm registers which are 128 bits wide. The
conversion and the comparison are
made in xmm and the flag is then moved out of xmm into integer eax and
returned.
When we look at the -O0 code
push rbp
mov rbp, rsp
movsd qword ptr [rbp - 8], xmm0
cvttsd2si rax, qword ptr [rbp - 8]
mov qword ptr [rbp - 16], rax
cvtsi2sd xmm0, qword ptr [rbp - 16]
ucomisd xmm0, qword ptr [rbp - 8]
sete cl
setnp dl
and cl, dl
and cl, 1
movzx eax, cl
pop rbp
ret
we see that the comparison is done with ucomisd which is 64 bits wide.
So it’s clear, why -O0 and -O3 differ. In -O3 the computation is done
in higher precision registers and the outcome is different, than if it were
done in -O0. It’s a little surprising to me, that ARM shows the same
effect. But I know very little about that CPU.
Making it fail predictably
If the double is not passed in a xmm register but rather via a true double
the function fails in -O3 and -O0. To achieve sending a true double I
pass it via a pointer, which is obviously less efficient:
int fits_long( double *value)
{
long l_val;
int fits;
if( *value < (double) LONG_MIN || *value > (double) LONG_MAX)
return( 0);
l_val = (long) *value;
fits = (double) l_val == *value;
if( ! fits)
fprintf( stderr, "Mismatch %g with %ld\n", *value, l_val);
return( fits);
}
But will this always work ? Sometimes I feel like the compiler isn’t my friend
anymore, but rather an enemy. How can I be sure, that fits_long won’t get
inlined and consequently “improved” to use a double instead of a double * ?
Since it is not a static, the compiler should keep it as is, so the symbol is
available for linking. But I suspect nothing prevents the compiler from
inlining it anyway in a local calling function. I could move the function to
a different .o file, so inlining can’t happen. But then I live in fear of
link-time optimizers becoming more aggressive…
I will adorn it with __attribute__(( noinline)) and hope that all compilers
will understand and respect that.
Form an opinion
If you pass a value as double into a function, it may be using higher
precision registers. The following calculations can be done, and in -O3 in my case
will be done, with those registers exclusively. The results of the casting
operation differ in the 128 bit and 64 bit floating point (FP) domain.
Only a true double size variable (via a pointer) gives a reliable 64 bit
FP domain result:
__attribute__(( noinline))
int fits_long( double *value)
{
long l_val;
int fits;
l_val = (long) *value;
fits = (double) l_val == *value;
return( fits);
}
So who is to blame ?
In terms of the C-Standard, I am hitting the clause
If the value being converted is outside the range of values that can be represented, the behavior is undefined.
The value LONG_MAX if 64 bit long can not be represented in a double also
64 bit long. The mistake is basically on the side of the caller of the function
casting a LONG_MAX to a double and passing that along.
There is no warning from the compiler though. So you have to know which numbers are kosher and which are not (which I don’t).
Comments from Hacker News
So I linked this article on HackerNews, were it gained some traction. Some of the comments on HackerNews were very illuminating:
Use -fsanitize=undefined
hannob pointed out
that this could have been caught more easily, if one had used the
-fsanitize=undefined option. I didn’t think of that, but I will remember it
now.
Check FP registers with fetestexcept
dfranke mentioned use of the
floating point status registers. I didn’t know these were accessible in C,
since I don’t do FP computations a lot.
Reading up on this topic a bit more, it looks like a cast from double to
long, might raise an exception if the processor is configured that way.
So this is something I’d like to avoid.
SSE registers are 128 bits wide, but precision is 64 bit
heftig notices, that what I thought were 128 bits operations are really 64 bit after all.
The best comment was deleted
I am sure I read this, but I can’t find it anymore. So I don’t know who wrote it. Why was it best ? Because it just showed how I am tripping over my own feet again and again :)
// keep my code optimiza
if( argc == 1)
value = (double) LONG_MAX;
else
value = (double) (LONG_MAX - atoi( argv[ 1]));
The use of some command-line argument to keep the optimizer from optimizing code away is valid, because the input is unpredictable. Yet because I am lazy, I gave a default value, if there are no arguments…
The optimizer snatches up the opportunity to forego the fits_long call
when argc == 1 (1 is a constant) and does all the calculation at compile time
rather than execution time, with different results.
Fixing this to
// keep my code optimiza
value = (double) (LONG_MAX - atoi( argv[ 1]));
(and supplying 0 on the command line), produces NOPE output with -O0 and -O3.
Conclusion
Let’s build it back up from the ground.
What is problematic about this code ?
int fits_long( double value)
{
long l_val;
double d_val;
int fits;
l_val = (long) value; // 1
d_val = (double) l_val; // 2
fits = d_val == value; // 3
return( fits);
}
The conversion in 1 (long) value may produce an undefined value for l_val,
if the value is out of bounds for long.
The same goes for the conversion in 2. Lets assume the undefined value returned from
1 is by chance LONG_MAX, which we know to not be a double. Again
we would be running into a case were the value is undefined in 2.
The outcome of the comparison in 3 of an undefined d_val with value is
therefore unpredictable. This is no good.
It would be nice to check beforehand, if a value is undefined
and then not attempt the comparison. It’s known,
that 64 bit doubles can represent all integers in the range of -1*2^53+1 to 1*2^53+1.
Checking for that we get code that doesn’t produce undefined values:
int fits_long( double value)
{
long l_val;
double d_val;
int fits;
if( value < (double) -((1L<<52)+1) && value > (double) ((1L<<52)+1))
return( 0);
l_val = (long) value;
d_val = (double) l_val;
fits = d_val == value;
return( fits);
}
But these are by far not all the integers that a double can “cleanly” represent.
For instance you can store without loss the number 1000000000000000000 into
a double and a long. 1*2^53 is only 4503599627370496. Three digits less.
To check for all values, one needs to use the math library function lrint in conjunction with fetestexcept of the fenv library. With the fenv library, you get access to the floating point status registers.
int fits_long( double d)
{
long l_val;
double d_val;
// may be needed ?
// #pragma STDC FENV_ACCESS ON
feclearexcept( FE_INVALID); // 1
l_val = lrint( d); // 2
d_val = (double) l_val; // 3
if( fetestexcept( FE_INVALID)) // 4
return( 0);
return( d_val == d);
}
What this does is in 1, clear the FP exception flag FE_INVALID with
feclearexcept. Then it will do the conversion in 2. If the returned value
is undefined, the FE_INVALID flag will be set. The returned value is now
casted back into a double in 3. The devils advocate, tells me that this
could go wrong too and, since the FP registers are sticky, there is no harm and possible gain in
doing the cast before the check. Then it is tested in 4 with
fetestexcept, if undefined values have been produced.
If yes, we skip the comparison.
Now we know that d_val and d are valid. So we can compare without fear of
comparing undefined values.
Caveat
Use of fenv seems to be fairly tricky though. Perusing clang and python discussions, one gets the impression that the optimizer could get in the way. There is a
#pragma STDC FENV_ACCESS ONfor this, but… it’s a pragma. Not everyone supports it.
Open end
I am still looking for an easier way without having to link -lm to check,
if a double can be represented by an integer of the same size or not.
I suspect there isn’t any.
8 Comments
From: CatoCatoCato
When you work with floating point, you need to remember you work with a tolerance to some epsilon for comparisons because you are rounding to 1/n^2 precision and different floating point units perform the conversion in different ways.
You must abandon the idea of '==' for floats (except for 0.0).
This is why your code is unpredictable, because you cannot guarantee the conversion of any integer to and from float is the same number. Period. The LSBs of the mantissa can and do change, which is why we mask to a precision or use signal-to-noise comparisons when evaluation bit drift between FP computations.
You has the first part correct, < and > are your friend with FP. But to get past the '==' hurdle, you need to define tolerance, the code should be something like:
if (fabs(f1 - f2) > TOLERANCE) ... fits = true.
That's how you work with FP because different hardware will always report different LSBs.
Written on .
From: Nat!
While this holds some truth in a general sense, it's not really helpful in this case. What this is testing is, if a `double` could be stored in a `long` without loss of information. If you use an epsilon here, you're doing it wrong.
Written on .
From: Macmade
Still undefined behavior...
Written on .
From: Steinar H. Gunderson
bool fits_long(double d)
{
feclearexcept(FE_INVALID);
lrint(d);
return !fetestexcept(FE_INVALID);
}
Written on .
From: Nat!
@Macmade Yes, the problem that `LONG_MAX` isn't `double` compatible remains.
Written on .
From: Nat!
@Steinar That looks like the best most general solution (I haven't tested it yet). Alas I'd like to do it, without having to link to `-lm`.
Written on .
From: mnd
> What this is testing is, if a `double` could be stored in a `long` without loss of information.
Then your code still wrong in this case: You convert long to double assuming that long can be stored to double without loss of information, but it's wrong. You can easily check that ((double)(LONG_MAX-1) == (double)(LONG_MAX-10)). It's simple task to find minimal long value when ((double)value==(double)(value+1)) becomes true.
Knowing this what the purpose of the storing double (at least double bigger than ((1L<<52)-1)) in long? This numbers for sure can't be converted back to doubles without loss of information.
Written on .
From: szc
Post to hackerNew, but the thread is dropping.... so rephrasing and posting here.
I think the problem is poorly posed and algorithm chosen to implement it is "backwards". StehpenCannon also notes the following in a comment, but it is not quite as explicit as this...
The problem is poorly posed because the question is, which mantissas, exponents and sign bits fit into a long.
The algorithm is backwards because simple comparisons in integer space cannot compute this; but I think the algorithm should be,
int fits_long(double value) {
unsigned int trailingZeros = countOfTrailingZeros(mantissaOf(value));
bool fits = (bitsInAMantissa + 2 - trailingZeros + exponentOf(value) < bitsInALong);
return fits;
}
[Notes:
if double is always > 0 and going to an unsigned long, the 2 above would be 1. The 2 represents the sign bit in the double. Given that sizeof(double) == sizeof(long) all bits in the two representations are accounted for, so there is no information loss.
checking:
mantissa = 0 (with a hidden msb of 1) means, trailingZeros = bitsInAMantissa -> fits will be true when the exponent value can be 0..62. So this represents each of the +/- 2^exponent values
mantissa = 1 x bitsInAMantissa means, trailingZeros = 0 -> fits will be true when the exponent value can be 0..(bitsInALong-BitsInAMantissa-2). So, +/- (all ones) * 2^exponent value.
The representation for going from double to long is not "smooth".
]
countOfTrailingZeros() is a favorite of bit twiddlers. "Hackers Delight" or "bithacks" https://graphics.stanford.edu/~seander/bithacks.html
Written on .
Post a comment
All comments are held for moderation; basic HTML formatting accepted.