# My C code works with -O3 but not with -O0

- v1.2 - added comments from HN
- v1.3 - added “What does fits mean ?”
- v1.4 - added the conclusion

When something very basic goes wrong, I have this hierarchy of potential culprits:

- the compiler
- buggy hardware
- OS vendor
- last and least me, because I don’t make mistakes :)

So what doesn’t work ? I am doing a simple cast from a floating point value
to an integer value in C. This happens in a routine, that checks, if a `double`

is integral and if it would fit into a long.

## What does “fits” mean ?

If the value is stored into a long instead of a double, it should have the same numerical value. I.e. it shouldn’t matter if the number is stored as a double or a long, even when casting back and forth.

```
int fits_long( double value)
{
long l_val;
int fits;
if( value < (double) LONG_MIN || value > (double) LONG_MAX)
return( 0);
l_val = (long) value;
fits = ((double) l_val == value);
return( fits);
}
```

First it checks, if the `double`

value is within the limits of `long`

.
If yes, then the the `double`

value is converted to a `long`

, and then checked
if both values as `double`

still match.

This works for `LONG_MIN`

, `INT_MIN`

, `INT_MAX`

but it doesn’t
work for `LONG_MAX`

.

## Blame the compiler

I wrote a few tests to see what’s going on. There is a repository up
on github that has the
source. Let’s look at one file `long.c`

:

```
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#include <assert.h>
#include <float.h>
int fits_long( double value)
{
long l_val;
int fits;
if( value < (double) LONG_MIN || value > (double) LONG_MAX)
return( 0);
l_val = (long) value;
fits = (double) l_val == value;
if( ! fits)
fprintf( stderr, "Mismatch %f with %ld\n", value, l_val);
return( fits);
}
int main( int argc, char *argv[])
{
double value;
fprintf( stderr, "long (%d bytes): %ld - %ld\n", (int) sizeof( long), LONG_MIN, LONG_MAX);
fprintf( stderr, "double (%d bytes): %g - %g\n", (int) sizeof( double), DBL_MIN, DBL_MAX);
assert( (double) LONG_MAX <= DBL_MAX);
// keep my code optimiza
if( argc == 1)
value = (double) LONG_MAX;
else
value = (double) (LONG_MAX - atoi( argv[ 1]));
if( fits_long( value))
printf( "FITS\n");
else
printf( "NOPE\n");
return( 0);
}
```

Let’s build it with `cc -o long-O0.exe long.c`

and run it. `cc`

is **gcc-9**. This is
the output on my Intel Xeon machine:

```
long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
Mismatch 9223372036854775808.000000 with -9223372036854775808
NOPE
```

But with optimized code `cc -O3 -o long-O3.exe long.c`

I get:

```
long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
FITS
```

Suddenly it works! This has got to be a compiler problem. So lets try another
compiler **clang-9**. But it gets the same result!

Or is it the CPU ?

## Suspect the CPU

Intel CPUs have a history of bugs. Did I hit one of those ? Let’s fire up
the Raspberry Pi and see if results differ on ARM.
On ARM the `long-long.c`

test exhibits the same behavior (`long.c`

is fine,
because `long`

is only 4 bytes large on the PI 4.)

```
./long-long-O0.exe
long-long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
Mismatch 9223372036854775808.000000 with -9223372036854775808
NOPE
./long-long-O3.exe
long-long (8 bytes): -9223372036854775808 - 9223372036854775807
double (8 bytes): 2.22507e-308 - 1.79769e+308
FITS
```

So it must be a compiler bug! Or ?

## Do another experiment

You can give `long.exe`

a number that will be
subtracted from `LONG_MAX`

. A manual binary search brought me to the
conclusion,that the problem manifests only with numbers that are larger than
`LONG_MAX - 512`

.

And 512 are seven bits. Which is peculiar, but does it matter ?

## Gather some more data

The problem happens when `long`

or `long-long`

match the size of `double`

a floating point number. Both are 8 bytes large.
How can a 64 bit integer fit into a 64 bit double ?

### What does C say ?

If you look at the “New C Standard” we have:

When a finite value of real floating type is converted to an integer type other than

`_Bool`

, the fractional part is discarded (i.e., the value is truncated toward zero).

Fine `long`

is not `_Bool`

.

If the value being converted is outside the range of values that can be represented, the behavior is undefined.

Well we checked against `LONG_MIN`

and `LONG_MAX`

and both are well within
`DBL_MIN`

and `DBL_MAX`

. So the value seems cool.

### Wikipedia

A floating point in 64 bit C ought to be an IEEE 754 which looks like this:

The exponent takes up 11 bit, and the fraction is 52 bit.

Wikipedia says:

The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2−53 ≈ 1.11 × 10−16). If a decimal string with at most 15 significant digits is converted to IEEE 754 double-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string.

15 is also what `DBL_DIG`

gives
us. But our `LONG_MAX`

is 9,223,372,036,854,775,807 so it has 19 digits. If we
subtract 512 it becomes 9,223,372,036,854,775,295 and it fits. That doesn’t
match up.

### Godbolt

Looking at the `-O3`

assembler output on GodBolt of
a simplified version of `fits_llong`

:

```
#include <limits.h>
int fits_llong( double value)
{
long long l_val;
l_val = (long long) value;
return( l_val == value);
}
```

we see this assembler output:

```
cvttsd2si rax, xmm0
cvtsi2sd xmm1, rax
cmpeqsd xmm1, xmm0
movq rax, xmm1
and eax, 1
ret
```

The generated code uses floating point (FP) `xmm`

registers which are 128 bits wide. The
conversion and the comparison are
made in `xmm`

and the flag is then moved out of `xmm`

into integer `eax`

and
returned.

When we look at the `-O0`

code

```
push rbp
mov rbp, rsp
movsd qword ptr [rbp - 8], xmm0
cvttsd2si rax, qword ptr [rbp - 8]
mov qword ptr [rbp - 16], rax
cvtsi2sd xmm0, qword ptr [rbp - 16]
ucomisd xmm0, qword ptr [rbp - 8]
sete cl
setnp dl
and cl, dl
and cl, 1
movzx eax, cl
pop rbp
ret
```

we see that the comparison is done with `ucomisd`

which is 64 bits wide.

So it’s clear, why `-O0`

and `-O3`

differ. In `-O3`

the computation is done
in higher precision registers and the outcome is different, than if it were
done in `-O0`

. It’s a little surprising to me, that ARM shows the same
effect. But I know very little about that CPU.

## Making it fail predictably

If the `double`

is not passed in a `xmm`

register but rather via a true `double`

the function fails in `-O3`

and `-O0`

. To achieve sending a true double I
pass it via a pointer, which is obviously less efficient:

```
int fits_long( double *value)
{
long l_val;
int fits;
if( *value < (double) LONG_MIN || *value > (double) LONG_MAX)
return( 0);
l_val = (long) *value;
fits = (double) l_val == *value;
if( ! fits)
fprintf( stderr, "Mismatch %g with %ld\n", *value, l_val);
return( fits);
}
```

But will this always work ? Sometimes I feel like the compiler isn’t my friend
anymore, but rather an enemy. How can I be sure, that `fits_long`

won’t get
inlined and consequently “improved” to use a `double`

instead of a `double *`

?

Since it is not a static, the compiler should keep it as is, so the symbol is
available for linking. But I suspect nothing prevents the compiler from
inlining it anyway in a local calling function. I could move the function to
a different `.o`

file, so inlining can’t happen. But then I live in fear of
link-time optimizers becoming more aggressive…

I will adorn it with `__attribute__(( noinline))`

and hope that all compilers
will understand and respect that.

## Form an opinion

If you pass a value as `double`

into a function, it may be using higher
precision registers. The following calculations can be done, and in `-O3`

in my case
will be done, with those registers exclusively. The results of the casting
operation differ in the 128 bit and 64 bit floating point (FP) domain.

Only a true `double`

size variable (via a pointer) gives a reliable 64 bit
FP domain result:

```
__attribute__(( noinline))
int fits_long( double *value)
{
long l_val;
int fits;
l_val = (long) *value;
fits = (double) l_val == *value;
return( fits);
}
```

So who is to blame ?

In terms of the C-Standard, I am hitting the clause

If the value being converted is outside the range of values that can be represented, the behavior is undefined.

The value `LONG_MAX`

if 64 bit long can not be represented in a `double`

also
64 bit long. The mistake is basically on the side of the caller of the function
casting a `LONG_MAX`

to a double and passing that along.

There is no warning from the compiler though. So you have to know which numbers are kosher and which are not (which I don’t).

## Comments from Hacker News

So I linked this article on HackerNews, were it gained some traction. Some of the comments on HackerNews were very illuminating:

### Use -fsanitize=undefined

hannob pointed out
that this could have been caught more easily, if one had used the
`-fsanitize=undefined`

option. I didn’t think of that, but I will remember it
now.

### Check FP registers with `fetestexcept`

dfranke mentioned use of the
floating point status registers. I didn’t know these were accessible in C,
since I don’t do FP computations a lot.
Reading up on this topic a bit more, it looks like a cast from `double`

to
`long`

, might raise an exception if the processor is configured that way.
So this is something I’d like to avoid.

### SSE registers are 128 bits wide, but precision is 64 bit

heftig notices, that what I thought were 128 bits operations are really 64 bit after all.

### The best comment was deleted

I am sure I read this, but I can’t find it anymore. So I don’t know who wrote it. Why was it best ? Because it just showed how I am tripping over my own feet again and again :)

```
// keep my code optimiza
if( argc == 1)
value = (double) LONG_MAX;
else
value = (double) (LONG_MAX - atoi( argv[ 1]));
```

The use of some command-line argument to keep the optimizer from optimizing code away is valid, because the input is unpredictable. Yet because I am lazy, I gave a default value, if there are no arguments…

The optimizer snatches up the opportunity to forego the `fits_long`

call
when argc == 1 (1 is a constant) and does all the calculation at compile time
rather than execution time, with different results.

Fixing this to

```
// keep my code optimiza
value = (double) (LONG_MAX - atoi( argv[ 1]));
```

(and supplying 0 on the command line), produces `NOPE`

output with `-O0`

*and* `-O3`

.

## Conclusion

Let’s build it back up from the ground.

What is problematic about this code ?

```
int fits_long( double value)
{
long l_val;
double d_val;
int fits;
l_val = (long) value; // 1
d_val = (double) l_val; // 2
fits = d_val == value; // 3
return( fits);
}
```

The conversion in **1** `(long) value`

may produce an *undefined* value for `l_val`

,
if the value is out of bounds for `long`

.
The same goes for the conversion in **2**. Lets assume the *undefined* value returned from
**1** is by chance `LONG_MAX`

, which we know to not be a `double`

. Again
we would be running into a case were the value is *undefined* in **2**.
The outcome of the comparison in **3** of an *undefined* `d_val`

with `value`

is
therefore unpredictable. This is no good.

It would be nice to check beforehand, if a value is *undefined*
and then not attempt the comparison. It’s known,
that 64 bit doubles can represent all integers in the range of `-1*2^53+1`

to `1*2^53+1`

.
Checking for that we get code that doesn’t produce *undefined* values:

```
int fits_long( double value)
{
long l_val;
double d_val;
int fits;
if( value < (double) -((1L<<52)+1) && value > (double) ((1L<<52)+1))
return( 0);
l_val = (long) value;
d_val = (double) l_val;
fits = d_val == value;
return( fits);
}
```

But these are by far not all the integers that a double can “cleanly” represent.
For instance you can store without loss the number `1000000000000000000`

into
a `double`

and a `long`

. `1*2^53`

is only `4503599627370496`

. Three digits less.

To check for all values, one needs to use the **math** library function
lrint in conjunction with
fetestexcept of the **fenv**
library. With the fenv library, you get access to the floating point
status registers.

```
int fits_long( double d)
{
long l_val;
double d_val;
// may be needed ?
// #pragma STDC FENV_ACCESS ON
feclearexcept( FE_INVALID); // 1
l_val = lrint( d); // 2
d_val = (double) l_val; // 3
if( fetestexcept( FE_INVALID)) // 4
return( 0);
return( d_val == d);
}
```

What this does is in **1**, clear the FP exception flag `FE_INVALID`

with
`feclearexcept`

. Then it will do the conversion in **2**. If the returned value
is *undefined*, the `FE_INVALID`

flag will be set. The returned value is now
casted back into a `double`

in **3**. The devils advocate, tells me that this
could go wrong too and, since the FP registers are sticky, there is no harm and possible gain in
doing the cast before the check. Then it is tested in **4** with
`fetestexcept`

, if *undefined* values have been produced.
If yes, we skip the comparison.

Now we know that `d_val`

and `d`

are valid. So we can compare without fear of
comparing *undefined* values.

## Caveat

Use of

fenvseems to be fairly tricky though. Perusing clang and python discussions, one gets the impression that the optimizer could get in the way. There is a`#pragma STDC FENV_ACCESS ON`

for this, but… it’s a pragma. Not everyone supports it.

## Open end

I am still looking for an easier way without having to link `-lm`

to check,
if a `double`

can be represented by an integer of the same size or not.
I suspect there isn’t any.

## 8 Comments

## From: CatoCatoCato

When you work with floating point, you need to remember you work with a tolerance to some epsilon for comparisons because you are rounding to 1/n^2 precision and different floating point units perform the conversion in different ways.

You must abandon the idea of '==' for floats (except for 0.0).

This is why your code is unpredictable, because you cannot guarantee the conversion of any integer to and from float is the same number. Period. The LSBs of the mantissa can and do change, which is why we mask to a precision or use signal-to-noise comparisons when evaluation bit drift between FP computations.

You has the first part correct, < and > are your friend with FP. But to get past the '==' hurdle, you need to define tolerance, the code should be something like:

if (fabs(f1 - f2) > TOLERANCE) ... fits = true.

That's how you work with FP because different hardware will always report different LSBs.

Written on .

## From: Nat!

While this holds some truth in a general sense, it's not really helpful in this case. What this is testing is, if a `double` could be stored in a `long` without loss of information. If you use an epsilon here, you're doing it wrong.

Written on .

## From: Macmade

Still undefined behavior...

Written on .

## From: Steinar H. Gunderson

bool fits_long(double d)

{

feclearexcept(FE_INVALID);

lrint(d);

return !fetestexcept(FE_INVALID);

}

Written on .

## From: Nat!

@Macmade Yes, the problem that `LONG_MAX` isn't `double` compatible remains.

Written on .

## From: Nat!

@Steinar That looks like the best most general solution (I haven't tested it yet). Alas I'd like to do it, without having to link to `-lm`.

Written on .

## From: mnd

> What this is testing is, if a `double` could be stored in a `long` without loss of information.

Then your code still wrong in this case: You convert long to double assuming that long can be stored to double without loss of information, but it's wrong. You can easily check that ((double)(LONG_MAX-1) == (double)(LONG_MAX-10)). It's simple task to find minimal long value when ((double)value==(double)(value+1)) becomes true.

Knowing this what the purpose of the storing double (at least double bigger than ((1L<<52)-1)) in long? This numbers for sure can't be converted back to doubles without loss of information.

Written on .

## From: szc

Post to hackerNew, but the thread is dropping.... so rephrasing and posting here.

I think the problem is poorly posed and algorithm chosen to implement it is "backwards". StehpenCannon also notes the following in a comment, but it is not quite as explicit as this...

The problem is poorly posed because the question is, which mantissas, exponents and sign bits fit into a long.

The algorithm is backwards because simple comparisons in integer space cannot compute this; but I think the algorithm should be,

int fits_long(double value) {

unsigned int trailingZeros = countOfTrailingZeros(mantissaOf(value));

bool fits = (bitsInAMantissa + 2 - trailingZeros + exponentOf(value) < bitsInALong);

return fits;

}

[Notes:

if double is always > 0 and going to an unsigned long, the 2 above would be 1. The 2 represents the sign bit in the double. Given that sizeof(double) == sizeof(long) all bits in the two representations are accounted for, so there is no information loss.

checking:

mantissa = 0 (with a hidden msb of 1) means, trailingZeros = bitsInAMantissa -> fits will be true when the exponent value can be 0..62. So this represents each of the +/- 2^exponent values

mantissa = 1 x bitsInAMantissa means, trailingZeros = 0 -> fits will be true when the exponent value can be 0..(bitsInALong-BitsInAMantissa-2). So, +/- (all ones) * 2^exponent value.

The representation for going from double to long is not "smooth".

]

countOfTrailingZeros() is a favorite of bit twiddlers. "Hackers Delight" or "bithacks" https://graphics.stanford.edu/~seander/bithacks.html

Written on .

## Post a comment

All comments are held for moderation; basic HTML formatting accepted.