Nat! bio photo

Nat!

Senior Mull.

Twitter RSS

Github

Part 1 - i7, Quad Core and Dual Core -- Compilation Times

Motivation

As a rule of thumb, I buy a new system when I estimate that the newer system is about four times faster than the old. Four times faster at compiling stuff. Also the price needs to be OK. I am going to spend at most 3000 EUR on such a system and preferably much less. Coming from a Q9650 system, I figured the i7-970x with its improved memory system, its generally better instruction handling and the two added cores, should be able to deliver. But how could I be sure ?

One problem I have with almost all the benchmarks articles on the web, is, that they heavily center on tasks like movies, encoding or games. Because some of these tasks don't parallelize well like gaming, compilation does, or use mainly floating point arithmetic like movies and encoding, compilation doesn't; they are not telling me very much.

Of course all this testing is too late, since I already bought the new system. :)


The Test Scenario

I use a shell script to build about 20 frameworks in succession. Some of the frameworks are made up of many object files, some frameworks contain only a few. All in all there are about 2000 source files with a total of 250K lines. The file with the most lines contains a little more than 2000 lines. I don't use precompiled headers at all.

I measured the time with time(1). As its man page says:

DESCRIPTION

The time utility executes and times utility. After the utility finishes, time writes the total time elapsed, the time consumed by system overhead, and the time used to execute utility to the standard error stream. Times are reported in seconds.

So a time result will look like this [1]:

real       7m13.895s 
user    10m19.728s
sys     2m58.034s

total time elapsed  is 'real', this is like a stopwatch time
time consumed by system overhead  is 'sys'
time used to execute  is 'user'

The time executed in 'user' was always more than the 'real' time, because the build process is heavily parallelized.


The Tested Systems

T7300 A white MacBook with a 2 GHz Intel Core 2 Duo and 2 modules for 4 GB total of some aftermarket DDR2@667 MHz RAMs (5.33 GB/s) running 10.6.4 in 32 bit from a 160GB IBM X-25 M SSD. I asssume this is dual channel memory, but I am not sure. Estimated memory bandwidth: 10.66 GB/s

Q9650 A Gigabyte X48-DS5 with a 3GHz Intel Q9650 Quad Core and 2 modules for 4 GB total of 5-5-5-15 DDR2@800 MHz RAMs (6.4 GB/s) on two channels (12.8 GB/s) running 10.6.4 in 32 bit from 256 GB a Crucial RealSSD

i7-980x A Gigabyte X58A-UD7 rev2 with a 3.33 GHz Intel i7 980 Extreme and 3 modules for 6 GB total of 7-7-7-20 DDR3@1600MHz RAMs (12.8 GB/s) on three channels (38.4 GB/s) running 10.6.4 in 64 bit from a Crucial RealSSD 256 GB


Overview of Tested Configurations

The Gigabyte X58A-UD7 BIOS allows for a lot of interesting setup choices, so I changed a few parameters around for different configurations.


GHz Cores Threads L1 L2 L3
RAM Type MHz Channels Timing
T7300 2 2 2 64 KB
4 MB
4 GB DDR2 667 ? ?
Q9650 3 4 4 64 KB
12 MB
4 GB DDR2 800 2 5-5-5-15
i7-980x 3,33 6 12 64 KB 256KB 12 MB
6 GB DDR3 1600 3 7-7-7-20













i7-980x 6 GB 3CH-800 3,33 6 12 64 KB 256KB 12 MB
6 GB DDR3 800 3 7-7-7-20
i7-980x 4 GB 2CH-1600 3,33 6 12 64 KB 256KB 12 MB
4 GB DDR3 1600 2 7-7-7-20
i7-980x 3TH/6TH 3,33 3 6 64 KB 256KB 12 MB
6 GB DDR3 1600 3 7-7-7-20
i7-980x 3CH/3TH 3,33 3 3 64 KB 256KB 12 MB
6 GB DDR3 1600 3 7-7-7-20
i7-980x 6CH/6TH 3,33 6 6 64 KB 256KB 12 MB
6 GB DDR3 1600 3 7-7-7-20
i7-980x 3.6 3,6 6 12 64 KB 256KB 12 MB
6 GB DDR3 1600 3 7-7-7-20


The Measurements

And this is what I measured. When run multiple times the 'real' part typically varied by a second. So for my analysis, I assume that 47s is not really any faster than 48s, but that 49s is a little slower than 47s.

Machine COMPILE 1 COMPILE 2
T7300 243 441
Q9650 172 531
i7-980x 3C/3TH 81 152
i7-980x 3C/6TH 72 336
i7-980x 4 GB 2CH-1600 47 357
i7-980x 6 GB 3CH-800 48 365
i7-980x 47 356
i7-980x 3.6 GHz 44 338
i7-980x 6C/6TH 49 160

COMPILE 1 : real in seconds
COMPILE 2 : user + sys in seconds

As can be seen from the output, the times for COMPILE2, which is 'sys' and 'user', can not be taken seriously on all configurations. I'd hate to make a judgement, which is correct and which is wrong. So I will ignore COMPILE2 completely, though the Q9650 number really looks like an underlying problem waving its hand there.


Analysis

Indeed, the i7-980x system is (just about) four times faster than my Q9650 system, so I haven't made a costly mistake there. :)

Compile time results for various configurations

Taking out hyper-threading I saw very little difference in compilation performance (i7-980x 47s vs i7-980x 6C/6TH 49s) with all six cores enabled. But it did have a pronounced impact, with only three cores enabled (i7-980x 3C/6TH 72s vs i7-980x 3C/3TH 81s). So the extra threads aren't necessarily useless, but something denies their effectiveness in higher numbers. This could indicate, that possibly the CPU is memory starved with twelve threads competing. It could also indicate, that the overhead of twelve threads is comparatively much higher than the overhead of six threads. It could also indicate that my projects idn't have enough files to parallel compile. But I tested another single Xcode project with 250 files and the best 6/12 time was 25s vs 28s for 6/6.

An interesting observation is contained in this Ars Technica article, where the author compiles WebKit on a 8 core, 16 thread machine in 574s and on a 12 core/24 threads machine in 429s. In that test adding 50% more cores and threads, yielded a 25% improvement. Doubling the number of cores and threads on the i7-980x , yields a 50%+ performance increase (i7-980x 47s vs. i7-980x 3C/6TH 72s). So this is comparable and doesn't indicate a drastic diminishing returns for many threads.

I reduced the memory bandwidth, in one configuration by removing one memory module - thereby taking out one memory channel (i7-980x 4 GB 2CH-1600 to 25.5 GB/s from 38.4 GB/s) - and in another configuration by halving the memory speed in the BIOS from 1600MHz to 800MHz (i7-980x 6 GB 3CH-800 19.2 GB/s). To my complete surprise, that did almost nothing to the compile times. So this would rule out the memory starvation theory.

I don't believe that the SSD is the bottleneck, because of the following observations:

  • When I place the intermediate built products on a ramdisk, I see no change in compile times at all.
  • The OS should be caching all the headers and compiler binaries in RAM anyway.
  • Using "Activity Monitor" the actual amount of I/O for a run is 40 MB read and 50 MB written, which is almost nothing for a SSD.
  • By visual inspection, the read/written data per second values are in the low KB/s range most of the time, which means that the OS file cache seems to work as expected.

What was also surprising to me, is, that the Q9650 is not even twice as fast as the T7300, although it has twice the cores, three times the cache and 50% higher clock speed. The memory bandwidth of the two systems, though, I assume to be comparable. Barring any software problem or other misconfiguration, it would appear that the memory system is the bottleneck for the Q9650 in this benchmark.

But on the other hand the i7-980x apparently isn't bottlenecked at all, even with a reduced bandwidth, that is "only" about twice that of the Q9650. So this isn't conclusive to me and it's something I might want to look into. With what I know now, I really would expect to see a 120s figure for the Q9650.


Coming up next

Next I take a look at the GeekBench results and try to extract some more (... or less) interesting deductions from them and from what I have covered here so far.


[1] Actual result of my PPC 2 GHz running 10.5.