Part 1 - i7, Quad Core and Dual Core -- Compilation Times
Motivation
As a rule of thumb, I buy a new system when I estimate that the newer system is about four times faster than the old. Four times faster at compiling stuff. Also the price needs to be OK. I am going to spend at most 3000 EUR on such a system and preferably much less. Coming from a Q9650 system, I figured the i7-970x with its improved memory system, its generally better instruction handling and the two added cores, should be able to deliver. But how could I be sure ?
One problem I have with almost all the benchmarks articles on the web, is, that they heavily center on tasks like movies, encoding or games. Because some of these tasks don't parallelize well like gaming, compilation does, or use mainly floating point arithmetic like movies and encoding, compilation doesn't; they are not telling me very much.
Of course all this testing is too late, since I already bought the new system. :)
The Test Scenario
I use a shell script to build about 20 frameworks in succession. Some of the frameworks are made up of many object files, some frameworks contain only a few. All in all there are about 2000 source files with a total of 250K lines. The file with the most lines contains a little more than 2000 lines. I don't use precompiled headers at all.
I measured the time with time(1). As its man page says:
DESCRIPTION
The time utility executes and times utility. After the utility finishes, time writes the total time elapsed, the time consumed by system overhead, and the time used to execute utility to the standard error stream. Times are reported in seconds.
So a time result will look like this [1]:
real 7m13.895s user 10m19.728s sys 2m58.034s
total time elapsed is 'real', this is like
a stopwatch time
time consumed by system overhead is
'sys'
time used to execute is 'user'
The time executed in 'user' was always more than the 'real' time, because the build process is heavily parallelized.
The Tested Systems
|
Overview of Tested Configurations
The Gigabyte X58A-UD7 BIOS allows for a lot of interesting setup choices, so I changed a few parameters around for different configurations.
GHz | Cores | Threads | L1 | L2 | L3 | RAM | Type | MHz | Channels | Timing | ||
T7300 | 2 | 2 | 2 | 64 KB | 4 MB | 4 GB | DDR2 | 667 | ? | ? | ||
Q9650 | 3 | 4 | 4 | 64 KB | 12 MB | 4 GB | DDR2 | 800 | 2 | 5-5-5-15 | ||
i7-980x | 3,33 | 6 | 12 | 64 KB | 256KB | 12 MB | 6 GB | DDR3 | 1600 | 3 | 7-7-7-20 | |
i7-980x 6 GB 3CH-800 | 3,33 | 6 | 12 | 64 KB | 256KB | 12 MB | 6 GB | DDR3 | 800 | 3 | 7-7-7-20 | |
i7-980x 4 GB 2CH-1600 | 3,33 | 6 | 12 | 64 KB | 256KB | 12 MB | 4 GB | DDR3 | 1600 | 2 | 7-7-7-20 | |
i7-980x 3TH/6TH | 3,33 | 3 | 6 | 64 KB | 256KB | 12 MB | 6 GB | DDR3 | 1600 | 3 | 7-7-7-20 | |
i7-980x 3CH/3TH | 3,33 | 3 | 3 | 64 KB | 256KB | 12 MB | 6 GB | DDR3 | 1600 | 3 | 7-7-7-20 | |
i7-980x 6CH/6TH | 3,33 | 6 | 6 | 64 KB | 256KB | 12 MB | 6 GB | DDR3 | 1600 | 3 | 7-7-7-20 | |
i7-980x 3.6 | 3,6 | 6 | 12 | 64 KB | 256KB | 12 MB | 6 GB | DDR3 | 1600 | 3 | 7-7-7-20 |
The Measurements
And this is what I measured. When run multiple times the 'real' part typically varied by a second. So for my analysis, I assume that 47s is not really any faster than 48s, but that 49s is a little slower than 47s.
Machine | COMPILE 1 | COMPILE 2 |
T7300 | 243 | 441 |
Q9650 | 172 | 531 |
i7-980x 3C/3TH | 81 | 152 |
i7-980x 3C/6TH | 72 | 336 |
i7-980x 4 GB 2CH-1600 | 47 | 357 |
i7-980x 6 GB 3CH-800 | 48 | 365 |
i7-980x | 47 | 356 |
i7-980x 3.6 GHz | 44 | 338 |
i7-980x 6C/6TH | 49 | 160 |
COMPILE 1 : real in seconds
COMPILE 2 : user + sys in seconds
As can be seen from the output, the times for COMPILE2, which is 'sys' and 'user', can not be taken seriously on all configurations. I'd hate to make a judgement, which is correct and which is wrong. So I will ignore COMPILE2 completely, though the Q9650 number really looks like an underlying problem waving its hand there.
Analysis
Indeed, the i7-980x system is (just about) four times faster than my Q9650 system, so I haven't made a costly mistake there. :)
Taking out hyper-threading I saw very little difference in compilation performance (i7-980x 47s vs i7-980x 6C/6TH 49s) with all six cores enabled. But it did have a pronounced impact, with only three cores enabled (i7-980x 3C/6TH 72s vs i7-980x 3C/3TH 81s). So the extra threads aren't necessarily useless, but something denies their effectiveness in higher numbers. This could indicate, that possibly the CPU is memory starved with twelve threads competing. It could also indicate, that the overhead of twelve threads is comparatively much higher than the overhead of six threads. It could also indicate that my projects idn't have enough files to parallel compile. But I tested another single Xcode project with 250 files and the best 6/12 time was 25s vs 28s for 6/6.
An interesting observation is contained in this Ars Technica article, where the author compiles WebKit on a 8 core, 16 thread machine in 574s and on a 12 core/24 threads machine in 429s. In that test adding 50% more cores and threads, yielded a 25% improvement. Doubling the number of cores and threads on the i7-980x , yields a 50%+ performance increase (i7-980x 47s vs. i7-980x 3C/6TH 72s). So this is comparable and doesn't indicate a drastic diminishing returns for many threads.
I reduced the memory bandwidth, in one configuration by removing one memory module - thereby taking out one memory channel (i7-980x 4 GB 2CH-1600 to 25.5 GB/s from 38.4 GB/s) - and in another configuration by halving the memory speed in the BIOS from 1600MHz to 800MHz (i7-980x 6 GB 3CH-800 19.2 GB/s). To my complete surprise, that did almost nothing to the compile times. So this would rule out the memory starvation theory.
I don't believe that the SSD is the bottleneck, because of the following observations:
- When I place the intermediate built products on a ramdisk, I see no change in compile times at all.
- The OS should be caching all the headers and compiler binaries in RAM anyway.
- Using "Activity Monitor" the actual amount of I/O for a run is 40 MB read and 50 MB written, which is almost nothing for a SSD.
- By visual inspection, the read/written data per second values are in the low KB/s range most of the time, which means that the OS file cache seems to work as expected.
What was also surprising to me, is, that the Q9650 is not even twice as fast as the T7300, although it has twice the cores, three times the cache and 50% higher clock speed. The memory bandwidth of the two systems, though, I assume to be comparable. Barring any software problem or other misconfiguration, it would appear that the memory system is the bottleneck for the Q9650 in this benchmark.
But on the other hand the i7-980x apparently isn't bottlenecked at all, even with a reduced bandwidth, that is "only" about twice that of the Q9650. So this isn't conclusive to me and it's something I might want to look into. With what I know now, I really would expect to see a 120s figure for the Q9650.
Coming up next
Next I take a look at the GeekBench results and try to extract some more (... or less) interesting deductions from them and from what I have covered here so far.
[1] Actual result of my PPC 2 GHz running 10.5.