Zerofilling - Part III - Doing the Math
Lets's use the 1 Ghz result as a basis to make some assumptions and work out what it all means. Hopefully you have downloaded the new version of the MulleZerofillCalculator by now, and are playing around with it a little bit.
Download | |
---|---|
Binary | MulleZerofillCalc.app.tgz |
Source | MulleZerofillCalc.src.tgz |
These were the measurements taken, enter them in the first four fields of the MulleZerofillCalculator
Operation | Time | Comment |
---|---|---|
1p alloc | 11.2 | vm_allocate( 1p), vm_deallocate( 1p) |
1p alloc + fault | 24.0 | vm_allocate( 1p), page touched - therefore faulted in, vm_deallocate( 1p) |
2p alloc | 11.5 | vm_allocate( 2p), vm_deallocate( 2p) |
2p alloc + fault | 33.6 | vm_allocate( 2p), 2 pages touched - therefore faulted in, vm_deallocate( 2p) |
Deriving the Other Fields
Doing 1 million vm_allocate
and
vm_deallocate
calls to allocate and free one page of
memory takes 11.2. Doing the same amount of calls for a 2
page memory block takes 11.5. There is apparently a
0.3 overhead incurred per page (page extra cost).
page extra cost = 2p alloc - 1p alloc or 11.5 - 11.2 = 0.3
Lets assume that allocation and deallocation takes pretty much
the same time. Then calls to either vm_allocate
or
vm_deallocate
would be responsible for half the time
measured, with the page extra cost taken into account, we arrive at
this formula for allocation cost:
allocation cost = 1p alloc - page extra cost / 2 or (11.2 - 0.3) / 2 = 5.45
For each page allocated in one call, the formula would be
allocation cost + page extra cost /
2
The actual mapping of the physical page into userspace and the
zerofilling is done during the page fault. Since we measured
allocation, page fault and deallocation, the time added because of
the actual use of the memory page is the page fault extra
cost
page fault extra cost = 1p alloc + fault - 1p alloc or 24.0 - 11.2 = 12.8
The time spent in allocation can be expected to have been unchanged, since the OS can't very well know in advance, if a page is actually mapped in or not. Deallocation time can be expected to increase, since now physical memory pages must be mapped out. Here another assumption is made, namely that the extra time spent for mapping in and mapping out is the same, so the cost is divided equally on page faulting and deallocation. This object cost is computed as
object cost = 2 * (1p alloc + fault - 1p alloc) - (2p alloc + fault - 2p alloc) or 2 * (24.0 - 11.2) - (33.6 - 11.5) = 25.6 - 22.1 = 3.5
This will be evenly shared with the page fault and the deallocation. So the time spent for one page fault is estimated as
fault + zerofill cost = (2p alloc + fault - 2p alloc) / 2 - object cost / 2 or (33.6 - 11.5) / 2 - 3.5 / 2 = 9.3
and the time for deallocation rises to
allocation cost + page extra cost / 2 * pages + object cost / 2 or 5.45 + 0.3 / 2 * 1 + 3.5 / 2 = 7.5
To get an estimate of the time spent for zerofilling, i assumed that approximately a quarter of the time used for allocation is spent for the actual page faulting. And I compute zerofill cost like this:
fault + zerofill cost - allocation cost / 4 - object cost / 2 or 9.3 - 5.45 / 4 - 3.5 / 2 = 6.15
This in effect weighs zerofilling in at around 50% to 80%, something I observed on my machine using Shark as rather realistic.
Some Observations
Although the computations are in some areas "pi mal Daumen", a few safe observations can nevertheless be made:
- on all machines observed in a alloc/page fault/dealloc loop, the page fault time is the largest of the three.
- the time for allocation and deallocation diminishes in significance greatly, with the number of pages allocated in one call. Already with only 4 pages (16 KB) allocated in one call, the time for faulting in and zerofilling the pages takes 80% of the time.
And a little more guessy:
- of the time spent for allocating and faulting in 4 pages, at least 50% will be spent zero filling
Apparently Mac OS X 10.3 has some optimizations in memory management, as the G4 500 with 10.3 is doing the allocation loops much faster than some faster processors on 10.2.6. The fact that it is still slower overall, is more likely than not a direct result of zerofilling.
- zero filling is likely to be a bottleneck even more so in 10.3
How to Optimize This
An easy optimization would be to batch fault more pages in when
a page fault occurs. For very small allocations, it would probably
be advantageous to immediately map in the one or two pages than
instead to wait for the page fault to occur and then take the
punishment of the fault. I would hazard a guess that in the vast
majority of cases a call to vm_allocate
will be
followed by an access to the first page immediately afterwards.
If faulting optimizations are in effect, then the drag of zerofilling will be even more pronounced. You can optimize zerofilling by not zerofilling :)
Acknowledgements
This article was sparked of by a private email from Jim Magee concerning my original 1 page benchmark. Whereever some of the above comments and calculations in Deriving the Other Fields make sense, you can attribute it to Jim, where there are mistakes, oversights and errata put the blame on me.