Nat! bio photo

Nat!

Senior Mull

Twitter Github Twitch

Zerofilling - Part III - Doing the Math

Lets's use the 1 Ghz result as a basis to make some assumptions and work out what it all means. Hopefully you have downloaded the new version of the MulleZerofillCalculator by now, and are playing around with it a little bit.

Download
Binary MulleZerofillCalc.app.tgz
Source MulleZerofillCalc.src.tgz

These were the measurements taken, enter them in the first four fields of the MulleZerofillCalculator

Operation Time Comment
1p alloc 11.2 vm_allocate( 1p), vm_deallocate( 1p)
1p alloc + fault 24.0 vm_allocate( 1p), page touched - therefore faulted in, vm_deallocate( 1p)
2p alloc 11.5 vm_allocate( 2p), vm_deallocate( 2p)
2p alloc + fault 33.6 vm_allocate( 2p), 2 pages touched - therefore faulted in, vm_deallocate( 2p)


Deriving the Other Fields

Doing 1 million vm_allocate and vm_deallocate calls to allocate and free one page of memory takes 11.2. Doing the same amount of calls for a 2 page memory block takes 11.5. There is apparently a 0.3 overhead incurred per page (page extra cost).

   page extra cost = 2p alloc - 1p alloc
or
        11.5 - 11.2 = 0.3

Lets assume that allocation and deallocation takes pretty much the same time. Then calls to either vm_allocate or vm_deallocate would be responsible for half the time measured, with the page extra cost taken into account, we arrive at this formula for allocation cost:

   allocation cost = 1p alloc - page extra cost / 2
or
        (11.2 - 0.3) / 2 = 5.45

For each page allocated in one call, the formula would be allocation cost + page extra cost / 2

The actual mapping of the physical page into userspace and the zerofilling is done during the page fault. Since we measured allocation, page fault and deallocation, the time added because of the actual use of the memory page is the page fault extra cost

   page fault extra cost = 1p alloc + fault - 1p alloc
or
        24.0 - 11.2 = 12.8

The time spent in allocation can be expected to have been unchanged, since the OS can't very well know in advance, if a page is actually mapped in or not. Deallocation time can be expected to increase, since now physical memory pages must be mapped out. Here another assumption is made, namely that the extra time spent for mapping in and mapping out is the same, so the cost is divided equally on page faulting and deallocation. This object cost is computed as

   object cost = 2 * (1p alloc + fault - 1p alloc) - (2p alloc + fault - 2p alloc)
or
        2 * (24.0 - 11.2) - (33.6 - 11.5) = 25.6 - 22.1 = 3.5

This will be evenly shared with the page fault and the deallocation. So the time spent for one page fault is estimated as

   fault + zerofill cost = (2p alloc + fault - 2p alloc) / 2 - object cost / 2
or
        (33.6 - 11.5) / 2 - 3.5 / 2 = 9.3

and the time for deallocation rises to

   allocation cost + page extra cost / 2 * pages + object cost / 2
or
        5.45 + 0.3 / 2 * 1 + 3.5 / 2 = 7.5

To get an estimate of the time spent for zerofilling, i assumed that approximately a quarter of the time used for allocation is spent for the actual page faulting. And I compute zerofill cost like this:

   fault + zerofill cost - allocation cost / 4 - object cost / 2
or
        9.3 - 5.45 / 4 - 3.5 / 2 = 6.15

This in effect weighs zerofilling in at around 50% to 80%, something I observed on my machine using Shark as rather realistic.

Some Observations

Although the computations are in some areas "pi mal Daumen", a few safe observations can nevertheless be made:

  • on all machines observed in a alloc/page fault/dealloc loop, the page fault time is the largest of the three.
  • the time for allocation and deallocation diminishes in significance greatly, with the number of pages allocated in one call. Already with only 4 pages (16 KB) allocated in one call, the time for faulting in and zerofilling the pages takes 80% of the time.

And a little more guessy:

  • of the time spent for allocating and faulting in 4 pages, at least 50% will be spent zero filling

Apparently Mac OS X 10.3 has some optimizations in memory management, as the G4 500 with 10.3 is doing the allocation loops much faster than some faster processors on 10.2.6. The fact that it is still slower overall, is more likely than not a direct result of zerofilling.

  • zero filling is likely to be a bottleneck even more so in 10.3

How to Optimize This

An easy optimization would be to batch fault more pages in when a page fault occurs. For very small allocations, it would probably be advantageous to immediately map in the one or two pages than instead to wait for the page fault to occur and then take the punishment of the fault. I would hazard a guess that in the vast majority of cases a call to vm_allocate will be followed by an access to the first page immediately afterwards.

If faulting optimizations are in effect, then the drag of zerofilling will be even more pronounced. You can optimize zerofilling by not zerofilling :)

Acknowledgements

This article was sparked of by a private email from Jim Magee concerning my original 1 page benchmark. Whereever some of the above comments and calculations in Deriving the Other Fields make sense, you can attribute it to Jim, where there are mistakes, oversights and errata put the blame on me.