| Download | |
|---|---|
| Binary | MulleZerofillCalc.app.tgz |
| Source | MulleZerofillCalc.src.tgz |
These were the measurements taken, enter them in the first four fields of the MulleZerofillCalculator
| Operation | Time | Comment |
|---|---|---|
| 1p alloc | 11.2 | vm_allocate( 1p), vm_deallocate( 1p) |
| 1p alloc + fault | 24.0 | vm_allocate( 1p), page touched - therefore faulted in, vm_deallocate( 1p) |
| 2p alloc | 11.5 | vm_allocate( 2p), vm_deallocate( 2p) |
| 2p alloc + fault | 33.6 | vm_allocate( 2p), 2 pages touched - therefore faulted in, vm_deallocate( 2p) |
Deriving the Other Fields
Doing 1 millionvm_allocate and vm_deallocate calls to allocate and free one page of memory takes 11.2. Doing the same amount of calls for a 2 page memory block takes 11.5. There is apparently a 0.3 overhead incurred per page (page extra cost).
page extra cost = 2p alloc - 1p alloc or 11.5 - 11.2 = 0.3Lets assume that allocation and deallocation takes pretty much the same time. Then calls to either
vm_allocate or vm_deallocate would be responsible for half the time measured, with the page extra cost taken into account, we arrive at this formula for allocation cost:
allocation cost = 1p alloc - page extra cost / 2 or (11.2 - 0.3) / 2 = 5.45For each page allocated in one call, the formula would be
allocation cost + page extra cost / 2
The actual mapping of the physical page into userspace and the zerofilling is done during the page fault. Since we measured allocation, page fault and deallocation, the time added because of the actual use of the memory page is the page fault extra cost
page fault extra cost = 1p alloc + fault - 1p alloc or 24.0 - 11.2 = 12.8The time spent in allocation can be expected to have been unchanged, since the OS can't very well know in advance, if a page is actually mapped in or not. Deallocation time can be expected to increase, since now physical memory pages must be mapped out. Here another assumption is made, namely that the extra time spent for mapping in and mapping out is the same, so the cost is divided equally on page faulting and deallocation. This object cost is computed as
object cost = 2 * (1p alloc + fault - 1p alloc) - (2p alloc + fault - 2p alloc) or 2 * (24.0 - 11.2) - (33.6 - 11.5) = 25.6 - 22.1 = 3.5This will be evenly shared with the page fault and the deallocation. So the time spent for one page fault is estimated as
fault + zerofill cost = (2p alloc + fault - 2p alloc) / 2 - object cost / 2 or (33.6 - 11.5) / 2 - 3.5 / 2 = 9.3and the time for deallocation rises to
allocation cost + page extra cost / 2 * pages + object cost / 2 or 5.45 + 0.3 / 2 * 1 + 3.5 / 2 = 7.5To get an estimate of the time spent for zerofilling, i assumed that approximately a quarter of the time used for allocation is spent for the actual page faulting. And I compute zerofill cost like this:
fault + zerofill cost - allocation cost / 4 - object cost / 2 or 9.3 - 5.45 / 4 - 3.5 / 2 = 6.15This in effect weighs zerofilling in at around 50% to 80%, something I observed on my machine using Shark as rather realistic.
Some Observations
Although the computations are in some areas "pi mal Daumen", a few safe observations can nevertheless be made:- on all machines observed in a alloc/page fault/dealloc loop, the page fault time is the largest of the three.
- the time for allocation and deallocation diminishes in significance greatly, with the number of pages allocated in one call. Already with only 4 pages (16 KB) allocated in one call, the time for faulting in and zerofilling the pages takes 80% of the time.
- of the time spent for allocating and faulting in 4 pages, at least 50% will be spent zero filling
- zero filling is likely to be a bottleneck even more so in 10.3
How to Optimize This
An easy optimization would be to batch fault more pages in when a page fault occurs. For very small allocations, it would probably be advantageous to immediately map in the one or two pages than instead to wait for the page fault to occur and then take the punishment of the fault. I would hazard a guess that in the vast majority of cases a call tovm_allocate will be followed by an access to the first page immediately afterwards.
If faulting optimizations are in effect, then the drag of zerofilling will be even more pronounced. You can optimize zerofilling by not zerofilling :)