« July 2003 | Main | September 2003 »

August 2003 Archives

August 5, 2003

Nix gemacht

Am Wochende war Bochum Total und Terminator-3 durchgehend, da war nix mit coden.

August 13, 2003

Zerofilling - The results Part I

As I've been slacking (due to the heat wave here probably), this is the first real update in quite some time. I asked you to send in some benchmark results and indeed I got a nice feedback. In this first part I will just quickly talk about what has been measured, and give the reported data the second part will interpret the data.

What the code measures

The code is doing the following. First it will do a little warm up, doing some allocations untimed. This gives the memory system a chance to prepare for what is coming :)

The program will then do a million operations of vm_allocate and vm_deallocate to factor out the overhead for allocation an deallocation, since the primary interest of this benchmark is the page faulting (with the zero filling). The reason 1 mio operations are done, is so that the runtime can be easily converted into us/operation. A runtime of 10s means 10us per iteration.

The first loop just does vm_allocate() and vm_deallocate(). The second does the same thing, except touches a single page within the allocation. This triggers a fault, a vm_object() allocation, and zero-fill. Then when we call vm_deallocate(), there is additional work to free the page and the object. The third loop allocates and frees two pages, akin to the first loop. Finally, the fourth loop does the same for 2 pages per allocation, isolating out the per-page costs (fault, zero-fill, and free).

 

The measured results

I had to name one entry Anonymous because it was submitted with 10.3 and AFAIK that's already problematic with Apple's touchy-feely lawyers...
CPU/ MachineOSSubmitter
1p alloc1p alloc + fault 2p alloc2p alloc + fault
1.25GHz G4 (167MHz system bus)  unknown Jim Magee
9.03s 18.13s N/A 24.58s
1 GHz G4 10.2.6F.Inci
11.2s 24.0s 11.5s33.6s
867 MHz G4 (Powerbook) 10.2.6F. Inci
16.5s 34.5s 17.0s 47.7s
500 MHz G4 10.3Anonymous
14.9s 39.4s 14.9s 50.5s
800 MHz G4 (iMac) 10.2.6H. Hess
16.4s 47.7s 17.8s 71.4s
450 MHz G4 (Cube) 10.2.6Nat!
18.1s 51.2s 18.5s 75.1s
500 MHz G3 (IBook) unknown R.Lutz
21.2s 59.8s 22.1s 84.4s

August 15, 2003

Offline two days, due to technical difficulties

This journal was offline two days, because Movable Type wouldn't run.

The filesystem determined to mess up an important directory in the perl library hierarchy. Thanks to ZNeK everything's back to normal again.

Zerofilling - Part II

Hopefully tomorrow :)

August 16, 2003

Zerofilling - Part II

I have - for your utmost convenience - created a little application, that does the benchmarking and the analysis in one step. Ready for your download.

Download
BinaryMulleZerofillCalc.app.tgz
SourceMulleZerofillCalc.src.tgz

Well the bad news is there is going to be a part III. Unfortunately the application took longer than expected, so you will have to wait for the third part, where I will tell you what all the numbers mean. :) The third part might manifest as late as next Wednesday.

 

August 17, 2003

Zerofilling - Part III - Doing the Math

Lets's use the 1 Ghz result as a basis to make some assumptions and work out what it all means. Hopefully you have downloaded the new version of the MulleZerofillCalculator by now, and are playing around with it a little bit.
Download
BinaryMulleZerofillCalc.app.tgz
SourceMulleZerofillCalc.src.tgz

These were the measurements taken, enter them in the first four fields of the MulleZerofillCalculator

OperationTimeComment
1p alloc 11.2 vm_allocate( 1p), vm_deallocate( 1p)
1p alloc + fault 24.0 vm_allocate( 1p), page touched - therefore faulted in, vm_deallocate( 1p)
2p alloc 11.5 vm_allocate( 2p), vm_deallocate( 2p)
2p alloc + fault 33.6 vm_allocate( 2p), 2 pages touched - therefore faulted in, vm_deallocate( 2p)


Deriving the Other Fields

Doing 1 million vm_allocate and vm_deallocate calls to allocate and free one page of memory takes 11.2. Doing the same amount of calls for a 2 page memory block takes 11.5. There is apparently a 0.3 overhead incurred per page (page extra cost).
	page extra cost = 2p alloc - 1p alloc
or
	11.5 - 11.2 = 0.3
Lets assume that allocation and deallocation takes pretty much the same time. Then calls to either vm_allocate or vm_deallocate would be responsible for half the time measured, with the page extra cost taken into account, we arrive at this formula for allocation cost:
	allocation cost = 1p alloc - page extra cost / 2
or
	(11.2 - 0.3) / 2 = 5.45
For each page allocated in one call, the formula would be allocation cost + page extra cost / 2

The actual mapping of the physical page into userspace and the zerofilling is done during the page fault. Since we measured allocation, page fault and deallocation, the time added because of the actual use of the memory page is the page fault extra cost

	page fault extra cost = 1p alloc + fault - 1p alloc
or
	24.0 - 11.2 = 12.8
The time spent in allocation can be expected to have been unchanged, since the OS can't very well know in advance, if a page is actually mapped in or not. Deallocation time can be expected to increase, since now physical memory pages must be mapped out. Here another assumption is made, namely that the extra time spent for mapping in and mapping out is the same, so the cost is divided equally on page faulting and deallocation. This object cost is computed as
	object cost = 2 * (1p alloc + fault - 1p alloc) - (2p alloc + fault - 2p alloc)
or
	2 * (24.0 - 11.2) - (33.6 - 11.5) = 25.6 - 22.1 = 3.5
This will be evenly shared with the page fault and the deallocation. So the time spent for one page fault is estimated as
	fault + zerofill cost = (2p alloc + fault - 2p alloc) / 2 - object cost / 2
or
	(33.6 - 11.5) / 2 - 3.5 / 2 = 9.3
and the time for deallocation rises to
	allocation cost + page extra cost / 2 * pages + object cost / 2
or
	5.45 + 0.3 / 2 * 1 + 3.5 / 2 = 7.5
To get an estimate of the time spent for zerofilling, i assumed that approximately a quarter of the time used for allocation is spent for the actual page faulting. And I compute zerofill cost like this:
	fault + zerofill cost - allocation cost / 4 - object cost / 2
or
	9.3 - 5.45 / 4 - 3.5 / 2 = 6.15
This in effect weighs zerofilling in at around 50% to 80%, something I observed on my machine using Shark as rather realistic.

Some Observations

Although the computations are in some areas "pi mal Daumen", a few safe observations can nevertheless be made:
  • on all machines observed in a alloc/page fault/dealloc loop, the page fault time is the largest of the three.
  • the time for allocation and deallocation diminishes in significance greatly, with the number of pages allocated in one call. Already with only 4 pages (16 KB) allocated in one call, the time for faulting in and zerofilling the pages takes 80% of the time.
And a little more guessy:
  • of the time spent for allocating and faulting in 4 pages, at least 50% will be spent zero filling
Apparently Mac OS X 10.3 has some optimizations in memory management, as the G4 500 with 10.3 is doing the allocation loops much faster than some faster processors on 10.2.6. The fact that it is still slower overall, is more likely than not a direct result of zerofilling.
  • zero filling is likely to be a bottleneck even more so in 10.3

How to Optimize This

An easy optimization would be to batch fault more pages in when a page fault occurs. For very small allocations, it would probably be advantageous to immediately map in the one or two pages than instead to wait for the page fault to occur and then take the punishment of the fault. I would hazard a guess that in the vast majority of cases a call to vm_allocate will be followed by an access to the first page immediately afterwards.

If faulting optimizations are in effect, then the drag of zerofilling will be even more pronounced. You can optimize zerofilling by not zerofilling :)

Acknowledgements

This article was sparked of by a private email from Jim Magee concerning my original 1 page benchmark. Whereever some of the above comments and calculations in Deriving the Other Fields make sense, you can attribute it to Jim, where there are mistakes, oversights and errata put the blame on me.

August 20, 2003

Zerofilling - Part IV - New and Improved Application

Here's a new version, quite a bit nicer than the previous version. Here are the highlights:
  • Use mmap/munmap instead of vm_allocate/vm_unallocate for memory allocation (Thanks to Jim Magee for this)
  • It is more apparent which fields are editable and which aren't.
  • There is a option to loop the benchmark to stabilize the values (Apple-L)
  • You can have more than one window, but you can't close them :)
Download
BinaryMulleZerofillCalc.app.tgz
SourceMulleZerofillCalc.src.tgz

The mmap feature is the biggy in this release. Using mmap to allocate memory is much faster than the vm_allocate as you will find out. On my Cube 450 use of mmap is almost twice as fast as the use of vm_allocate. That is for pure allocation and deallocation. The actual page fault and zerofill costs are - from my observations - virtually :) unaffected. So while this certainly is of big interest, it in itself doesn't change the comments made in part III.

Operationmmapvm_allocate
1p alloc 9.7 17.6
1p alloc + fault 41.0 51.5
2p alloc 11.1 18.9
2p alloc + fault 65.9 77,3

If you want to know why mmap is faster read on...

Continue reading "Zerofilling - Part IV - New and Improved Application" »

August 21, 2003

Zerofilling - Part V - Bug fix

There was no apparent reason why the previous version should have worked. But it did! This new version does the same job, but it's clear why :)

Download
BinaryMulleZerofillCalc.app.tgz
SourceMulleZerofillCalc.src.tgz

August 25, 2003

Nothing going on

And nothing planned. This is going to be a Slow Tuesday Night .

August 28, 2003

Zero Filling - Part VI - A Readers Suggestion

I got an email from Maynard Handley with a good suggestion:
There is another way to look at your zero page-filling issue, and one you might try to get Apple to adopt. Rather than zero-ing a page on demand (as is done currently) the kernel maintains a list of pre-zeroed pages. This list is refreshed by some kernel-thread that's running every so often, and that takes the pages it want's to pre-zero from say the free-list (or a new free-but-dirty) list. Now the work still has to be done, but with luck it'll be done during some dead time while we're waiting on a disk IO or whatever.

Maynard

A good suggestion and a suggestion that would work. I had asked the same thing Jim Magee off list:
In the back of my mind lurks that factoid, that Mac OS X has a queue of zeroed RAM pages. If the queue is full, then probably a page fault can be serviced in 4-5 us. When it becomes exhausted, like my test program will do it will take 15 us. Correct ? How big is that queue ? And at what rate does it refill ?

The answer was:
Nope. Nothing is done ahead in this regard. Other Mach implementations experimented with this (having the idle loop zero pages ahead instead of doing "nothing"). But it tended to play hell with caches and/or power-management (we _really_ put the processor to sleep in idle).

I would suspect the cache is more of an issue, since the system oughta quiet down after enough pages have been zeroed. Since cleaning a page does invalidate a sizeable part of the cache, this would be quite detrimental to running code I would think.
If I get a little more idle time, I might wrap this up and make Parts I to VI a bona fide "Mulle" article. :)

August 29, 2003

Filler

Thought this was an interesting read, found it while checking up on a Lafferty quote:

August 31, 2003

VfL-Bochum iCal Online

Für VfL Bochum Fans mit Mac OS X :webcal://muller.mulle-kybernetik.com/calendar/VfL-Bochum.ics. Schreibt mir ne Mail, falls was nicht funktioniert.

About August 2003

This page contains all entries posted to Nat!'s Web Journal in August 2003. They are listed from oldest to newest.

July 2003 is the previous archive.

September 2003 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 4.25