The faster objc_msgSend

You can download my test case. This one comes ready compiled with all sources. It requires emacs and make this time :).

This article is a byproduct of the original yet unpublished part IMP Caching Deluxe. At the end of the article I knew that my NSAutoreleasePool class was much faster than the original, but I couldn't quite properly explain why. Then I had to dig a little deeper and came up with some explanations, which immediately led to two followup articles. Namely this one and its predecessor Giving [] a boost

objc_msgSend analyzed

Sending a message to an object in Objective-C works like this. The object's class (the isa pointer) and the selector _cmd (method name) are used to find the implementation in the class and then jump to that implementation.

For better performance the implementation and its selector are noted in a private cache that is part of every class. This cache is very simple and very fast.

In this example, only a few methods have been memorized yet in this cache of a class Foo. Therefore the cache is small, it only holds 8 entries. (NSTextView has 256 slots right from the start, just initialized from a NIB and displaying.) There are 5 slots filled. Each slot shows the preferred slot for the indexed method. Our target methods optimal slot is slot #3, but since this is occupied, it resides in slot #4.

A part of the selector is cookie cut with the cache mask to quickly index into the cache. This avoids a linear search. In the given example the selector indexes into the slot #3, which is occupied by a different method. After one cache miss the matching selector and its implementation is found.

The graphic also tries to visualize, that the cache is arranged in a circular fashion. If the last entry doesn't match the the search continues with the start of the cache. (Notice that slot #0 is occupied with a method, that would preferably be stored in slot #7)

For those who can read C easier than PPC assembler, here's the objc_msgSend entry part in almost functional - but yet unfixable and unusable - C:

#include <objc/objc-runtime.h>

id  c_objc_msgSend( struct objc_class /* ahem */ *self, SEL _cmd, ...)
{
   struct objc_class    *cls;
   struct objc_cache    *cache;
   unsigned int         hash;
   struct objc_method   *method;   
   unsigned int         index;
   
   if( self)
   {
      cls   = self->isa;
      cache = cls->cache;
      hash  = cache->mask;
      index = (unsigned int) _cmd & hash;
      
      do
      {
         method = cache->buckets[ index];
         if( ! method)
            goto recache;
         index = (index + 1) & cache->mask;
      }
      while( method->method_name != _cmd);
      return( (*method->method_imp)( (id) self, _cmd));
   }
   return( (id) self);

recache:
   /* ... */
   return( 0);
}

Some more background information: The cache with its mask is dynamically grown. The cache is never more than 75% full. Methods that are recorded later have a statistical speed advantage over those entered earlier.

objc_msgSend rewritten

The following code is entry part of Apple's objc_msgSend. This contains all the code executed when there is a cached implementation. I have noted the instruction cycles on the left, as I think they are spent. An L means latency, there is a stall as the next instruction depends on the previous load. (This is for a G3. Latencies only increase with later processors, due to their longer pipelines)

01	cmplwi   r3,0		; is message to nil
02	beq-     aleave		; ok, just return
    
03      lwz      r12,0(r3)	; get ->isa in r12
04 L				; wait for value
05	lwz      r12,32(r12)	; get ->Cache from objc_class
06	stw      r9,48(r1)	; save r9 on stack 
07	lwz      r11,0(r12)	; load ->cache_mask from objc_cache
08	addi     r9,r12,8	; dial r9 to ->buckets
09	rlwinm	 r11,r11,2,0,29	; shift mask up by 2
10	and      r12,r4,r11	; SEL AND mask, r12: index into buckets

aloop:
11	lwzx     r2,r9,r12	; load objc_method pointer from bucket
12	addi     r12,r12,4	; move to next bucket
13	cmplwi   r2,0		; bucket empty ?
14	beq-     afill		; cache needs to filled ->

15	lwz      r0,0(r2)       ; get SEL from objc_method
16	and      r12,r12,r11	; buckets, circular wrap code
17	cmplw    r0,r4		; same SEL ?
18	bne+     aloop		; no, try next ->

19	lwz      r12,8(r2)	; get IMP from objc_method
20 L				; wait for value
21	mtctr    r12		; stuff it into ctr
22	lwz      r9,48(r1)	; pop r9 from stack
23 	bctr			; jump to IMP

aleave:
    blr

afill:
    nop		                  ; more code but snipped for simplicity
    blr

Change the branch prediction

The first idea is simple yet non-obvious. At cycle 18 the assumption of bne+ is, that the cache has failed and that another loop must be taken. This is the standard way to write loops, it optimizes the loop. Lets see how this will pan out:

Method Cache Misses	Predictions Detail	Predictions Total
0 Cache misses	-	(-1)
1 Cache misses	+ -	(0)
2 Cache misses	+ + -	(+1)
3 Cache misses	+ + + -	(+2)

It penalizes the O Cache miss path and favors the 2 Cache misses and worse path. Lets assume we code it bne- aloop then the table turns

Method Cache Misses	Predictions Detail	Predictions Total
0 Cache misses	+	(+1)
1 Cache misses	- +	(0)
2 Cache misses	- - +	(-1)
3 Cache misses	- - - +	(-2)

Now its a matter of likelihood of cache misses. Looking at actual cache contents, the spread of Cache hits and misses will not be even. In fact 0 Cache misses is a much more likely event than 2 Cache misses or more.

So this could potentially depending on the branch prediction architecture and the instruction cache contents pay off.

The superflous rotation

09	rlwinm	r11,r11,2,0,29	; shift mask up by 2

This is always done to the mask being fetched. It'd be easy to just pre-compute the shift in the cache filling routine instead of having the dispatcher do the work, this definitely saves a cycle.

A different cache algorithm

Currently the buckets are organized in a circular buffer fashion. If the search hits the end of the bucket array, the search continues with the beginning. This is just slightly more complicated than extending the end of the array to accommodate enough methods.

Lets assume the cache fill algorithm has been rewritten so that the buckets are not in a circular buffer but extend until a sentinel (NULL) value is reached. The algorithm at first only changes slightly, with the above two optimizations already coded in:

03      lwz      r12,0(r3)	; get ->isa in r12
04 L				; wait for value
05	lwz      r12,32(r12)	; get ->Cache from objc_class
06	stw      r9,48(r1)	; save r9 on stack 
07	lwz      r11,0(r12)	; -> preshifted_cache_mask from objc_cache
08	addi     r9,r12,8	; dial to ->buckets
   	rlwinm	r11,r11,2,0,29	; shift mask up by 2
09	and      r12,r4,r11	; SEL AND mask, r12: index into buckets

mloop:
10	lwzx     r2,r9,r12	; load objc_method pointer from bucket
11	addi     r12,r12,4	; move to next bucket
12	cmplwi   r2,0		; bucket empty ?
13	beq-     mfill		; cache needs to filled ->

14	lwz      r0,0(r2)       ; get SEL from objc_method
  	and      r12,r12,r11	; buckets, circular wrap code
15 L
16	cmplw    r0,r4		; same SEL ?
17	bne-     mloop		; no, try next

18	lwz      r12,8(r2)	; get IMP from objc_method
19 L				; wait for value
20	mtctr    r12		; stuff it into ctr
21	lwz      r9,48(r1)	; pop r9 from stack
22 	bctr			; jump to IMP

But as r11 is no longer needed past cycle 09, we can now get rid of register r9 thusly

03      lwz      r12,0(r3)	; get ->isa in r12
04 L				; wait for value
05	lwz      r12,32(r12)	; get ->cache from objc_class
  	stw      r9,48(r1)	; save r9 on stack
06 L
07	lwz      r11,0(r12)	; -> preshifted_cache_mask from objc_cache
08	addi     r12,r12,8	; dial to ->buckets
09	and      r11,r4,r11	; SEL AND mask, r11: index into buckets

mloop:
10	lwzx     r2,r12,r11	; load objc_method pointer from bucket
11	addi     r11,r11,4	; move to next bucket
12	cmplwi   r2,0		; bucket empty ?
13	beq-     mfill		; cache needs to filled ->

14	lwz      r0,0(r2)       ; get SEL from objc_method
15 L
16	cmplw    r0,r4		; same SEL ?
17	bne-     mloop		; no, try next

18	lwz      r12,8(r2)	; get IMP from objc_method
19 L				; wait for value
20	mtctr    r12		; stuff it into ctr
  	lwz      r9,48(r1)	; pop r9 from stack
21 L
22 	bctr			; jump to IMP

Hmm so far that didn't gain anything, because the cleared up slots are incurring latencies.
The latency between 5 and 6 one can get around by shadowing the mask of the objc_cache structure in the objc_class structure. Then the CPU does not have to wait anymore for the cache pointer to be fetched. This is also kind of neat, because now the original mask format can be kept in the cache structure and the pre-shifted mask is now available in the class structure. Here's how it could look like, with a reshuffling of the registers:

03      lwz      r12,0(r3)	; get ->isa in r12
04 L				; wait for value
05	lwz      r11,40(r12)	; preshifted ->cache_mask from objc_class
06	lwz      r12,32(r12)	; get ->Cache from objc_class
07	and      r11,r4,r11	; SEL AND mask, r11: index into buckets
08	addi     r12,r12,8	; dial to ->buckets

so one latency gone and another cycle saved.

The next idea is a bit tricky. This should pay off in the majority of cases. See what happens when:

14	lwz      r0,0(r2)       ; get SEL from objc_method
15 L
16	cmplw    r0,r4		; same SEL ?
17	bne-     mloop		; no, try next

18	lwz      r12,8(r2)	; get IMP from objc_method
19 L				; wait for value
20	mtctr    r12		; stuff it into ctr
21 L
22 	bctr			; jump to IMP

is changed to:

14	lwz      r0,0(r2)       ; get SEL from objc_method
15	lwz      r12,8(r2)	; get IMP from objc_method
16	cmplw    r0,r4		; same SEL ?
17	bne-     mloop		; no, try next

18	mtctr    r12		; stuff it into ctr
  	lwz      r12,8(r2)	; get IMP from objc_method
19 L
20 	bctr			; jump to IMP

This looks good on paper because we lost two latencies.

Of course this doesn't work immediately because we are wasting r12, which we need in the loop. But that can be fixed.

Also there are now potentially extra memory accesses in the loop, that may punish us in more than 0 cache misses cases. Buy how much will it punish us ? How likely is the CPU data cache likely to miss one wonders... Given that a cache line is 16 bytes on older processors, we have if 0(r2) is fetched a 50:50 chance that 8(r2) is already in the cache. On a G5 with larger cache lines the chance is even better.
So given the likelyhood of

a data cache hit
a method cache hit

I think its a worthwhile endeavour.

Now r12 needs to be freed up somehow. This can be done by another slight tweaking of the code:

07	and      r11,r4,r11	; SEL AND mask, r11: index into buckets
08	add      r11,r11,r12	; add index to Cache

mloop:
09	lwz      r2,8(r11)	; load objc_method pointer from bucket
10	addi     r11,r11,4	; move to next bucket

In the end a conveniently placed nop into the latency slot at cycle 04 pushes the mloop start on a 16 byte boundary, which is what the G5 likes:

01	cmplwi   r3,0		; is message to nil
02	beq-     mleave		; ok, just return
    
03      lwz      r12,0(r3)	; get ->isa in r12
04      nop			; wait for value
05	lwz      r11,40(r12)	; ->preshifted_cache_mask from objc_class
06	lwz      r12,32(r12)	; get ->cache from objc_class
07	and      r11,r4,r11	; SEL AND mask, r11: index into buckets
08	add      r11,r11,r12	; add index to Cache

mloop:
09	lwz      r2,8(r11)	; load objc_method pointer from bucket
10	addi     r11,r11,4	; move to next bucket
11	cmplwi   r2,0		; bucket empty ?
12	beq-     mfill		; cache needs to filled ->

13	lwz      r0,0(r2)       ; get SEL from objc_method
14	lwz      r12,8(r2)	; get IMP from objc_method
15	cmplw    r0,r4		; same SEL ?
16	bne-     mloop		; no, try next

17	mtctr    r12		; stuff it into ctr
18 L
19	bctr			; jump to IMP

mleave:
    blr

mfill:
    nop		                  ; more code but snipped for simplicity
    blr

objc_msgSend optimized

That is 19 cycles vs. 22 cycles plus an unknown statistical gain because of the improved branch prediction. Around 14% improvement on paper... This is offset by the potential loss because of the extra memory hit, if the method cache incurs lots of misses.
The actual advantage has to be measured. (Try the test case!)

So that's what I did on my old Lombard Powerbook 400 MHz. Here's the results of a typical run

2003-12-27 12:05:15.885 test[901] 0 cache misses
2003-12-27 12:05:15.888 test[901] Start 100 Mio message calls with Apple Code
2003-12-27 12:05:21.646 test[901] Start 100 Mio message calls with Mulle Code
2003-12-27 12:05:26.628 test[901] 1 cache misses
2003-12-27 12:05:26.630 test[901] Start 100 Mio message calls with Apple Code
2003-12-27 12:05:34.321 test[901] Start 100 Mio message calls with Mulle Code
2003-12-27 12:05:41.183 test[901] 2 cache misses
2003-12-27 12:05:41.185 test[901] Start 100 Mio message calls with Apple Code
2003-12-27 12:05:49.995 test[901] Start 100 Mio message calls with Mulle Code
2003-12-27 12:05:57.515 test[901] 3 cache misses
2003-12-27 12:05:57.517 test[901] Start 100 Mio message calls with Apple Code
2003-12-27 12:06:07.639 test[901] Start 100 Mio message calls with Mulle Code
2003-12-27 12:06:16.200 test[901] 4 cache misses
2003-12-27 12:06:16.202 test[901] Start 100 Mio message calls with Apple Code
2003-12-27 12:06:27.581 test[901] Start 100 Mio message calls with Mulle Code
2003-12-27 12:06:37.159 test[901] Stop

5.0 vs 5.8 or something like 13% gain on a G3 processor.

On my Dual 2GHz G5 the results are even more pleasant, I measured an almost 20% boost!

If you want to discuss this articles, please do so in this thread in the Mull e kybernetiK Optimization Forum.