The faster objc_msgSend
objc_msgSend analyzed
Sending a message to an object in Objective-C works like this. The object's class (the isa pointer) and the selector _cmd (method name) are used to find the implementation in the class and then jump to that implementation.For better performance the implementation and its selector are noted in a private cache that is part of every class. This cache is very simple and very fast.
In this example, only a few methods have been memorized yet in this cache of a class Foo. Therefore the cache is small, it only holds 8 entries. (NSTextView has 256 slots right from the start, just initialized from a NIB and displaying.) There are 5 slots filled. Each slot shows the preferred slot for the indexed method. Our target methods optimal slot is slot #3, but since this is occupied, it resides in slot #4.
A part of the selector is cookie cut with the cache mask to quickly index into the cache. This avoids a linear search. In the given example the selector indexes into the slot #3, which is occupied by a different method. After one cache miss the matching selector and its implementation is found.
The graphic also tries to visualize, that the cache is arranged in a circular fashion. If the last entry doesn't match the the search continues with the start of the cache. (Notice that slot #0 is occupied with a method, that would preferably be stored in slot #7)
For those who can read C easier than PPC assembler, here's the objc_msgSend entry part in almost functional - but yet unfixable and unusable - C:
#include <objc/objc-runtime.h> id c_objc_msgSend( struct objc_class /* ahem */ *self, SEL _cmd, ...) { struct objc_class *cls; struct objc_cache *cache; unsigned int hash; struct objc_method *method; unsigned int index; if( self) { cls = self->isa; cache = cls->cache; hash = cache->mask; index = (unsigned int) _cmd & hash; do { method = cache->buckets[ index]; if( ! method) goto recache; index = (index + 1) & cache->mask; } while( method->method_name != _cmd); return( (*method->method_imp)( (id) self, _cmd)); } return( (id) self); recache: /* ... */ return( 0); }Some more background information: The cache with its mask is dynamically grown. The cache is never more than 75% full. Methods that are recorded later have a statistical speed advantage over those entered earlier.
objc_msgSend rewritten
The following code is entry part of Apple's objc_msgSend. This contains all the code executed when there is a cached implementation. I have noted the instruction cycles on the left, as I think they are spent. An L means latency, there is a stall as the next instruction depends on the previous load. (This is for a G3. Latencies only increase with later processors, due to their longer pipelines)01 cmplwi r3,0 ; is message to nil 02 beq- aleave ; ok, just return 03 lwz r12,0(r3) ; get ->isa in r12 04 L ; wait for value 05 lwz r12,32(r12) ; get ->Cache from objc_class 06 stw r9,48(r1) ; save r9 on stack 07 lwz r11,0(r12) ; load ->cache_mask from objc_cache 08 addi r9,r12,8 ; dial r9 to ->buckets 09 rlwinm r11,r11,2,0,29 ; shift mask up by 2 10 and r12,r4,r11 ; SEL AND mask, r12: index into buckets aloop: 11 lwzx r2,r9,r12 ; load objc_method pointer from bucket 12 addi r12,r12,4 ; move to next bucket 13 cmplwi r2,0 ; bucket empty ? 14 beq- afill ; cache needs to filled -> 15 lwz r0,0(r2) ; get SEL from objc_method 16 and r12,r12,r11 ; buckets, circular wrap code 17 cmplw r0,r4 ; same SEL ? 18 bne+ aloop ; no, try next -> 19 lwz r12,8(r2) ; get IMP from objc_method 20 L ; wait for value 21 mtctr r12 ; stuff it into ctr 22 lwz r9,48(r1) ; pop r9 from stack 23 bctr ; jump to IMP aleave: blr afill: nop ; more code but snipped for simplicity blr
Change the branch prediction
The first idea is simple yet non-obvious. At cycle 18 the assumption of bne+ is, that the cache has failed and that another loop must be taken. This is the standard way to write loops, it optimizes the loop. Lets see how this will pan out:
Method Cache Misses | Predictions Detail | Predictions Total |
---|---|---|
0 Cache misses | - | (-1) |
1 Cache misses | + - | (0) |
2 Cache misses | + + - | (+1) |
3 Cache misses | + + + - | (+2) |
It penalizes the O Cache miss path and favors the 2 Cache misses and worse path. Lets assume we code it bne- aloop then the table turns
Method Cache Misses | Predictions Detail | Predictions Total |
---|---|---|
0 Cache misses | + | (+1) |
1 Cache misses | - + | (0) |
2 Cache misses | - - + | (-1) |
3 Cache misses | - - - + | (-2) |
Now its a matter of likelihood of cache misses. Looking at actual cache contents, the spread of Cache hits and misses will not be even. In fact 0 Cache misses is a much more likely event than 2 Cache misses or more.
So this could potentially depending on the branch prediction architecture and the instruction cache contents pay off.
The superflous rotation
09 rlwinm r11,r11,2,0,29 ; shift mask up by 2This is always done to the mask being fetched. It'd be easy to just pre-compute the shift in the cache filling routine instead of having the dispatcher do the work, this definitely saves a cycle.
A different cache algorithm
Currently the buckets are organized in a circular buffer fashion. If the search hits the end of the bucket array, the search continues with the beginning. This is just slightly more complicated than extending the end of the array to accommodate enough methods.Lets assume the cache fill algorithm has been rewritten so that the buckets are not in a circular buffer but extend until a sentinel (NULL) value is reached. The algorithm at first only changes slightly, with the above two optimizations already coded in:
03 lwz r12,0(r3) ; get ->isa in r12 04 L ; wait for value 05 lwz r12,32(r12) ; get ->Cache from objc_class 06 stw r9,48(r1) ; save r9 on stack 07 lwz r11,0(r12) ; -> preshifted_cache_mask from objc_cache 08 addi r9,r12,8 ; dial to ->bucketsBut as r11 is no longer needed past cycle 09, we can now get rid of register r9 thuslyrlwinm r11,r11,2,0,29 ; shift mask up by 209 and r12,r4,r11 ; SEL AND mask, r12: index into buckets mloop: 10 lwzx r2,r9,r12 ; load objc_method pointer from bucket 11 addi r12,r12,4 ; move to next bucket 12 cmplwi r2,0 ; bucket empty ? 13 beq- mfill ; cache needs to filled -> 14 lwz r0,0(r2) ; get SEL from objc_methodand r12,r12,r11 ; buckets, circular wrap code15 L 16 cmplw r0,r4 ; same SEL ? 17 bne- mloop ; no, try next 18 lwz r12,8(r2) ; get IMP from objc_method 19 L ; wait for value 20 mtctr r12 ; stuff it into ctr 21 lwz r9,48(r1) ; pop r9 from stack 22 bctr ; jump to IMP
03 lwz r12,0(r3) ; get ->isa in r12 04 L ; wait for value 05 lwz r12,32(r12) ; get ->cache from objc_classHmm so far that didn't gain anything, because the cleared up slots are incurring latencies.stw r9,48(r1) ; save r9 on stack06 L 07 lwz r11,0(r12) ; -> preshifted_cache_mask from objc_cache 08 addi r12,r12,8 ; dial to ->buckets 09 and r11,r4,r11 ; SEL AND mask, r11: index into buckets mloop: 10 lwzx r2,r12,r11 ; load objc_method pointer from bucket 11 addi r11,r11,4 ; move to next bucket 12 cmplwi r2,0 ; bucket empty ? 13 beq- mfill ; cache needs to filled -> 14 lwz r0,0(r2) ; get SEL from objc_method 15 L 16 cmplw r0,r4 ; same SEL ? 17 bne- mloop ; no, try next 18 lwz r12,8(r2) ; get IMP from objc_method 19 L ; wait for value 20 mtctr r12 ; stuff it into ctrlwz r9,48(r1) ; pop r9 from stack21 L 22 bctr ; jump to IMP
The latency between 5 and 6 one can get around by shadowing the mask of the objc_cache structure in the objc_class structure. Then the CPU does not have to wait anymore for the cache pointer to be fetched. This is also kind of neat, because now the original mask format can be kept in the cache structure and the pre-shifted mask is now available in the class structure. Here's how it could look like, with a reshuffling of the registers:
03 lwz r12,0(r3) ; get ->isa in r12 04 L ; wait for value 05 lwz r11,40(r12) ; preshifted ->cache_mask from objc_class 06 lwz r12,32(r12) ; get ->Cache from objc_class 07 and r11,r4,r11 ; SEL AND mask, r11: index into buckets 08 addi r12,r12,8 ; dial to ->bucketsso one latency gone and another cycle saved.
The next idea is a bit tricky. This should pay off in the majority of cases. See what happens when:
14 lwz r0,0(r2) ; get SEL from objc_method 15 L 16 cmplw r0,r4 ; same SEL ? 17 bne- mloop ; no, try next 18 lwz r12,8(r2) ; get IMP from objc_method 19 L ; wait for value 20 mtctr r12 ; stuff it into ctr 21 L 22 bctr ; jump to IMPis changed to:
14 lwz r0,0(r2) ; get SEL from objc_method 15 lwz r12,8(r2) ; get IMP from objc_method 16 cmplw r0,r4 ; same SEL ? 17 bne- mloop ; no, try next 18 mtctr r12 ; stuff it into ctrThis looks good on paper because we lost two latencies.lwz r12,8(r2) ; get IMP from objc_method19 L 20 bctr ; jump to IMP
Of course this doesn't work immediately because we are wasting r12, which we need in the loop. But that can be fixed.
Also there are now potentially extra memory accesses in the loop, that may punish us in more than 0 cache misses cases. Buy how much will it punish us ?
How likely is the CPU data cache likely to miss one wonders... Given that a cache line is 16 bytes on older processors, we have if 0(r2) is fetched a 50:50 chance that 8(r2) is already in the cache. On a G5 with larger cache lines the chance is even better.
So given the likelyhood of
- a data cache hit
- a method cache hit
Now r12 needs to be freed up somehow. This can be done by another slight tweaking of the code:
07 and r11,r4,r11 ; SEL AND mask, r11: index into buckets 08 add r11,r11,r12 ; add index to Cache mloop: 09 lwz r2,8(r11) ; load objc_method pointer from bucket 10 addi r11,r11,4 ; move to next bucketIn the end a conveniently placed nop into the latency slot at cycle 04 pushes the mloop start on a 16 byte boundary, which is what the G5 likes:
01 cmplwi r3,0 ; is message to nil 02 beq- mleave ; ok, just return 03 lwz r12,0(r3) ; get ->isa in r12 04 nop ; wait for value 05 lwz r11,40(r12) ; ->preshifted_cache_mask from objc_class 06 lwz r12,32(r12) ; get ->cache from objc_class 07 and r11,r4,r11 ; SEL AND mask, r11: index into buckets 08 add r11,r11,r12 ; add index to Cache mloop: 09 lwz r2,8(r11) ; load objc_method pointer from bucket 10 addi r11,r11,4 ; move to next bucket 11 cmplwi r2,0 ; bucket empty ? 12 beq- mfill ; cache needs to filled -> 13 lwz r0,0(r2) ; get SEL from objc_method 14 lwz r12,8(r2) ; get IMP from objc_method 15 cmplw r0,r4 ; same SEL ? 16 bne- mloop ; no, try next 17 mtctr r12 ; stuff it into ctr 18 L 19 bctr ; jump to IMP mleave: blr mfill: nop ; more code but snipped for simplicity blr
objc_msgSend optimized
That is 19 cycles vs. 22 cycles plus an unknown statistical gain because of the improved branch prediction. Around 14% improvement on paper... This is offset by the potential loss because of the extra memory hit, if the method cache incurs lots of misses.The actual advantage has to be measured. (Try the test case!)
So that's what I did on my old Lombard Powerbook 400 MHz. Here's the results of a typical run
2003-12-27 12:05:15.885 test[901] 0 cache misses 2003-12-27 12:05:15.888 test[901] Start 100 Mio message calls with Apple Code 2003-12-27 12:05:21.646 test[901] Start 100 Mio message calls with Mulle Code 2003-12-27 12:05:26.628 test[901] 1 cache misses 2003-12-27 12:05:26.630 test[901] Start 100 Mio message calls with Apple Code 2003-12-27 12:05:34.321 test[901] Start 100 Mio message calls with Mulle Code 2003-12-27 12:05:41.183 test[901] 2 cache misses 2003-12-27 12:05:41.185 test[901] Start 100 Mio message calls with Apple Code 2003-12-27 12:05:49.995 test[901] Start 100 Mio message calls with Mulle Code 2003-12-27 12:05:57.515 test[901] 3 cache misses 2003-12-27 12:05:57.517 test[901] Start 100 Mio message calls with Apple Code 2003-12-27 12:06:07.639 test[901] Start 100 Mio message calls with Mulle Code 2003-12-27 12:06:16.200 test[901] 4 cache misses 2003-12-27 12:06:16.202 test[901] Start 100 Mio message calls with Apple Code 2003-12-27 12:06:27.581 test[901] Start 100 Mio message calls with Mulle Code 2003-12-27 12:06:37.159 test[901] Stop5.0 vs 5.8 or something like 13% gain on a G3 processor.
On my Dual 2GHz G5 the results are even more pleasant, I measured an almost 20% boost!
If you want to discuss this articles, please do so in this thread in the Mull e kybernetiK Optimization Forum.