Mulle kybernetiK - Tech Info: v0.5
Obj-C Optimization: Method and function call innards
This third installment of the ongoing series "How to optimize in Objective-C" lays some more ground work in the area of method calls and regular function calls.
(c) 2000 Mulle kybernetiK - text by Nat!

Lameness Disclaimer: All this is written to the best of my knowledge. Corrections, additions etc. are certainly welcome.

Method and function call innards

Ok, I lied last time when I said, that we'd be covering different allocation strategies this time. While writing some more articles (which I unfortunately accidently lost, thanks to Mac OS X beta and my stupidity) I noticed that it became awkward to justify many a decision, because the basics on calls hadn't been covered. (Though not necessary, it doesn't hurt to have read the previous articles).


Inside calling C-Functions and Objective-C methods

There will be a lot of disassembled code in this article. Although it can't hurt to understand, what is really happening in the code, for further discussions it is only necessary that you get an impression on the effort involved to do "certain things". Also this article is very much geared to coding on a PPC machine under Mach (i.e. your Mac OS X (Server) box. Things on an Intel machine may very well be different in certain areas, f.i. in the way shared libraries are invoked. Nevertheless most of the comments should also be valid for Intel machinery.
    O tempora O mores - or how to read disassembled code
Since presumably some people have never seen disassembled code . Lets just quickly examine a typical output of gdb and see, what it means:
gdb>x/1i $pc
0x1ddc <call_test+12>: stw r0,8(r1)

The first number 0x1ddc is the memory address(1) that particular code (C-function or Obj-C method f.e.) happened to be loaded to. If there is a symbol associated with this address (a function or a variable) or its ascending, immediate vicinity, gdb will print the symbols name and its offset in lozenges following the address: <call_test+12>. The data (usually 4 bytes) encountered at this address will then be disassembled into the mnemonic assembler format stw r0,8(r1).


C calls

Since C is the basis of Obj-C lets review the options we have available for invoking a C function.

If you have a function defined as inline (0) that is short and not too complicated, the compiler (given the -O option) will happily avoid the actual function call and inline the code in the callers code. This is the fastest way to invoke a C function. It is on par with the more old fashioned method of using macros (aka #define). If you have set your compiler to a more aggressive optimization level, it may even inline some of your small functions automatically.

The following example shows the generated code for two C functions. Function inline_foo is inlined by the compiler, whereas plain foo is called the "usual" way as a subroutine.

In the box on the right, you can see the disassembly of the object code generated by the compiler from the source code below.

In the disassembled code listing the inlined code is marked green (<call_test+32> - <call_test+44>), it is just four instructions. The functional equivalent code wrapped in the normal C function header and footer is marked purple (<foo+16> - <foo+28>). All instructions that comprise the overhead of a regular C call over an inlined C call have been marked blue.

0x1dac <foo>: mflr r0 
0x1db0 <foo+4>:        bcl        20,4*cr7+so,0x1db4 <foo+8>  
0x1db4 <foo+8>:        mflr        r12  
0x1db8 <foo+12>:        mtlr        r0   

0x1dbc <foo+16>:        addis        r9,r12,0    
0x1dc0 <foo+20>:        addi        r9,r9,568  
0x1dc4 <foo+24>:        lfd        f0,0(r9)    
0x1dc8 <foo+28>:        fmul        f1,f1,f0   

0x1dcc <foo+32>:        blr                 


0x1dd0 <call_test>:    mflr        r0
0x1dd4 <call_test+4>:        stfd        f31,-8(r1)
0x1dd8 <call_test+8>:        stw        r31,-12(r1)
0x1ddc <call_test+12>:        stw        r0,8(r1)
0x1de0 <call_test+16>:        stwu        r1,-80(r1)
0x1de4 <call_test+20>:        bcl        20,4*cr7+so,0x1de8
0x1de8 <call_test+24>:        mflr        r31

0x1dec <call_test+28>:        fmr        f31,f1

0x1df0 <call_test+32>:        addis        r9,r31,0
0x1df4 <call_test+36>:        addi        r9,r9,524
0x1df8 <call_test+40>:        lfd        f0,0(r9)
0x1dfc <call_test+44>:        fmul        f31,f31,f0

0x1e00 <call_test+48>:        bl        0x1dac <foo>

0x1e04 <call_test+52>:        fadd        f1,f31,f1

0x1e08 <call_test+56>:        addi        r1,r1,80
0x1e0c <call_test+60>:        lwz        r0,8(r1)
0x1e10 <call_test+64>:        mtlr        r0
0x1e14 <call_test+68>:        lwz        r31,-12(r1)
0x1e18 <call_test+72>:        lfd        f31,-8(r1)
0x1e1c <call_test+76>:        blr
static inline double  inline_foo( float x)
{
   return( x * 2.1);
}


static double  foo( float x)
{
   return( x * 2.1);
}


double call_test( float x)
{
   return( inline_foo( x) + foo( x));
}

There is no need to wax on over this trivial issue. Inlining rocks!


Calling a function in a shared library

You very often call functions that are provided by a shared library. Any function that resides in a Framework is in a shared library. As an example lets see how a call to malloc works:

0x1dd4 <call_test2>: mflr        r0
0x1dd8 <call_test2+4>:        stw        r0,8(r1)
0x1ddc <call_test2+8>:        stwu        r1,-64(r1)

0x1de0 <call_test2+12>:        li        r3,128
0x1de4 <call_test2+16>:        bl        0x1fa4 <dyld_stub_malloc>

0x1de8 <call_test2+20>:        addi        r1,r1,64
0x1dec <call_test2+24>:        lwz        r0,8(r1)
0x1df0 <call_test2+28>:        mtlr        r0
0x1df4 <call_test2+32>:        blr


0x1fa4 <dyld_stub_malloc>:        mflr        r0
0x1fa8 <dyld_stub_malloc+4>:        bcl        20,4*cr7+so,0x1fac
0x1fac <dyld_stub_malloc+8>:        mflr        r11
0x1fb0 <dyld_stub_malloc+12>:        addis        r11,r11,0
0x1fb4 <dyld_stub_malloc+16>:        mtlr        r0
0x1fb8 <dyld_stub_malloc+20>:        lwz        r12,112(r11)
0x1fbc <dyld_stub_malloc+24>:        mtctr        r12
0x1fc0 <dyld_stub_malloc+28>:        addi        r11,r11,112
0x1fc4 <dyld_stub_malloc+32>:        bctr


0x5ace3800 <malloc>:    mflr        r0
0x5ace3804 <malloc+4>:        stmw        r30,-8(r1)
0x5ace3808 <malloc+8>:        stw        r0,8(r1)
0x5ace380c <malloc+12>:        stwu        r1,-80(r1)
0x5ace3810 <malloc+16>:        bcl        20,4*cr7+so,0x5ace3814
0x5ace3814 <malloc+20>:        mflr        r31
0x5ace3818 <malloc+24>:        mr        r30,r3
0x5ace381c <malloc+28>:        addis        r9,r31,14
0x5ace3820 <malloc+32>:        lwz        r9,3020(r9)
0x5ace3824 <malloc+36>:        cmpwi        r9,0
0x5ace3828 <malloc+40>:        bne        0x5ace3830 <malloc+48>
0x5ace382c <malloc+44>:        bl        0x5ace74a0
0x5ace3830 <malloc+48>:        addis        r9,r31,14
0x5ace3834 <malloc+52>:        lwz        r9,3024(r9)
0x5ace3838 <malloc+56>:        lwz        r3,0(r9)
0x5ace383c <malloc+60>:        mr        r4,r30
0x5ace3840 <malloc+64>:        bl        0x5ace3860
0x5ace3844 <malloc+68>:        addi        r1,r1,80
0x5ace3848 <malloc+72>:        lwz        r0,8(r1)
0x5ace384c <malloc+76>:        mtlr        r0
0x5ace3850 <malloc+80>:        lmw        r30,-8(r1)
0x5ace3854 <malloc+84>:        blr
void  call_test2()
{
   void  *p;

   p = malloc( 128);
}
The generated code for the malloc call in call_test2 are just the two blue lines at address 0x1de0 and 0x1de4. Consider the other code wrappage :)

Instead of directly calling malloc, another piece of code named dyld_stub_malloc gets called. This stub code, that gets statically linked to your code. provides the interface to the shared library function malloc. (I have no idea, why it needs to be done this way)

Therefore a few extra instructions (red) are executed before we are in malloc(2).

From the disassembled output even the layman can easily deduce that this is a more involved procedure and therefore slower than calling a statically linked C function like "foo" above.


Anatomy of a Obj-C method call

Given an object and a selector the Objective-C runtime system somehow determines the address of the code that it should call, and then calls it with the appropriate parameters. Now how does that work in detail ? The first thing to note is that writing

[p callWith:x and:y];

is really just the same as writing

objc_msgSend( p, @selector( callWith:and:), x, y);

The objc_msgSend function must determine from the object and the selector, what code should be executed. Since the method callWith:and: could be implemented by any number of different objects (rather their classes), just examining the selector can not be sufficient for objc_msgSend. What it needs to do is to examine the class of the object and try to find in this or its superclasses definition an implementation for this selector.

The actual mechanics are nicely explained in the Objective-C book, so I will forego duplicating this. See your local copy on your harddisk or http://www.toodarkpark.org/computers/objc/coreobjc.html#1522.

What is a selector anyway ?
A selector in the current Apple implementation is just the address of a C String, that contains the name of the selector. So if @selector( callWith:and:) yields 0x10210
you would find at address 0x10210 'c', 'a', 'l', 'l', 'W', 'i', 't', 'h', ':', 'a', 'n', 'd', ':', 0

These selector strings are uniqued by the mach runtime during loading. Therefore all selectors of the various frameworks and your main code that have the same name share the same selector address.

So lets check out a simple Objective C method and step through the instructions executed when calling it.

0x2d18 <-[CallTest3 fooMethod:]>:    addis  r9,r12,0
0x2d1c <-[CallTest3 fooMethod:]+4>:  addi   r9,r9,732
0x2d20 <-[CallTest3 fooMethod:]+8>:  lfd   f0,0(r9)
0x2d24 <-[CallTest3 fooMethod:]+12>: fmul  f1,f1,f0
0x2d28 <-[CallTest3 fooMethod:]+16>: blr
- (double) fooMethod:(float) x
{
   return( x * 2.1);
}

It is interesting is that the fooMethod: itself looks very slender compared to the C function we saw in the previous example. Except for the return instruction blr, there is no overhead for stack maintenance and "environmental" register setup, because all this is taken care of in the stub and objc_mgsSend code.
0x2d2c <call_test3>:    mflr  r0
0x2d30 <call_test3+4>:  stw   r31,-4(r1)
0x2d34 <call_test3+8>:  stw   r0,8(r1)
0x2d38 <call_test3+12>: stwu  r1,-80(r1)
0x2d3c <call_test3+16>: bcl   20,4*cr7+so,0x2d40
0x2d40 <call_test3+20>: mflr  r31
0x2d44 <call_test3+24>: addis  r4,r31,0
0x2d48 <call_test3+28>: lwz    r4,4800(r4)
0x2d4c <call_test3+32>: stfs   f1,56(r1)
0x2d50 <call_test3+36>: lwz    r5,56(r1)
0x2d54 <call_test3+40>: bl     <objc_msgSend>
0x2d58 <call_test3+44>: addi   r1,r1,80
0x2d5c <call_test3+48>: lwz    r0,8(r1)
0x2d60 <call_test3+52>: mtlr   r0
0x2d64 <call_test3+56>: lwz    r31,-4(r1)
0x2d68 <call_test3+60>: blr


double   call_test3( CallTest3 *p, float x)
{
   return( [p fooMethod:x]);
}

The blue code sets up the parameter, the selector and the object (it puts them into registers, r5, r4 and r3 respectively) and calls the stub function.
0x2fc0 <dyld_stub_objc_msgSend>
0x2fc0:               mflr   r0
0x2fc4:               bcl    20,4*cr7+so,0x2fc8
0x2fc8:               mflr   r11
0x2fcc:               addis  r11,r11,0
0x2fd0:               mtlr   r0
0x2fd4:               lwz    r12,88(r11)
0x2fd8:               mtctr  r12
0x2fdc:               addi   r11,r11,88
0x2fe0:               bctr

Since objc_msgSend resides in a shared library, the processor has to trod through the stub code (red) to jump to the objc_msgSend implementation.
0x720bb088 <objc_msgSend>:      cmplwi      r3,0
0x720bb08c <objc_msgSend+4>:     beq        0x720bb1f4
0x720bb090 <objc_msgSend+8>:     stw        r8,44(r1)
0x720bb094 <objc_msgSend+12>:    stw        r9,48(r1)
0x720bb098 <objc_msgSend+16>:    stw        r10,52(r1)
0x720bb09c <objc_msgSend+20>:    lwz        r12,0(r3)
0x720bb0a0 <objc_msgSend+24>:    lwz        r12,32(r12)
0x720bb0a4 <objc_msgSend+28>:    lwz        r11,0(r12)
0x720bb0a8 <objc_msgSend+32>:    addi       r9,r12,8
0x720bb0ac <objc_msgSend+36>:    and        r12,r4,r11
0x720bb0b0 <objc_msgSend+40>:    rlwinm     r0,r12,2,0
0x720bb0b4 <objc_msgSend+44>:    lwzx       r10,r9,r0
0x720bb0b8 <objc_msgSend+48>:    cmplwi     r10,0
0x720bb0bc <objc_msgSend+52>:    beq        0x720bb0f4
0x720bb0c0 <objc_msgSend+56>:    addi       r12,r12,1
0x720bb0c4 <objc_msgSend+60>:    lwz        r8,0(r10)
0x720bb0c8 <objc_msgSend+64>:    and        r12,r12,r11
0x720bb0cc <objc_msgSend+68>:    lwz        r10,8(r10)
0x720bb0d0 <objc_msgSend+72>:    cmplw      r8,r4
0x720bb0d4 <objc_msgSend+76>:    bne-       0x720bb0b0
0x720bb0d8 <objc_msgSend+80>:    mr         r12,r10
0x720bb0dc <objc_msgSend+84>:    mtctr      r10
0x720bb0e0 <objc_msgSend+88>:    lwz        r8,44(r1)
0x720bb0e4 <objc_msgSend+92>:    lwz        r9,48(r1)
0x720bb0e8 <objc_msgSend+96>:    lwz        r10,52(r1)
0x720bb0ec <objc_msgSend+100>:   li         r11,0
0x720bb0f0 <objc_msgSend+104>:   bctr
0x720bb0f4 <objc_msgSend+108>:   stw        r3,24(r1)
0x720bb0f8 <objc_msgSend+112>:   stw        r4,28(r1)
0x720bb0fc <objc_msgSend+116>:   stw        r5,32(r1)
0x720bb100 <objc_msgSend+120>:   stw        r6,36(r1)
0x720bb104 <objc_msgSend+124>:   stw        r7,40(r1)
0x720bb108 <objc_msgSend+128>:   mflr       r0
0x720bb10c <objc_msgSend+132>:   stw        r0,8(r1)
0x720bb110 <objc_msgSend+136>:   stfd       f13,-8(r1)
0x720bb114 <objc_msgSend+140>:   stfd       f12,-16(r1)
0x720bb118 <objc_msgSend+144>:   stfd       f11,-24(r1)
0x720bb11c <objc_msgSend+148>:   stfd       f10,-32(r1)
0x720bb120 <objc_msgSend+152>:   stfd       f9,-40(r1)
0x720bb124 <objc_msgSend+156>:   stfd       f8,-48(r1)
0x720bb128 <objc_msgSend+160>:   stfd       f7,-56(r1)
0x720bb12c <objc_msgSend+164>:   stfd       f6,-64(r1)
0x720bb130 <objc_msgSend+168>:   stfd       f5,-72(r1)
0x720bb134 <objc_msgSend+172>:   stfd       f4,-80(r1)
0x720bb138 <objc_msgSend+176>:   stfd       f3,-88(r1)
0x720bb13c <objc_msgSend+180>:   stfd       f2,-96(r1)
0x720bb140 <objc_msgSend+184>:   stfd       f1,-104(r1)
0x720bb144 <objc_msgSend+188>:   stwu       r1,-160(r1)
0x720bb148 <objc_msgSend+192>:   lwz        r3,0(r3)
0x720bb14c <objc_msgSend+196>:   mflr       r0
0x720bb150 <objc_msgSend+200>:   bl         0x720bb154
0x720bb154 <objc_msgSend+204>:   mflr       r12
0x720bb158 <objc_msgSend+208>:   mtlr       r0
0x720bb15c <objc_msgSend+212>:   addis      r12,r12,5
0x720bb160 <objc_msgSend+216>:   lwz        r12,-1156(r12)
0x720bb164 <objc_msgSend+220>:   mtctr      r12
0x720bb168 <objc_msgSend+224>:   mflr       r0
0x720bb16c <objc_msgSend+228>:   stw        r0,8(r1)
0x720bb170 <objc_msgSend+232>:   stwu       r1,-56(r1)
0x720bb174 <objc_msgSend+236>:   bctrl
0x720bb178 <objc_msgSend+240>:   addic      r1,r1,56
0x720bb17c <objc_msgSend+244>:   lwz        r0,8(r1)
0x720bb180 <objc_msgSend+248>:   mtlr       r0
0x720bb184 <objc_msgSend+252>:   mr         r12,r3
0x720bb188 <objc_msgSend+256>:   mtctr      r3
0x720bb18c <objc_msgSend+260>:   lwz        r1,0(r1)
0x720bb190 <objc_msgSend+264>:   lwz        r0,8(r1)
0x720bb194 <objc_msgSend+268>:   mtlr       r0
0x720bb198 <objc_msgSend+272>:   lfd        f13,-8(r1)
0x720bb19c <objc_msgSend+276>:   lfd        f12,-16(r1)
0x720bb1a0 <objc_msgSend+280>:   lfd        f11,-24(r1)
0x720bb1a4 <objc_msgSend+284>:   lfd        f10,-32(r1)
0x720bb1a8 <objc_msgSend+288>:   lfd        f9,-40(r1)
0x720bb1ac <objc_msgSend+292>:   lfd        f8,-48(r1)
0x720bb1b0 <objc_msgSend+296>:   lfd        f7,-56(r1)
0x720bb1b4 <objc_msgSend+300>:   lfd        f6,-64(r1)
0x720bb1b8 <objc_msgSend+304>:   lfd        f5,-72(r1)
0x720bb1bc <objc_msgSend+308>:   lfd        f4,-80(r1)
0x720bb1c0 <objc_msgSend+312>:   lfd        f3,-88(r1)
0x720bb1c4 <objc_msgSend+316>:   lfd        f2,-96(r1)
0x720bb1c8 <objc_msgSend+320>:   lfd        f1,-104(r1)
0x720bb1cc <objc_msgSend+324>:   lwz        r3,24(r1)
0x720bb1d0 <objc_msgSend+328>:   lwz        r4,28(r1)
0x720bb1d4 <objc_msgSend+332>:   lwz        r5,32(r1)
0x720bb1d8 <objc_msgSend+336>:   lwz        r6,36(r1)
0x720bb1dc <objc_msgSend+340>:   lwz        r7,40(r1)
0x720bb1e0 <objc_msgSend+344>:   lwz        r8,44(r1)
0x720bb1e4 <objc_msgSend+348>:   lwz        r9,48(r1)
0x720bb1e8 <objc_msgSend+352>:   lwz        r10,52(r1)
0x720bb1ec <objc_msgSend+356>:   li         r11,0
0x720bb1f0 <objc_msgSend+360>:   bctr

0x720bb1f4 <objc_msgSend+364>:   mflr     r0
0x720bb1f8 <objc_msgSend+368>:   bl       0x720bb1fc
0x720bb1fc <objc_msgSend+372>:   mflr     r11
0x720bb200 <objc_msgSend+376>:   mtlr     r0
0x720bb204 <objc_msgSend+380>:   addis    r11,r11,5
0x720bb208 <objc_msgSend+384>:   lwz      r11,-1328(r11)
0x720bb20c <objc_msgSend+388>:   lwz      r11,0(r11)
0x720bb210 <objc_msgSend+392>:   cmplwi   r11,0
0x720bb214 <objc_msgSend+396>:   beqlr
0x720bb218 <objc_msgSend+400>:   mflr     r0
0x720bb21c <objc_msgSend+404>:   stw      r0,8(r1)
0x720bb220 <objc_msgSend+408>:   addi     r1,r1,-64
0x720bb224 <objc_msgSend+412>:   mtctr    r11
0x720bb228 <objc_msgSend+416>:   bctrl
0x720bb22c <objc_msgSend+420>:   addi     r1,r1,64
0x720bb230 <objc_msgSend+424>:   lwz      r0,8(r1)
0x720bb234 <objc_msgSend+428>:   mtlr     r0
0x720bb238 <objc_msgSend+432>:   li       r3,0
0x720bb23c <objc_msgSend+436>:   blr
Lets not go too deeply into the objc_msgSend implementation itself. The first two lines handle messaging to nil (orange). It branches to objc_msgSend+364 if the object is nil - we will ignore this part of the routine.

The brownish part (objc_msgSend +8 to +104) is the code, that searches for and jumps to cached methods.

Whenever a method is called the first time for a class, objc_msgSend will not find an entry in its cache for it. In this case the other "black" part will be used (obc_msgSend+108 ff.) to traverse the class hierarchy, locate the proper implementation and fill the method cache with this information.

This lookup code is very slow. It loops and branches a lot into various subroutines and it can take 500 and more instructions to execute.

When a method has been found and put into the cache, lookup times are very fast! Only a very few loops of the khaki colored code (objc_msgSend+40 ff.) need to be executed to find the cache entry then.

The minimum overhead of objc_msgSend is therefore at least 30 instructions per call, sometimes a little more.


Calling Obj-C methods directly, avoiding objc_msgSend

Objective-C gives you the possibility to resolve a method's address for an object or a class at runtime and to call that method using this address directly. Since this address is dependent on the class of the object (or the class itself), you have to make sure that you do not erroneously use this method address for another object of a different class. For example it would be dangerous to assume that every NSArray subclass uses the same objectAtIndex: method, most probably this will not be the case! Therefore caching objectAtIndex: by asking NSArray class directly is wrong. You need ask the specific object and its actual class will return the proper address, and you should only use it on this particular object and objects you know are of identical class.

You determine a methods address with the methodForSelector: method and get a so called IMP returned.

IMP  imp = [anObject methodForSelector:@selector( fooMethod:)];

An IMP is a type defined by Foundation and it is a function pointer returning an id, with a variable number of arguments, whose first two parameters are the object and the selector (just like objc_msgSend). Here is its definition copied from /System/Library/Frameworks/System.framework/Headers/objc/objc.h

/*      objc.h
 *      Copyright 1988-1996, NeXT Software, Inc.
 */

#ifndef _OBJC_OBJC_H_
#define _OBJC_OBJC_H_

#import <objc/objc-api.h>               // for OBJC_EXPORT

typedef struct objc_class *Class;

typedef struct objc_object {
        Class isa;
} *id;

typedef struct objc_selector    *SEL;    

typedef id                      (*IMP)(id, SEL, ...); 

Here is a little function that uses an IMP and its object to call the method with one parameter. Note that for many methods that do not return an int or an id like our fooMethod it is necessary(3) that the IMP is casted to the correct type.


double   call_test3b( CallTest3 *p, double (*f)( id, SEL, ...), float x)
{
   return( (*f)( p, @selector( fooMethod:), x));
}

And you would call it like this, notice (again) the cast:

This is our now well known fooMethod (green).
0x2d24 <-[CallTest4 fooMethod:]>:      addis  r9,r12,0
0x2d28 <-[CallTest4 fooMethod:]+4>:    addi   r9,r9,732
0x2d2C <-[CallTest4 fooMethod:]+8>:    lfd    f0,0(r9)
0x2d30 <-[CallTest4 fooMethod:]+12>:   fmul   f1,f1,f0
0x2d34 <-[CallTest4 fooMethod:]+16>:   blr
call_test3b( p, 
 (double (*)( id, SEL, ...)) /* cast */
 [p methodForSelector:@selector( fooMethod:)], 
 2.1);
As the disassembly shows, three more instructions are introduced by using a function pointer call (blue). What you do not see disassembled is the stub code, because it and the objc_msgSend code are now circumvented.



Caching shared lib C functions
Interestingly enough you can cache shared library functions using C function pointers and call them directly, circumventing the stub code. An example to speed up a read loop (marginally)

   int  (*f_read)( int fd, char *buf int len);
   f_read = read;
   while( (*f_read)( fd, buf, 1) == 1)
    ...

0x2cd4 <call_test3b>:      mflr    r0
0x2cd8 <call_test3b+4>:    stw     r31,-4(r1)
0x2cdc <call_test3b+8>:    stw     r0,8(r1)
0x2ce0 <call_test3b+12>:   stwu    r1,-80(r1)

0x2ce4 <call_test3b+16>:   bcl     20,4*cr7+so,0x2ce8
0x2ce8 <call_test3b+20>:   mflr    r31
0x2cec <call_test3b+24>:   mr      r0,r4

0x2cf0 <call_test3b+28>:   addis   r4,r31,0
0x2cf4 <call_test3b+32>:   lwz     r4,4888(r4)
0x2cf8 <call_test3b+36>:   stfd    f1,56(r1)
0x2cfc <call_test3b+40>:   lwz     r5,56(r1)
0x2d00 <call_test3b+44>:   lwz     r6,60(r1)
0x2d04 <call_test3b+48>:   mtlr    r0

0x2d08 <call_test3b+52>:   mr      r12,r0
0x2d0c <call_test3b+56>:   blrl

0x2d10 <call_test3b+60>:   addi    r1,r1,80
0x2d14 <call_test3b+64>:   lwz     r0,8(r1)
0x2d18 <call_test3b+68>:   mtlr    r0
0x2d1c <call_test3b+72>:   lwz     r31,-4(r1)
0x2d20 <call_test3b+76>:   blr

Using the method address can only pay off with the second call of the method. That's because getting the address takes one objc_msgSend also.

What you gain: Speed. Hidden Pitfalls (See next)
What you lose: Possibly generality.


Hidden pitfalls when using IMPs

With methodForSelector: you are getting the address of a method implementation for that class. Which particular method implementation (remember method overloading) is determined at runtime. The method implementation to be used for a class can possibly change during runtime.

Whenever a class is posed away with (+poseAsClass:), the implementation of a method is likely to change. Whenever a category for this class is added, for example by loading an NSBundle, that implementation could possibly change. It is therefore in theory only safe to use the direct call, when you can be certain that the class system remains fixed for the duration of the use of this IMP. For the fainthearted that would mean "never".

But lets check with reality. Whenever you are using a NSNotificationCenter f.e. you are potentially running into the same pitfall, since NSNotificationCenter happens to store the implementation address and not the selector. So if Foundation can do it so can you. Rationalizing this a bit further, poser classes and categories in NSBundles should not be surprised if their code doesn't work as expected, when they are loaded into a nicely running system (via a NSBundle f.e). So we should give the "Schwarzer Peter" to them not to methodForSelector:

An often implemented compromise is to cache the method implementation during the lifetime of the caller, as in the following example.

_array is an instance variable of an NSArray subclass. It contains objects, that implement the methods operate and annihilate.

operateAnnihilate task is to loop through all objects of the array and call both methods in succession. The code optimizes NSArray's objectAtIndex: method call out of the loop, by resolving the implementation address once beforehand.

If it were ensured, that only objects of one same class are ever stored in _array, it would be possible to optimize this method further by manually resolving not only the objectAtIndex: method address, but also the operate and annihilate method addresses.

By NOT doing this we keep the implementation more general and versatile.

- (void) operateAnnihilate
{
  SEL  sel;
  IMP  f;
  int  n;
  int  i;
  id   p;

  sel = @selector( objectAtIndex:);
  f = [_array methodForSelector:sel];
  n = [_array count];
  for( i = 0; i < n; i++)
  {
    p = (f)( _array, sel, i);
    [p operate];
    [p annihilate];
  }
}


Using inline functions instead of method calls

An inline function can only access private or protected declared instance variables, if the function is declared within the @implementation of the class. Unfortunately you can not declare functions inside of the @interface clause. So this does not work

@interface Foo : NSObject
{
  int someVar;
}
// it ain't compiling folks
static inline int  someComputedValue( id obj)
{
   return( ((Foo *) obj)->someVar * STRANGE_CONSTANT);
}
@end

But using @defs (thanks to ZNeK for the suggestion) it can be written like this:

@interface Foo : NSObject
{
   int   someVar;
}
@end


static inline int   someComputedValue( id obj)
{
   return( ((struct { @defs( Foo) } *) obj)->someVar * STRANGE_CONSTANT);
}

What you gain: speed
What you lose: run time binding of methods.

The lossage is obvious, since we are not calling a method, we are just calling/inlining a C - function.


Wrap Up (The Executive Version)

<Insert the usual optimization disclaimer here, that I am just too tired of writing AND reading>

Obj-C method calls aren't slow, but they are slower than plain C calls. Outside of inline code, the fastest invocations are the static C function (or any function that isn't crossing shared library boundaries) and equally fast if not even a smidgen faster message implementations (IMP)s.

In times of desperation when you feel that you must abandon objects in favor of performance, think again. Using a mixture of C-technology for performance and Objective-C methods for convenience oughta sound more attractive than dropping Objective-C altogether.

Optimize shared library C calls by using function pointers.

Optimize Objective C messages calls, by caching the instance or class method address.


If you want to discuss this articles, please do so in this thread in the Mulle kybernetiK Optimization Forum.
(0)inline is a common compiler extension, though not properly part of the C language (yet).

(1) This address is btw usually not as random as one might think. Most shared libraries, although principally relocatable, have a certain default address they usually load to. The position of the application code is determined by the linker. If your binary doesn´t change it is always loaded to the same virtual memory address.

(2)Some of the code in the stub, would have to be inlined into the caller, if it weren't in the stub. So strictly the overhead is not 9 instructions.

(3)Necessary as in object code necessary, rather than necessary for lack of warnings or to please your coding style fascist colleagues :).