Nat! bio photo

Nat!

Senior Mull.

Twitter RSS

Github

mulle-objc: investigating the pros and cons of inlining mulle_objc_object_call

Continued from mulle-objc: divining the guts of message sending.

As was shown in a previous article the optimal case for the method-cache lookup occurs, when the selector indexes directly to its matching cache entry. In a warmed up cache the statistical expectation of such a first stage cache hit is ~75%. [1]

To test out the inlining potential of this first stage, I wrote a stripped down version of mulle_objc_object_call: code.cpp.

This is the most interesting part from that file (slightly edited for readability):

static inline void  *call( void *obj,
                           uintptr_t methodid,
                           void *parameter)
{
   int                     index;
   void                    *(*f)( void *, uintptr_t, void *);
   struct cache            *cache;
   struct cacheentry       *entries;
   struct cacheentry       *entry;
   struct xclass           *cls;
   uintptr_t               offset;

   // self == nil check
   if( __builtin_expect( ! obj, 0))
      return( obj);

   // get class from object (isa)
   cls = ((xclass **) obj)[ -1];

   // get cache from class via cache-pivot
   entries = cls->entries;
   cache   = (cache *) &((char *) entries)[ -(int) offsetof( struct cache, entries)]);

   // get index into cache entries by masking the selector
   offset  = methodid & cache->mask;
   entry   = (cacheentry *) &((char *) entries)[ offset];

   // use default functionpointer if no cache hit (else path)
   if( __builtin_expect( (entry->uniqueid == methodid), 1))
      f = entry->functionpointer;
   else
      f = cls->functionpointer;

   return( (*f)( obj, methodid, parameter));
}

As you can see the first-stage "hit" case exits mulle_objc_object_call via entry->functionpointer. In the "miss" case method lookup and execution is deferred to a second stage called via cls->functionpointer. So the second stage of mulle_objc_object_call isn't inlined anymore.

Compiler produced assembler code

Here is the assembler output produced by a motley crew of compilers.

See mulle-inline-cache-lookup README.md how this was done.

All of the compiler used -O3, except where it is noted that they used -Os:

arm-gcc-5.4.png
arm64-gcc-5.4.png
clang-3.2.png
clang-3.9-Os.png
clang-3.9.png
gcc-7.0-Os.png
gcc-7.0.png
icc-17.0.png
mips64-gcc-5.4.png
powerpc-gcc-4.8.png

You can click on an image to get the text file.

Benchmarking

I then benchmarked the assembler codes of the x86_64 compilers and I also benchmarked some handwritten code. The project is available on mulle-inline-cache-lookup so you can reproduce my findings:

Source Size (bytes) Best Hit Run Best Miss Run Weighted Average
nat-call-4.s 46 1,03 1,03 1,0197
nat-call-3.s 51 1,03 1,16 1,0509
gcc-7.0 65 1,03 1,41 1,1109
clang-3.2 66 1,03 1,42 1,1133
clang-3.9-Os 63 1,16 1,16 1,1484
clang-3.9 66 1,16 1,16 1,1484
gcc-7.0-Os 65 1,16 1,16 1,1484
nat-call-1.s 66 1,16 1,16 1,1484
nat-call-2.s 57 1,16 1,16 1,1484
icc-17.0 64 1,16 1,54 1,2396

The weighted average is 0.75 * hit + 0.25 * miss. That fits the proposition, that the hit case is executed 75% of the time.

So the speed difference between human assembler and compiler assembler is 10%. The size improvement by human assembler is even better, it's 30%.

10%, that's about the speed improvement you'd get, if you spend an additional €240 on an iMac to get the faster processor. I'm just sayin...

Machine vs Human

It is actually quite disappointing, that the compiler can't come close here, since mulle-objc is C-only, and does not use assembler code. The piece of C-code to optimize is in my opinion very simple, what could possibly go wrong ?

Let's have a look at the best runner-up "gcc-7.0" to see where the problems are:

        testq   %rdi, %rdi
        je      .L2

        movq    -8(%rdi), %rsi
        movq    (%rsi), %rax
        movq    -8(%rax), %rcx
        andl    $407377992, %ecx    # 32 bit constant loaded
        addq    %rcx, %rax
        cmpq    $407377992, (%rax)  # same 32 bit constant used again
        jne     .L3                 # little jump
        movq    8(%rax), %rax
.L4:
        movl    $407377992, %esi    # and same 32 bit constant used yet again
        jmp     *%rax
.L2:
        xorl    %eax, %eax
        ret
.L3:
        movq    8(%rsi), %rax       # isn't that very similiar to 8(%rax) ?
        jmp     .L4                 # another little jump

Background Information

The AMD Call convention - used by Unices and OS X - lists the available registers for function call parameters. I've annotated them as they are used by mulle-objc:

Available registers

  1. %rdi # self
  2. %rsi # _cmd
  3. %rdx # _param
  4. %rcx # free
  5. %r8 # free
  6. %r9 # free

Why doesn't the compiler store the constant $407377992 (which is the selector 0x18481848) in an available register like $r9 for example ? That should save around 8 bytes of instruction space and it isn't slower.

The second problem is, that I carefully designed the offsets so that the function to call is either in 8(%rsi) (the class) or in 8(%rax) (the cache entry). This opportunity is unfortunately also not exploited.

Human vs Machine

So here is nat-call-4.s it hasn't proven to be slower, quite the opposite. It folds the three immediate constants into an available register $r9 and replaces the conditional jump instructions with a cmove.

0000  testq %rdi, %rdi
0003  je .L2
0009  movq  -0x8(%rdi), %r9
000d  movq  (%r9), %rax
0010  movq  -0x8(%rax), %rcx
0014  movl  $0x18481848, %esi
0019  andl  %esi, %ecx
001b  addq  %rcx, %rax
001e  cmpq  %rsi, (%rax)
0021  cmoveq %rax, %r9
0025  movq  0x8(%r9), %rax
0029  jmpq  *%rax
.L2:
002b  xorl  %eax, %eax
002d  retq

Calculating the overhead for mulle_objc_object_call

These 46 bytes worth of instructions would actually reduce even further in "real" life. Why ? In any method until reassigned, self is known to be non nil [2]. Also everytime you write code like this

   while( obj = [rover nextObject])
   {
      [obj call];
   }

It is known by the compiler that at [obj call] obj can not be nil. This means that in those cases the inlining function can ignore the nil test case. Furthermore the constant assignment in this code is really part of the user method call and not overhead, compared to a conventional method call.

So removing lines 0000,0003,002b,002d for the nil check and 0014 for the constant, the code gets reduced to:

0000  movq  -0x8(%rdi), %r9
0004  movq  (%r9), %rax
0007  movq  -0x8(%rax), %rcx
000b  andl  %esi, %ecx
000d  addq  %rcx, %rax
0010  cmpq  %rsi, (%rax)
0013  cmoveq   %rax, %r9
0017  movq  0x8(%r9), %rax
001b  jmpq  *%rax

Which is 29 bytes. That could be, in a perfect compiler world, the best-case overhead for inlining the first stage of mulle_objc_object_call on x86_64.


[1] This is a function related to the relative emptiness of the cache, which is 25% empty entries minimum and the bit diffusion in the selectors.

[2] Why ? Because the method is never called in the nil case.


Continue to mulle-objc: fast methods make mulle_objc_object_call even faster.


Post a comment

All comments are held for moderation; basic HTML formatting accepted.

Name:
E-mail: (not published)
Website: