Mulle kybernetiK - Tech Info: v0.1
Obj-C Optimization: Atomic Operations
This sixth installment of the ongoing series "How to optimize in Objective-C" ties up some lose ends from chapter four. It shows how to make the retain/release code of PrivateString thread safe with the help of assembler code. Thread safe but still fast.
(c) 2002 Mulle kybernetiK - text by Nat!
Lameness Disclaimer: All this is written to the best of my knowledge. Corrections, additions etc. are certainly welcome.

An oversight, NSString is mutable!

An oversight in Optimizing Foundation Classes was that NSString is not really a completely immutable class. It's just the string contents itself that can't be mutated. In fact you can mutate a NSString's retain count very easily - and obviously - with retain/release. Since a NSString may very well be shared between threads, the retain and release counting must be thread safe. Correcting this problem with NSLock is certainly possibly but would also rob much or all of the performance gain, that was had by inline reference counting. If NSLock was employed, it would add two method calls overhead to the each retain and each release call.

If you aren't quite certain, why incrementing the inline retain count is inherently thread unsafe on your PowerPC computer, then read this refresher on concurrent programming.

Atomic Operations when there isn't much to do

Since the retain and release code just increments or decrements a counter there is luckily an alternative. I can leave much of the theory aside, because there is an in depth and whimsical treatment of the subject in the MacHack 99 paper by Jonathan 'Wolf' Rentzsch. The executive version of that paper - given here - says that there is some assembler code you can use, that will guarantee that a read, modify and write operation sequence is executed without any interference by a different process(or) or thread. In case of the PPC this means that the code is looping as often as necessary until the read modify write operation has succeeded (1). The code for an atomic increment is:


loop:	
   lwarx  r3,0,(r4)  ;; fetch memory pointed to by r4 
   addi   r3,r3,1    ;; into r3 and now increment r3
   stwcx. r3,0,(r4)  ;; store r3 back
   bne-   loop       ;; someone modified memory ? Try again

The immediate and visible advantage is that there are only four instructions to be executed (best case), whereas locking and unlocking an NSLock involves typically a few dozen instructions. These few instructions are just perfect candidates for a little inlining C function.

The main limitation of atomic operations is, that you can only do one read/modify/write cycle. If you need to access more than just one word of data safely, you must fall back on locking.

    lwarx/stwcx
These two PPC instructions are the ticket to atomicity. If you are familiar with 68K assembler you might know the TAS instruction and if you know x86 assembler you might be familiar with th XCHG instructions. Both these instruction perform a read,modify,write on a memory cell in one uninteruptible bus operation.

The way the PowerPC does it is different. The PowerPC reads from a memory location with lwarx, which tells the CPU's memory system to watch on the bus if that memory location is modified by someone. Either itself, another CPU or another device.

In case this does happen a subsequent stcwx will register this and set the zero bit in the status register. So the bne branch instruction, which test this bit, will branch backwards. What the atomicity code now will do is keep refetching, modifying and writing until it knows, that the location has not been modified in between.


Atomic Increment and Decrement

Luckily the retain and release only needs to increment and decrement a variable safely. An increment or decrement is just such a read, modify, write operation that is a perfect match for an atomic operation.


- (id) retain
{
   MulleAtomicIncrement( &refCountMinusOne_);
   return( self);
}


- (void) release
{
   if( MulleAtomicDecrement( &refCountMinusOne_) == -1)
      [self dealloc];
}

What's left to do is give the code for the two atomic increment and decrement functions and run a test to see, in what way it influences the performance.


//
// This decrements atomically. It will return the value
// after decrement
//
static inline int   MulleAtomicDecrement( int *p)
{
   int    tmp;

   // code lifted from linux
   asm volatile(
                "1:     lwarx   %0,0,%1\n"
                "	addic   %0,%0,-1\n"  // addic allows r0, addi doesn't 
                "	stwcx.  %0,0,%1\n"
                "	bne-    1b"
                : "=&r" (tmp)
                : "r" (p)
                : "cc", "memory");

   return( tmp);
}


//
// This increments atomically. It will return the value
// after increment
//
static inline int   MulleAtomicIncrement( int *p)
{
   int    tmp;

   // code lifted from linux
   asm volatile(
                "1:     lwarx   %0,0,%1\n"
                "	addic   %0,%0,1\n"
                "	stwcx.  %0,0,%1\n"
                "	bne-    1b"
                : "=&r" (tmp)
                : "r" (p)
                : "cc", "memory");

   return( tmp);
}

Here are the results of some test code, that calls retain and releases in a loop. The following was measured:

Thread safe with NSLock (using IMPs):   12.3 s
Thread safe with Atomic operations:   2.4 s
Not thread safe (original code):   2.1 s

Compared with the NSLock code the savings using atomic operations is huge.

You can download my test code. Since this is ProjectBuilderWO project you will have to import it into the new Project Builder, if you don't have ProjectBuilderWO installed. You can install ProjectBuilderWO from the developer CD as a custom install option.


Portability

If you use atomic operations it will make your code platform specific. If you want to retain the performance advantage on another platform like for instance Intel x86, you would need to define some atomic operations in i386 assembler.

As an alternative you can always fall back on NSLock or pthreads to supply you with thread safeness at some additional cost. Using the preprocessor, you can make your code portable albeit not particularly beautiful like this:


@interface AtomicIncrementCounter
{
#if ! defined( __ppc__)
   NSLock   *syncLock;
#endif
}

@end


@implementation AtomicIncrementCounter

static int  someCounter;

- (id) init
{
  [super init];
#if ! defined( __ppc__)
   syncLock = [[NSLock alloc] init];
#endif
  return( self);
}


- (void) dealloc
{
#if ! defined( __ppc__)
   [syncLock release];
#endif
   [super dealloc];
}


- (void) incrementSharedCounter
{
#if ! defined( __ppc__)
   [syncLock lock];

   someCounter++;

   [syncLock unlock];
#else
   MulleAtomicIncrement( &someCounter);	  // our inline increment function
#endif
}

@end


Apple's View

There is this notorious but fairly old Technote 1137, where Apple cautions against using lwarx/stwcx. Why is that ?

  • Well it turns out a good number of PowerPC processors have bugs, especially when it comes to this tricky area, where possibly multiple processors must be synchronized. [Fact]
  • Over the years a lot of bug fixes and voodoo was inserted to make Apple's own lwarx/stcwx. code compatible with the broadest range of processors. [Stipulation]
  • For the supported PowerPC types of Mac OS X none of those history restrictions apply. This is a good thing, because a lot of those older fixes/voodoo sucked up a lot of performance. [Fact]
  • There are still some recommendations from Motorola how to implement lwarx/stwcx, which aren't obvious. [Fact]
  • Apple engineers now "allow" you to implement atomic operations yourself but caution you to avoid implementing your own mutexes/semaphores with lwarx/stwcx, for compatibility reasons. [Fact]
  • Apple's later machines no longer support load with reservation when the memory being addressed is on the other side of a PCI Bridge. [Rumor]
The gist of all this is
  • Use of atomic operations in regular applications is fine and can prove very beneficial to your performance.
  • If you write drivers be very careful!
  • You should not write your own semaphores


Summary

The remaining hole of Optimizing Foundation Classes has been plugged. Using atomic operation it is possible to implement thread safe counters very efficiently on the PowerPC. Similar facilities exist on other CPU architectures. A good place to look for such code is in the linux kernel source. The atomicity code for i386 f.e. can be found in linux/include/asm-i386/atomic.h.


If you want to discuss this articles, please do so in this thread in the Mull e kybernetiK Optimization Forum.
(1)lwarx/stwcx is commonly also known as "load and store with reservation".