Mulle kybernetiK - Optimizing Foundation Classes

Mulle kybernetiK - Tech Info: v0.5

Obj-C Optimization: Optimizing Foundation Classes

This fourth installment of the ongoing series "How to optimize in Objective-C" now jumps from theory to praxis. This article chronicles the optimization of some Obj-C Foundation based code. Lets see how far one can go...

Lameness Disclaimer: All this is written to the best of my knowledge. Corrections, additions etc. are certainly welcome.

Optimizing Foundation Classes

Enough of the theory from the previous articles, lets get down to some practical optimizations. This article only covers Obj-C/Foundation specific optimizations this time. Some generic optimizatons like "improving the algorithm" or loop unrolling will not be mentioned. The previous articles are prerequisite knowledge!

Into the Code

First lets write a little test program and time it.

At this point you should download the Example. Compile it and run it. Since it is a project made with ProjectBuilderWO you will have to import it into the new Project Builder, if you don't have ProjectBuilderWO installed. Run each tests multiple times to let the system "warm up". The lowest timing returned counts. Turn off iTunes and other Apps with some - maybe just periodic - activity (like Mail.app polling for email).

The program reads a UNIX text file, breaks the contents up into an array of lines and then reverses the order of the lines. To do that the code uses an reverse enumerator to access the lines in the array from back to front and to store them in another array to return. This will be the part that will be optimized.


   result = [NSMutableArray arrayWithCapacity:[lines count]];
   rover  = [lines reverseObjectEnumerator];
   
   while( s = [rover nextObject])
      [result addObject:s];
   
   return( result);

Running this over 4233000 lines on a Cube 450Mhz took 9.5 seconds. The task: reduce these 9.5 seconds as much as possible.
[ExampleApp: Unoptimized]

A First Blind Stab

Taking the information from the previous article, one will consider replacing the Obj-C dynamic dispatching [] with direct calls to the method implementation. To refresh your memory, you can determine in your program the address of the method to be called for an object and use this address directly to jump to that subroutine. It's done like this:


   IMP   address;

   adress = [someObject methodForSelector:@selector( someMethod:someOtherArgument:)];
   adress( someObject, @selector( someMethod:someOtherArgument:), 
           argument1, argument2);

if you do this for one call only it will obviously be even SLOWER than calling [someObject someMethod:argument1 someOtherArgument:argument2], but the payoff will come if you are calling this method repeatedly, e.g. in a loop.

So in the code above there are two method calls in the loop, which will be replaced with IMP function calls.

The code will change like this.


   IMP   impNextObject;
   IMP   impAddObject;

   result = [NSMutableArray arrayWithCapacity:[lines count]];
   rover  = [lines reverseObjectEnumerator];

   impAddObject  = [result methodForSelector:@selector( addObject:)];
   impNextObject = [rover methodForSelector:@selector( nextObject)];

   while( s = impNextObject( rover, @selector( nextObject)))
      impAddObject( result, @selector( addObject:), s);
   
   return( result);

The result is 8.5 seconds. That's an improvement of more than 10% which is nice but not really crushing.
[ExampleApp: IMP]

In terms of Objective-C there is nothing left to be done at this stage. Now one has to consider whether there isn't something one can do to make Foundation move faster.

How Good is 8.5s Really ?

All the code really does is to read from one array and insert into another. The corresponding C-Code would be:


   for( i = n, j = 0; --i >= 0; j++)
      result[ j] = lines[ i];

Coding up a test case for this, the result is fairly exciting. The test runs in a fraction of a second!
[ExampleApp: C]

There are some quality differences between what Foundation does and what the simple C version does. These are the enhancements of the Foundation version, that you get for free and that the C code does not provide

NSObject inheritance (Foundation compatibilty)
reference counting (retain release)
automatic array growth
boundary checking
thread safeness ? No check out this footnote ⁽¹⁾.

For this article the assumption will be that this list is in order of importance. If you don't need NSObject compatibility but you do need prime performance, by all means stay with the C code. The reference counting goes hand in hand with the NSObject, so one can't do without it. Removal of the array growth would violate the NSMutableArray specification, so we should keep that. The boundary checking one might live without. And a thread safeness is in this case not needed, nor was it supplied by the original implementation.

Analysis with Sampler.app

Running the "8.5s" code under the eyes of Sampler.app, Sampler.app will produce a result somewhat like this.

Of the 1481 samples spent in imped - the optimized reordering routine - over 90% are spent in NSMutableArray's addObject:! Surprisingly only 4% are spent in the enumerators nextObject method. Obviously NSMutableArray's addObject: method will be the focus of the next optimization step.

Homebrewed NSMutableArray

The general belief is that the Foundation classes are heavily optimized and that there is only performance to be lost in writing your own classes. Lets try nevertheless to write a custom NSMutableArray subclass and run some tests with it. To avoid clutter only an incomplete implementation is done that provides sufficient functionality for the test case


@interface PrivateMutableArray : NSMutableArray
{
   id             *pointers_;
   unsigned int   size_;
   unsigned int   nPointers_;
}
@end


@implementation PrivateMutableArray


- (id) initWithCapacity:(unsigned int) n
{
   [super init];

   pointers_  = malloc( sizeof( id) * n);
   nPointers_ = 0;
   size_      = n;
   return( self);
}


- (void) dealloc
{
   unsigned int   i;
   
   for( i = 0; i < nPointers_; i++)
      [pointers_[ i] release];

   free( pointers_);
   [super dealloc];
}


- (void) grow
{
    unsigned int  newsize;

    if( (newsize = size_ + size_) < size_ + 4)
       newsize = size_ + 4;
    if( ! (pointers_ = realloc( pointers_, newsize * sizeof( id))))
        [NSException raise:NSMallocException
                    format:@"%@ can't grow from %u to %u entries", self, size_, newsize];
    size_ = newsize;
}


- (void) addObject:(id) p
{
   if( ! p)
      [NSException raise:NSInvalidArgumentException
                  format:@"null pointer"];
   if( nPointers_ >= size_)
       [self grow];
   self->pointers_[ self->nPointers_++] = [p retain];
}


- (unsigned int) count
{
   return( nPointers_);
}


- (id) objectAtIndex:(unsigned int) i
{
   if( i >= nPointers_)
      [NSException raise:NSInvalidArgumentException
                  format:@"index %d out of bounds %d", i, nPointers_];
   return( pointers_[ i]);
}

@end

The reader will notice, that NSObject and retain/release compatibility has been kept. Also there is boundary checking implemented and the dynamic growth of the array.

There are a few methods lacking for a proper NSMutableArray subclass, namely insertObject:atIndex:, removeLastObject, removeObjectAtIndex:, replaceObjectAtIndex:withObject:. They have not been implemented for simplicity. Here we are optimizing strictly for addObject: performance. It would be interesting and tricky to optimize for both insertObject:atIndex: and addObject: performance, maybe a topic for another article. Straightforward implementations of the leftout methods would result in a fully functional NSMutableArray subclass that is much faster in addObject: and replaceObjectAtIndex:withObject: and removeLastObject performance and probably much slower in executing removeObjectAtIndex: or insertObject:atIndex:.

Putting this through its paces, it finishes in 3s. Not a bad improvement over 9 seconds, and it wasn't a very tricky subclass that was used, but rather a straightforward implementation. No reason to rest on laurels, since it is still about a 100 times slower than the C-code.⁽³⁾
[ExampleApp: PrivateMutableArray]

Second Analysis with Sampler.app

The time spent in addObject: has been reduced to 62%. The enumeration still takes up less than 10% of the loop. The rest of the time is spent in creation of the enumerator 16%, surprisingly large, and the creation of the array and the code of the actual function.

addObject: then spends 60% of its time in retain and 20% calling it through objc_msgSend.

The Blockade

One can not optimize the objc_msgSend away, because there is no way to know what kind of object is to be retained. The actual retain implementation can vary from class to class. One must therefore not assume that the object's retain method is NSObject's implementation and cache that. An idea would be to use CFRetain instead of retain which would improve performance for bridged Foundation types, but lose performance for non-bridged types. In this case this is a valid trade-off, because the code is dealing exlusively with strings, which are bridged. This enhancement yields another 0.5 seconds. (No test case for that)

One should not completely do away with the retain, because that would break the NSMutableArray specification ⁽²⁾.

There is no way for the normal developer to enhance the retain routines of all existing and future classes. The only power that exists that could do this is Apple. One might pursue the political route and ask Apple to optimize their library code. In a perfect world that could work...

Homebrewed NSString

In this special case all array items the code deals with are NSStrings. And PrivateMutableArray was already quite a timesaver so why not write a custom NSString class that implements the needed bits and pieces and do non-Foundation retain/release management.
Internal reference counting will be used. It uses a instance variable to remember the retain count. The code to do this is very simple.

This implementation is only safe in singletreaded applications. Another article explains how to do this fast and thread safe


@interface PrivateInlineRefCountingString : PrivateString
{
   unsigned int   refCountMinusOne_;
}


@implementation PrivateInlineRefCountingString

- (id) retain
{
   refCountMinusOne_++;
   return( self);
}


- (void) release
{
   if( ! refCountMinusOne_)
   {
      [self dealloc];
      return;
   }
   refCountMinusOne_--;
}


- (unsigned int) retainCount
{
   return( refCountMinusOne_ + 1);
}

@end

The problem of this approach is that one has to recode all the NSString functionality one needs, because care must be taken that one is only dealing with instances of PrivateInlineRefCountingString. Any method that returns a NSString, is likely to return a Foundation NSString and not an instance of PrivateInlineRefCountingString. This recoding is done in the class PrivateString not shown here, but of course available in the example code.

Writing a NSString subclass properly means a huge amount of effort, because one will lose a lot of Foundations optimizations, that you used to get from the built-in Foundation subclasses. A NSString subclass only needs to implement length and characterAtIndex:. Therefore if you as an example do not implement your own substringWithRange: there will be a lot of characterAtIndex: calls from the generic implementation and your NSString subclass will be very very slow. A optimized variant should use the internal representation and generate the substring directly. Taking a look at the NSString specification shows, that there is a lot of routines to implement...

Anyway putting the subclass to work, the payoff of the custom NSString subclass is pretty good. With this optimization another second could be knocked off to a total of less than 2 seconds. [ExampleApp: PrivateInlineRefCountingString]

Since PrivateInlineRefCountingString does it's own reference counting, using CFretain will actually hurt our performance, because CFRetain will have to call it's retain method with an Objective-C method call, so the call of CFRetain is unneeded overhead.

Final Effort

In the "final effort" the boundary checks have been sacrificed. Inline C functions are used to access the array elements and the enumerator has been replaced by a conventional for loop.


   n      = [lines count];
   result = [PrivateUnsafeMutableArray arrayWithCapacity:n];

   for( i = n; --i >= 0;)
   {
      s = PrivateUnsafeMutableArrayObjectAtIndex( lines, i);
      PrivateUnsafeMutableArrayAddObject( result, s);
   }

PrivateUnsafeMutableArray is a subclass of PrivateMutableArray that provides two inline C functions for instance variable access. Check out the previous article if you are unfamiliar with the (struct { @defs( Class) } *) self) "pattern".


#import "PrivateMutableArray.h"


@interface PrivateUnsafeMutableArray : PrivateMutableArray
{
}

@end


#define PrivateUnsafeMutableArrayCast( self)  \
   ((struct { @defs( PrivateUnsafeMutableArray) } *) self)


static inline void  PrivateUnsafeMutableArrayAddObject(
   PrivateUnsafeMutableArray *self,
                             id p)
{
    if( PrivateUnsafeMutableArrayCast( self)->nPointers_ 
        >= PrivateUnsafeMutableArrayCast( self)->size_)
      [self grow];
   PrivateUnsafeMutableArrayCast( self)->
      pointers_[ PrivateUnsafeMutableArrayCast( self)-> nPointers_++] = 
         [p retain];
}


static inline id  PrivateUnsafeMutableArrayObjectAtIndex( 
   PrivateUnsafeMutableArray *self,
   unsigned int index)
{
   return( PrivateUnsafeMutableArrayCast( self)->pointers_[ index]);
}

With this the running time is halved again.
[ExampleApp: Final Effort]

Final Analysis with Sampler.app

The final sampler output shows, that the majority of time is spent in the Obj-C message dispatch, the one dynamic retain call one can't optimize away. The reference counting percentage has decreased to a mere 10%!

So in the end a 10 fold increase rolling custom Foundation subclasses was achieved, which is much more than I was expecting when I started writing this article.

Summary

There are a few conclusions to be drawn from this article. Caching method implementation addresses (IMP) is only useful if the operation that is to be performed by the method is so quickly executed, that the time used for the dynamic message dispatch is notable. Caching IMPs becomes very valuable, when the method to be called is executing fast, like a simple accessor or computation method.

Writing custom subclasses of Foundation classes is a good method to gain plenty of speed. Obvious candidates are the "non-technical" Foundation classes like NSNumber, NSDictionary, NSString, NSArray etc, especially when you need only a certain part of the functionality provided by Foundation and can optimize for that. Subclassing NSTask for performance reasons is most certainly a waste of time. Be careful though that you aren't deoptimizing certain other functionality with your subclass, as your code might leave out optimizations provided by "hidden" Foundation subclasses, whereas your subclass might fall back on the generic functionality.

If you want to discuss this articles, please do so in this thread in the Mulle kybernetiK Optimization Forum.

If performance is a premium concern, code in C. It would be madness for example to have a Pixel class derived from NSObject to store a Megapixels worth of image information. Wrap your optimized C code into a Obj-C class for convenience. NSBitmapImageRep does this for example. There is no NSPixel class.

⁽¹⁾ See ThreadSupport.html in the ReleaseNotes in your Developer Documentation.

⁽²⁾ If one was to create a NSMutableNonRetainingArray that would not retain its objects, one should at least proclaim an interface where objects have to be retained before adding and will be released upon destruction of the array. Otherwise the risk of leaks will be very high.
This would only give a benefit, if the objects to be added have been freshly created and not autoreleased.
Check out PrivateMutableNonRetainingArray in the Example code.

⁽³⁾ Another possible reason, why addObject: performs comparatively poorly could be that Foundation ignores the capacity parameter in the arrayWithCapacity: call and then needs to reallocate their internal array often, when the array is growing. From the class sepecification it is free to do that.
There is no way to influence this behaviour though.