Mulle kybernetiK - Optimized Object Generation

Mulle kybernetiK - Tech Info: v0.2

Obj-C Optimization: Optimized Object Generation

This fifth installment of the ongoing series "How to optimize in Objective-C" turns to topic of allocating and deallocating objects. Allocation speed: the final frontier!

Lameness Disclaimer: All this is written to the best of my knowledge. Corrections, additions etc. are certainly welcome.

Optimized Object Generation

One barrier you are likely to hit when optimizing your program is the allocation speed or rather slowness in Objective-C. As you know from the previous article Allocation Basics & Foundation the unholy trinity alloc/init/autorelease is used to allocate and initialize an autoreleased object. Here are some collected ideas, how to make object allocations 300% faster than what Foundation can give you.

You can download my test cases. Since these are projects made with ProjectBuilderWO you will have to import it into the new Project Builder⁽¹⁾, if you don't have ProjectBuilderWO installed. You can install ProjectBuilderWO from the developer CD as a custom install option.

Avoid autorelease

One win is to avoid autoreleasing the object whenever possible. For instance in a situation, where you are adding newly created objects to an NSMutableArray

NSMutableArray *array; NSString *s; s = [NSString stringWithCString:"bla bla bla"]; [array addObject:s];

it is more efficient to write

NSMutableArray *array; NSString *s; s = [[NSString alloc] initWithCString:"bla bla bla"]; [array addObject:s]; [s release];

Though you actually issue two more Objective-C calls yourself, the "behind the scenes" work has been reduced.

On my machine the result in a test case was 11 s for the autorelease version and 7 s for the init/release version. That was measured without freeing the autorelease pool, so there will be even more of an improvement, when the time to free the object is also taken into account.
[Example: allocautorelease]

Use `allocWithZone:` instead of `alloc`

The reason for that is that (currently) alloc on NSObject is implemented like this:

+ (void) alloc { return( [self allocWithZone:NULL]); }

If you call allocWithZone: directly you can save yourself a very small amount of time, because the extra call in NSObject's code is avoided.
[Example: alloczone]

Code `new` as well as `init` in your own classes

Using new, you can save yourself one call by combining initialization and allocation.

static inline void _init( MyObject *self) { self->myArray_ = [NSMutableArray new]; self->myCounter_ = 2; } - (id) init { [super init]; _init( self); return( self); } + (id) new { MyObject *obj; obj = NSAllocateObject( self, 0, NULL)); _init( obj); return( obj); }

for completeness you should also code alloc and allocWithZone: to be sure, that all your objects are using the same (default) allocation scheme:

+ (id) alloc { return( NSAllocateObject( self, 0, NULL)); } + (id) allocWithZone:(NSZone *) zone { return( NSAllocateObject( self, 0, zone)); } - (void) dealloc { [myArray_ release]; NSDeallocateObject( self); }

The gain is not very large. alloc/init runs in 7.4 s, whereas the new code runs in 6.7 s. Using new doesn't pay off at all when using Foundation classes, because NSObject's new implementation just calls alloc/init.
[Example: allocnew]

Pooling objects

An interesting idea is to save deallocated objects in a pool for reuse. Objects in the pool will be used preferably over freshly allocated objects and time is gained on the saved malloc call. Here is a simple implementation of a base class that uses a custom allocWithZone: and a custom dealloc to do the pooling.

The pool is implemented as a circular buffer of fixed size. If the buffer is empty, then objects will be allocated from memory. If the ring buffer is full, then the object's memory will really be freed. In normal operation objects are allocated from and freed to the ring buffer.

#import <Foundation/Foundation.h> typedef struct { Class poolClass; unsigned int low; unsigned int high; unsigned int poolSize; id *pool; } Pool; @interface PoolObject : NSObject { } + (Pool *) instancePool; @end #import "PoolObject.h" #import <objc/objc.h> #import <objc/objc-class.h> static NSString *COMPLAIN_MISSING_IMP = @"Class %@ needs this code:\n\ + (Pool *) instancePool\n\ {\n\ static Pool myPool;\n\ \n\ return( &myPool);\n\ }"; @implementation PoolObject + (id) allocWithZone:(NSZone *) zone { Pool *pool; id obj; pool = [self instancePool]; // get this class' pool if( ! pool->poolClass) // if first time alloc { pool->poolClass = self; // init pool structure pool->poolSize = 10000; // pool 10000 objects pool->pool = malloc( sizeof( id) * pool->poolSize); } else if( pool->poolClass != self) // sanity check [NSException raise:NSGenericException format:COMPLAIN_MISSING_IMP, self]; if( pool->low == pool->high) // if pool empty, allocate { obj = NSAllocateObject( self, 0, NULL); return( obj); } // // reuse and clear // obj = pool->pool[ pool->low]; pool->low = (pool->low + 1) % pool->poolSize; memset( obj + 1, 0, ((struct objc_class *) self)->instance_size - 4); return( obj); } - (void) dealloc { Pool *pool; unsigned int next; pool = [isa instancePool]; next = (pool->high + 1) % pool->poolSize; if( next == pool->low) // pool full ? { NSDeallocateObject( self); return; } // // add object to pool // pool->pool[ pool->high] = self; pool->high = next; } + (Pool *) instancePool { [NSException raise:NSGenericException format:COMPLAIN_MISSING_IMP, self]; return( 0); } @end

A subclass of this PoolObject class would then implement the following method to provide a pool for its instances.

+ (Pool *) instancePool { static Pool myPool; return( &myPool); }

Please note that this implementation is not thread safe!

If you can tune the pool size so that in the end most allocations come from the pool and not from memory you can see a very nice increase in allocation speed. In my measurements I could decrease allocation times for alloc/init/release cycles from 9.5 s to 3.6 s, when the pool was used optimally.

Advantage: Allocation/freeing speed is potentially fast
Disadvantage: Increases memory footprint. Allocation slows down, when the pool becomes exhausted. Slow until first objects get deallocated⁽²⁾ Difficult to get the pool size right. Depending on the alloc/release pattern of the application a pool might slowly destroy the locality of reference, as pool objects adresses tend to get randomized over time. The same can be said for malloc too though.

An idea would be to extend this scheme to handle objects of various sizes. Since the malloc granularity is 16 bytes, one could handle object sizes from zero to 512 bytes with "only" 32 different pools. Obviously the memory footprint of your application will be larger than using the regular mechanism, because you are keeping a lot of objects around.

Another idea would be to make the pool size flexible, so that all freed objects can always be kept in the pool. Because of the virtual memory architecture of Mac OS X this may not be as wasteful as it sounds, because many of the long time unused pooled objects would be swapped out eventually.
[Example: allocpool]

Allocating Objects in Bunches

An alternative to pooling is to reduce the number of actual memory allocations and deallocations by reserving memory for a number of objects and placing these objects into this common region. The payoff in speed will depend on the number of objects sharing such a region. If eight objects share a region, then the number of malloc/free calls has been reduced to one eighth.

Here is an implementation of that idea. The start of each malloced memory block contains the Bunch information. This information structure keeps track of the number of objects allocated and released. Just ahead of the object address there is an offset field, that enables to calculate the start of the Bunch given the address of the object.

The whole bunch will be freed, after all available object storage is used up and subsequently all those allocated objects have been deallocated again.

Because of the custom memory layout the code is a little bit more involved than the pool code.

#import <Foundation/Foundation.h> typedef struct { void *currentBunch; // need this be volatile ? } BunchInfo; @interface BunchObject : NSObject { unsigned int retainCountMinusOne_; } + (BunchInfo *) bunchInfo; @end #import "BunchObject.h" #import <objc/objc.h> #import <objc/objc-class.h>> // // The structure at the beginning of // every Bunch // typedef struct { long instance_size_; unsigned int freed_; unsigned int allocated_; // allocated by "user" unsigned int reserved_; // reserved from malloc } Bunch; // // the size needed for the bunch // with proper alignment for objects // #define S_Bunch ((sizeof( Bunch) + (ALIGNMENT - 1)) & ~(ALIGNMENT - 1)) #define ALIGNMENT 8 #define OBJECTS_PER_MALLOC 16 @implementation BunchObject static Bunch *newBunch( long bunchInstanceSize) { unsigned int len; unsigned int nBunches; unsigned long size; Bunch *p; bunchInstanceSize = (bunchInstanceSize + (ALIGNMENT - 1)) & ~(ALIGNMENT - 1); size = bunchInstanceSize + sizeof( int); nBunches = OBJECTS_PER_MALLOC; len = size * nBunches + S_Bunch; if( ! (p = calloc( len, 1))) // calloc, for compatibility return( nil); p->instance_size_ = bunchInstanceSize; return( p); } static inline void freeBunch( Bunch *p) { free( p); } static inline BOOL canBunchHandleSize( Bunch *p, long size) { // // We can't deal with subclasses, that are larger then what we // first allocated. // return( p && size <= p->instance_size_); } static inline unsigned int nObjectsABunch( Bunch *p) { return( OBJECTS_PER_MALLOC); } static inline BOOL isBunchExhausted( Bunch *p) { return( p->allocated_ == nObjectsABunch( p)); } static inline id newBunchObject( Bunch *p) { id obj; unsigned int offset; // // Build an object // put offset to the bunch structure ahead of the isa pointer. // offset = S_Bunch + (sizeof( int) + p->instance_size_) * p->allocated_; obj = (id) ((char *) p + offset); *((unsigned int *) obj)++ = offset + sizeof( int); // // up the allocation count // p->allocated_++; return( obj); } // // determine Bunch adress from object adress // static inline Bunch *bunchForObject( id self) { int offset; Bunch *p; offset = ((int *) self)[ -1]; p = (Bunch *) &((char *) self)[ -offset]; return( p); } + (BunchInfo *) bunchInfo { static BunchInfo bunchInfo; return( &bunchInfo); } /* ## ## override alloc, dealloc, retain and release ## */ static inline BunchObject *alloc_object( struct objc_class *self) { BunchObject *obj; BOOL flag; BunchInfo *p; obj = nil; p = [self bunchInfo]; // this hurts a little, because we call it every time // // first time ? malloc and initialize a new bunch // if( ! p->currentBunch) p->currentBunch = newBunch( ((struct objc_class *) self)->instance_size); if( canBunchHandleSize( p->currentBunch, self->instance_size)) { // // grab an object from the current bunch // and place isa pointer there // obj = newBunchObject( p->currentBunch); obj->isa = self; } // // bunch full ? then make a new one for next time // if( isBunchExhausted( p->currentBunch)) p->currentBunch = newBunch( ((struct objc_class *) self)->instance_size); // // Failed means, some subclass is calling... // if( ! obj) [NSException raise:NSGenericException format:@"To be able to allocate an instance,\ your class %@ needs this code:\n\ \n\ + (volatile BunchInfo *) bunchInfo\n\ {\n\ static BunchInfo bunchInfo;\n\ \n\ return( &bunchInfo);\n\ }\ ", self]; return( obj); } + (id) allocWithZone:(NSZone *) zone { BunchObject *obj; obj = alloc_object( self); return( obj); } // // Only free a bunch, if all objects are // allocated(!) and freed // - (void) dealloc { Bunch *p; p = bunchForObject( self); if( ++p->freed_ == nObjectsABunch( p)) freeBunch( p); } - (id) retain { ++retainCountMinusOne_; return( self); } - (void) release { if( ! retainCountMinusOne_--) [self dealloc]; } @end

Please note that this implementation is not thread safe!

In my measurements I could decrease allocation times for alloc/init/release cycles from 9.5 s to 2.9 s. The difference in speed compared to the pool in this example is due to the absolute lower number of malloc calls done. A fully warmed up pool that always has a surplus of objects should be expected to perform (then) slightly better than bunched objects.

Advantage: Allocation/freeing speed is very fast. Possibly improved locality of reference, as a number of objects of "same fate" are stored contiguously.
Disadvantage: Might leak memory, if bunches can not be reclaimed

Bunching objects together is the more complicated option but does not suffer from the "tuning" problems of an object pool. To what extent the memory leaking may really become a problem in real world application is unclear.

My experiments would suggest that choosing even a small number of objects per bunch (like four) is very sufficient for a nice speed increase. It is probably not wise to bunch more than 64 objects together, due to the greater likelihood of bunches not being deallocated.
[Example: allocbunch]

Summary

With a little bit of clever coding you can overcome the limitations of object allocation, at the risk of enlarging your memory footprint or incurring memory fragmentation. The proposed schemes work for your own classes, not for Foundation classes like NSString.

An adventurous hacker might attempt to write a NSObject posing subclass, to replace the default alloc/dealloc routines with handwritten ones. This way Foundation objects would benefit from the speed increase as well.

The next article will focus on making all this stuff thread safe without losing (too much) performance.

If you want to discuss this articles, please do so in this thread in the Mulle kybernetiK Optimization Forum.

⁽¹⁾ At the time of writing the new Project Builder is unfortunately too crash prone to be useful to me.

⁽²⁾ One could get around this initial "lag", by pre-filling all pools with objects in the respective initialize methods.t