|
Lameness Disclaimer: All this is written to the best of my knowledge. Corrections, additions etc. are certainly welcome.
Optimized Object GenerationOne barrier you are likely to hit when optimizing your program is the allocation speed or rather slowness in Objective-C. As you know from the previous article Allocation Basics & Foundation the unholy trinity alloc/init/autorelease is used to allocate and initialize an autoreleased object. Here are some collected ideas, how to make object allocations 300% faster than what Foundation can give you.
Avoid autorelease
|
NSMutableArray *array; NSString *s; s = [NSString stringWithCString:"bla bla bla"]; [array addObject:s]; |
it is more efficient to write
NSMutableArray *array; NSString *s; s = [[NSString alloc] initWithCString:"bla bla bla"]; [array addObject:s]; [s release]; |
Though you actually issue two more Objective-C calls yourself, the "behind the scenes" work has been reduced.
On my machine the result in a test case was 11 s for the autorelease version and 7 s for the init/release version. That was measured without freeing the autorelease pool, so there will be even more of an improvement, when the time to free the object is also taken into account.
[Example: allocautorelease]
+ (void) alloc { return( [self allocWithZone:NULL]); } |
If you call allocWithZone: directly you can save yourself a very small amount of time, because the extra call in NSObject's code is avoided.
[Example: alloczone]
static inline void _init( MyObject *self) { self->myArray_ = [NSMutableArray new]; self->myCounter_ = 2; } - (id) init { [super init]; _init( self); return( self); } + (id) new { MyObject *obj; obj = NSAllocateObject( self, 0, NULL)); _init( obj); return( obj); } |
for completeness you should also code alloc and allocWithZone: to be sure, that all your objects are using the same (default) allocation scheme:
+ (id) alloc { return( NSAllocateObject( self, 0, NULL)); } + (id) allocWithZone:(NSZone *) zone { return( NSAllocateObject( self, 0, zone)); } - (void) dealloc { [myArray_ release]; NSDeallocateObject( self); } |
The gain is not very large. alloc/init runs in 7.4 s, whereas the new code runs in 6.7 s. Using new doesn't pay off at all when using Foundation classes, because NSObject's new implementation just calls alloc/init.
[Example: allocnew]
The pool is implemented as a circular buffer of fixed size. If the buffer is empty, then objects will be allocated from memory. If the ring buffer is full, then the object's memory will really be freed. In normal operation objects are allocated from and freed to the ring buffer.
#import <Foundation/Foundation.h> typedef struct { Class poolClass; unsigned int low; unsigned int high; unsigned int poolSize; id *pool; } Pool; @interface PoolObject : NSObject { } + (Pool *) instancePool; @end #import "PoolObject.h" #import <objc/objc.h> #import <objc/objc-class.h> static NSString *COMPLAIN_MISSING_IMP = @"Class %@ needs this code:\n\ + (Pool *) instancePool\n\ {\n\ static Pool myPool;\n\ \n\ return( &myPool);\n\ }"; @implementation PoolObject + (id) allocWithZone:(NSZone *) zone { Pool *pool; id obj; pool = [self instancePool]; // get this class' pool if( ! pool->poolClass) // if first time alloc { pool->poolClass = self; // init pool structure pool->poolSize = 10000; // pool 10000 objects pool->pool = malloc( sizeof( id) * pool->poolSize); } else if( pool->poolClass != self) // sanity check [NSException raise:NSGenericException format:COMPLAIN_MISSING_IMP, self]; if( pool->low == pool->high) // if pool empty, allocate { obj = NSAllocateObject( self, 0, NULL); return( obj); } // // reuse and clear // obj = pool->pool[ pool->low]; pool->low = (pool->low + 1) % pool->poolSize; memset( obj + 1, 0, ((struct objc_class *) self)->instance_size - 4); return( obj); } - (void) dealloc { Pool *pool; unsigned int next; pool = [isa instancePool]; next = (pool->high + 1) % pool->poolSize; if( next == pool->low) // pool full ? { NSDeallocateObject( self); return; } // // add object to pool // pool->pool[ pool->high] = self; pool->high = next; } + (Pool *) instancePool { [NSException raise:NSGenericException format:COMPLAIN_MISSING_IMP, self]; return( 0); } @end |
+ (Pool *) instancePool { static Pool myPool; return( &myPool); } |
Please note that this implementation is not thread safe!
If you can tune the pool size so that in the end most allocations come from the pool and not from memory you can see a very nice increase in allocation speed. In my measurements I could decrease allocation times for alloc/init/release cycles from 9.5 s to 3.6 s, when the pool was used optimally.
An idea would be to extend this scheme to handle objects of various sizes. Since the malloc granularity is 16 bytes, one could handle object sizes from zero to 512 bytes with "only" 32 different pools. Obviously the memory footprint of your application will be larger than using the regular mechanism, because you are keeping a lot of objects around.
Another idea would be to make the pool size flexible, so that all freed objects can always be kept in the pool. Because of the virtual memory architecture of Mac OS X this may not be as wasteful as it sounds, because many of the long time unused pooled objects would be swapped out eventually.
[Example: allocpool]
An alternative to pooling is to reduce the number of actual memory allocations and deallocations by reserving memory for a number of objects and placing these objects into this common region. The payoff in speed will depend on the number of objects sharing such a region. If eight objects share a region, then the number of malloc/free calls has been reduced to one eighth.
Here is an implementation of that idea. The start of each malloced memory block contains the Bunch information. This information structure keeps track of the number of objects allocated and released. Just ahead of the object address there is an offset field, that enables to calculate the start of the Bunch given the address of the object. The whole bunch will be freed, after all available object storage is used up and subsequently all those allocated objects have been deallocated again. |
![]() |
Because of the custom memory layout the code is a little bit more involved than the pool code.
#import <Foundation/Foundation.h> typedef struct { void *currentBunch; // need this be volatile ? } BunchInfo; @interface BunchObject : NSObject { unsigned int retainCountMinusOne_; } + (BunchInfo *) bunchInfo; @end #import "BunchObject.h" #import <objc/objc.h> #import <objc/objc-class.h>> // // The structure at the beginning of // every Bunch // typedef struct { long instance_size_; unsigned int freed_; unsigned int allocated_; // allocated by "user" unsigned int reserved_; // reserved from malloc } Bunch; // // the size needed for the bunch // with proper alignment for objects // #define S_Bunch ((sizeof( Bunch) + (ALIGNMENT - 1)) & ~(ALIGNMENT - 1)) #define ALIGNMENT 8 #define OBJECTS_PER_MALLOC 16 @implementation BunchObject static Bunch *newBunch( long bunchInstanceSize) { unsigned int len; unsigned int nBunches; unsigned long size; Bunch *p; bunchInstanceSize = (bunchInstanceSize + (ALIGNMENT - 1)) & ~(ALIGNMENT - 1); size = bunchInstanceSize + sizeof( int); nBunches = OBJECTS_PER_MALLOC; len = size * nBunches + S_Bunch; if( ! (p = calloc( len, 1))) // calloc, for compatibility return( nil); p->instance_size_ = bunchInstanceSize; return( p); } static inline void freeBunch( Bunch *p) { free( p); } static inline BOOL canBunchHandleSize( Bunch *p, long size) { // // We can't deal with subclasses, that are larger then what we // first allocated. // return( p && size <= p->instance_size_); } static inline unsigned int nObjectsABunch( Bunch *p) { return( OBJECTS_PER_MALLOC); } static inline BOOL isBunchExhausted( Bunch *p) { return( p->allocated_ == nObjectsABunch( p)); } static inline id newBunchObject( Bunch *p) { id obj; unsigned int offset; // // Build an object // put offset to the bunch structure ahead of the isa pointer. // offset = S_Bunch + (sizeof( int) + p->instance_size_) * p->allocated_; obj = (id) ((char *) p + offset); *((unsigned int *) obj)++ = offset + sizeof( int); // // up the allocation count // p->allocated_++; return( obj); } // // determine Bunch adress from object adress // static inline Bunch *bunchForObject( id self) { int offset; Bunch *p; offset = ((int *) self)[ -1]; p = (Bunch *) &((char *) self)[ -offset]; return( p); } + (BunchInfo *) bunchInfo { static BunchInfo bunchInfo; return( &bunchInfo); } /* ## ## override alloc, dealloc, retain and release ## */ static inline BunchObject *alloc_object( struct objc_class *self) { BunchObject *obj; BOOL flag; BunchInfo *p; obj = nil; p = [self bunchInfo]; // this hurts a little, because we call it every time // // first time ? malloc and initialize a new bunch // if( ! p->currentBunch) p->currentBunch = newBunch( ((struct objc_class *) self)->instance_size); if( canBunchHandleSize( p->currentBunch, self->instance_size)) { // // grab an object from the current bunch // and place isa pointer there // obj = newBunchObject( p->currentBunch); obj->isa = self; } // // bunch full ? then make a new one for next time // if( isBunchExhausted( p->currentBunch)) p->currentBunch = newBunch( ((struct objc_class *) self)->instance_size); // // Failed means, some subclass is calling... // if( ! obj) [NSException raise:NSGenericException format:@"To be able to allocate an instance,\ your class %@ needs this code:\n\ \n\ + (volatile BunchInfo *) bunchInfo\n\ {\n\ static BunchInfo bunchInfo;\n\ \n\ return( &bunchInfo);\n\ }\ ", self]; return( obj); } + (id) allocWithZone:(NSZone *) zone { BunchObject *obj; obj = alloc_object( self); return( obj); } // // Only free a bunch, if all objects are // allocated(!) and freed // - (void) dealloc { Bunch *p; p = bunchForObject( self); if( ++p->freed_ == nObjectsABunch( p)) freeBunch( p); } - (id) retain { ++retainCountMinusOne_; return( self); } - (void) release { if( ! retainCountMinusOne_--) [self dealloc]; } @end |
Please note that this implementation is not thread safe!
In my measurements I could decrease allocation times for alloc/init/release cycles from 9.5 s to 2.9 s. The difference in speed compared to the pool in this example is due to the absolute lower number of malloc calls done. A fully warmed up pool that always has a surplus of objects should be expected to perform (then) slightly better than bunched objects.
My experiments would suggest that choosing even a small number of objects per bunch (like four) is very sufficient for a nice speed increase. It is probably not wise to bunch more than 64 objects together, due to the greater likelihood of bunches not being deallocated.
[Example: allocbunch]
An adventurous hacker might attempt to write a NSObject posing subclass, to replace the default alloc/dealloc routines with handwritten ones. This way Foundation objects would benefit from the speed increase as well.
The next article will focus on making all this stuff thread safe without losing (too much) performance.
(2) One could get around this initial "lag", by pre-filling all pools with objects in the respective initialize methods.t