Mulle kybernetiK - Tech Info: v0.3
Profiling C and Objective-C code with Sampler.app
The first task in optimization is finding out what needs to be optimized. So the first installment (logically) covers using Apple's profiler.
(c) 2002 Mulle kybernetiK - text by Nat!

Step 1 - Getting your code in shape for profiling

This article was written for Mac OS X Server 1.2 (aka Rhapsody). It needs some revising to cover the new and improved Mac OS X Sampler.app. Unfortunately, Sampler.app is a little broken right now, so an update to this article is not to be expected soon.
Profiling is a good way to find out problematic parts of your code, when something needs to run just a bit quicker. In all likelihood the program that you want to profile, is made up of some frameworks and the application itself. Here is the first caveat, the application to profile must be an AppKit application. You can not profile WebObjects applications or command line tool projects. If you do want to profile code for one of these you should create a new AppKit application and copy/paste the code to profile in there. [Might not be true anymore]

To usefully profile your application it should not be stripped of it's symbol table. Otherwise there are no requirements.

Step 2 - Getting your test application launched and profiled with Sampler.app

If you're really lazy you can download the code for SimpleProfile and as a bonus an example profile framework SimpleProfileFramework (some modification of SimpleProfile required).

The code on the right side will be the test code to be used for the following demonstration. Apart from some niceties for sampling - the NSRunAlertPanel and sleep calls - the -[Controller awakeFromNib] method does nothing more than call the methods -[Controller foo] and -[Controller bar] repeatedly. Both methods do an equal amount of memcpy calls, -[Controller foo] copying 5120000 bytes and -[Controller bar] copying 2560000 bytes.

An instance of Controller is embedded in the NIB file, that is loaded when the application starts up. The load of the NIB file in turn triggers the -[Controller awakeFromNib] messages.

Two simple predictions

  • -[Controller foo] will be slower than -[Controller bar], because -[Controller foo] has to do more work
  • most of the time will be spent in the memcpy routine, because looping and function calling should be relatively inexpensive


Sampler.app
Now it is time to launch the profiler called Sampler.app, which you should be able to locate in /System/Developer/Applications or some such path.

When the profiling windows opens (see below) Set Executable to your applications "profile binary" and ensure that Arguments reads -NSUseRunningCopy NO (red arrow).

After that just press Launch and Sample. When you think you have collected enough data, press Stop Sampling and examine the output in the browser views. You can press Stop Sampling at any time, because your application running and Sampler.app are two different processes. You can also launch the application first without sampling and start sampling when its most convenient and meaningful (yellow).

@implementation Controller

static char   dst[ 512];
static char   src[ 512];

#define LOOPS   10000


- (void)  foo
{
   unsigned int  i;

   for( i = 0; i < LOOPS; i++)
      memcpy( dst, src, sizeof( dst));
}


- (void)  bar
{
   unsigned int  i;

   for( i = 0; i < LOOPS; i++)
      memcpy( dst, src, sizeof( dst) / 2);
}


- (void) awakeFromNib
{
   unsigned int  i;
   
   NSRunAlertPanel( @"START", @"When you press \
OK there will be a 3 seconds delay, then a beep \
will be heard and the test will run.\nStart \
sampling at that moment.\nPress Stop sampling \
when you hear the second beep (or preferrably a \
bit earlier :))\n", @"OK", nil, nil);
   sleep( 3);
   NSBeep();
   
   for( i = 0; i < 1000; i++)
   {
      [self foo];
      [self bar];
   }

   NSBeep();
   sleep( 1);
   NSRunAlertPanel( @"END", @"The test is thru \
   ", @"OK", nil, nil);
}

@end


Step 3 - Getting the needed info from the output

A first look should go to the number of samples collected Samples (Displayed/Total) (blue ). A reasonably meaningful sample collection is 200 (IMO), which translates to 10 seconds sampling. For a finer granularity you can reduce the sampling period (I usually use 5ms (on a G4)).

Now examine the browser views (green). The first thing to note is that the numbers besides the symbolic names (f.ex. _memcpy, or -[Controller foo]) means "number of samples". You must relate this number to the total number of samples taken, it is not an absolute value.

Sampler.app does something very nifty when sampling the running application, it does not only examine the current position of the CPU program counter at the time of the sample, it also takes a snapshot of the stack. With this added stack information Sampler.app can not only tell you how much time was spent in a routine, but also who called it.

Coming back to our example, this means that we can not only find out of much time was spent in memcpy but also how much time was spent in memcpy when being called by -[Controller foo] and how much when being called by -[Controller bar].

On the right side Sampler.app by default shows you the stack frame, which was responsible for the most CPU time spent. From below you can follow the call chain up to memcpy. Of particular interest of course is our test routine -[Controller awakeFromNib], which was encountered on the stack during all (283 of 283) taken samples. This does not mean, that the CPU spent all the time executing that one routine, but rather that this routine and the routines called, and the routines that were called by those etc. pp. used up all that time.

That weird underscore before C functions
The UNIX C-compiler since it's early pre-ANSI Ritchie days prepends an underscore in front of all output symbols. Since the ld linker used to be able to only handle 8 characters, that meant that all external names needed to be significant in the first 7 characters, since the underscore did count.

The prefix apparently was added to distinguish "user" functions from predefined/system functions.

Climbing up the stack you notice that the number of samples 187 besides -[Controller foo] is much smaller than that of -[Controller awakeFromNib]. Since by the main loop of the test code
   for( i = 0; i < 1000; i++)
   {
      [self foo];
      [self bar];
   }

-[Controller bar] and -[Controller foo] were called equal amount of times, but -[Controller foo] on the whole took longer to execute, you don't see -[Controller bar] in this stack trace.

If we were to browse to the stack frame, where -[Controller bar] is shown we would notice 96 samples recorded for -[Controller bar]. Since -[Controller bar] copies half the amount of bytes than -[Controller foo]. This matches nicely our assumptions.

Going up the stack once more we can see that time spent in memcpy is 181 samples. This leads to the conclusion, that the loop and C-function call overhead is responsible. for approx. 3% of the runtime. The upper stackframes reveal implementation details of memcpy (or some such) and do not interest us, since the code we wrote is shown between the memcpy and -[NSConcreteMutableSet make...

In a situation like this you could feel pretty good about yourself, because the time is wasted in code not written by you. On the other hand you should consider, that maybe your code conceptually is inefficient or has some bad algorithmic mistake.


If you want to discuss this articles, please do so in this thread in the Mulle kybernetiK Optimization Forum.