Nat! bio photo

Nat!

Senior Mull

Twitter Github Twitch

The Mullocator - Part I - Testing Boundaries

Ok, let's ignore the plan, and rather start with the technical stuff.

At the beginning of a project like the Mullocator, I will try to figure out what the optimum result could be and what the current state of affairs is. Yes, that's what the old folks call "back of the envelope calculations" and yes I do it too, because I am old... :)

The goal for the Mullocator was to write a malloc implementation that could allocate memory as fast as possible. With no regard for the complicated task of managing freed memory, the most simple and fastest malloc routine would appear to be a routine like this:

/* No, this is not the Mullocator :) */
/* some .c file */
char   memory[ 0x1000000];    /* 4K pages of 4K size */
char   *buf = memory;

/* malloc.h */
static inline void  *malloc( size_t size)
{
   extern char   memory[];
   extern char   *buf;
   void          *p;
   
   p   = buf;
   buf = &buf[ size];   /* one memory write, inlined code, hard to beat ... */
   return( p);
}

Although the optimal malloc is so small, there are already a few problems with this code, one of which being: this may not be the fastest way to allocate memory!

Problems

  1. Alignment: the buffer is a char buffer and the code doesn't ensure that blocks are properly aligned to hold ints, longs or doubles.
  2. No checks for memory exhaustion.
  3. There are static variables, that aren't protected by locks, so this code is not thread safe.
  4. We are allocating a huge memory buffer and this memory buffer is by C convention guaranteed to be zeroed out. So the Virtual Memory System ( could be bringing in about 4000 4K pages of memory at the beginning of the program. Small programs, that only malloc a few bytes, might suffer greatly under the VM load (see The irony of having to optimize echo)


Ok so let's fix #1 and #2 first: we'll assume that a long alignment is good enough.

long   memory[ 0x1000000 / sizeof( long)];
long   *sentinel = &memory[ 0x1000000 / sizeof( long)];
long   *buf = memory;

static inline void  *malloc( size_t size)
{
   extern long  memory[];
   extern long  *sentinel;
   extern long  *buf;
   void         *p;
   long         *q;
   
   /* assume the compiler does reduce that to shifting, it will... */
   q = &buf[ (size + sizeof( long) - 1) / sizeof( long)];  
   if( q > sentinel)
      return( NULL);

   p   = buf;
   buf = q;
   return( p);
}

Note how the code itself got a little slower: one add, one shift, a comparison with a not-taken branch.

Alignment creates an interesting dilemma

  • Programs are likely to run faster. CPU load and store instructions will perform better.
  • Programs are likely to run slower. The CPU cache is likely to miss more often.

Fixes for #3 and #4 next time. Any questions ? Use the comments.