7 posts
• Page **1** of **1**

**Posts:**42**Joined:**Fri Aug 06, 2004 9:20 am**Location:**Bochum**Website:**http://www.mulle-kybernetik.com/weblog

well, in my project we are using a lots of <tt>NSDictionary</tt> and <tt>NSSet</tt> that basically store "records". Each of our records has a <tt>kGCMKUIDProperty</tt> (a <tt>NSString*</tt>) (inspired by the AddressBook <tt>kABUIDProperty</tt> ) that is created using the corefoundation function <tt>CFUUIDCreate()</tt> to create a "unique id" for this record :

Code: Select all

`+ (NSString *) uniqueString`

{

CFUUIDRef uuid = CFUUIDCreate(NULL);

NSString *uString = (NSString *)CFUUIDCreateString(NULL, uuid);

CFRelease(uuid);

return [uString autorelease];

}

now, to test equality of 2 records, i implemented the hash method like this :

Code: Select all

`-(unsigned int)hash {`

return [[self valueForProperty:kGCMKUIDProperty] hash];

}

as you probably know, <b>CoreFoundation</b> uuid create string that looks like this :

<pre>CF0A031B-2136-11D9-8B70-000B95C34268

E3F45F04-2967-11D9-8092-000B95C34268

E9A382E0-2967-11D9-8092-000B95C34268

EF24F24B-2967-11D9-8092-000B95C34268</pre>

<tt>000B95C34268</tt> is the MAC address of my machine, <tt>-8092</tt> is IIRC the unique id of the current session (computer login time...) , <tt>-11D9</tt> i dont know and <tt>EF24F24B-2967</tt> is about the current time.

my question is basically this : you said that in <tt>NSString</tt>, lowest and highest bytes are really important, and in this case, higher bytes are strictly identical, and only lower bytes are different...

do you think it is "enough" ? i fear not, leading to slow <tt>NSDictionary objectForKey:</tt> calls....

can you advice me a new uuid ?

do you think i should use your "magic" hash method replacement? :

Code: Select all

`- (unsigned int) hash`

{

return( ((unsigned int) self >> 4) |

(unsigned int) self << (32 - 4));

}

(i always fear using category on <tt>NSObject</tt> (Or even just on our records!) that touch a so low level problem)

**Posts:**42**Joined:**Fri Aug 06, 2004 9:20 am**Location:**Bochum**Website:**http://www.mulle-kybernetik.com/weblog

OK and now my reply.

First you have to think about your objects. What does**isEqual:** mean with respect to them ? As you are putting a UUID - of which not two can be the same - into each of them does that mean, that no two objects will ever be equal ? Or can two objects with the same UUID exist (possibly because they where fetched twice from some storage ?

If no two objects with the same UUID can exist, then you can implement the "magic" hash just for your subclass, you don't need to put it on**NSObject**. As your equality tests can be reduced to pointer equality testing.

This ought to be fairly optimal, as the spread is pretty good and the hash speed rules.

Otherwise you can rather easily test for yourself how good the hash is you are producing. I coded up this little tool:

The UUID hash you used is not too bad, it only gets around a 25% spread (by visual inspection of the output...) , but that shouldn't give pathetic performance. It'd be interesting to know the real life consequences, if you find a better hash.

First you have to think about your objects. What does

If no two objects with the same UUID can exist, then you can implement the "magic" hash just for your subclass, you don't need to put it on

This ought to be fairly optimal, as the spread is pretty good and the hash speed rules.

Otherwise you can rather easily test for yourself how good the hash is you are producing. I coded up this little tool:

Code: Select all

`#import <Foundation/Foundation.h>`

#import <CoreFoundation/CoreFoundation.h>

NSString *newUniqueString()

{

CFUUIDRef uuid;

NSString *s;

uuid = CFUUIDCreate(NULL);

s = (NSString *) CFUUIDCreateString(NULL, uuid);

CFRelease( uuid);

return( s);

}

static unsigned int linear_test( unsigned int index)

{

return( index);

}

static unsigned int random_test( unsigned int index)

{

return( rand());

}

static unsigned int uuid_test( unsigned int index)

{

unsigned int hash;

NSString *s;

s = newUniqueString();

hash = [s hash];

[s release];

return( hash);

}

#define RUNS 1000000

#define MASK 0xFF

void test( unsigned int (*hashGenerator)( unsigned int index))

{

unsigned int spread[ MASK + 1];

unsigned int i;

unsigned long total;

unsigned int mean;

NSString *s;

unsigned int hash;

long diff;

memset( spread, 0, sizeof( spread));

for( i = 0; i < RUNS; i++)

{

hash = (*hashGenerator)( i);

spread[ hash & MASK]++;

}

mean = (RUNS + MASK) / (MASK + 1);

total = 0;

for( i = 0; i <= MASK; i++)

{

diff = (long) spread[ i] - (long) mean;

printf( "% 3d : %d (%d)\n", i, spread[ i], diff);

total += diff < 0 ? -diff : diff;

}

printf( "Deviation %d total, %d average\n\n", total, (total + MASK) / (MASK + 1));

}

int main (int argc, const char * argv[])

{

NSAutoreleasePool *pool;

pool = [[NSAutoreleasePool alloc] init];

test( linear_test);

srand( 0x1848);

test( random_test);

test( uuid_test);

[pool release];

return 0;

}

The UUID hash you used is not too bad, it only gets around a 25% spread (by visual inspection of the output...) , but that shouldn't give pathetic performance. It'd be interesting to know the real life consequences, if you find a better hash.

no i'm not dead, just in short vacation

thanks to have put my email, nicely formatted, in your forum

my objects are unique.

currently 2 objects are equals if they have the same UUID. but as you said i could use address equality in that case

But i noticed, using shark, that inserting my objects in a NSDictionary (using the UUID of the object as a key) is "slow" (i thought that inserting objects in a dictionary should be fast if the copying of the key is fast (it is in my case for small string like UUIDs). )

Reading your has optimization article make me think that inserting a lot of (about 10 000 items) object in a dictionary can be slow if the hash method does not produce a large range of values.

ok, i then conclude that the current hash method i use is good and not responsible for the "slow" NSDictionary insertion of my objects...

thanks to have confirmed that, i would not have been able to conclude that myself without your tool !

thanks to have put my email, nicely formatted, in your forum

First you have to think about your objects. What does isEqual: mean with respect to them ? As you are putting a UUID - of which not two can be the same - into each of them does that mean, that no two objects will ever be equal ? Or can two objects with the same UUID exist (possibly because they where fetched twice from some storage ?

my objects are unique.

currently 2 objects are equals if they have the same UUID. but as you said i could use address equality in that case

But i noticed, using shark, that inserting my objects in a NSDictionary (using the UUID of the object as a key) is "slow" (i thought that inserting objects in a dictionary should be fast if the copying of the key is fast (it is in my case for small string like UUIDs). )

Reading your has optimization article make me think that inserting a lot of (about 10 000 items) object in a dictionary can be slow if the hash method does not produce a large range of values.

The UUID hash you used is not too bad, it only gets around a 25% spread (by visual inspection of the output...) , but that shouldn't give pathetic performance. It'd be interesting to know the real life consequences, if you find a better hash

ok, i then conclude that the current hash method i use is good and not responsible for the "slow" NSDictionary insertion of my objects...

thanks to have confirmed that, i would not have been able to conclude that myself without your tool !

**Posts:**42**Joined:**Fri Aug 06, 2004 9:20 am**Location:**Bochum**Website:**http://www.mulle-kybernetik.com/weblog

altimac wrote:ok, i then conclude that the current hash method i use is good and not responsible for the "slow" NSDictionary insertion of my objects...

thanks to have confirmed that, i would not have been able to conclude that myself without your tool !

I wouldn't go that far, this is after all just my opinion and not gospel. You could for a test replace your hash temporarily with something like the linear hash or the "magic" hash and see if insertion speed differs a lot.

You could also post your Shark output - usefully trimmed here - for us to have a look.

altimac wrote:But i noticed, using shark, that inserting my objects in a NSDictionary (using the UUID of the object as a key) is "slow"

Hmm don't you mean insertion into a

**Posts:**42**Joined:**Fri Aug 06, 2004 9:20 am**Location:**Bochum**Website:**http://www.mulle-kybernetik.com/weblog

A little bird told me that in the new CoreFoundation the hash bucket number is indeed prime as coincidentally suggested by the article. I will work this into the article eventually as lazyness (or lack thereof) permits. But here's the news for now.

**Posts:**42**Joined:**Fri Aug 06, 2004 9:20 am**Location:**Bochum**Website:**http://www.mulle-kybernetik.com/weblog

Here's something else that I should work into the article just as a background. That is an elaboration why modulo prime hashing is usually far superior to binary AND hashing (masking).

If you have a number of buckets to address, lets say three or four, and you do an AND of 0x3, you only consider the entropy in the lowest two bits of your hash value.

If you do a modulo it's basically the same as a division, just with a different part of the result used. When you divide by a prime number, the modulo/division algorithms works all the bits into one value. 0x1000 % 0x3 gives 0x1 but 0x2000 gives 0x2. 0x10 % 0x3 gives 0x1 and 0x20 % 0x3 gives 0x2. And so on, you see that each bit is important.

Whereas 0x1000 AND 0x3 gives 0x0 and 0x2000 AND 0x3 gives 0x0 and it's no different for 0x10 and 0x20 and so on.

Now if your hashes have perfect entropy, that means that the bits in each position have the same likelyhood to be either 0 or 1, it doesn't make a difference if you use AND or modulo prime. There are only two bits to be used for addressing and if those bits are random, then the randomness of the other bits or lack thereof doesn't improve or worsen your AND masking hash addressing.

So if you're hashes are truely random AND would be a better choice, because the masking operation is faster than the modulo prime operation.

In real life, hash algorithms produce far from perfect hashes. That's why modulo prime wins.

If you have a number of buckets to address, lets say three or four, and you do an AND of 0x3, you only consider the entropy in the lowest two bits of your hash value.

If you do a modulo it's basically the same as a division, just with a different part of the result used. When you divide by a prime number, the modulo/division algorithms works all the bits into one value. 0x1000 % 0x3 gives 0x1 but 0x2000 gives 0x2. 0x10 % 0x3 gives 0x1 and 0x20 % 0x3 gives 0x2. And so on, you see that each bit is important.

Whereas 0x1000 AND 0x3 gives 0x0 and 0x2000 AND 0x3 gives 0x0 and it's no different for 0x10 and 0x20 and so on.

Now if your hashes have perfect entropy, that means that the bits in each position have the same likelyhood to be either 0 or 1, it doesn't make a difference if you use AND or modulo prime. There are only two bits to be used for addressing and if those bits are random, then the randomness of the other bits or lack thereof doesn't improve or worsen your AND masking hash addressing.

So if you're hashes are truely random AND would be a better choice, because the masking operation is faster than the modulo prime operation.

In real life, hash algorithms produce far from perfect hashes. That's why modulo prime wins.

7 posts
• Page **1** of **1**

Return to “Article Discussion”

Users browsing this forum: Yahoo [Bot] and 1 guest