Optimized Hashing in Mac OS X
This articles starts with an intro about hashing for the budding coder. Seasoned veterans amble down to "Unfortunately on Mac OS X, it's not so simple"
- (unsigned int) hash
Every instance of NSObject or any of its subclasses (or more exact, every class that implements the NSObject protocol) responds to the message -hash. To really drive this home, 99% of all classes respond to hash.One usage of the number returned by hash is as a shortcut for the isEqual: message. If a comparison between two objects yields YES then also both their hash values are equal. The other way around is not true though, two disequal objects may very well by chance return the same hash value. This means that comparing hash values is no substitute for true equality tests. But it may very well be used as a quick test to ensure that two objects are NOT equal. You would code it like this in a loop for example:
If computing the hash value of an object is likely to be much faster then comparing two objects with isEqual: then this is likely to be faster than the equivalent code with out the hash comparison.
But the hash is usually used in a much more clever way, and that's where NSDictionary, NSSets and friends come into play. Instead of sequentially comparing all objects with the hash value, a portion or all of the hash value is used as an index to most likely candidates. Here is an excerpt of a hypothetical class implementing a simple dictionary :
This class maintains two arrays of buckets, one for the keys and one for the objects. Each bucket is an NSArray that contains keys or objects. | |
The graphic shows a very small dictionary where only two buckets are used each for key and object storage. The key and corresponding object are stored at the same index in their buckets. The hash is now used to preselect the bucket to insert or search. This will make searching much faster because only a small portion of all keys must be searched. It is clear that the number of buckets should grow proportionally with the number of objects kept in the dictionary so that performance does not degrade. |
![]() |
Good hash, bad hash
What happens if the hash method is bad, and for example by programmers error the hash method always returns 0 ? All objects now reside in a single bucket. That would mean that the loop has always to go through the equality check and that it always has to perform the (now useless) hash call. In the end that would be slower than just testing for equality with isEqual:.
Here is the charted output of a small demo program, that shows how a really bad hash (X=0) is actually worse than no hash (X=-1). Any halfassed hash (X=2) is already very helpful, and things get better as the hash gets better (X=32). The demo program creates a number of random strings of equal size with a very restricted character set (a and b). |
|
The hash X=2 has four possible hash values, the X=32 hash uses the full unsigned int range.
It is therefore important that a hash function has the following properties
- it must be fast, very fast.
- the value returned for an instance, should be different from other objects, that are not "equal" to the instance. If this would violate the "speed" requirement the hash function should try to minimize the likelyhood of disequal objects with the same hash value.
Unfortunately on Mac OS X, it's not so simple
Hash values are used in many places in Foundation and CoreFoundation. Most notably of course the NSDictionary, where the hash is used to quickly search for the key of a dictionary for the subsequent lookup of its associated value. Then there is also the NSSet, the NSHashTable, the NSMapTable, the CFDictionary and probably a bunch of other facilities that serve as collections, lookup tables etc.During the development of CoreFoundation, which is now the real brains behind most of Foundation, hashing and its usage were apparently not very well thought through. As becomes apparent in the following example.
Assume that we build a class, where every object is disequal from the other. An easy way (not the best, but certainly a perfectly valid way) of doing this is to store a unique number during init as an instance variable, like this:
Another perfectly valid way (without the benefit of hindsight) would be to generate the hash like this:
- (unsigned int) _hash2
{
static unsigned int counter;
++counter;
return( ((counter >> 24) & 0xFF) |
((counter >> 16) & 0xFF) << 8 |
((counter >> 8) & 0xFF) << 16 |
((counter) & 0xFF) << 24);
}
which reverses the byte order in the hash value.
Both methods create unique unsigned int number, that should be equally serviceable as hash values. Alas it is not so! (This is true for Mac OS X versions 10.0 till 10.2). Running a test program that inserts 5000 objects into a NSHashTable the runtime for using _hash1 is 0.03s, the runtime for _hash2 is 7.4s. That's 250 times slower! It is very apparent that CoreFoundation favors the lower bit positions over the higher bit positions, or rather the higher bit positions aren't even used at all. Presumably the code to select a hash bucket is just the same as was used in the little example dictionary above. A look into the relevant CFDictionary code reveals, that indeed - as feared - only the lower bits are used to index the bucket:
|
|
So, for now a hash function should also obey the following rule
- The randomness or entropy of the lowest bits should be maximized. Avoid creating hash values that differ only or mostly in the upper bits. The _hash1 method is in this respect perfect, because entropy is guaranteed to be hightest in the lower bits.
A better NSObject hash ? A better CFHash ?
Disassembling NSObject hash method we find that the address of the object is used as its hash value:
which retranslates back to
That's a good idea, because isEqual: on NSObject tests for adress equality. This obviates the need to call isEqual: in all cases, when hash values are compared before. Also there will be perfect disequality of hash values between different objects, since no two objects share the same memory. The entropy of the lower bits is pretty good (but not without optimization possibilities) as objects are typically allocated close to each other in memory. This can be seen in a random sample run of code that did 16 objects allocations outputting their memory addresses:
4f600 4f520 4fc20 4f580
51040 5bb40 5bb50 5bb60
5bb70 5bb80 5bb90 5bba0
5bbb0 5bbc0 5bbd0 5bbe0
The reason for the shift
The author of the -[NSObject hash] method was undoubtedly aware that objects need to be at least 4 byte aligned on the PPC (because of the leading isa pointer). Therefore any object by that reasoning will have the lower two bits cleared. Given an inclination towards preferable entropy in the lower bits, this makes for a better hash value.Since objects are - in general - allocated with malloc, and malloc in Mac OS X guarantees 16 byte alignment on the PPC the shift count should actually be 4 nowadays (compare this with the example allocations output above)
Running a little test case, this shows the following improvement: 0.9 s over 1.4 s with stock -[NSObject hash].
A very interesting experiment is to not do any shifting at all. With that the speed drops down 3.7s!
The bad touch
Now lets have a peek into CoreFoundation to see the code for CFHash
As you notice, there is no shifting performed whatsoever. CoreFoundation objects that do not implement their own hash function, therefore will suffer. For those objects, the performance on dictionary operations is only 25% of what it could be. Fortunately most CFRuntimeClasses, the CoreFoundation equivalent of the Objective-C classes, do implement their own hash functions (1)
CFHash's implementation should be of concern to you, if you implement your own class based on CFRuntimeClass - outside of the NSObject hierarchy. If you subclass from NSObject though, you will inherit NSObject's hash method.
A further (academic) step
The implementation of hash that shifts the address value is suboptimal, because the shift zeroes top bits. Every shift operation therefore destroys a little the uniqueness of the hash value. It is more appropriate to code the hash function as a rotation instead of a shift as:
This code is the best of both worlds, increased entropy in the lower bits and full use of all address bits. (This exercise is largely academic, because due to the expected 16 byte alignment of the memory block only zero bits will be rotated up.)
I was very pleasantly surprised to see, that the gcc compiler was able to compile this (with -O3) into the optimal machine code:
Put it in a category on NSObject, and get a little Cocoa speedup for free! It remains to be seen if this does anything for overall speed. Please let me know.
[Example nsobjecthashtest]
Some Foundation hash implementations discussed
Lets look at the hash implementations of a few selected but often used and usually numerously instantiated Foundation classes. As these are all class clusters, avoid the temptation of optimizing their hash method. Each subclasses' hash needs to stay return value compatible (for isEqual:) with all other subclasses of the same class cluster. The implementations are those of Mac OS X 10.3. Later versions may have different implementations.NSString/CFString/NSURL
The hash is a convolution of the first and last eight bytes plus the length of the string basically the byte values are shifted and added to the string length.
This is a sensible hash for many computer language and natural language strings.
NSURL uses the same hash as NSString. This is a very bad choice. As the first eight and last eight characters of URLs are likely to be very similiar if not identical, as in these not too contrived examples:
- http://www.mulle-kybernetik.com/index.html
- http://www.google.de/index.html
- ftp://ftp.apple.com/pub/share/Resource.tgz
- ftp://ftp.mulle-kybernetik.com/pub/software/MulleEOInterface/MulleEOInterface.source.tgz
NSNumber/CFNumber
The hash for integer is the absolute value. For doubles the absolute value of the integral part plus the absolute value of the fractional part scaled to integer (e.g. 0.5 * 0xFFFFFFFF)Sounds good.
NSData/CFData
The hash is the ELF hash over the first 16 bytesELF isn't bad, but the implementation could be better(1). In certain cases where much is known about the to-be-expected data, a self-written data class can be fruitful. Especially if the first 16 bytes are known to not vary much.
NSDate/CFDate/NSCalendarDate
The hash used is floor( [self timeIntervalSinceReferenceDate])This gives a value in seconds. Pretty much perfect for a hash value.
NSDictionary/CFDictionary
The hash implementation is [self count].It's not a good idea to put a lot of equal sized dictionaries into NSSets. An optimization that readily comes to mind, would be to use a "primary key". To create a subclass of NSDictionary and write hash like this:
You might try to put this code into a category on NSDictionary. This would work properly only, if no NSDictionary subclasses implement their own hash methods. If that is the case, it is still likely to be dependent on the Foundation version and therefore dangerous.
The most robust but also most expensive solution appears to be a wrapper class, subclassed from NSObject, where you store a NSDictionary as an instance variable.
NSArray/CFArray
The implementation is [self count].It's not a good idea to put lots of equal sized NSArrays into a NSSet, as the hash will always the same.
Wrap Up (The Executive Version)
With Mac OS X in it's current state it is double- and triple important to check your hash method.Keep entropy (randomness, distribution) high in the lowest and lower bits.
Put a category on NSObject to override hash and enjoy faster performance on all NSObject derived classes, that do not implement their own isEqual:. If you need a good hash, but computing it is expensive, consider using an instance variable to cache it there.
Do not add mutable objects to hashing collections.
If you want to discuss this articles, please do so in this thread in the Mulle kybernetiK Optimization Forum.
(1) As of 10.2 the CoreFoundation classes not implementing hash are: CFUserNotification, CFNull, CFAllocator, CFTree, CFBoolean, CFXMLParser, CFBundle, CFPlugInInstance, CFRunLoop, CFSocket.
A little bonus anecdote for the thorough reader, that also reads the footnotes. The hash function of CFData uses the ELF hash algorithm. That algorithm is unfortunately not implemented in a way that is beneficial to the PPC architecture.
Also the algorithm itself has a bug, because the upper nybble of the returned hash value is needlessly always zeroed. Here is a modified ELF hash code, that's 30% faster on my Cube with gcc -O3: and preserves the upper nybble:
The example source contains another hash function, which I believe is none worse and is about 40% faster than the original.