The reason for my creating a ".NET Undocumented" blog was to that learn more about the internals of .NET and to post my new discoveries from playing around with .NET libraries, Anakrino, and the Rotor source files to the words. I have a few lengthy articles in CodeProject that details the internals about strings and arrays, but sometimes I don't want to write a full article about a given topic. I just want to write notes, so posting a blog seems like a good idea. If I finally post enough blog entries about a given topic, I may well then compile those entries into one article. And if I compile enough articles, I may then integrate those articles into a book.
The Hashtables that Microsoft provides in .NET are fairly well done. Hashtables are implement using an flat array of buckets containing three members(a key, value, and stored hashcode) and probing and double hashing for collisions. The stored hashcode member for each bucket uses one bit to indicate whether a collision has occured. The arrays initially start out with a length of 11 and then grow up to the next prime number greater than twice the current length. This means that, on average, an Add operation will be a constant time operation, for an object with a decent distributed hash function.
Hash tables have a load factor from 0.10 to 1.0 , which is used for trading off the size of the hashtable for performance. By default, a load size of 1.0, uses a bucket array that is about 39% larger in length to alleviate impact of collisions. For two key that collide in the same bucket, a new hash value is computed for the second key and is used as an offset from the current bucket position.
Hashtables offer two methods of customizing the hash code function and the key comparison function. One is through the documented use of providing the IHashCodeProvider and IComparer interfaces to the constructor. The second method uses subclassing to over two appropriately named member functions.
I have 2 issues with Hashtables.
First is that they aren't lightweight enough to be used freely in million of every object. For example, if you want to include "dynamic properties" in each object. In addition to a bucket array, each hashtable carries around a load factor, a load size, the hashtable size, a HashCodeProvider, a Comparer, an object to hold serialization information, a version number, a pointer to its a Key and Values collection. That's at least 48 bytes off the bat directly on the object, if you count the object header and vtable.
The overhead of the bucket array object information is about 12 bytes. If you include the bucket entries, which consist of a three items (a key, a value, a hash code), that's an additional 12 bytes per entry. The minimum size of hashtable is 11 items. So, all told, a hashtable object starts out with 48 + 12 + 12 * 11 or
192 bytes each. Wow! That's a little over 5 hashtables per kilobyte or 5000 per megabyte.
In practice the variable overhead per key in a standard hashtable will be on average 25.2, which is 12 (entry size) * 139% (for the load factor) * 150% (for the growth factor, midway between 100% and 200%). For a custom load factor, the load factor contribution is determined by dividing 139% by the new load factor.
For hashtables of low count, .NET makes available ListDictionary (which includes link lists) and HybridDictionary ( which alternates between a ListDictionary and HybridDictionary ). These implementations may save some memory but they are not that compact either.
Second, I believe there is a simpler approach to hashtable, using chaining, which I believe would produce both superior performance and memory savings over the Microsoft's original implementation. Each bucket would consist of 2 items (a key and a value, but no stored hashcode). The chaining hashtable would also eliminate the need to maintain extra unused bucket entries for load; because of chaining, other buckets would not be impact. Each bucket would either store the actual key/values hit or point to a dynamically array of key/value pairs in the event of a collision.
This should be faster, because only the items that map to the same bucket would need to be searched during lookup and thus collision tests are minimized. This also reduces space requirements, because a full chaining hashtable that does not have any wasted space and also does not store hashcode values internally.