Data structure internals: Hash Tables22 Aug 2017
Hash tables: everybody’s favourite way of avoiding the need to search through a list of items. But what goes on inside one? How do we achieve amortized constant time access and insertion and deletion? Well I’m so glad you asked…
Mapping keys to indices: Hash functions
Fun fact: Hash tables are actually arrays in disguise!
Computers can only store more than one thing by giving you the memory to store a bunch of those things and letting you choose where to put each one. This means that whether you use integers or strings or arbitrary objects as the keys into your hash table, at some point you’ll still need a way to map that key to an index internally.
Enter hash functions. A hash function is simply any function which takes input of arbitrary size, and returns output of a fixed size. A function that takes a string and returns the first 3 characters (padded with empty space as necessary) is a hash function, just not a very good one. The same goes for a function which takes a string and returns a 32-bit integer containing the string’s length, or one that takes a list of integers and returns their sum. As long as the input can be arbitrarily large and the output is a fixed size.
Exactly what hash function you use really depends on your problem. If security is important then you’ll want to use a cryptographically secure hash function (like SHA-2, bcrypt, or Argon2). On the other hand if you’re just hashing strings that aren’t user input you could use a faster string-hashing algorithm like MurmurHash2. It may be that a generic hash function is good enough for your purposes, but chances are that you can find (or construct) a hash function which works well for your particular data set.
Once you have a hash function you can compute the hash of any input data to give a sort of numerical “fingerprint” of it. You can then take the modulus of this number with the size of an array to get the index that we’d like to store the object at.
Collisions (and what to do about them)
Let’s say we’re constructing a hash table to store data about different countries and we’re using the country’s name as the key.
Our hash might map “Japan” to 7, and so we store all sorts of wonderful information about Japan at index 7. This is great until we try to insert data with another key (say, “Antarctica”) that also maps to 7. This is called a collision, where multiple keys map to the same index in our array. This scenario is guaranteed to happen at some point because you have a finite number of array indices that you can map to, and an infinite number of possible keys you map (because our hash function can, by definition, theoretically handle infinitely large input).
Clearly we need a way to resolve collisions.
All of the approaches to collision handling (that I know about) can be categorized broadly into two categories: Separate chaining and Open addressing. Obviously both have their pros and cons, but the goal is generally to minimize the amount of computation that we need to do when inserting and retrieving items from our hash table.
With separate chaining we store an array of linked lists to the entries we want to store in the hashtable, rather than directly keeping an array of those entries. Every time we insert an item, we hash it to find the index of the linked list to add it to and then add it to the end of that. One benefit of using linked lists in this way, is that we can store many items that hash to index 6 (for example), without affecting the work required to insert or retrieve items that hash to 7 (because we only have to look through the list at index 6).
Open addressing avoids the complication of using linked lists by storing the objects in the array directly, and then handling collisions by searching (or probing) the array when a collision happens. The simplest of these approaches is called linear probing. In a hashtable with linear probing, if we insert “Japan” into index 7, and then try to insert “Antarctica”, we’ll see that index 7 is already occupied, and try index 8. If that is also full we’ll try index 9 and then 10 and 11 and so on until we’ve searched the entire array or we’ve found an empty spot.
One downside of this approach is that we are likely end up with large clumps of occupied indices. This results in other (potentially never-used-before) indices needing many probes to store at or retrieve from. For example if our table has 4 elements that map to index 6, then they’re get stored at 6, 7, 8 and 9. Now if we add a new value that hashes to 7, it needs to be stored at index 10. This obviously becomes a problem when you have a large table that gets very full and each operation needs to search through nearly the entire array to find the correct item.
One way of trying to address this problem is to use quadratic probing. This is the same as linear probing except that we probe at offsets that increase quadratically (so if we map to index h, we’ll probe h, h+1, h+4, h+9 etc) which reduces the chances of growing large clumps of occupied indicies in the underlying array.
Another variant which I find particularly interesting is Robin Hood hashing. In short Robin Hood hashing involves probing linearly, but as you probe, you move the values around so as to keep the average probing distance to a minimum. There is a very good article explaining Robin Hood hashing in more detail here.
Obviously we don’t always know ahead of time how many values we’ll need to store in our hash table, so we’d like to be able to dynamically grow to accommodate more and more data.
The problem with this is that if you recall, when we decided which index (or list, in the case of separate chaining) to put a value into, we took the modulus of the hash function with respect to the length of the underlying array. This means that if we change the length of the array then we must re-hash every single value in the hash table to find its new place.
Obviously re-hashing every entry is expensive and is something we’d like to avoid - so how often should we do it? This is a trade-off between spending lots of extra time every now and then, and spending a little bit of extra time very often. On the one hand we can decide to only expand the table when it gets completely full, but then accessing an element (which is meant to be a fast operation for hash tables) becomes a linear-time operation because when there is only 1 open slot, inserting a new element most likely involves probing a large portion of the array. On the other hand if we expand the table every time we add an element, then access is going to be fast but insertion will be slow because it will incur the cost of a complete re-hash every time.
We can boil this trade-off down to a single value called load factor. Load factor refers to the fraction of available array indices which need to be filled before we expand the hash table. A load factor of 0.9 indicates that 90% of all the available slots have been filled. Having a single value allows us to simply set a threshold value, and then expand our table when the load factor becomes larger than that.
It is worth noting that load factor threshold does not necessarily need to be less than 1 - if we’re using separate chaining then it makes far more sense to have a threshold that is greater than one, because we intend to store multiple entries in each list in our array. As an example of what sorts of values you might use for load factor, the Java 8 Hashtable has a default load factor of 0.75.
Selecting a size
When we initially create the hash table and when we resize it, we are faced with a problem: How big do we want the table to be? The answer (somewhat surprisingly) depends on how good your hash function is (where “good” pretty much means that it’s suitable for cryptographic applications).
With a good hash function you would be fine with just about any size. If you don’t know what hash function you’ll have (maybe you’re developing a library that lets users define what hash function they want to use), or you know you have a bad hash function (and can’t/won’t change it for some reason), then you’ll want the length of the array to be a prime number.
The reason you want to length to be prime is that if many items hash to multiples of some factor of the size of your array, then when you take the modulus of those multiples, the large multiples will land up with the same array indices as the small multiples. This results in the entries clumping up around those indices in the array rather than spreading out nicely, which in turn slows everything down.
So there you have it: Hash tables aren’t magic, they’re just a clever application of hash functions on top of plain old arrays. They give you fast access, fast insertion and fast deletion (but without addition work, you get slow iteration).