How to Implement a Hash Table in Python
Originally published at pagekey.io.
This tutorial will show you how to implement a hash table with separate chaining. It’s not the most efficient method, but it is the simplest way to get started and create a fully functioning hash table.
Background
Hash tables are indispensable tools for solving a wide assortment for so many interesting programming problems. I always love to incorporate a hash table into a problem; they can provide a clean solution to an issue that would be a mess otherwise.
For the longest time, I wondered how hash tables were created. I wanted to make my own, but I had no clue as to how they worked. Luckily, I found James Routley’s awesome post detailing how to implement one in C. For anyone interested, I highly recommend it.
Using this knowledge, I ported the hash table to Python. By the end of this tutorial, you will understand the basic ideas behind the hash table. Perhaps more importantly, you will have implemented your very own!
The Basics
If you’ve ever used a dictionary in Python or an associative array in a language like PHP, You’ve probably used a hash table before. Features such as the dictionary in Python or the associative array in PHP are often implemented using a hash table. Even more straightforward is the HashTable class available in Java.
Why would we need such a structure? Well, sometimes a flat area just isn’t enough. To make sense of the problem at hand, you may need to store and access your data by a key, a definite step up from the rudimentary integer index provided by flat arrays.
The hash table we build will be used like this:
# Create a new HashTable
ht = HashTable()
# Create some data to be stored
phone_numbers = ["555-555-5555", "444-444-4444"]
# Insert our data under the key "phoneDirectory"
ht.insert("phoneDirectory", phone_numbers)
# Do whatever we need with the phone_numbers variable
phone_numbers = None
... # Later on...
# Retrieve the data we stored in the HashTable
phone_numbers = ht.find("phoneDirectory")
# find() retrieved our list object
# phone_numbers is now equal to ["555-555-5555", "444-444-4444"]
How does this really work under the hood? As it turns out, your key (phoneDirectory
in this example) is converted into an index. This index is used for storing and retrieving the data value from the hash table’s internal array. All those messy details are hidden from the user - they just have to worry about insert()
, find()
, and remove()
.
Fields
Our hash table will need a few fields to keep it together. It needs a size
, which will be the number of elements that have been inserted. It needs a capacity
, which will determine the size of our internal array. Last, it needs buckets
- this is the internal array, storing each inserted value in a “bucket” based on the provided key.
class HashTable:
def __init__(self):
self.capacity = INITIAL_CAPACITY
self.size = 0
self.buckets = [None] * self.capacity
Note the INITIAL_CAPACITY
variable, arbitrarily set to 50 in my example class. This defines the size of our internal array. In a more complex hash table implementation (i.e. an open-addressed, double-hashed hash table), it’s important that the capacity is prime, and that it can be changed. On the other hand, our separate chaining hash table sets the capacity once and never changes it, regardless of how many elements are stored. This is good for simplicity, but bad for scalability.
HashTable Node
If you thought you were getting a break from the internal Node structure, you were wrong! Our hash table will need its own version of a Node:
class Node:
def __init__(self, key, value):
self.key = key
self.value = value
self.next = None
Look familiar? Node has a next
field because it’s actually part of a LinkedList. Because the hash table uses separate chaining, each bucket will actually contain a LinkedList of nodes containing the objects stored at that index. This is one method of collision resolution.
Collisions
Whenever two keys have the same hash value, it is considered a collision. What should our hash table do? If it just wrote the data into the location anyway, we would be losing the object that is already stored under a different key.
With separate chaining, we create a Linked List at each index of our buckets
array, containing all keys for a given index. When we need to look up one of those items, we iterate the list until we find the Node matching the requested key.
There are other, far more efficient ways of handling collisions, but separate chaining is likely the simplest method.
Methods
Now we can really get started. Let’s jump into our hash table’s methods.
Hash
Our hash method needs to take our key, which will be a string of any length, and produce an index for our internal buckets
array.
We will be creating a hash function to convert the string to an index. There are many properties of a good hash function, but for our purposes the most important characteristic for our function to have is uniformity. We want our hash values to be as evenly distributed among our buckets as possible, to take full advantage of each bucket and avoid collisions. The ideal case is pictured below:
On the other hand, an uneven distribution will defeat the purpose of the hash table altogether, yielding nothing more than a bloated LinkedList.
Consider an extreme case: Our hash function will be h(x) = 1
. That’s right, each input produces the same constant value. So, what happens? Every time we hash a key, the output is 1, meaning that we assign that node to bucket 1. The result would look something like this:
Not pretty! We’ll just have to make sure we avoid this bottleneck at all costs.
Here’s the code for our hash function:
def hash(self, key):
hashsum = 0
# For each character in the key
for idx, c in enumerate(key):
# Add (index + length of key) ^ (current char code)
hashsum += (idx + len(key)) ** ord(c)
# Perform modulus to keep hashsum in range [0, self.capacity - 1]
hashsum = hashsum % self.capacity
return hashsum
While fairly arbitrary, this function will provide an acceptable degree of uniformity for our purposes.
Insert
To insert a key/value pair into our hash table, we will follow these steps:
- Increment size of hash table.
- Compute
index
of key using hash function. - If the bucket at
index
is empty, create a new node and add it there. - Otherwise, a collision occurred: there is already a linked list of at least one node at this index. Iterate to the end of the list and add a new node there.
This is reflected in the following code:
def insert(self, key, value):
# 1. Increment size
self.size += 1
# 2. Compute index of key
index = self.hash(key)
# Go to the node corresponding to the hash
node = self.buckets[index]
# 3. If bucket is empty:
if node is None:
# Create node, add it, return
self.buckets[index] = Node(key, value)
return
# 4. Collision! Iterate to the end of the linked list at provided index
prev = node
while node is not None:
prev = node
node = node.next
# Add a new node at the end of the list with provided key/value
prev.next = Node(key, value)
Find
After storing data in our hash table, we will surely need to retrieve it at some point. To do this, we’ll perform the following steps:
- Compute the
index
for the provided key using the hash function. - Go to the bucket for that
index
. - Iterate the nodes in that linked list until the key is found, or the end of the list is reached.
- Return the value of the found node, or None if not found.
This idea would be expressed in code like this:
def find(self, key):
# 1. Compute hash
index = self.hash(key)
# 2. Go to first node in list at bucket
node = self.buckets[index]
# 3. Traverse the linked list at this node
while node is not None and node.key != key:
node = node.next
# 4. Now, node is the requested key/value pair or None
if node is None:
# Not found
return None
else:
# Found - return the data value
return node.value
Removing an element from a hash table is similar to removing an element from a linked list. This method will return the data value removed, or None if the requested node was not found.
- Compute hash for the key to determine
index
. - Iterate linked list of nodes. Continue until end of list or until key is found.
- If the key is not found, return None.
- Otherwise, remove the node from the linked list and return the node value.
This would be reflected in code as such:
def remove(self, key):
# 1. Compute hash
index = self.hash(key)
node = self.buckets[index]
prev = None
# 2. Iterate to the requested node
while node is not None and node.key != key:
prev = node
node = node.next
# Now, node is either the requested node or none
if node is None:
# 3. Key not found
return None
else:
# 4. The key was found.
self.size -= 1
result = node.value
# Delete this element in linked list
if prev is None:
node = None
else:
prev.next = prev.next.next
# Return the deleted language
return result
For more information about removing a node from a linked list, see my LinkedList article.
Applications
Hash tables can be useful in a wide variety of computer science applications. Once you learn how to use them, you won’t be able to stop! It seems at every turn there is a new application for the hash table.
Below are a few problems you can attempt to solve using your new hash table:
- Write a function to determine whether a string contains repeated characters.
- Given a string of any length, find the most-used character in the string.
- Write a function to determine whether two strings are anagrams.
Source
Thank you for reading. Check out the full source code for what we did today below!
Do you enjoy learning about programming and computer engineering? If so, PageKey is the place for you! Click here to visit.
Originally published at pagekey.io.