Hash Table Data Structure, Implementing Set and Map ADTs

7 minute read

CS 66: Introduction to Computer Science II

References for this lecture

Problem Solving with Algorithms and Data Structures using Python

Section 6.5 https://runestone.academy/ns/books/published/pythonds/SortSearch/Hashing.html

Review: Hash Tables

Hash tables work like the above approach, except it allows for values outside of indices of the table using a hash function.

Hash functions can be any function that transforms a value into a suitable index for the list.

For example:

hash_table_size = 10
hash_table = [None] * 10 #initialize all 10 spots to None

def simple_hash(value,num_items):
    return value % num_items

print("Initialized table;",hash_table)

#putting the value 87 into the hash table
hash_table[ simple_hash(87,hash_table_size) ] = 87
#putting the value 1004 into the hash table
hash_table[ simple_hash(1004,hash_table_size) ] = 1004

print("The hash table with some data in it")
print(hash_table)
Initialized table; [None, None, None, None, None, None, None, None, None, None]
The hash table with some data in it
[None, None, None, None, 1004, None, None, 87, None, None]

Group Activity Problem 1 (Review)

What line of code would I use to check if a number like 87 or 63 or 37 is in the hash table? What is the Big O of this?

Review: Collisions

When two different items end up with the same hash value, it is called a collision.

This is a problem since both items can’t occupy the same memory location.

Collision Resolution is the process of finding a new location for an item when there is a collision.

Linear Probing

Linear probing: collision resolution strategy - put the item in the next available spot.

Let’s see what would happen if we put the following items into our hash table of size 10:

13, 56, 193, 84, 73, 264

Group Activity Problem 2 - a bigger example

Make a copy of this spreadsheet and put the following numbers into the hash table:

https://docs.google.com/spreadsheets/d/11hRzwujbkeBGmGGDXgaZrExLDNBN7_dM5Y1X25nmJcU/edit?usp=sharing

157, 345, 206, 721, 300, 111, 897, 66, 578, 416, 126

Answer the following question:

  • What are some ways we make collisions less likely?
  • What problem do you run into if you try to remove items from the table?

Rehashing

Rehashing: use another function to find a new spot when a collision happens

  • linear probing is a case of rehashinig

  • skip n - instead of jumping to next slot, skip down $n$ slots and try it there.

    skip 3: if the original hash value was $h$, then try $h+3$, then $h+6$, then $h+9$, then $h+12$ (wrap around the table as necessary)

    Let’s try skip 3 with 13, 56, 193, 84, 73, 264

    Problem: hash table size should be a prime number so that you eventually get to all slots in the table, still results in some clustering

  • quadratic probing - increase the skip value each time by the next square number

    if the original hash value was $h$, then try $h+1$, then $h+4$, then $h+9$, then $h+16$ (wrap around the table as necessary)

  • PRNG probing - use a pseudo-random number generator to generate the probing sequence (this is what Python uses for sets and dicionaries)

ALl of these techniques that try to find open slot somewhere else in the table are called open addressing

Chaining

Chaining: A collision resolution strategy in which values are placed into a linked list referenced by each spot in the hash table

Eliminates clustering!

Let’s try chaining with 13, 56, 193, 84, 73, 264

Hash Tables as a Data Structure

Hash Tables

  • don’t guarantee any particular ordering - it all depends on what the hash function does
  • bad for ADTs that have order or indexes: UnorderedList, OrderedList, Queue, Stack, Deque
  • good for ADTs where order doesn’t matter: Map and Set

Set ADT

The above examples consider using hash tables as the underlying data structure to implement a set abstract data type - similar to Python’s set type.

  • A set is a collection in which the ordering of the values is not important
  • The only thing that matters with a set is whether the item is in the set or not
  • Duplicates are not allowed
  • Python has a built-in set type

The book does not formally define the Set ADT, but we could define it like this:

  • Set() Create a new, empty set. It returns an empty set collection.
  • add(val) Add a new value to the set. If the value is already in the set, do nothing.
  • in Return True for a statement of the form val in set, if the given value is in the set, False otherwise.
  • remove(val) Remove the given value from the set.
  • len() Return the number of values stored in the set.
items_in_my_fridge = {"milk","eggs","leftovers","pop"}
items_in_my_fridge.add("butter")
items_in_my_fridge.add("butter")
items_in_my_fridge.remove("pop")
print( type(items_in_my_fridge) )
print(items_in_my_fridge)
print( "butter" in items_in_my_fridge )
print( "pop" in items_in_my_fridge )
<class 'set'>
{'butter', 'eggs', 'leftovers', 'milk'}
True
False

Implementing the Set ADT with a Chained Hash Table as the underlying data structure

Coding demo


# Example code for your notes

class ChainedHashSet:
    
    def __init__(self,table_size=10):
        self.table = []
        self.table_size = table_size
        for slot in range(self.table_size):
            self.table.append([])
            
    def __repr__(self):
        display_str = ""
        for slot in range(self.table_size):
            display_str += str(slot)+":"+str(self.table[slot])+"\n"
        return display_str
    
    def hash_function(self,item):
        return item%self.table_size
    
    def add(self,item):
        hashed_val = self.hash_function(item)
        list_at_slot = self.table[ hashed_val ]
        if not item in list_at_slot:
            list_at_slot.append(item)
    
    
my_set = ChainedHashSet()

my_set.add(13)
my_set.add(56)
my_set.add(193)
my_set.add(84)
my_set.add(73)
my_set.add(264)


print(my_set)        
0:[]
1:[]
2:[]
3:[13, 193, 73]
4:[84, 264]
5:[]
6:[56]
7:[]
8:[]
9:[]

Group Activity Problem 3

Implement the __contains__ method - which will allow you to use the in operator to check if an item is in the set.

Note: it should return True or False

Group Activity Problem 4

Our hash table will still only store integers.

Can you think of a strategy that will allow us to put strings in a table?

0:[]
1:[]
2:[]
3:['Dr. Mario']
4:[]
5:['Pikachu', 'Kirby']
6:['Sheik', 'Mr. Game and Watch']
7:['Dr. Carlson', 'Jiggly Puff']
8:['Captain Falcon']
9:['Mew Two']

these are the names my kids gave our chicks that just hatched


Map ADT

Hash tables are also often used to implement the map abstract data type

A map abstract data type stores key-value pairs and allows you to use a key to look up its associated value.

A Python dictionary is a map.

There are other data structures you could use to implement a map, such as a list of tuples.

The following is the book’s definition of the Map ADT:

  • Map() Create a new, empty map. It returns an empty map collection.
  • put(key,val) Add a new key-value pair to the map. If the key is already in the map then replace the old value with the new value.
  • get(key) Given a key, return the value stored in the map or None otherwise.
  • del Delete the key-value pair from the map using a statement of the form del map[key].
  • len() Return the number of key-value pairs stored in the map.
  • in Return True for a statement of the form key in map, if the given key is in the map, False otherwise.

Chained Hash Map

We could use a similar strategy that we used for set, but store (key,value) tuples

0:[(20, 'chicken')]
1:[(31, 'cow')]
2:[]
3:[(93, 'lion')]
4:[(54, 'cat'), (44, 'goat')]
5:[(55, 'pig')]
6:[(26, 'dog')]
7:[(17, 'tiger'), (77, 'bird')]
8:[]
9:[]

(maybe this is a map that a zoo uses to look up which animal has each id)

Group Activity Problem 5

With our set, we did something like this

    def add(self,item):
        hashed_val = self.hash_function(item)
        list_at_slot = self.table[ hashed_val ]
        if not item in list_at_slot:
            list_at_slot.append(item)

For a map, it would look like this

    def put(self,key,value):
        hashed_key = self.hash_function(key)
        list_at_slot = self.table[ hashed_key ]
        if not (key,value) in list_at_slot:
            list_at_slot.append((key,value))

What would happend if you tried to change the value associate with a key?

How can you fix it?

my_map = ChainedHashMap()

my_map.put(20,"Turkey")
my_map.put(20,"Chicken") #should overwrite Turkey

Built-in Hash Function

Python contains a built-in hash function that you can use.

Be careful, the hash is somewhat randomized and values change every time you re-start your code (this brings issues with saving hashed values to a file, etc.)

print( hash("Star Wars: Episode VII - The Force Awakens (2015)") )
1538015278275964211

Textbook’s Linear-Probed-Hash-Table Map

The following code shows the start of the book’s approach to using a Hash Table with linear probing to implement a map.

self.slots list stores the keys

self.data stores the associated values

class HashTable:
    def __init__(self):
        self.size = 11
        self.slots = [None] * self.size
        self.data = [None] * self.size
        
    def put(self,key,data):
        hashvalue = self.hashfunction(key,len(self.slots))

        if self.slots[hashvalue] == None:
            self.slots[hashvalue] = key
            self.data[hashvalue] = data
        else:
            if self.slots[hashvalue] == key:
                self.data[hashvalue] = data  #replace
            else:
                nextslot = self.rehash(hashvalue,len(self.slots))
            while self.slots[nextslot] != None and self.slots[nextslot] != key:
                nextslot = self.rehash(nextslot,len(self.slots))

            if self.slots[nextslot] == None:
                self.slots[nextslot]=key
                self.data[nextslot]=data
            else:
                self.data[nextslot] = data #replace

    def hashfunction(self,key,size):
         return key%size

    def rehash(self,oldhash,size):
        return (oldhash+1)%size