Improving Search

4 minute read

CS 66: Introduction to Computer Science II

References for this lecture

Problem Solving with Algorithms and Data Structures using Python

Sections 6.1-6.5 https://runestone.academy/ns/books/published/pythonds/SortSearch/toctree.html

As long as you can iterate through items in a data structure, you can write a sequential search or linear search - a search that runs in $O(n)$ time.

We’ve been doing this all semester long: 1st week activity, search in OrderedList class, etc.

  • for each item in the data structure
    • check if the item is the one you’re looking for

linear search of Python list

15 in [3,5,2,4,1]
False
def sequentialSearch(alist, item):
    pos = 0
    found = False
    
    while pos < len(alist) and not found:
        if alist[pos] == item:
            found = True
        else:
            pos = pos+1
    return found

testlist = [1, 2, 32, 8, 17, 19, 42, 13, 0]
print(sequentialSearch(testlist, 3))
print(sequentialSearch(testlist, 13))
False
True

If the item is actually in the list

  • best case: 1 comparison
  • worst case: $n$ comparison
  • average case: $\frac{n}{2}$ comparisons

If the item is not actually in the list

  • best case: $n$ comparison
  • worst case: $n$ comparison
  • average case: $n$ comparisons

Group Activity Problem 1

If we knew that the list was in sorted order is it possible to make the sequential search end early? How?

testlist = [0, 1, 2, 8, 13, 17, 19, 32, 42]
print(sequentialSearch(testlist, 3))
print(sequentialSearch(testlist, 13))
False
True

Strategy for Guess-My-Number Game

clockgame.jpg

Image credit: https://www.buzzerblog.com/2014/09/23/watch-the-price-is-right-revamps-clock-game/

Binary Search works by looking at the middle item within a given range and throwing out at least half the items on each iteration.

only works on a sorted data structure

Let’s draw some pictures on the board to see how this works.

def binarySearch(alist, item):
    first = 0
    last = len(alist)-1
    found = False

    while first<=last and not found:
        midpoint = (first + last)//2
        if alist[midpoint] == item:
            found = True
        else:
            if item < alist[midpoint]:
                last = midpoint-1
            else:
                first = midpoint+1

    return found

testlist = [0, 1, 2, 8, 13, 17, 19, 32, 42,]
print(binarySearch(testlist, 3))
print(binarySearch(testlist, 13))
False
True

Group Activity Problem 2:

Write down how many items you have to look at in the worst case for lists of the following size:

  • 100
  • 200
  • 400
  • 1000
  • 10000

Things to notice

  • in each iteration, you can eliminate half the list
  • you can double the number of items in a list and only need to check one extra item
  • it does grow - so it isn’t $O(1)$
  • it’s much less than $O(n)$

Binary Search is $O(\log n)$ - inverse of exponential growth

This is fantastic!

The Ordered List Abstract Data Type

The book implemented the Ordered List Abstract Data Type as a Linked List: https://runestone.academy/ns/books/published/pythonds/BasicDS/ImplementinganOrderedList.html

  • Search was $O(n)$

If we instead implement Ordered List as an Array (Python list)

  • Search is $O(\log n)$
  • Both add and remove still $O(n)$ because items are shifted (but finding the spot is a search

Sequential Search is $O(n)$

Binary Search is $O(\log n)$

Is it possible to do search in $O(1)$ time?

An Idea

Let’s say we have a collection of values to store like 1, 5, 6, 7, 9, 10, 13, 15, 18.

We could represent this in a list like this:

my_collection = [None,1,None,None,None,5,6,7,None,9,10,None,None,13,None,15,None,None,18,None]

Now, we can search if a value is in the collection by looking it up at its index. If it is not in the collection, None will be returned.

print( my_collection[5] )
print( my_collection[18] )
print( my_collection[4] )
print( my_collection[17] )
5
18
None
None

Group Activity Problem 3:

Because list access is $O(1)$, this does search in $O(1)$. So, are the downsides to this approach?

Could we make an UnorderedList this way? How about an OrderedList? How about a Set?

Hash Tables

Hash tables work like the above approach, except it allows for values outside of indices of the table using a hash function.

Hash functions can be any function that transforms a value into a suitable index for the list.

For example:

hash_table_size = 10
hash_table = [None] * 10 #initialize all 10 spots to None

def simple_hash(value,num_items):
    return value % num_items

print(hash_table)
[None, None, None, None, None, None, None, None, None, None]

Now let’s put a value into the hash table.

val = 87
val_hash = simple_hash(val,hash_table_size)
hash_table[val_hash] = val
print(hash_table)
[None, None, None, None, None, None, None, 87, None, None]

Doing a few more:

hash_table[simple_hash(33,hash_table_size)] = 33
hash_table[simple_hash(112,hash_table_size)] = 112
hash_table[simple_hash(19,hash_table_size)] = 19
print(hash_table)
[None, None, 112, 33, None, None, None, 87, None, 19]

Now we can search the hash table like this:

hash_table[simple_hash(33,hash_table_size)] == 33 #is 33 at the spot is should be in the hash table?
True
hash_table[simple_hash(12,hash_table_size)] == 12 #is 12 at the spot is should be in the hash table?
False

Group Activity Problem 4

There’s still a problem with this approach. What is it?

Collisions

When two different items end up with the same hash value, it is called a collision.

This is a problem since both items can’t occupy the same memory location.

Collision Resolution is the process of finding a new location for an item when there is a collision - we’ll talk about some approaches next time.

Hash functions for strings

If you want to put strings in your hash table, you have to come up with a way to convert them into numbers.

One approach is to use the ord function for each character, which returns the unicode value used to represent the character in memory.

ord('c')
99
ord('a')
97
ord('t')
116
def string_hash(astring, tablesize):
    sum = 0
    for pos in range(len(astring)):
        sum = sum + ord(astring[pos])

    return sum%tablesize

print( string_hash("cat",10) )
print( string_hash("dog",10) )
print( string_hash("chicken",10) )
2
4
5