Insertion Sort and Merge Sort¶

CS 66: Introduction to Computer Science II

References for this lecture¶

Problem Solving with Algorithms and Data Structures using Python

Sections 6.6 - 6.12: https://runestone.academy/ns/books/published/pythonds/SortSearch/toctree.html

The book covers 6 sorting algorithms: Bubble Sort, Selection Sort, Insertion Sort, Shell Sort, Merge Sort, and Quick Sort.

We did Bubble Sort last time. Now we'll do Insertion and Merge.

In class: we will go through examples on the white board.¶

Static images are copied from the book for those who are not in class - though I recommend reading those sections and walking through the algorithms on paper yourself.

Insertion Sort¶

Idea: Think of the list as having two parts

  • Front part: sorted part
  • Back part: unsorted part

Algorithm

  • For current_index in the range of 1 to $n-1$
    • Insert the item from current_index into where it goes within the sorted part of the list (which goes from index 0 to current_index-1)
In [2]:
def insertionSort(alist):
    for index in range(1,len(alist)):

        currentvalue = alist[index]
        position = index

        while position>0 and alist[position-1]>currentvalue:
            alist[position]=alist[position-1]
            position = position-1

        alist[position]=currentvalue

alist = [54,26,93,17,77,31,44,55,20]
insertionSort(alist)
print(alist)
[17, 20, 26, 31, 44, 54, 55, 77, 93]

An Experiment comparing Insertion and Bubble Sort¶

The following code

  1. Creates a list of 10,000 integers, each integer is between 1 and 1,000,000
  2. Make two copies of the list to be sorted by two different algorithms
  3. Time how long it takes to execute Bubble Sort on a copy of the list
  4. Time how long it takes to execute Selection Sort on a copy of the list
In [3]:
import time
import random

N = 10000
random_list = [random.randint(1,1000000) for i in range(N)]
random_list_copy1 = random_list[:]
random_list_copy2 = random_list[:]

start = time.time()
bubbleSort(random_list_copy1)
end = time.time()
print("Bubble sort:",(end-start),"seconds")


start = time.time()
insertionSort(random_list_copy2)
end = time.time()
print("Selection sort:",(end-start),"seconds")
Bubble sort: 7.338683843612671 seconds
Selection sort: 3.95986008644104 seconds

Group Activity Problem 1¶

Have each person in your group repeat this experiment and compare your results.

What happens on a list of size 50,000?

Group Activity Problem 2¶

What is the Big O of Insertion Sort?

How does it compare experimentally?

Merge Sort¶

Idea: Split the list in half. Sort both halves (recursively!) and then merege the two sorted halves together.

Merge Sort Computational Complexity¶

Merge Step: Merging together two sorted lists take $O(n)$ time - you keep track of an index into each smaller list and one into the new list you're copying them into. At each step, you just copy the next item from one of the smaller lists.

How many times do you have to do that?

  • At the "bottom" level of recursion, you have $n/2$ lists each with $\leq 2$ items, so on the order of 2 steps done $n/2$ times which is $O(n)$
  • At the next level, you have $n/4$ lists, each with $\leq 4$ items, so on the order of 4 steps done $(n/4)$ times, again $O(n)$.
  • ... and so on

There are $\log_2(n)$ levels in the recursion since we split it in half each time (like binary search), so the entire algorithm takes $O(n \log(n))$

In [4]:
def mergeSort(alist):
    print("Splitting ",alist)
    if len(alist)>1:
        mid = len(alist)//2
        lefthalf = alist[:mid]
        righthalf = alist[mid:]

        mergeSort(lefthalf)
        mergeSort(righthalf)

        i=0
        j=0
        k=0
        while i < len(lefthalf) and j < len(righthalf):
            if lefthalf[i] <= righthalf[j]:
                alist[k]=lefthalf[i]
                i=i+1
            else:
                alist[k]=righthalf[j]
                j=j+1
            k=k+1

        while i < len(lefthalf):
            alist[k]=lefthalf[i]
            i=i+1
            k=k+1

        while j < len(righthalf):
            alist[k]=righthalf[j]
            j=j+1
            k=k+1
    print("Merging ",alist)
    
alist = [54,26,93,17,77,31,44,55,20]
mergeSort(alist)
print(alist)
Splitting  [54, 26, 93, 17, 77, 31, 44, 55, 20]
Splitting  [54, 26, 93, 17]
Splitting  [54, 26]
Splitting  [54]
Merging  [54]
Splitting  [26]
Merging  [26]
Merging  [26, 54]
Splitting  [93, 17]
Splitting  [93]
Merging  [93]
Splitting  [17]
Merging  [17]
Merging  [17, 93]
Merging  [17, 26, 54, 93]
Splitting  [77, 31, 44, 55, 20]
Splitting  [77, 31]
Splitting  [77]
Merging  [77]
Splitting  [31]
Merging  [31]
Merging  [31, 77]
Splitting  [44, 55, 20]
Splitting  [44]
Merging  [44]
Splitting  [55, 20]
Splitting  [55]
Merging  [55]
Splitting  [20]
Merging  [20]
Merging  [20, 55]
Merging  [20, 44, 55]
Merging  [20, 31, 44, 55, 77]
Merging  [17, 20, 26, 31, 44, 54, 55, 77, 93]
[17, 20, 26, 31, 44, 54, 55, 77, 93]

What about Python's sort method?¶

Python's sort method is a hybrid of Merge Sort and Insertion Sort called Timsort. You can read more here: https://en.wikipedia.org/wiki/Timsort

In [5]:
alist = [54,26,93,17,77,31,44,55,20]
alist.sort() #built-in sort uses Timsort algorithm
print(alist)
[17, 20, 26, 31, 44, 54, 55, 77, 93]

Group Activity Problem 3¶

Comment out the print statements and add mergeSort to your timing experiment. How does it compare? Then, add in the built-in Timsort - which is the best?

Group Activity Problem 4¶

When sorting, we usually don't just have a big list of numbers - often we have objects that need to be sorted by a variety of criteria. Re-write the mergeSort function so that we could sort our movie data by the World Sales field, and sort it from largest to smallest.

Here's an example of what the results should look like if you did it right (printing only the first 10 items in the sorted list). This code shows you how you can use the the built-in sort method with data like this (you give it a function that returns the value of the thing you want to sort by), but you will actually change some of the code inside the mergeSort function to make that one work.

In [6]:
import json
from pprint import pprint #for displaying the data in an easier-to-read way

def get_world_sales(x):
    return x["World Sales"]

with open("HighestGrossingMovies.json") as moviefile:
    movies = json.load(moviefile)
    
    
movies.sort(key=get_world_sales,reverse=True)
pprint(movies[:10])
[{'Distributor': 'Twentieth Century Fox',
  'Domestic Sales': 760507625,
  'Genre': "['Action', 'Adventure', 'Fantasy', 'Sci-Fi']",
  'International Sales': 2086738578,
  'MPAA Rating': 'PG-13',
  'Release Date': 'December 16, 2009',
  'Runtime': '2 hr 42 min',
  'Summary': 'A paraplegic Marine dispatched to the moon Pandora on a unique '
             'mission becomes torn between following his orders and protecting '
             'the world he feels is his home.',
  'Title': 'Avatar (2009)',
  'World Sales': 2847246203},
 {'Distributor': 'Walt Disney Studios Motion Pictures',
  'Domestic Sales': 858373000,
  'Genre': "['Action', 'Adventure', 'Drama', 'Sci-Fi']",
  'International Sales': 1939128328,
  'MPAA Rating': 'PG-13',
  'Release Date': 'April 24, 2019',
  'Runtime': '3 hr 1 min',
  'Summary': 'After the devastating events of Avengers: Infinity War, the '
             'universe is in ruins. With the help of remaining allies, the '
             "Avengers assemble once more in order to reverse Thanos' actions "
             'and restore balance to the universe.',
  'Title': 'Avengers: Endgame (2019)',
  'World Sales': 2797501328},
 {'Distributor': 'Paramount Pictures',
  'Domestic Sales': 659363944,
  'Genre': "['Drama', 'Romance']",
  'International Sales': 1542283320,
  'MPAA Rating': 'PG-13',
  'Release Date': 'December 19, 1997',
  'Runtime': '3 hr 14 min',
  'Summary': 'A seventeen-year-old aristocrat falls in love with a kind but '
             'poor artist aboard the luxurious, ill-fated R.M.S. Titanic.',
  'Title': 'Titanic (1997)',
  'World Sales': 2201647264},
 {'Distributor': 'Walt Disney Studios Motion Pictures',
  'Domestic Sales': 936662225,
  'Genre': "['Action', 'Adventure', 'Sci-Fi']",
  'International Sales': 1132859475,
  'MPAA Rating': 'PG-13',
  'Release Date': 'December 16, 2015',
  'Runtime': '2 hr 18 min',
  'Summary': 'As a new threat to the galaxy rises, Rey, a desert scavenger, '
             'and Finn, an ex-stormtrooper, must join Han Solo and Chewbacca '
             'to search for the one hope of restoring peace.',
  'Title': 'Star Wars: Episode VII - The Force Awakens (2015)',
  'World Sales': 2069521700},
 {'Distributor': 'Walt Disney Studios Motion Pictures',
  'Domestic Sales': 678815482,
  'Genre': "['Action', 'Adventure', 'Sci-Fi']",
  'International Sales': 1369544272,
  'MPAA Rating': None,
  'Release Date': None,
  'Runtime': '2 hr 29 min',
  'Summary': 'The Avengers and their allies must be willing to sacrifice all '
             'in an attempt to defeat the powerful Thanos before his blitz of '
             'devastation and ruin puts an end to the universe.',
  'Title': 'Avengers: Infinity War (2018)',
  'World Sales': 2048359754},
 {'Distributor': 'Universal Pictures',
  'Domestic Sales': 652385625,
  'Genre': "['Action', 'Adventure', 'Sci-Fi']",
  'International Sales': 1018130819,
  'MPAA Rating': 'PG-13',
  'Release Date': 'June 10, 2015',
  'Runtime': '2 hr 4 min',
  'Summary': 'A new theme park, built on the original site of Jurassic Park, '
             'creates a genetically modified hybrid dinosaur, the Indominus '
             'Rex, which escapes containment and goes on a killing spree.',
  'Title': 'Jurassic World (2015)',
  'World Sales': 1670516444},
 {'Distributor': 'Walt Disney Studios Motion Pictures',
  'Domestic Sales': 543638043,
  'Genre': "['Adventure', 'Animation', 'Drama', 'Family', 'Musical']",
  'International Sales': 1119261396,
  'MPAA Rating': 'PG',
  'Release Date': 'July 11, 2019',
  'Runtime': '1 hr 58 min',
  'Summary': 'After the murder of his father, a young lion prince flees his '
             'kingdom only to learn the true meaning of responsibility and '
             'bravery.',
  'Title': 'The Lion King (2019)',
  'World Sales': 1662899439},
 {'Distributor': 'Sony Pictures Entertainment (SPE)',
  'Domestic Sales': 675813257,
  'Genre': "['Action', 'Adventure', 'Fantasy', 'Sci-Fi']",
  'International Sales': 868642706,
  'MPAA Rating': None,
  'Release Date': None,
  'Runtime': '2 hr 28 min',
  'Summary': "With Spider-Man's identity now revealed, Peter asks Doctor "
             'Strange for help. When a spell goes wrong, dangerous foes from '
             'other worlds start to appear, forcing Peter to discover what it '
             'truly means to be Spider-Man.',
  'Title': 'Spider-Man: No Way Home (2021)',
  'World Sales': 1544455963},
 {'Distributor': 'Walt Disney Studios Motion Pictures',
  'Domestic Sales': 623357910,
  'Genre': "['Action', 'Adventure', 'Sci-Fi']",
  'International Sales': 895457605,
  'MPAA Rating': 'PG-13',
  'Release Date': 'April 25, 2012',
  'Runtime': '2 hr 23 min',
  'Summary': "Earth's mightiest heroes must come together and learn to fight "
             'as a team if they are going to stop the mischievous Loki and his '
             'alien army from enslaving humanity.',
  'Title': 'The Avengers (2012)',
  'World Sales': 1518815515},
 {'Distributor': 'Universal Pictures',
  'Domestic Sales': 353007020,
  'Genre': "['Action', 'Thriller']",
  'International Sales': 1162334379,
  'MPAA Rating': 'PG-13',
  'Release Date': 'April 1, 2015',
  'Runtime': '2 hr 17 min',
  'Summary': 'Deckard Shaw seeks revenge against Dominic Toretto and his '
             'family for his comatose brother.',
  'Title': 'Furious 7 (2015)',
  'World Sales': 1515341399}]

Next Topic: Trees¶

A tree data structure is a collection of nodes that express a hierarchical relationship.

You can think of it like a linked list, but each node may be linked to more than one other node.

Example: A tree representing an animal taxonomy¶

Vocabulary¶

Node: the circles - may contain a key and other information

Edge: arrow representing relationship between two nodes

Parent/Child: the edge goes out from the parent node and into the child node. (Mammal is the parent of Carnivora, Carnivora is the child of Mammal)

Root: node with no parent (Animalia)

Leaf: nodee with no children (Human, Chimpanzee, etc.)

Example: A tree representing the folder structure on a computer¶

Example: Natural langauge parse tree¶

Parse tree for the sentence "Homer hit Bart"

Example: Expression Parse Tree¶

Parse tre for the expression $((7+3)*(5−2))$

Binary Search Tree¶

A Binary Search Tree is a tree that allows a search operation to be done similar to a binary search.

Like hash tables, they are commonly used to implement the set and map abstract data types.

The book's code: BST for implementing a map (https://runestone.academy/ns/books/published/pythonds/Trees/SearchTreeImplementation.html )

Remindeer: Set operations: add, contains/__contains__, remove, size/__len__

Binary Search Tree Nodes¶

BST Nodes have at most 2 children, called left child and right child

Nodes in left subtree have value less than this node

Nodes in right subtree have value more than this node

This is true for every node!