Lab 10: CSV Files and Text Analysis¶

In this lab, you are going to practice reading data from files into lists (one and two-dimensional) and then writing some programs which analyze the data.

Exercise 1: Consider the following program from the lecture that created a 2D list of national park info from a csv file and then counted how many parks there are in a given state.

import csv

with open("nationalparks.csv") as npfile:
    parks = csv.reader(npfile)
    parks = list(parks)
    
state = input("Enter a state: ")

park_counter = 1
parks_in_state = 0

while park_counter < len(parks):
    
    if parks[park_counter][1] == state:
        parks_in_state += 1
    
    
    park_counter += 1
    
print("There are",parks_in_state,"national parks in",state)

Write a similar program that asks the user for a state name and then prints the total size in acres of all national parks in that state. Hint: Remember that all values read from files come in as strings, so you will need to convert it to a number in order to total up the values.

Exercise 2: Write a program that asks the user to enter a year. Print the number of national parks that have been established since that date.

Movie Reviews Data¶

We're now going to start working with a data set with movie reviews from the Rotten Tomaties website (data sourced from here: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data) that we will use over the course of several labs. Eventually, we will develop a much larger program than we have so far, and it will have a neat effect.

Take a look at the movie_reviews.csv file. I suggest opening it in a text editor (even Thonny) so that you can see what the raw data looks like.

Note that it is a csv file with two columns. The first column is a number which represents the number of stars given to a movie by a reviewer (0 means very negative review, 1 means negative, 2 means neutral, 3 means positive, 4 means very positive). The second column is the text of the review - you'll see that the csv file has quotes around things in the second column because some of the reviews could contain commas, and by putting it in quotes, it won't be mistaken for commas that separate different columns - the Python csv module can handle reading this without doing anything special).

Exercise 3: Write a program that can search the reviews for a given word, print out any reviews that contain that word, and then report how many reviews matched the search term. Your output could look like this

Hint: Note that you can use the in operator to see if one string is contained inside the other:

word = "amuses"
review = "A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story ."
word in review

True

Challenge Exercise 4: Write a program that will ask the user for a word and then print out the average number of stars of reviews that contain that word (if a single review contains the word more than once, you only need to include it once in the average). Make sure to account for words that don't appear in the reviews so that you don't end up dividing by 0 when computing the average. The output should look like this:

Hint: You can set this up a lot like the previous exercise, but instead of printing the reviews with the search term, add its score onto an accumulator variable.

When you are finished, submit your solution to the Lab 10: Word score assignment on codePost. You only need to submit your .py file - the movie_reviews.csv file will already be uploaded to codePost for you. It will be autograded, and it'll be looking for the output to match the examples above.