Lab 13: Improving the Sentiment Analysis Program¶

You might be wondering if the sentiment analysis program you wrote for Lab 12 is really how sentiment analysis is done. The answer is yes and no. The algorithm you implemented is similar to some common sentiment analysis approaches, though there are more sophisticated algorithms as well. They all, however, rely on the same basic assumptions to work - they use pre-scored examples to come up with scores for new, unknown examples. At Drake, we have an upper-level course called Machine Learning where we explore different algorithms for text analysis problems like this.

One of the things you may have noticed from your testing is that longer reviews tend to get scores closer to 2 whereas shorter ones might have more extreme values. One reason for this is that long reviews might have lots of near-neutral words that bring the average closer to the middle. One way to deal with this is to ignore stop words. Stop words are words like I, you, it, who, is, and, etc. that are very common but don't carry much meaning.

I have provided a file for you called stopwords.txt, which contains a list of stop words (I got it from a Python text analytics package here: https://gist.github.com/sebleier/554280 ). We're going to change the sentiment analysis program so that it ignores these words when calculating the sentiment score.

Exercise 1: Write a function called remove_stopwords which takes a string as an argument, removes any word from the stopwords.txt file, and returns the resulting string. You could unit test it with examples like this:

This problem will require you to do a lot of the things you've learned to do so far in this course - including reading from files, writing functions with parameters and return values, and working with lists. You are welcome/encouraged to split this into more than one function.

Hint: I suggest that you convert the parameter string into a list using .split(" ") like you did in Lab 12. Then, loop through the list of words from stopwords.txt, and check if each of them is in your list. If they are, remove them (and you may need to keep removing them until there are no more copies left). Finally, create a new string which is a concatenation of all the words in the list.

Exercise 2: Change your user_interaction function so that it calls your remove_stopwords function on the input you get from the user, and then get the sentiment score of the resulting string. You should find that most of your test cases have more extreme values, which should make you more confident when you conclude that the sentiment is negative or positive.

The end result might look something like this:

Turning it in¶

You are now going to submit your code with all of your functions. Make sure your file is named sentiment_analysis.py, and make sure you have functions named word_score, text_score, and remove_stopwords as described in Labs 12 and 13.

Important: comment out the call to user_interaction() or any other testing code that you wrote outside of your functions (my automated tests are not set up to provide user input, they're just going to unit-test your functions).

Submit your file to the Lab 12-13: Sentiment Analysis assignment on codePost. You do not need to submit stopwords.txt or movie_reviews.csv - I already set up codePost with a copy of each of those file.

I have written automated tests for each of word_score, text_score, and remove_stopwords. Since this is the product of two lab assignments, it will be worth a total of 8 points.