Assignment 10

3 minute read

Due: by the end of the calendar day on Monday, April 17, 2023

What are we doing?

As we get ready to do our course projects, I wonder if we can do some cool things with data collected from web sites. This will require you to do some experimentation and think about the best way to organize the data for the task you have in mind. Before jumping in, I want to try a smaller version of this first. I will strongly encourage you to use ChatGPT or some similar tool (maybe Google Bard) and fill out the AI Assisted Learning Reflection. As you do this, don’t just blindly copy code - ask follow up questions and try to understand how it works.

Web Scrapers

Web scrapers are programs that collect data from web pages. Web scraping can be a great way to get data that is displayed on the web when it isn’t available via some other more convenient source like a web API. The downside to web scraping is that to get the information you need, you often have to sift through lots of HTML code - the markup-language instructions that tell web browsers how to display the information.

An example

As we’ve seen, you can access data over the web within your Python programs using the requests module (if you didn’t do this earlier in the course, you’ll need to install requests using pip). Try this example which shows what the source code looks like for the URL https://en.wikipedia.org/wiki/Mars

import requests

url = "https://en.wikipedia.org/wiki/Mars"
response = requests.get(url)

print(response.text)

Try this out and see what the data looks like - there’s a lot of it, and unless you are familiar with HTML, it can be overwhelming.

Whenever an HTML page links to another one, you’ll see tags of the form <a href="some_other_page">some text</a>. For example, if you change your code above to something like

import requests

url = "https://en.wikipedia.org/wiki/Mars"
response = requests.get(url)

print(response.text[135000:136000]) #looking at a slice of1000 characters in the middle of the page

you’ll see output like this

></tr></tbody></table>
<p><b>Mars</b> is the fourth <a href="/wiki/Planet" title="Planet">planet</a> from the <a href="/wiki/Sun" title="Sun">Sun</a> and the second-smallest planet in the <a href="/wiki/Solar_System" title="Solar System">Solar System</a>, larger only than <a href="/wiki/Mercury_(planet)" title="Mercury (planet)">Mercury</a>. In the <a href="/wiki/English_language" title="English language">English language</a>, Mars is named for the <a href="/wiki/Mars_(mythology)" title="Mars (mythology)">Roman god of war</a>. Mars is a <a href="/wiki/Terrestrial_planet" title="Terrestrial planet">terrestrial planet</a> with <a href="/wiki/Atmosphere_of_Mars" title="Atmosphere of Mars">a thin atmosphere</a> and has a crust primarily composed of elements similar to Earth's crust, as well as a core made of iron and nickel. Mars has surface features such as <a href="/wiki/Impact_crater" title="Impact crater">impact craters</a>, <a href="/wiki/Valley" title="Valley">valleys</a>, <a href="/

Notice that there are links to a bunch of pages and it matches what you see at https://en.wikipedia.org/wiki/Mars.

Assignment Requirements

I would like you to write a function that can take a URL as an argument and return a list (or some other collection data structure) of all the links that appear in that page.

Rather than trying to do this completely from scratch, try using one of the AI tools discussed above to help you - there are probably some Python modules that you can learn to use that will do much of the work for you.

What to turn in

Turn in the following

  • a .py file with your code
  • in your .py file, include a comment that shows a sample run - what happened when you tested your code
  • if you use an AI tool, fill out and submit the AI Assisted Learning Reflection via the link on the course Blackboard page.

Submit your code to Assignment 10: Web Scraping on codePost. There is no automated testing.

Grading

This is a 4-point assignment. See the rubric from the syllabus. Even if you do not get everything working correctly, if you attempt to use the AI tool to assist you, and your AI Assisted Learning Reflection shows you made an effort, you will get full credit.