# Scraping the web

## 1. The State of the Union

[The American Presidency Project](http://www.presidency.ucsb.edu) is a treasure trove of official presidential speeches, press releases, news conferences, and other documents from the US presidential history. They do not have an API, though. So if we want to download data from the site, we'll have to scrape it from their website. Thankfully, the HTML code underlying their site is well structured, so we can use web-scraping tools to extract and parse data without too much trouble.

Let's start with the [2018 State of the Union address](http://www.presidency.ucsb.edu/ws/index.php?pid=128921). You can view it in a browser, but we can do a lot more with the raw HTML data in Python.

Import the requisite libraries, and download the 2018 State of the Union speech.

- import the `requests` package
- use the `get()` function of `requests` to download the full HTML code for the 2018 SOTU and store it in a string, `sotu`. URL: `http://www.presidency.ucsb.edu/ws/index.php?pid=128921`
- use the `text` attribute (not method!) of `sotu` to print the HTML code.

<hr>

Be sure to check your notes (or the course slides) from Importing Data in Python, Parts 1 & 2, if necessary.

## Good to know

You can find (and print out -- maybe even laminate!) a helpful [pandas cheatsheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) to remind yourself of some of the basic commands and workflows we'll be exploring here.

This project builds on skills learned in the DataCamp course, [Importing Data in Python, Part 2](https://campus.datacamp.com/courses/importing-data-in-python-part-2).

In [None]:
# importing requests
...

# getting the SOTU and assigning the response object to `sotu`
sotu = ...

# printing the text attribute of `sotu`
print(...)

## 2. Trump's text

The HTML code is useful, especially for documents containing links -- but we're interested in the text of the speech.

Use `BeautifulSoup()` to extract the speech text from the larger HTML file.

- Import the `BeautifulSoup` function from the `bs4` package.
- Use `BeautifulSoup` to parse the text from `sotu` and make a new object `soup`.
- Use the `find_all()` method to extract only paragraphs from the page (using the HTML tag `p`) and store the results in `speech_paragraphs`.
- Print the results.

<hr/>

The reason we are using `find_all()` instead of `get_text()` is that there is a lot of text on the page that doesn't belong to the speech. However, none of the other text is contained in paragraph tags (`<p>`), so by pulling only the paragraphs, we'll get the text of the speech.

In [None]:
# importing BeautifulSoup
...

# parsing the SOTU HTML text with BeautifulSoup
soup = ...

# extracting the speech text (paragraphs)
speech_paragraphs = ...

# printing the speech text
print(speech_paragraphs)

## 3. Finalizing the speech text

We've successfully honed in on the speech, but we have a list of `BeautifulSoup` objects, not an actual block of text. Let's clean that up.

Loop through the paragraphs in `speech_paragraphs`, extract the text from each one, and store the results in the string `speech`.

- Initialize an empty string, `speech`.
- Loop through each `paragraph` in `speech_paragraphs`.
- For each paragraph, use `get_text()` to extract the text from the HTML object. Concatenate the resulting text to `speech` (note the use of `+=`).
- Print the results.

In [None]:
# initializing the string that will contain the speech
speech = ...

# looping through the paragraphs to extract the speech text
for ... in ...:
    # extracting the paragraph text
    speech += ...
    # adding some white space between paragraphs
    speech += '\n\n'

# printing the speech text
print(speech)

## 4. Historical states of the union

That's better! Much more readable, and in shape for us to do some _natural language processing_ on the text in the future.

But The American Presidency Project has more than just the most recent State of the Union. Let's pull all of Barack Obama's SotU speeches, too!

Take the code from the previous three tasks and tweak it to define a new function, `scrape_sotu()`, that will download and pre-process any SOTU speech from The American Presidency Project.

- Define a new function `scrape_sotu()` that takes a single argument: `url`.
- Use the appropriate code from the above three tasks to perform the same scraping and pre-processing on _any_ SOTU speech passed through it.
- Replace the `print()` function above with a `return` statement that will return the speech text rather than print it.
- Loop through all of Obama's SOTU speeches and print the text of each one.

<hr/>

You can find the URLs for President Obama's speeches here: [http://www.presidency.ucsb.edu/sou.php](http://www.presidency.ucsb.edu/sou.php).

In [None]:
def scrape_sotu(url):
    """Download a file from a URL, extract text only from paragraph tags, and export the results as a single string."""
    
    # downloading the raw HTML file
    ...
    
    # parsing the SOTU HTML text with BeautifulSoup
    ...
    
    # extracting the speech text (paragraphs)
    ...
    
    # initializing the string that will contain the speech
    ...

    # looping through the paragraphs to extract the speech text
    ...
        # extracting the paragraph text
        ...
        # adding some white space between paragraphs
        ...

    # returning the speech text
    ...

# creating a list of URLs for Obama's SOTU speeches
obama = ...

# looping through Obama's SOTUs and printing the resulting speech text
for speech_url in obama:
    print(...)

## 5. Saving the data

Most of the time we won't just want to print our data, we'll at least want to save it. And as we learn more text-analysis tools, we'll have more data-science-y things to do with the text. But for now, let's save it.

Use a for loop to download and save to a text file each of Barack Obama's SotU speeches.

- Establish a variable `year` with the numeric value of 2009, representing Obama's first speech.
- Loop through all of Obama's SOTU speeches in the same list `obama` as before.
- Inside the for loop, combine the year and the string `.txt` to create a filename for each speech.
- Inside the `with` context, scrape the speech text using the function you already created. Instead of printing the results, use the `.write()` method to write them to the text file.
- At the end of the loop, use `+=` to add 1 to the value of `year` and reset it for the next speech.

In [None]:
# year of first speech in list
year = ...

# iterating through speeches in obama
for ... in ...:
    
    # establishing filename
    filename = ...
    
    # establish a context for writing the file
    with open(filename, 'w') as f:
        
        # scraping the speech text and writing to file
        ...
    
    # adding 1 to the year to prep for the next speech in the list
    year ...

## 6. Find your own data! (a.k.a., getting ready for midterm projects!!!)

I've assembled a collection of online sources for free, public datasets. Peruse some of these, and choose a dataset to explore on your own.

Once you've chosen a dataset, use `requests` or `pandas.read_csv()` to load that dataset directly into your Python script. (I.e., don't download it in your browser and read it in locally -- scrape or fetch it via Python code.

In the cells that follow, perform some of the basic exploration on that dataset that we've already done in this course. Extract _at least one insight_ from that data. It doesn't need to be earth-shattering, but it _does_ need to be data-driven. This could be something like the following:

- Using immunization data, find a difference in vaccination schedules for children in different states, with parents from different economic statuses, etc.
- Plot the distribution of life expectancy to show difference in gender or nation.
- Find evidence of changing average weather patterns due to climate change.
- Establish the actor(s) most likely to occur in movies alongside your favorite star.
- Demonstrate which president had the longest/shortest SotU speeches.
- etc.

You should be able to load the file in a single code cell, and then use 3-5 additional code cells to process, analyze, and visualize your insight.

**Be sure to include some (but not a lot of) text, either in text/Markdown cells or in code comments, to explain what you're finding.**

<hr/>

Note: some datasets are more usable than others. If you're having trouble with one dataset, try another. Just because a dataset is public, doesn't mean it's ready to be used by the public! Play around with a few, until you find one that suits both your needs and your interests.