The Steven Quill Blog: Building a Corpus Using Python Beautiful Soup, Visualizing in R with Word Cloud

Hey Everyone,

I just wrapped up a project that breaks down Barack Obama's speeches over time, and thought I'd share the code. My background is in Python(2.7), but the finished product had to use data visualizations from R, so for this project, we'll be splitting time between the two languages, using the Juypter Notebook as an editor, in a Windows environment.

Beautiful Soup is a powerful library for web scraping in Python, which allows us to select certain elements of a website based on values in HTML tags.

For this walkthrough, I'll show how to leverage Beautiful Soup to create a corpus from online html documents, then use specific libraries in R to create a compelling visualization.

Set-Up:

- I used Jupyter Notebook as my IDE. It's a versatile tool supports over 40 languages, and is ideal for the kind of step-by-step exploratory analysis needed in data analysis. Jupyter operates cleanest when it's installed as part of Anaconda.

- Download and install R. To make the R kernel available to Jupyter, follow the directions posted here.

- Install the BeautifulSoup Python library, which does not come with Anaconda. Basic instructions for retrieving BeautifulSoup can be found here.

Data Collection using Beautiful Soup:

Let's begin by importing the necessary libraries for this analysis in Python:

Next, we'll select the website with the information we want. For this exercise, we'll be accessing AmericanRhetoric.com's log of Barack Obama Speeches.

As you can see, we have a list of links to speeches, and dates, saved in a web table. We'll save the webpage's data in a Soup object. Beautiful Soup will automatically assign the appropriate HTML parser, and soup.prettify() will give us a summary of the html tags available in our soup object.

For the project, I wanted to get a chronological breakdown of Obama's speeches, so getting dates was critical. This is where we begin to the see the power of Beautiful Soup. The html code for the dates followed a standard format:

In more modern websites, similar data fields would typically be identified with an 'id' property. For this older website design however, we had to find different criteria. After a bit of digging around in the site's code (using Developer Options in Chrome (F12)), I discovered that the tags containing dates were exclusively set to a width of 114.

In Beautiful Soup, you can search by tag, i.e. soup.find_all('p'), or by value, i.e. soup.find_all(font="Verdana"), or a combination of factors. soup.find_all('p', font="Verdana")

This unique width allows us to build a query in Beautiful Soup that only returns date cells, and collect an array from our results.

The main piece of code here is 'date_cell.font.contents[0]'. Our 'font' tag is a child of 'td' (date_cell), and thanks to Beautiful Soup, we can access it with a simple period ('.'). 'contents[0]' then returns the string value of the font tag, which is the date string.

By default, Beautiful Soup switches to a unicode format if it encounters any characters it's not familiar with. When we set the 'date' variable, the various transformations ensure that we have a correctly parsed string variable, so we don't have to worry about any unicode parsing issues later on.

Next, we need to access the hyperlinks listed on the page, and retrieve the contents of the speech behind each link. We'll accomplish this in 2 steps. First, we capture the urls of the referenced speech, contained in the 'href' attribute of the 'a' tag of the cells in the 'Links' column.

Similar to the date tags, width=329 ended up being the defining attribute of the link's parent tags. Now we collect the links we'll need:

To validate our work so far, we test whether the date and link arrays are stored as expected, as strings.

Looks like that's all worked correctly.

Next is our final step of the collection phase, where we use our collected links to access the pages, and retrieve the body of the text. For this step, we'll leverage a similar technique as we used earlier to pull in paragraphs, and save them to an array, which will then be saved to a file. The actual speech pages we'll be parsing appear as follows:

The defining feature of a speech paragraph was a 'font' tag, with a size of '2', that also contained a string.

For the following code, we will perform the following to retrieve the text of each speech:

- Retrieve the needed html page, using the url from our 'speech_htmls' array

- Save it as a soup object

- Retrieve each paragraph and save it to an array as a string

- Write that array of strings to a file, named with the date the speech was given.

And there you have it. You now have your corpus.

Making the Word Cloud in R

To make the word cloud, we will leverage the following packages in R:

- tm: for easy scrubbing of our text files

- wordcloud: to automate the creation of the word cloud

- RColorBrewer: access a palette of impactful, nice looking colors

In Windows File Explorer, I separated each speech in the corpFile folder by year, by searching for *[speech year].txt, and putting the results in individualized folders. Each of these will be a separate corpus to generate a word cloud from:

Next we need to clean the corpus. We will do this by using the 'tm_map' command from the 'tm' library. We will do the following:

- stripWhitespace: remove white space

- tolower: transform all text to lowercase

- removeWords, stopWords("english"): remove common English words such as 'a', 'the', 'it', etc.

- stemDocument: Break down similar words (jump, jumped, jumping) to their root word (jump).

- removePunctuation: remove punctuation

- PlainTextDocument: the other transformations change the datatype of your .txt files, so this argument returns the files back to plain text, which is the data type accepted by the cloud creation argument we'll use next.

Finally, use the 'wordcloud' method to generate our word cloud from the corpus, using ColorBrewer to build our color palette. (For a list of arguments that can be passed to the the 'wordcloud' method, click here. )

And there you have it! You've now created a word cloud for a corpus of President Obama's speeches in 2008. Since we broke it down by year, I created and screen captured the clouds for each year's corpus. Here's the final result:

I hope you enjoyed the tutorial!

If you have any questions, feel free to reach out in the comments below.

Until next time,

- Greg Lewis

The Steven Quill Blog

Wednesday, April 27, 2016

Building a Corpus Using Python Beautiful Soup, Visualizing in R with Word Cloud

Set-Up:

Data Collection using Beautiful Soup:

Making the Word Cloud in R

No comments:

Post a Comment