Wednesday, April 27, 2016

Building a Corpus Using Python Beautiful Soup, Visualizing in R with Word Cloud

Hey Everyone,

I just wrapped up a project that breaks down Barack Obama's speeches over time, and thought I'd share the code. My background is in Python(2.7), but the finished product had to use data visualizations from R, so for this project, we'll be splitting time between the two languages, using the Juypter Notebook as an editor, in a Windows environment.

Beautiful Soup is a powerful library for web scraping in Python, which allows us to select certain elements of a website based on values in HTML tags.

For this walkthrough, I'll show how to leverage Beautiful Soup to create a corpus from online html documents, then use specific libraries in R to create a compelling visualization.

Set-Up:

- I used Jupyter Notebook as my IDE. It's a versatile tool supports over 40 languages, and is ideal for the kind of step-by-step exploratory analysis needed in data analysis. Jupyter operates cleanest when it's installed as part of Anaconda

- Download and install R. To make the R kernel available to Jupyter, follow the directions posted here.

- Install the BeautifulSoup Python library, which does not come with Anaconda. Basic instructions for retrieving BeautifulSoup can be found here.

Data Collection using Beautiful Soup:


Let's begin by importing the necessary libraries for this analysis in Python:





Next, we'll select the website with the information we want. For this exercise, we'll be accessing AmericanRhetoric.com's log of Barack Obama Speeches.


As you can see, we have a list of links to speeches, and dates, saved in a web table. We'll save the webpage's data in a Soup object. Beautiful Soup will automatically assign the appropriate HTML parser, and soup.prettify() will give us a summary of the html tags available in our soup object.


For the project, I wanted to get a chronological breakdown of Obama's speeches, so getting dates was critical. This is where we begin to the see the power of Beautiful Soup. The html code for the dates followed a standard format: 





In more modern websites, similar data fields would typically be identified with an 'id' property. For this older website design however, we had to find different criteria. After a bit of digging around in the site's code (using Developer Options in Chrome (F12)), I discovered that the tags containing dates were exclusively set to a width of 114.

In Beautiful Soup, you can search by tag, i.e. soup.find_all('p'), or by value, i.e. soup.find_all(font="Verdana"), or a combination of factors. soup.find_all('p', font="Verdana")

This unique width allows us to build a query in Beautiful Soup that only returns date cells, and collect an array from our results.









The main piece of code here is 'date_cell.font.contents[0]'. Our 'font' tag is a child of 'td' (date_cell), and thanks to Beautiful Soup, we can access it with a simple period ('.').  'contents[0]' then returns the string value of the font tag, which is the date string.

By default, Beautiful Soup switches to a unicode format if it encounters any characters it's not familiar with. When we set the 'date' variable, the various transformations ensure that we have a correctly parsed string variable, so we don't have to worry about any unicode parsing issues later on.

Next, we need to access the hyperlinks listed on the page, and retrieve the contents of the speech behind each link. We'll accomplish this in 2 steps. First, we capture the urls of the referenced speech, contained in the 'href' attribute of the 'a' tag of the cells in the 'Links' column.


Similar to the date tags, width=329 ended up being the defining attribute of the link's parent tags. Now we collect the links we'll need:

To validate our work so far, we test whether the date and link arrays are stored as expected, as strings.

Looks like that's all worked correctly.

Next is our final step of the collection phase, where we use our collected links to access the pages, and retrieve the body of the text. For this step, we'll leverage a similar technique as we used earlier to pull in paragraphs, and save them to an array, which will then be saved to a file. The actual speech pages we'll be parsing appear as follows: 


The defining feature of a speech paragraph was a 'font' tag, with a size of '2', that also contained a string. 

For the following code, we will perform the following to retrieve the text of each speech:
- Retrieve the needed html page, using the url from our 'speech_htmls' array
- Save it as a soup object
- Retrieve each paragraph and save it to an array as a string
- Write that array of strings to a file, named with the date the speech was given.


And there you have it. You now have your corpus.

Making the Word Cloud in R

To make the word cloud, we will leverage the following packages in R:
- tm: for easy scrubbing of our text files
- wordcloud: to automate the creation of the word cloud
- RColorBrewer: access a palette of impactful, nice looking colors






In Windows File Explorer, I separated each speech in the corpFile folder by year, by searching for *[speech year].txt, and putting the results in individualized folders. Each of these will be a separate corpus to generate a word cloud from:





Next we need to clean the corpus. We will do this by using the 'tm_map' command from the 'tm' library. We will do the following:

- stripWhitespace: remove white space
- tolower: transform all text to lowercase
- removeWords, stopWords("english"): remove common English words such as 'a', 'the', 'it', etc.
- stemDocument: Break down similar words (jump, jumped, jumping) to their root word (jump).
- removePunctuation: remove punctuation
- PlainTextDocument: the other transformations change the datatype of your .txt files, so this argument returns the files back to plain text, which is the data type accepted by the cloud creation argument we'll use next.




Finally, use the 'wordcloud' method to generate our word cloud from the corpus, using ColorBrewer to build our color palette. (For a list of arguments that can be passed to the the 'wordcloud' method, click here. )





And there you have it! You've now created a word cloud for a corpus of President Obama's speeches in 2008. Since we broke it down by year, I created and screen captured the clouds for each year's corpus. Here's the final result:


I hope you enjoyed the tutorial!

If you have any questions, feel free to reach out in the comments below.

Until next time,

- Greg Lewis

Monday, January 11, 2016

Sports: Predicting the CFP Championship



Some time ago, I built a simple, excel-based prediction engine for the inaugural "College Football Pick 'Em'" game with my family.

I was surprised with how accurate it ended up being. In the bowl games, it was more accurate than the picks of Mike Slabach (ESPN College Football), an aggregate of CBS College football analysts, as well as Mike Norris of Bleacher Report (blew him out of the water). It bested every expert I could find.

It predicted the upsets by TCU and Baylor. It accurately predicted the first round of the playoff.

And it has Clemson winning (by the slimmest of margins) in the Playoff Championship.

Final Score: 32-30, Clemson.

Now the one thing the model has failed to do well is account for teams in the SEC. It's only at 50% when the SEC is involved, so we'll see what happens. Either way, it should be a great game.

Until next time,

-Greg Lewis

Friday, January 8, 2016

On Bell Curves and Human Investment


Characteristics of a Bell Curve

In statistics, a bell curve represents the simple understanding that many things with similar inputs can have different results. Bell curves show us that despite that difference, we can still come to an accurate understanding of how things will generally turn out.

Statistics does not concern itself with being able to find root causes for every result, but instead, mathematically, it makes sense of the range of results and pulls out certain, distinctive threads that drive overall trends. In statistics, major impacts, turning points, and general trends that effect the entire curve are what really matter.

The most important part of the bell curve is the median. Located at the center of the curve, it represents the most frequent results of a given situation. Things near the median are boring and expected, but critically important.

Outliers exist on the edges of every bell curve. Outliers represent unexpected outcomes, and we often hear about them on the news. The biggest fraud, the strongest man, the biggest storms. All outliers.

What statistics seeks to understand is not just the outliers, but also the 99 other cases of fraud, the 49 slightly weaker men, the 149 other minor storms. In other words, "What moves the median? What changes the most frequent outcome? (and the rest of the curve with it?)", as opposed just understanding the outliers.

How Human Investment Works


Now some of you may be asking what this has to do with Human Investment (since it's mentioned in the title). Human Investment is a basic principle of Labor Economics. It goes something like this: The more you invest in a person, the better chance they have of doing well.

It sounds heartless, but has proven to be remarkably accurate in explaining how labor markets work. For example, if you invest in the best paint, suspension, engine, and transmission for your car, your car will have increased value. In a similar fashion, if a person obtains valuable skills like electrical repair, programming, teaching, or welding, they have increased value in the job market.

All cars with the same parts and repair histories don't sell for the same price, and people with the same skills don't all get paid the same wage. There are differences based on location, popular opinion, even the weather, but generally, you can get an idea of how much you'd expect to sell the car for, or get paid, thanks to a bell curve.

The Human Investment Bell Curve


Now, bell curves become important here for highlighting 2, contradictory truths. You may hear people say "You are what you make of yourself. Hard work determines your life", and that is right. You may also hear people say "You are a result of your situation. Your environment determines your outcomes", and that is also right. Both are true.

The whole truth is that your economic inputs (social class, race, family situation, neighborhood, school district, etc) put you on a bell curve, and the results of that curve vary widely. As an example, you might consider classmates that lived in your neighborhood, went to the same schools, had a similar socio-economic situation, but now have a vastly different quality of life than you.

The Influence of Outliers


A prime example of an outlier is Republican presidential candidate Ben Carson. He grew up in the ghetto, but has become of the most respected pediatric neurosurgeons in the world. He is a Human Investment outlier of the greatest kind.

Let's say Dr. Carson graduated with 300 other students. What has come of the 299 other kids in his graduating class? Their lives and economic opportunities represent the rest of his high school's bell curve. For every Ben Carson on the top end of performers, there is likely a counter-balancing person serving a life sentence in prison, who received the same opportunities, but did different things with them.

That leaves us with 298 students left. If we are going to make the greatest impact on the greatest number of people, our public policy decisions should not focus on the Ben Carson or the prison lifer, but reflect the outcomes of the bulk of the population, the other 298 of Ben Carson's classmates. It should attempt to move the median.

Based on what we know of his environment, we can reasonably assume that the curve looks something like this: a few in prison, some unemployed,  many underemployed, many employed at minimum wage, some employed above minimum wage, and a few successful businessmen. Each and every one of these people is as important as Ben Carson, and public policy should try and improve their median outcomes.

Takeaways


All of this should inform our perspective on the "1%" debate. When we look at business creators, we understand that they have likely performed above average on their bell curve.

We can understand that children born into wealth have a good shot at being wealthy, because of the Human Investments they receive. The median of their economic bell curve is high.

We also understand that children born into poverty have a good shot at remaining in poverty, because their family can't provide the same kind of Human Investments. The median of their economic bell is situated very low.

Their 'good shot' is simply a matter of where their median lies.

All of them have dictated the direction of their lives, and yet, all of them are also a result of their given opportunities. Both ideas are true, and complimentary.

As we make decisions on policies we intend to support this upcoming election, I might suggest that we remember the principle of the Bell Curve in Human Investment, and move toward those policies that will help nudge the median (and therefore, the entire Bell Curve) up a little higher for our neighbors, ourselves, and our children.

Until Later,

Greg Lewis