Category Archives: R

Lies, Damned Lies, and Politicians

I like politics; I don’t like all of the lying involved.  If you ask me, I think that there should be “Ethics Committee” investigations into all of the lying.  Sure, tweeting a picture of your junk is probably not the best idea, but neither is lying.  And nearly all politicians are guilty [1].  Fortunately, the St. Petersburg Times started a website call Politifact [2] with the hopes of keeping some of these people honest.  I’m not sure it’s helping.

In any case, I wrote an R script to scrape the data from Politifact so that I could do some analysis.  I only got about as far as the following figure related to some of the Republican candidates and their propensity to lie.  The figure displays the number of statements made by each candidate that can be categorized as “True”, “Mostly True”, …, “False”, or “Pants on Fire” according to Politifact.

What can we take from this?  Well, Michelle Bachmann is a big-time liar — ain’t no denying that.  She’s also a freaking nutjob.  Ron Paul probably lies the least, but nobody seems to care about him in the media. Tim Pawlenty doesn’t lie too much.  Then again, he’s a wuss and dropped out anyway.  Mitt Romney seems pretty good when it comes to speaking the truth; it’s gotta be the Mormon background.  I suspect that he’ll lie a bit more in the upcoming months.  And Rick Perry…well, he’s just bat-shit crazy, so I’ll ignore him.

If I had the time, I would try to randomly select some Republicans and Democrats from both the House and Senate and analyse of statement category (truth through pants on fire) is independent of political party and/or branch of Congress.  If you’re interested in doing this, I would be happy to help you get started.  Have a look at my github repo for this project and give it a go!


[1] – Note that I said ‘nearly all’ because Dennis Kucinich doesn’t have a single statement classified as “Pants on Fire” on Politifact.

[2] – Winner of a Pulitzer Prize in 2009.


Filed under Data Mining, Politics, R, Scraping Data

Slides from Rocky Mtn SABR Meeting

Last Saturday I had the good fortune to present a talk on finding, gathering, and analyzing some sports-related data on the web at the local SABR group meeting.  In case you’re not familiar with the “SABR” acronym, it stands for “Society for American Baseball Research”; here’s a link to the national organization.  The talk was light on tech and heavy on graphs (predominately made in R and in particular ggplot2).  Good times were had by all.  The slides from the talk are given below.  Most of the slides and code are recycled from previous talks, so I apologize in advance if you’re already familiar with the content.  It was, however, new to the SABR people.

1 Comment

Filed under Data Mining, Data Science, ggplot2, R, Scraping Data, Sports

Google Trends, R, and the NFL

A week or so ago I saw a tweet related how the NFL lockout was affecting the search traffic for “fantasy football” on Google (using Google Trends).  Basically, the search traffic (properly normalized on Google Trends) was down prior to the end of the lockout.  I decided to explore a bit with this myself and chose to look into the RGoogleTrends package for R.  The RGoogleTrends package is basically a convenient way to interact with the Google Trends API using R.  Very cool stuff.  All of my code can be found at my GoogleTrends repo on github.

My first query was to pull the google trends data for the past seven or so years using the search term “fantasy football”.  The following figure shows the results over time.  It’s immediately obvious that the normalized search scores for “fantasy football” were on the decline (2011 over previous years) prior to the end of the lockout; however, it appears that interest has since recovered.

I then decided to look at the trends for “NFL”.  There isn’t a dramatic decrease in the (normalized) searches for “NFL” prior to the lockout’s end, but you do see a huge spike in searches after the announcement.

A few notes:

  • It would be interesting to align the curves in the last plot by the day of week.  That is, it would be nice to compare the trends scores, as an example, for the 7th Wednesday prior to the start of the NFL preseason or something.
  • In order to use the RGoogleTrends package, you can use the cookies in Firefox to pass username and password if you just log into gmail (or another google service).


Filed under Data Viz, ggplot2, R, Sports

The RDSTK Presentation at Denver R Users Group

Last night I presented a talk at the DRUG introducing the R wrapper for the Data Science Toolkit.  Lots of good questions, good forking, and good beer afterwards at Freshcraft.  The slides are given below.

Leave a comment

Filed under Data Science, Data Viz, R, Web 2.0

R and the Data Science Toolkit

I recently decided to present a talk to the Denver R Users Group (DRUG) on how to make an R package (May 17). There were only two problems: (1) I’ve never made a package and (2) I had nothing in mind to package up.  At about this same time, Pete Warden and others were blogging about the iPhone tracking issue [1]. How are these two events related? Well, I remembered that a few of my favorite Twitter ‘friends’ posted some things related to Pete Warden’s “The Data Science Toolkit (DSTK)” [2] a while back. And? And at the time I thought that it would be cool to have an R package/wrapper for accessing the DSTK’s API, similar to Drew Conway’s  R wrapper  for the infochimps API.

So I’m happy to announce that after spending a little time on this project in the past week, Version 0.1 of the RDSTK package is available on github. I haven’t submitted this package to CRAN and, hence, you need to install it from source (RDSTK_0.1.tar.gz). In order to do this, use the install.packages() function within R or R CMD INSTALL from the shell prompt. Note that the package depends on the RCurl, plyr, and rjson packages.

The following functions are included in the package:

  • street2coordinates
  • ip2coordinates
  • coordinates2politics
  • text2sentences
  • text2people
  • html2text
  • text2times

They should be easy to use if you are familiar to the DSTK API. If not, RTFM! 🙂

Let me know if you have any comments and/or suggestions. Happy hacking.


I wanted to mention that I received a bit of help with the RCurl package from “Noah” on stackoverflow, Andy Gayton on stackoverflow, and Duncan Temple Lang on the R-Help list.  Thanks!


  1. To borrow a joke from Asi Behar, “Right after word leaks that the iPhone has been tracking your location at all times, we find Osama. Coincidence? Thanks Apple!”
  2. You may recall that a while back, I tweeted about disliking the phrase “data science”.  My feelings have not changed.


Filed under Data Science, R

NBA, Logistic Regression, and Mean Substitution

[Note: I wrote this on a flight and didn’t proofread it at all. You’ve been warned of possibly incoherencies!]

I’m currently sitting at about 32K feet above sea level on my way from Tampa International to DIA and my options are (1) watch a shitty romantic comedy starring Reese Witherspoon, Owen Wilson, et al. or (2) finish my blog post about the NBA data.  With a chance to also catch up on some previously downloaded podcasts, I decided on option (2).

So where was I related to the NBA analysis?  I downloaded some data and I was in the process of developing a predictive model.  I’m not going to get into the specifics of this model because it was an incredibly stupid model.  The original plan was to build a logistic regression model relating several team-based metrics (e.g., shots, assists, and blocks per game, field-goal and free throw percentage, etc.) to a team’s overall winning percentage.  I was hoping to use this model as the basis of a model for an individual player’s worth.  How?  Not sure.  In any case, I got about half-way through this exercise and realized that this was an incredibly stupid endeavor.  Why?  I’m glad you asked.

Suppose that you gave a survey to roughly 1K males and asked them several questions.  One of the questions happened to be “How tall are you (in inches)?”  The respondents were incredibly sensitive and only about half responded to this particular question.  There were other questions with various levels of missingness as well.  A histogram of the 500 answers the the aforementioned question is given in Figure 1.

Figure 1: A hypothetical sampling of 500 male heights.

One of the goals of the hypothetical survey/study is to classify these males using all of the available data (and then some).  What do I mean by the parenthetical back there?  Well, a buddy of mine suggests that we just substitute the average height for the missing heights in order to make our data set more complete.  Obviously, this isn’t going to change the average height in the data.  Are there any repercussions for doing this?  Consider the variance of the heights.  If we need to estimate the population variance of male heights, we will severely underestimate this parameter.  See Figure 2 for the density estimates of the original 500 heights and the original plus the imputed data.

Figure 2: Density estimates of the original 500 heights + 500 imputed (mean substitution) heights.

(Alter Ego:  Yo Ryan — WTF are you talking about here?  You’re supposed to be talking about the NBA and building a advanced player-evaluation metric!

Me:  I’m getting to that!)

OK, so how does this relate to what I was doing prior to the mean-substitution tangent?  Well, my model based on team metrics related to overall winning percentage was an exercise in mean substitution!  The summarized data (e.g. blocks per game or free throw percentage) are averaged overall all games and I’m trying to relate those numbers to n1 wins and n2 losses out of N = n1 + n2 games.  Essentially I would have N replicates of the averaged data (my predictor variables) and and n1 and n2 successes and failures (reps.) in the logistic regression model.  I was ignoring any variation in the individual game statistics that contributed to the individual wins and losses.

Why didn’t I just do a better job and ignore this mistake?  Basically, I felt compelled to offer up this little caveat related to data analysis.  Just because you can write a script to gather data and perhaps even build a model in something like R does not guarantee that your results are meaningful or that you know something about statistics!  If you are doing a data analysis, think HARD about any questions that you want to address, study what your particular methods are doing and any subsequent implications of using said methods, and for *$@%’s sake interpret your conclusions in the context of any assumptions.  This isn’t an exhaustive list of good data-analytic practice, but it’s not going to hurt anything.  Happy analyzing folks!

As usual, all of the code and work related to this project is available at the github repo.


Filed under Basic Statistics, Imputation, R, Scraping Data, Sports

NBA Analysis: Coming Soon!

I decided to spend a few hours this weekend writing the R code to scrape the individual statistics of NBA players (2010-11 only).  I originally planned to write up a few NBA-related analyses, but a friend was visiting from out of town and, of course, that means less time sitting in front of my computer…which is a good thing!  So in between an in-house concert at my place (video posted soon), the Rapids first game (a win, 3-1 over Portland), brunch, and trivia at Jonesy’s (3rd place), I did write some code.  The git repo can be found here on github.

Note that this code is having a little trouble at the moment.  I have no idea why, but it’s throwing an error when it tries to scrape the Bulls’ and the Raptors’ pages.  I’m pretty sure it’s NOT because the Bulls are awesome and the Raptors suck…though I haven’t confirmed that assertion.

In any case, let me know if you have any ideas about what I should do with this data.  Some of the concepts that I’m toying with at the moment include:

  • Comparing the before and after performances of players who were traded at or near the trading deadline, and/or
  • Examining some of the more holistic player-evaluation metrics w/r/t win-loss records for various teams.

Question:  Why didn’t you use BeautifulSoup for your scraping?  You seem to be a big proponent of python — what’s up?

Answer:  I wrote about scraping with R vs python in a previous post.  That little test was pretty conclusive in terms of speed and R won.  I am not totally convinced that I like the R syntax for xml/html parsing, but it is fast.  And me not liking the syntax is probably a result of me not being an XML expert rather a shortcoming of the XML package itself.


Filed under R, Scraping Data, Sports