Correlation, Causation, and LeBron

The Miami Heat beat the Indiana Pacers 90 – 79 on May 30 [1] and @ESPNStatsInfo tweeted the following bit of information.





Why does this interest me? If you recall, the book Moneyball came out in 2003 and several nerds like myself were starting to get interested in the quantitative analysis of sport. And, in particular, it reminds me of a conversation I had with Joe Banner back in 2006 [2] about a statistician’s role in a pro football organization. First, let me say that Joe comes across as a fucking smart dude — he’s not some meathead football player who happened to stay in the game based on inertia. Anyway, he said something along the lines of the following:

“It’s well known that 75 – 80% (#s made up, but it was high) of winning football  teams (in all games played) have a 100 yard rusher. Some coaches will do everything in their power to get their RB going in the first quarter by pounding the running game. I think that’s stupid. Why?”

The obvious response is because the coaches (e.g., Andy Reid) were confusing correlation [3] or, more generally, some association with causation. They teams were not winning because they had a 100+ yard rusher! The likely had a 100 yard rusher because they were winning in the second half and were trying to run clock, or maybe the fact that their running game was working opened things up for the offense, or any number of things. Who the hell knows?! My point here is that you should not confuse association with causation.

So back to LeBron and the Heat. In the first four games of the Heat vs Pacers series, Lebron averaged 73 touches and the Heat won two and lost two. He then gets 86 touches in game five and they win. Obviously, Spoelstra should make it a point to get LeBron 20+ touches in the first quarter of game 6, right? Right? RIGHT?! No! If the Heat are playing well, LeBron is likely to get a lot of touches and vice versa. However, trying to force touches outside of the normal flow of a good Heat game would be (I claim) idiotic. Winning and LeBron’s touches of the basketball are likely to evolve as an organic process and forcing the issue won’t help [4].

Here are a few takeaways from this post:

  1. Don’t confuse association and causation. This happens a lot in the big data world — correlations reign supreme.
  2. I’m pretty sure that the Eagles still regret their decision to pass on my services.
  3. If you are a pro sports team in the Denver area, please send me an email. If your team name is the “Nuggets”, I would likely work for one bag of popcorn per game.



[1] – Yeah, I know that I should’ve written this a while back, but I only get around to this stuff when I’m flying. And right now I’m flying to Cincinnati [5].

[2] – Let me brag for a second. I interviewed with the Philadelphia Eagles back in the summer of 2006 and one of my meetings was with Joe Banner, then the Executive Vice President of the Eagles, now CEO with the Cleveland Browns. They decided not to hire me and instead went with a graduate student at UPenn — and haven’t won shit since. I like to think it’s because they didn’t hire me. Fuck ’em.

[3] – Correlation necessarily implies linearity in the association.

[4] – See Monsanto. And this dog <insert pic> from the protest in Denver.

[5] – Mmmmmmmm, Skyline Chili!


Filed under Uncategorized

I fought the law and the law …

I got a ticket for parking in my driveway last weekend.  Weird, eh?  Here is my email correspondence to the Denver parking violations office:

date: Mon, Aug 29, 2011 at 10:00 AM
subject: Citation #: 137105264

Dear Parking Violation Office,

On Saturday, August 27 2011, I received a parking citation at my house on ??? St. As with any Broncos home game, I know to move my car to my driveway in order to not receive a ticket [1]. Therefore, I did exactly that on Saturday morning in order to prevent said violation. Upon going to my car at approximately 8.30 pm that night, I was shocked to see that the parking violations officer placed a ticket on my windshield with the Officer’s Comments being “In Area HD SB Not Unsable Drive” [2]. I have no idea what this means given that (1) this is a clear driveway for my house, (2) my car doesn’t interfere with the sidewalk nor the street when in the driveway, and (3) I parked there for every Broncos game last year without incident. A particularly troubling aspect of this incident is that the officer had to walk into my yard/driveway in order to give me a ticket.

I am happy to send you pictures of exactly where my car was in order to clear this up. I have included two pictures from Google maps using the ‘street view’ in order to show shots of the driveway. Note that my car is a 2010 Jetta and the car in the picture is a Subaru Forrester (much bigger than the Jetta). I think that the Forester could be blocking the sidewalk by about 1 foot (in these pictures), however, there is plenty of space behind the car and our car is much smaller than this Forrester. Therefore, there is no need to block either the sidewalk or the street when using our parking space.

Thanks for your time,

[1] – I received a parking citation for parking in the street during a Broncos game during the first pre-season game of 2010 (shortly after moving to 2749 Decatur St). Since that initial transgression, I have used my parking lot without incident.

[2] – After reading this email and looking at the pictures, can you please send me an explicit explanation of what the quoted statement means in terms of the Denver City Law so that I can correctly assess whether or not my driveway is in violation of this law/ordinance?

I forgot to tell them what I wanted from the initial email, so I had to follow-up:

date: Mon, Aug 29, 2011 at 2:00 PM
subject: Re: Citation #: 137105264

To be clear, I am asking for this violation to be revoked due to a mistake in the Parking Officer’s judgement; I would appreciate a response in a timely manner.


Their lame-ass response:

date: Wed, Aug 31, 2011 at 9:22 AM
subject: RE: Citation #: 137105264

We have submitted your claim to the Parking Magistrates for review, we will notify you by mail of their determination.

Here are the pictures that I referenced in the initial email:


Filed under Uncategorized

Lies, Damned Lies, and Politicians

I like politics; I don’t like all of the lying involved.  If you ask me, I think that there should be “Ethics Committee” investigations into all of the lying.  Sure, tweeting a picture of your junk is probably not the best idea, but neither is lying.  And nearly all politicians are guilty [1].  Fortunately, the St. Petersburg Times started a website call Politifact [2] with the hopes of keeping some of these people honest.  I’m not sure it’s helping.

In any case, I wrote an R script to scrape the data from Politifact so that I could do some analysis.  I only got about as far as the following figure related to some of the Republican candidates and their propensity to lie.  The figure displays the number of statements made by each candidate that can be categorized as “True”, “Mostly True”, …, “False”, or “Pants on Fire” according to Politifact.

What can we take from this?  Well, Michelle Bachmann is a big-time liar — ain’t no denying that.  She’s also a freaking nutjob.  Ron Paul probably lies the least, but nobody seems to care about him in the media. Tim Pawlenty doesn’t lie too much.  Then again, he’s a wuss and dropped out anyway.  Mitt Romney seems pretty good when it comes to speaking the truth; it’s gotta be the Mormon background.  I suspect that he’ll lie a bit more in the upcoming months.  And Rick Perry…well, he’s just bat-shit crazy, so I’ll ignore him.

If I had the time, I would try to randomly select some Republicans and Democrats from both the House and Senate and analyse of statement category (truth through pants on fire) is independent of political party and/or branch of Congress.  If you’re interested in doing this, I would be happy to help you get started.  Have a look at my github repo for this project and give it a go!


[1] – Note that I said ‘nearly all’ because Dennis Kucinich doesn’t have a single statement classified as “Pants on Fire” on Politifact.

[2] – Winner of a Pulitzer Prize in 2009.


Filed under Data Mining, Politics, R, Scraping Data

Slides from Rocky Mtn SABR Meeting

Last Saturday I had the good fortune to present a talk on finding, gathering, and analyzing some sports-related data on the web at the local SABR group meeting.  In case you’re not familiar with the “SABR” acronym, it stands for “Society for American Baseball Research”; here’s a link to the national organization.  The talk was light on tech and heavy on graphs (predominately made in R and in particular ggplot2).  Good times were had by all.  The slides from the talk are given below.  Most of the slides and code are recycled from previous talks, so I apologize in advance if you’re already familiar with the content.  It was, however, new to the SABR people.

1 Comment

Filed under Data Mining, Data Science, ggplot2, R, Scraping Data, Sports

Google Trends, R, and the NFL

A week or so ago I saw a tweet related how the NFL lockout was affecting the search traffic for “fantasy football” on Google (using Google Trends).  Basically, the search traffic (properly normalized on Google Trends) was down prior to the end of the lockout.  I decided to explore a bit with this myself and chose to look into the RGoogleTrends package for R.  The RGoogleTrends package is basically a convenient way to interact with the Google Trends API using R.  Very cool stuff.  All of my code can be found at my GoogleTrends repo on github.

My first query was to pull the google trends data for the past seven or so years using the search term “fantasy football”.  The following figure shows the results over time.  It’s immediately obvious that the normalized search scores for “fantasy football” were on the decline (2011 over previous years) prior to the end of the lockout; however, it appears that interest has since recovered.

I then decided to look at the trends for “NFL”.  There isn’t a dramatic decrease in the (normalized) searches for “NFL” prior to the lockout’s end, but you do see a huge spike in searches after the announcement.

A few notes:

  • It would be interesting to align the curves in the last plot by the day of week.  That is, it would be nice to compare the trends scores, as an example, for the 7th Wednesday prior to the start of the NFL preseason or something.
  • In order to use the RGoogleTrends package, you can use the cookies in Firefox to pass username and password if you just log into gmail (or another google service).


Filed under Data Viz, ggplot2, R, Sports

The RDSTK Presentation at Denver R Users Group

Last night I presented a talk at the DRUG introducing the R wrapper for the Data Science Toolkit.  Lots of good questions, good forking, and good beer afterwards at Freshcraft.  The slides are given below.

Leave a comment

Filed under Data Science, Data Viz, R, Web 2.0

R and the Data Science Toolkit

I recently decided to present a talk to the Denver R Users Group (DRUG) on how to make an R package (May 17). There were only two problems: (1) I’ve never made a package and (2) I had nothing in mind to package up.  At about this same time, Pete Warden and others were blogging about the iPhone tracking issue [1]. How are these two events related? Well, I remembered that a few of my favorite Twitter ‘friends’ posted some things related to Pete Warden’s “The Data Science Toolkit (DSTK)” [2] a while back. And? And at the time I thought that it would be cool to have an R package/wrapper for accessing the DSTK’s API, similar to Drew Conway’s  R wrapper  for the infochimps API.

So I’m happy to announce that after spending a little time on this project in the past week, Version 0.1 of the RDSTK package is available on github. I haven’t submitted this package to CRAN and, hence, you need to install it from source (RDSTK_0.1.tar.gz). In order to do this, use the install.packages() function within R or R CMD INSTALL from the shell prompt. Note that the package depends on the RCurl, plyr, and rjson packages.

The following functions are included in the package:

  • street2coordinates
  • ip2coordinates
  • coordinates2politics
  • text2sentences
  • text2people
  • html2text
  • text2times

They should be easy to use if you are familiar to the DSTK API. If not, RTFM! 🙂

Let me know if you have any comments and/or suggestions. Happy hacking.


I wanted to mention that I received a bit of help with the RCurl package from “Noah” on stackoverflow, Andy Gayton on stackoverflow, and Duncan Temple Lang on the R-Help list.  Thanks!


  1. To borrow a joke from Asi Behar, “Right after word leaks that the iPhone has been tracking your location at all times, we find Osama. Coincidence? Thanks Apple!”
  2. You may recall that a while back, I tweeted about disliking the phrase “data science”.  My feelings have not changed.


Filed under Data Science, R