NBA, Logistic Regression, and Mean Substitution

[Note: I wrote this on a flight and didn’t proofread it at all. You’ve been warned of possibly incoherencies!]

I’m currently sitting at about 32K feet above sea level on my way from Tampa International to DIA and my options are (1) watch a shitty romantic comedy starring Reese Witherspoon, Owen Wilson, et al. or (2) finish my blog post about the NBA data.  With a chance to also catch up on some previously downloaded podcasts, I decided on option (2).

So where was I related to the NBA analysis?  I downloaded some data and I was in the process of developing a predictive model.  I’m not going to get into the specifics of this model because it was an incredibly stupid model.  The original plan was to build a logistic regression model relating several team-based metrics (e.g., shots, assists, and blocks per game, field-goal and free throw percentage, etc.) to a team’s overall winning percentage.  I was hoping to use this model as the basis of a model for an individual player’s worth.  How?  Not sure.  In any case, I got about half-way through this exercise and realized that this was an incredibly stupid endeavor.  Why?  I’m glad you asked.

Suppose that you gave a survey to roughly 1K males and asked them several questions.  One of the questions happened to be “How tall are you (in inches)?”  The respondents were incredibly sensitive and only about half responded to this particular question.  There were other questions with various levels of missingness as well.  A histogram of the 500 answers the the aforementioned question is given in Figure 1.

Figure 1: A hypothetical sampling of 500 male heights.

One of the goals of the hypothetical survey/study is to classify these males using all of the available data (and then some).  What do I mean by the parenthetical back there?  Well, a buddy of mine suggests that we just substitute the average height for the missing heights in order to make our data set more complete.  Obviously, this isn’t going to change the average height in the data.  Are there any repercussions for doing this?  Consider the variance of the heights.  If we need to estimate the population variance of male heights, we will severely underestimate this parameter.  See Figure 2 for the density estimates of the original 500 heights and the original plus the imputed data.

Figure 2: Density estimates of the original 500 heights + 500 imputed (mean substitution) heights.

(Alter Ego:  Yo Ryan — WTF are you talking about here?  You’re supposed to be talking about the NBA and building a advanced player-evaluation metric!

Me:  I’m getting to that!)

OK, so how does this relate to what I was doing prior to the mean-substitution tangent?  Well, my model based on team metrics related to overall winning percentage was an exercise in mean substitution!  The summarized data (e.g. blocks per game or free throw percentage) are averaged overall all games and I’m trying to relate those numbers to n1 wins and n2 losses out of N = n1 + n2 games.  Essentially I would have N replicates of the averaged data (my predictor variables) and and n1 and n2 successes and failures (reps.) in the logistic regression model.  I was ignoring any variation in the individual game statistics that contributed to the individual wins and losses.

Why didn’t I just do a better job and ignore this mistake?  Basically, I felt compelled to offer up this little caveat related to data analysis.  Just because you can write a script to gather data and perhaps even build a model in something like R does not guarantee that your results are meaningful or that you know something about statistics!  If you are doing a data analysis, think HARD about any questions that you want to address, study what your particular methods are doing and any subsequent implications of using said methods, and for *$@%’s sake interpret your conclusions in the context of any assumptions.  This isn’t an exhaustive list of good data-analytic practice, but it’s not going to hurt anything.  Happy analyzing folks!

As usual, all of the code and work related to this project is available at the github repo.

8 Comments

Filed under Basic Statistics, Imputation, R, Scraping Data, Sports

NBA Analysis: Coming Soon!

I decided to spend a few hours this weekend writing the R code to scrape the individual statistics of NBA players (2010-11 only).  I originally planned to write up a few NBA-related analyses, but a friend was visiting from out of town and, of course, that means less time sitting in front of my computer…which is a good thing!  So in between an in-house concert at my place (video posted soon), the Rapids first game (a win, 3-1 over Portland), brunch, and trivia at Jonesy’s (3rd place), I did write some code.  The git repo can be found here on github.

Note that this code is having a little trouble at the moment.  I have no idea why, but it’s throwing an error when it tries to scrape the Bulls’ and the Raptors’ pages.  I’m pretty sure it’s NOT because the Bulls are awesome and the Raptors suck…though I haven’t confirmed that assertion.

In any case, let me know if you have any ideas about what I should do with this data.  Some of the concepts that I’m toying with at the moment include:

  • Comparing the before and after performances of players who were traded at or near the trading deadline, and/or
  • Examining some of the more holistic player-evaluation metrics w/r/t win-loss records for various teams.

Question:  Why didn’t you use BeautifulSoup for your scraping?  You seem to be a big proponent of python — what’s up?

Answer:  I wrote about scraping with R vs python in a previous post.  That little test was pretty conclusive in terms of speed and R won.  I am not totally convinced that I like the R syntax for xml/html parsing, but it is fast.  And me not liking the syntax is probably a result of me not being an XML expert rather a shortcoming of the XML package itself.

4 Comments

Filed under R, Scraping Data, Sports

Napkin Calculations

I ride the bus to work and ride my bike home.  I really enjoy the 8 mile ride on the way home — expect when it’s freezing like yesterday!  I haven’t decided whether or not it’s because (1) I’m cheap and don’t want to buy another car, (2) I work at the National Renewable Energy Lab, or (3) I like the evening workout.  To be honest, it’s probably a combination of all three.

Anyway, there are a few things that piss me off about the 28 route in Denver.  However, nothing, and I mean nothing, pisses me off more than the little side journey that the bus takes when we get to Yates and 26th.  As you can see in the link, we go south to Byron Pl, over to Sheridan, and then back north to 26th.  Why does this little sojourn piss me off you ask?  Because nobody ever uses the Byron Pl stop!  OK, there are a few people, but they should walk the 1.5 blocks to either 26th and Sheridan or 26th and Yates!

Here’s my back of the envelope calculation for how much this side trip costs RTD on its weekday routes.

Assumptions/facts:

  1. A bus gets 5 mpg.  Is this a good assumption?  Who knows.  I really don’t care.  I’m just bored and want to blog about this.
  2. Google maps puts this side trip at 0.4 miles.
  3. There are 36 eastbound and 40 westbound trips per day that utilize this ridiculous Byron Pl stop.  (Note: There could be more, but I’m not dealing with the routes that start at Byron Pl.)
  4. To keep things simple, let’s say that there are 250 ‘weekdays’ for the 28 route.

What does this all mean?  Using these figures, the trip uses about 0.08 gallons of fuel for each trip down to Byron Pl.  Maybe that’s not entirely fair, because the bus would still go 0.1 miles if it doesn’t take the stupid trip.  So adjusting point 2 above, let’s say that the trip costs 0.3 miles and, hence, uses 0.06 gallons of fuel.  That’s 86.4 gallons per day or about 21,600 gallons per year!  Assuming $2.50 per gallon of fuel, RTD spends about $54,000 on this unnecessary trip!  Holy shit, that doesn’t even include the weekends!

Any thoughts?

11 Comments

Filed under Basic Statistics, Rambling, Uncategorized

GPUs vs CPUs

I am going to start working on some benchmarks for GPUs vs CPUs.  Hopefully I can write something about that soon; however, I don’t have anything at the moment.  Nevertheless, I can give you a pretty sweet video illustrating the GPU vs CPU concept courtesy of Mythbusters.  Here ya go.

1 Comment

Filed under HPC, Rambling, Uncategorized

My Crappy Fantasy Football Draft

I compared the results of my fantasy football draft with the results of more than 1500 mock drafts at the Fantasy Football Calculator (FFC).  I looked at where player X was drafted in our league, subtracted off the average draft position on FFC, and divided by the standard deviation of the draft positon of that player on FFC.  In other words, I’ve computed a ‘standardized’ draft position for the given player.

How do we interpret this standardized draft position?  Obviously if we have a positive score, then a player was drafted later in our draft than the average position on FFC.  This would mean that a team owner in our league got a pretty good deal on that player.  Understand?  Divided by the standard deviation just places all of the draft positions in a standardized unit for comparison purposes.  Here are the results of our draft.

What do we see from this?  Well, my draft sucked.  Most of my boxes in the heat map are negative!  So I drafted my players a little higher than the average draft position on FFC.  In particular, it looks like I picked Pierre Thomas way earlier.

Some positives:  Yurcy picked Randy Moss with the 18th pick and his average draft position on this website was 8.8.  Possibly the biggest winner was Rob’s 6th round pick of Wes Welker…good value there.

I’ll do the same for my league with the boys in Vermont.  Hopefully the results are a little better than what I did with the Princeton gang.

The code is published at github under ffdraft.

3 Comments

Filed under R, Sports

Using XML package vs. BeautifulSoup

A while back I posted something about scraping a webpage using the BeautifulSoup module in Python.  One of the comments to that post was by Larry — a blogger over at IEORTools — suggesting that I take a look at the XML library in R.  Given that one of the points of this blog is to become more familiar with some of the R tools, it seemed like a reasonable suggestion — and I went with it.

I decided to replicate my work in python for scraping the MLS data using the XML package for R.  OK, I didn’t replicate it exactly because I only scraped five years worth of data.  I figured that five years would be a sufficient amount of time for comparison purposes.  The only major criterion that I enforced was that they both had to export nearly identical .csv files.  I say “nearly” because R seemed to want to wrap everything in parenthesis and it also exported the row names (1, 2, 3, etc.) as default options in write.table().  Neither of these defaults are an issue, so I didn’t bother changing them.  I wrote in a few print statements for commenting purposes in both scripts to show where a difference (if any) in timing might exist.  The code can be found in my scraping repository on github.

I don’t really know much about how system.time() works in R to be honest.  However, I used this function as the basis of my comparison.  Of course, I was source’ing an R file and using the system() function in R to run the python script, i.e., system(“path/to/py_script.py”).  The results can be summarized in the following graph.

As you can see in the figure, there is about a 3x speedup in using the XML package relative to using BeautifulSoup!  This is not what I was expecting.  Further, it appears that the overall “user” speedup is approximately 5x.  In fact, the only place where python seems to beat the R package is in the user.self portion of the time….whatever the hell this means.

As I said before, I decided to print out some system times within each script because scraping this data is iterative.  That is, I scrape and process the data for each year within the loop (over years).  So I was curious to see if there each option was scraping and processing at about the same speed.  It turns out that XML beat BeautifulSoup here as well.

Results:

## From system call to python:
Sun Aug 29 18:07:57 2010 -- Starting
Sun Aug 29 18:08:00 2010 -- Year: 2005
Sun Aug 29 18:08:02 2010 -- Year: 2006
Sun Aug 29 18:08:04 2010 -- Year: 2007
Sun Aug 29 18:08:06 2010 -- Year: 2008
Sun Aug 29 18:08:08 2010 -- Year: 2009
Sun Aug 29 18:08:08 2010 -- Finished :)

and in R:

[1] "2010-08-29 18:10:29 -- Starting"
[1] "2010-08-29 18:10:29 -- Year: 2005"
[1] "2010-08-29 18:10:29 -- Year: 2006"
[1] "2010-08-29 18:10:29 -- Year: 2007"
[1] "2010-08-29 18:10:30 -- Year: 2008"
[1] "2010-08-29 18:10:30 -- Year: 2009"
[1] "2010-08-29 18:10:30 -- Finished :)"

What do I conclude from this?  Well, use R damnit!  The XML package is super easy to use and it’s fast.  Will I still use python?  Of course!  I would bet thatpython/BeautifulSoup would be a superior option if I had to scrape and process huge amounts of data — which will happen sooner rather than later.

My computer’s technical specs: 2.66 GHz Intel Core 2 Duo, 8 GB RAM (DDR3), R version 2.11.0, XML v3.1-0, python 2.6.1, and BeautifulSoup v3.1.0.1.

Preview of upcoming post: I am going to compare my two fantasy football drafts with the results similar drafts that are posted online!  Exciting stuff…you know, if you’re a nerd and like sports.

17 Comments

Filed under Python, R

A Rule Change in Major League Soccer?

I have to admit that working with my Major League Soccer data set has been slow going.  There are a few reasons:  (1) I have a full-time job at the National Renewable Energy Lab and (2) the data isn’t quite as “rich” as I initially thought.  As an example, the MLS site doesn’t list the wins and losses for each team by year.  That seems to be a fundamental piece of “sports”-type data, right?  In any case, I did come across something that I can’t seem to answer.  If you know somebody that works with MLS, send ’em my email address and tell them that I want answers, damnit!

So following up on my previous MLS-related post, I wanted to see if I could pinpoint why goals per game has been decreasing in recent years.  My first thought was that with MLS expanding, more US-based players transferring overseas, etc., that the overall talent level in MLS has suffered a bit in the more recent years.  One way that this might manifest itself in the data is by having less shots “on target” or “on goal”.  Therefore, I looked at the number of shots on goal vs the number of shots and also vs the number of goals over the years.  The two figures are given next.

Based on the first figure, one could argue that the shooters are becoming a little less accurate.  That is, the number of shots on target per shot has decreased by about 10% over the course of the league’s lifetime.  Shots on goal per goal seems relatively steady over this same time period.  This might suggest that the league’s strikers are getting slightly worse whereas the quality of the keepers is holding steady.  That, of course, could contribute to the decline of goals per game.

I also decided to look at the number of assists per goal.  Why?  Well, my logic is that if there are more assists per goal, then there might be better overall team play.  Conversely, a decrease in this number might be a result of teams having one or more stars (hence, more individual brilliance) and less of the quality, build-up-type goals.  Make sense?  C’mon, I’m trying here!  Anyway, here is the resulting graph.

Whoa, what in the hell happened there?  The data look a bit suspicious.  Specifically, there seems to be a serious change between the 2002 and 2003 seasons.  So I made a similar graph, but I separated by the different time periods.  Here ya go.

What does this mean?  My hypothesis is that there was a fundamental change to the rules in how assists were recorded between the 2002 and 2003 seasons.  Unfortunately, I can’t confirm this.  I’ve searched the web, read the MLS Wikipedia page, read a good amount of the MLS website, and can’t seem to find anything related to a rules change that might result in this sort of phenomenon.  Sooooo, if you have any ideas, send ’em my way!

This will likely be the last MLS-specific post for a while.  Unless I can find some more data, I’m giving up — their data is just not that interesting.  Notice that I didn’t say that this would be my last soccer post.  Hopefully I can scrape some EPL (England’s Premiership) data.  Given that their league has been around for more than 15 years, it should be a bit more interesting than mine.

If you’re interested in taking a look at the data and/or code yourself, I’ve created a github repository for your perusal.  Feel free to pass along your comments and/or questions regarding any code — I have thick skin.

So what’s next?  I am thinking about comparing my current workflow of (a) scrape with Python and (b) analyze with R to just doing everything in R (e.g., using the xml package).  Hopefully, I can post some time comparisons soon!

Addendum:  According to at least one blogger, the recording of “secondary assists” was changed after the 2002 season.  I’m not sure why they record secondary assists in the first place — I guess MLS wanted to appeal to the hockey people in the early years.  Here is the bloggers take on secondary assists:

12 Comments

Filed under Data Viz, ggplot2, R, Sports