Category Archives: Basic Statistics

NBA, Logistic Regression, and Mean Substitution

[Note: I wrote this on a flight and didn’t proofread it at all. You’ve been warned of possibly incoherencies!]

I’m currently sitting at about 32K feet above sea level on my way from Tampa International to DIA and my options are (1) watch a shitty romantic comedy starring Reese Witherspoon, Owen Wilson, et al. or (2) finish my blog post about the NBA data.  With a chance to also catch up on some previously downloaded podcasts, I decided on option (2).

So where was I related to the NBA analysis?  I downloaded some data and I was in the process of developing a predictive model.  I’m not going to get into the specifics of this model because it was an incredibly stupid model.  The original plan was to build a logistic regression model relating several team-based metrics (e.g., shots, assists, and blocks per game, field-goal and free throw percentage, etc.) to a team’s overall winning percentage.  I was hoping to use this model as the basis of a model for an individual player’s worth.  How?  Not sure.  In any case, I got about half-way through this exercise and realized that this was an incredibly stupid endeavor.  Why?  I’m glad you asked.

Suppose that you gave a survey to roughly 1K males and asked them several questions.  One of the questions happened to be “How tall are you (in inches)?”  The respondents were incredibly sensitive and only about half responded to this particular question.  There were other questions with various levels of missingness as well.  A histogram of the 500 answers the the aforementioned question is given in Figure 1.

Figure 1: A hypothetical sampling of 500 male heights.

One of the goals of the hypothetical survey/study is to classify these males using all of the available data (and then some).  What do I mean by the parenthetical back there?  Well, a buddy of mine suggests that we just substitute the average height for the missing heights in order to make our data set more complete.  Obviously, this isn’t going to change the average height in the data.  Are there any repercussions for doing this?  Consider the variance of the heights.  If we need to estimate the population variance of male heights, we will severely underestimate this parameter.  See Figure 2 for the density estimates of the original 500 heights and the original plus the imputed data.

Figure 2: Density estimates of the original 500 heights + 500 imputed (mean substitution) heights.

(Alter Ego:  Yo Ryan — WTF are you talking about here?  You’re supposed to be talking about the NBA and building a advanced player-evaluation metric!

Me:  I’m getting to that!)

OK, so how does this relate to what I was doing prior to the mean-substitution tangent?  Well, my model based on team metrics related to overall winning percentage was an exercise in mean substitution!  The summarized data (e.g. blocks per game or free throw percentage) are averaged overall all games and I’m trying to relate those numbers to n1 wins and n2 losses out of N = n1 + n2 games.  Essentially I would have N replicates of the averaged data (my predictor variables) and and n1 and n2 successes and failures (reps.) in the logistic regression model.  I was ignoring any variation in the individual game statistics that contributed to the individual wins and losses.

Why didn’t I just do a better job and ignore this mistake?  Basically, I felt compelled to offer up this little caveat related to data analysis.  Just because you can write a script to gather data and perhaps even build a model in something like R does not guarantee that your results are meaningful or that you know something about statistics!  If you are doing a data analysis, think HARD about any questions that you want to address, study what your particular methods are doing and any subsequent implications of using said methods, and for *$@%’s sake interpret your conclusions in the context of any assumptions.  This isn’t an exhaustive list of good data-analytic practice, but it’s not going to hurt anything.  Happy analyzing folks!

As usual, all of the code and work related to this project is available at the github repo.

8 Comments

Filed under Basic Statistics, Imputation, R, Scraping Data, Sports

Napkin Calculations

I ride the bus to work and ride my bike home.  I really enjoy the 8 mile ride on the way home — expect when it’s freezing like yesterday!  I haven’t decided whether or not it’s because (1) I’m cheap and don’t want to buy another car, (2) I work at the National Renewable Energy Lab, or (3) I like the evening workout.  To be honest, it’s probably a combination of all three.

Anyway, there are a few things that piss me off about the 28 route in Denver.  However, nothing, and I mean nothing, pisses me off more than the little side journey that the bus takes when we get to Yates and 26th.  As you can see in the link, we go south to Byron Pl, over to Sheridan, and then back north to 26th.  Why does this little sojourn piss me off you ask?  Because nobody ever uses the Byron Pl stop!  OK, there are a few people, but they should walk the 1.5 blocks to either 26th and Sheridan or 26th and Yates!

Here’s my back of the envelope calculation for how much this side trip costs RTD on its weekday routes.

Assumptions/facts:

  1. A bus gets 5 mpg.  Is this a good assumption?  Who knows.  I really don’t care.  I’m just bored and want to blog about this.
  2. Google maps puts this side trip at 0.4 miles.
  3. There are 36 eastbound and 40 westbound trips per day that utilize this ridiculous Byron Pl stop.  (Note: There could be more, but I’m not dealing with the routes that start at Byron Pl.)
  4. To keep things simple, let’s say that there are 250 ‘weekdays’ for the 28 route.

What does this all mean?  Using these figures, the trip uses about 0.08 gallons of fuel for each trip down to Byron Pl.  Maybe that’s not entirely fair, because the bus would still go 0.1 miles if it doesn’t take the stupid trip.  So adjusting point 2 above, let’s say that the trip costs 0.3 miles and, hence, uses 0.06 gallons of fuel.  That’s 86.4 gallons per day or about 21,600 gallons per year!  Assuming $2.50 per gallon of fuel, RTD spends about $54,000 on this unnecessary trip!  Holy shit, that doesn’t even include the weekends!

Any thoughts?

11 Comments

Filed under Basic Statistics, Rambling, Uncategorized

Are MLB Games Getting Longer?

On July 29, 2010, I had a flight from Denver to Cincinnati.  About an hour before boarding, I went to ESPN’s website and found a new article by Bill Simmons, a.k.a The Sports Guy (@sportsguy33 on Twitter).  The basic premise of this article is that a core group of fans is losing interest in Red Sox games this season.  So he decides to assign percentages to his reasons why people are losing interest (he’s a writer, not a statistician).  Anyway, he states that the “biggie”, the “killer”, etc. is the time of games (his 55%), i.e., baseball games are too damn long.  He gives some data from baseball-reference.com to back up his claim.

So what does this have to do with me flying to Cincinnati from Denver?  Well, being the nerd that I am, I immediately went to baseball-reference.com to see if I could download more data!  As an aside, I’ve been obsessing over learning how to use ggplot2 since my return from the useR2010 conference about two weeks ago.  This seems like a good time to start learning.  Ah shit…it appears that this project is going to be a little harder than I expected.  I could download the data for one team and one season before I boarded, but I wanted the past 30 seasons for all teams that played every season.  I suppose that I would have to write a scraper to collect the data from about 750 web pages.  Sweet, now I have something to do on the flight rather than obsess over whether or not the person next to me will spill over into my seat.

Now I’m on the plane.  I’ve downloaded the html of one webpage so that I can test my python scraper (using BeautifulSoup) and I have the data that Simmons used in his article.  The first thing that I do is make a few graphs using his data.  Here is the first.

He essentially discretized the length of each game into five bins:  A – less than or equal to two hours, B – more than two but less than 2 hours and 30 min, and so on.  It certainly looks like the relative size of the blue and purple rectangles together is increasing and the golden rectangle seems to be decreasing.  Maybe Simmons is onto something.  Note that this isn’t the only figure that I made.  There were quite a few others, but I want to get to the good stuff.

(Back onboard the plane.  I have two or three terminals open, Textmate with some python code open, R is open, and the dude sitting next to me is sufficiently freaked out by what’s going on.)

OK, this is getting too long.  It turns out that I didn’t finish the scraper on the flight to Cincinnati.  Fortunately there is the return flight and a few ‘down’ hours to work on this project while visiting KY.  Success!  Upon landing at DIA on Monday, I had the scraper (for the test page that I downloaded on Friday) working.  Now I needed to write a script to iterate across all teams and every year since 1970.  I wrote the script, processed the ‘soup’ for all 700+ team/season combos and I give you the following figure!

So are baseball games getting longer?  Well, the preceding figure gives the median length of games (in minutes) for all teams over each season since 1970.  It looks like it is going from blue to red, suggesting an increase in the median length of games.  However, you might also notice that I’ve added two vertical lines from 1994 through 2004.  This roughly corresponds to the “Steroids Era” in MLB.  It looks like that the game times have been decreasing a bit since the middle of this era or so.  Can we look at the data in another way?  Of course, I give you Figure 3!

Here’s a different way of displaying the same data as given in the second figure.  I didn’t separate by team in this figure, however.  I’m just looking at the median length of games across all teams for each season and added a smooth curve to show a trend to the data.  Note that the peak on the curve corresponds to roughly 1998.  If you recall, Mark McGuire and Sammy Sosa were hitting HRs at record rates and teams were scoring a lot of runs.  And it looked as if their heads might suddenly explode from all of the ‘roids that they were taking.

So what do I take from this?  Well, overall, I would say that they length of baseball games in general seems to be on the decline in the most recent years.  Is the same true for just the Red Sox?  Ah ha, I can make that figure.  Here you go.

Interestingly enough, it looks like the smooth trend line for Red Sox games is increasing in recent years whereas it’s decreasing for all other teams.  So maybe this Simmons character is onto something for his beloved Red Sox.

What do I think?  I don’t really care.  I just wanted an interesting data set to use so that I can learn a bit more about ggplot2, learn a bit more about scraping data using python/beautifulsoup, and kill some time on a flight.  So that’s my story.

Note that all of the data was obtained from baseball-reference.com.  Figures were made using the ggplot2 package in the R software.   Further, I know that I could do more with this data set from a statistical perspective.  That’s what I do, I’m a statistician.  But I have a full-time job and just wanted to learn some stuff!

Some final thoughts.  I’m happy to share the r and/or python code for this project.  Just send me a message on twitter or gmail (twitter name @ gmail) and I’ll send it your way.  Also, you’ll notice some missing data for the Red Sox, NYY, LAD, and CHC.  Why?  I’m not sure.  Apparently the layout of the html code is a bit different than the rest of the pages and the scraper was returning missing values.  I should also say that I scraped the appropriate pages when teams switched cities.  For example, I was scraping Montreal prior to 2005 for the Washington Nationals data.

Ryan

@rtelmore on Twitter!

Here is the R code:

library(ggplot2)

## His Data
sim.dat <- read.table("baseball_rs_game_times.txt",header=T)
ggplot(data = sim.dat, aes(x=Year, y=Count, fill=Cat)) 
+ geom_bar(width=1,stat='identity') + scale_fill_discrete(name="length category") 
+ scale_x_continuous("year") + scale_y_continuous("games")

ggplot(data = sim.dat, aes(x=as.factor(Year), y=Count, fill=Cat)) 
+ geom_bar(stat='identity') 
+ scale_fill_discrete(name="length",breaks=unique(sim.dat$Cat),
labels=c("(,2]","(2,2.30]","(2.30,3]","(3,4]","(4,)")) 
+ scale_x_discrete("year") + scale_y_continuous("games")

ggsave("~/Sports/Simmons/mlb_length_1.png",hei=7,wid=7)

ggplot(sim.dat, aes(Cat, y=Count, col=Count, fill=Count)) 
+ geom_bar() + facet_wrap(~ Year) 

last_plot() + scale_y_continuous("number of games") 
+ scale_colour_continuous("games") + scale_fill_continuous("games") 
+ scale_x_discrete("length of games")

ggsave("~/Sports/Simmons/mlb_length_2.png",hei=7,wid=7)

## My analysis

rs.dat <- read.csv("~/Sports/Simmons/length_of_baseball_games_20100802.csv", header=F)
names(rs.dat) <- c("team","year","mean_len","med_len","std_len","league","TNAT")

ggplot(data=rs.dat, aes(x = year, y = team, fill = med_len))
+ geom_tile() + scale_fill_continuous("minutes")

#+ opts(title = "Median Length of MLB Games by Team (in minutes)") 

last_plot() + geom_vline(x=c(1993,2004),lty=3)
ggsave("~/Sports/Simmons/mlb_length_3.png",hei=7,wid=7)

rs.dat$bs <- rs.dat$team=='BOS'

qplot(x=year, y=med_len, data=rs.dat, geom = c("point","smooth"), 
span = .5, colour=bs, ylab="length of game in minutes")

last_plot() + scale_colour_discrete(name="Boston?")
last_plot() + geom_vline(x=c(1993,2004),lty=2,col="black")
ggsave("~/Sports/Simmons/mlb_length_5.png",hei=7,wid=7)

qplot(x=year, y=med_len, data=rs.dat, geom = c("point","smooth"), 
span = .5, ylab="length of game in minutes")

last_plot() + geom_vline(x=c(1993,2004),lty=2,col="red")

ggsave("~/Sports/Simmons/mlb_length_4.png",hei=7,wid=7)

12 Comments

Filed under Basic Statistics, ggplot2, R, Sports

(a)typical

Give a soapbox (blog) to a nonparametric statistician and you’re going to get this post or something very similar.  Specifically, the latest edition of The Log Cabin concerns two summary statistics:  the mean (average) and the median.  Anybody who has taken an introductory statistics course knows the difference between the two statistics, so why waste an entire post about this subject?  Glad you asked!

I just started reading The Tipping Point by Malcolm Gladwell and I’ve noticed that he likes to throw around statistics like they’re frisbees.  Yes, I know it was published in 2000, it was a NY Times best seller, etc.  If you like the book, don’t tune me out because I’m not here to rip the book — I really like this book.  It’s just that my blood begins to boil when I read something like “The average score in that class was 20.96, meaning that the average person in the class knew 21 people …” on page 40 of my copy.  To me, this doesn’t tell me a particularly compelling story.  Why not?

Here’s another example.  In the e-commerce world, you will often hear people refer to the average revenue per user (ARPU) as a summary of the company’s paying customers.  The logic being that knowing ARPU will be sufficient for describing how much money a typical user will spend on the website.  Does this seem reasonable?

Let’s consider the preceding example in a bit more detail.  Consider the following figure of revenue per user for a fictitious company.  ARPU for this example is $13.50.

RPU for Elmo’s Widget Factory

Simply reporting ARPU of $13.50 can be misinterpretted in a situation like this.  I would argue that reporting the median RPU (= $7) is more meaningful because there is an explicit interpretation — that is, approximately 50% of the users will spend at least $7.  We know immediately what half of the users are likely to do; conversely, we can’t make a similar statement when reporting ARPU.

How does this related to The Tipping Point?  Without knowing what the shape of the distribution of scores, I am not sure what an average score of 21 people really means.  It could be that the median score is 5 people and a few scores were near 100 and, thus, severely inflating the mean score.  On the other hand, it’s certainly possibly that the median score is around 21 people as well.  I just don’t know.  In any case, I would have preferred to see the median, the quartiles (future post), or then entire distribution.

What’s the take away message here, folks?  Don’t just give me an average anything and expect me to be impressed with your findings!  If you choose to report a single measurement (should be avoided at all costs!), use the median.  Better yet, show me the entire disribution of values.

2 Comments

Filed under Basic Statistics, Rambling