[Note: I wrote this on a flight and didn’t proofread it at all. You’ve been warned of possibly incoherencies!]

I’m currently sitting at about 32K feet above sea level on my way from Tampa International to DIA and my options are (1) watch a shitty romantic comedy starring Reese Witherspoon, Owen Wilson, et al. or (2) finish my blog post about the NBA data. With a chance to also catch up on some previously downloaded podcasts, I decided on option (2).

So where was I related to the NBA analysis? I downloaded some data and I was in the process of developing a predictive model. I’m not going to get into the specifics of this model because it was an incredibly stupid model. The original plan was to build a logistic regression model relating several team-based metrics (e.g., shots, assists, and blocks per game, field-goal and free throw percentage, etc.) to a team’s overall winning percentage. I was hoping to use this model as the basis of a model for an individual player’s worth. How? Not sure. In any case, I got about half-way through this exercise and realized that this was an incredibly stupid endeavor. Why? I’m glad you asked.

Suppose that you gave a survey to roughly 1K males and asked them several questions. One of the questions happened to be “How tall are you (in inches)?” The respondents were incredibly sensitive and only about half responded to this particular question. There were other questions with various levels of missingness as well. A histogram of the 500 answers the the aforementioned question is given in Figure 1.

Figure 1: A hypothetical sampling of 500 male heights.

One of the goals of the hypothetical survey/study is to classify these males using all of the available data (and then some). What do I mean by the parenthetical back there? Well, a buddy of mine suggests that we just substitute the average height for the missing heights in order to make our data set more complete. Obviously, this isn’t going to change the average height in the data. Are there any repercussions for doing this? Consider the variance of the heights. If we need to estimate the population variance of male heights, we will severely underestimate this parameter. See Figure 2 for the density estimates of the original 500 heights and the original plus the imputed data.

Figure 2: Density estimates of the original 500 heights + 500 imputed (mean substitution) heights.

(Alter Ego: Yo Ryan — WTF are you talking about here? You’re supposed to be talking about the NBA and building a advanced player-evaluation metric!

Me: I’m getting to that!)

OK, so how does this relate to what I was doing prior to the mean-substitution tangent? Well, my model based on team metrics related to overall winning percentage was an exercise in mean substitution! The summarized data (e.g. blocks per game or free throw percentage) are averaged overall all games and I’m trying to relate those numbers to n1 wins and n2 losses out of N = n1 + n2 games. Essentially I would have N replicates of the averaged data (my predictor variables) and and n1 and n2 successes and failures (reps.) in the logistic regression model. I was ignoring any variation in the individual game statistics that contributed to the individual wins and losses.

Why didn’t I just do a better job and ignore this mistake? Basically, I felt compelled to offer up this little caveat related to data analysis. Just because you can write a script to gather data and perhaps even build a model in something like R does not guarantee that your results are meaningful or that you know something about statistics! If you are doing a data analysis, think HARD about any questions that you want to address, study what your particular methods are doing and any subsequent implications of using said methods, and for *$@%’s sake interpret your conclusions in the context of any assumptions. This isn’t an exhaustive list of good data-analytic practice, but it’s not going to hurt anything. Happy analyzing folks!

As usual, all of the code and work related to this project is available at the github repo.

Do you need help with missing data?

Thanks for the offer, Ofer. But, no, I don’t need help. It was more of a realization on my part that I was essentially doing a large-scale mean substitution if I collapsed all of the season data into stuff like (average) blocks per game, (average) assists per game, etc. So that my X vector is lost all of its original variability on a game-by-game basis. Does that make sense?

I always question (but have no answers) whether it is wiser to use season/career data or should data from the last 5/10/whatever number of games be used. Even if the ppg for the season has relatively small variation, is that a better predictor for the next game than the most recent game(s)?

Could you use the variability of just the known data assuming that with a large enough sample size it would approach the variation for all the data?

I guess the point was that there are better ways to impute data, e.g. multiple imputation. Of course, I neglected to mention that in the actual post.

You can always try to fill in gaps using the first moment (mean), but then you realize you didn’t take into consideration the second moment (variance).Once you take this into consideration, you realize you forgot the third moment, etc. This ofcourse is the problem with missing data. Doesn’t this relate to sufficient statistics?

Hmmm, I’m not sure about the sufficiency bit — I’ll have to take a look at my book on Missing Data.

I thought about it a little today. Sufficient statistics relate to a particular distribution which I thought would imply any distributional assumption with missing data could be very costly.