Category Archives: Rambling

Napkin Calculations

I ride the bus to work and ride my bike home.  I really enjoy the 8 mile ride on the way home — expect when it’s freezing like yesterday!  I haven’t decided whether or not it’s because (1) I’m cheap and don’t want to buy another car, (2) I work at the National Renewable Energy Lab, or (3) I like the evening workout.  To be honest, it’s probably a combination of all three.

Anyway, there are a few things that piss me off about the 28 route in Denver.  However, nothing, and I mean nothing, pisses me off more than the little side journey that the bus takes when we get to Yates and 26th.  As you can see in the link, we go south to Byron Pl, over to Sheridan, and then back north to 26th.  Why does this little sojourn piss me off you ask?  Because nobody ever uses the Byron Pl stop!  OK, there are a few people, but they should walk the 1.5 blocks to either 26th and Sheridan or 26th and Yates!

Here’s my back of the envelope calculation for how much this side trip costs RTD on its weekday routes.


  1. A bus gets 5 mpg.  Is this a good assumption?  Who knows.  I really don’t care.  I’m just bored and want to blog about this.
  2. Google maps puts this side trip at 0.4 miles.
  3. There are 36 eastbound and 40 westbound trips per day that utilize this ridiculous Byron Pl stop.  (Note: There could be more, but I’m not dealing with the routes that start at Byron Pl.)
  4. To keep things simple, let’s say that there are 250 ‘weekdays’ for the 28 route.

What does this all mean?  Using these figures, the trip uses about 0.08 gallons of fuel for each trip down to Byron Pl.  Maybe that’s not entirely fair, because the bus would still go 0.1 miles if it doesn’t take the stupid trip.  So adjusting point 2 above, let’s say that the trip costs 0.3 miles and, hence, uses 0.06 gallons of fuel.  That’s 86.4 gallons per day or about 21,600 gallons per year!  Assuming $2.50 per gallon of fuel, RTD spends about $54,000 on this unnecessary trip!  Holy shit, that doesn’t even include the weekends!

Any thoughts?


Filed under Basic Statistics, Rambling, Uncategorized

GPUs vs CPUs

I am going to start working on some benchmarks for GPUs vs CPUs.  Hopefully I can write something about that soon; however, I don’t have anything at the moment.  Nevertheless, I can give you a pretty sweet video illustrating the GPU vs CPU concept courtesy of Mythbusters.  Here ya go.

1 Comment

Filed under HPC, Rambling, Uncategorized

Apologies and Style Guides

I have to say that it’s pretty exciting to watch your blog go from a few hits over its lifetime to getting almost 200 in a single day.  I am currently negotiating with Google over the purchase of this blog.  Or maybe not.  Again, thanks be to @revodavid for posting to the Revolution Analytics Blog.  Anyway, I just wanted to apologize for the format of the code snippets.  Reading the python script and the R code again was painful to even me.  I’m going to try to follow the Google style guides for python and R and hopefully the code will be a bit more readable in future posts.  Further, I will try to host my projects on github so that I might find some collaborators.

I’ve added a few interesting links to the Blogroll on the right.  I would recommend subscribing to the RSS feed for  I see some good stuff on there almost daily.

Next up:  An analysis of some MLS data!  All I’ve found so far is that the teams like to change their names.  Hopefully something useful will come of this project.

Leave a comment

Filed under Python, R, Rambling

The Wave is Dead.

Wow, on the day that I decide to write another blog post, Google announces that they are discontinuing their development on Google Wave.  Have a look at my last post.  Back?  Yeah, well I loved the Wave!  There I said it…I really did love Wave.  Unfortunately, nobody else even liked Wave.  I’ll be back; I’m heading outside to pour my forty on the Wave’s grave.

So my next blog post will involve R, ggplot2, baseball, and the Sports Guy’s latest column.  It should be a good one.

Leave a comment

Filed under Rambling, Uncategorized


Give a soapbox (blog) to a nonparametric statistician and you’re going to get this post or something very similar.  Specifically, the latest edition of The Log Cabin concerns two summary statistics:  the mean (average) and the median.  Anybody who has taken an introductory statistics course knows the difference between the two statistics, so why waste an entire post about this subject?  Glad you asked!

I just started reading The Tipping Point by Malcolm Gladwell and I’ve noticed that he likes to throw around statistics like they’re frisbees.  Yes, I know it was published in 2000, it was a NY Times best seller, etc.  If you like the book, don’t tune me out because I’m not here to rip the book — I really like this book.  It’s just that my blood begins to boil when I read something like “The average score in that class was 20.96, meaning that the average person in the class knew 21 people …” on page 40 of my copy.  To me, this doesn’t tell me a particularly compelling story.  Why not?

Here’s another example.  In the e-commerce world, you will often hear people refer to the average revenue per user (ARPU) as a summary of the company’s paying customers.  The logic being that knowing ARPU will be sufficient for describing how much money a typical user will spend on the website.  Does this seem reasonable?

Let’s consider the preceding example in a bit more detail.  Consider the following figure of revenue per user for a fictitious company.  ARPU for this example is $13.50.

RPU for Elmo’s Widget Factory

Simply reporting ARPU of $13.50 can be misinterpretted in a situation like this.  I would argue that reporting the median RPU (= $7) is more meaningful because there is an explicit interpretation — that is, approximately 50% of the users will spend at least $7.  We know immediately what half of the users are likely to do; conversely, we can’t make a similar statement when reporting ARPU.

How does this related to The Tipping Point?  Without knowing what the shape of the distribution of scores, I am not sure what an average score of 21 people really means.  It could be that the median score is 5 people and a few scores were near 100 and, thus, severely inflating the mean score.  On the other hand, it’s certainly possibly that the median score is around 21 people as well.  I just don’t know.  In any case, I would have preferred to see the median, the quartiles (future post), or then entire distribution.

What’s the take away message here, folks?  Don’t just give me an average anything and expect me to be impressed with your findings!  If you choose to report a single measurement (should be avoided at all costs!), use the median.  Better yet, show me the entire disribution of values.


Filed under Basic Statistics, Rambling

What’s in a name?

So why the name “The Log Cabin”?  Well, the reason is two-fold.  (1) I am constantly reminded about the general public’s lack-of-knowledge of and paranoia surrounding logs, or more specifically, logarithms.  For example, I’ve had the following exchange on more than one occasion:

Random acquaintence: “What do you study, Ryan?”

Me: “Mathematics and Statistics.”

RA:  “Wow; you must know a lot about logarithms.”

Ryan:  “Um, I guess I know a bit about them.  Probably nothing more than some other ‘special’ functions, e.g. sine, exponential, etc.”

RA: (shrieks and/or laughs nervously) “Yeah, did you see the score of the game?”

My best guess is that a person in RA’s shoes remembers “logarithm” because they think it’s a funny word or something.  I have no idea.  Any other hypotheses?

To hammer home the point, I recently read a NY Times article about using statistics to answer questions related to injury likelihood in professional baseball.  The author mentioned that the analysts “build logarithm formulas and computer codes that test Conte’s hypotheses…”  No shit!  The article is referenced at the bottom of this post.  As I said when I posted this on Facebook, I would bet that they did build a logit model (seems like a natural starting point), but this sounds like a gratuitous use of the word “logarithmic” in order to make them sound like mathematical geniuses or just plain nerds.

Um, Ryan, you said that there were two reasons.  Right.  Anyway, I needed something clever that can easily be coupled with log — naturally, a cabin.  And the cabin evokes memories of a youngster walking to a little red school house (a cabin) with their books tethered together with a leather strap slung over their shoulder.  OK, that may only work for the readers over 100 in age.

Nevertheless, this blog is going to be all about statistics and data in general.  I’ll touch on topics ranging from beginning statistics to my favorite problems and solutions from graduate school.  I hope to present solutions to classic problems as well as data mining topics (summarization, visualization, etc.) from the contemporary analytics world of Web 2.0.  The point here is that I am going to attempt to demystify statistics, hopefully educate some, and remind others why they fell in love with statistics in the first place.




Filed under Rambling