On July 29, 2010, I had a flight from Denver to Cincinnati. About an hour before boarding, I went to ESPN’s website and found a new article by Bill Simmons, a.k.a The Sports Guy (@sportsguy33 on Twitter). The basic premise of this article is that a core group of fans is losing interest in Red Sox games this season. So he decides to assign percentages to his reasons why people are losing interest (he’s a writer, not a statistician). Anyway, he states that the “biggie”, the “killer”, etc. is the time of games (his 55%), *i.e.*, baseball games are too damn long. He gives some data from baseball-reference.com to back up his claim.

So what does this have to do with me flying to Cincinnati from Denver? Well, being the nerd that I am, I immediately went to baseball-reference.com to see if I could download more data! As an aside, I’ve been obsessing over learning how to use ggplot2 since my return from the useR2010 conference about two weeks ago. This seems like a good time to start learning. Ah shit…it appears that this project is going to be a little harder than I expected. I could download the data for one team and one season before I boarded, but I wanted the past 30 seasons for all teams that played every season. I suppose that I would have to write a scraper to collect the data from about 750 web pages. Sweet, now I have something to do on the flight rather than obsess over whether or not the person next to me will spill over into my seat.

Now I’m on the plane. I’ve downloaded the html of one webpage so that I can test my python scraper (using BeautifulSoup) and I have the data that Simmons used in his article. The first thing that I do is make a few graphs using his data. Here is the first.

He essentially discretized the length of each game into five bins: A – less than or equal to two hours, B – more than two but less than 2 hours and 30 min, and so on. It certainly looks like the relative size of the blue and purple rectangles together is increasing and the golden rectangle seems to be decreasing. Maybe Simmons is onto something. Note that this isn’t the only figure that I made. There were quite a few others, but I want to get to the good stuff.

(Back onboard the plane. I have two or three terminals open, Textmate with some python code open, R is open, and the dude sitting next to me is sufficiently freaked out by what’s going on.)

OK, this is getting too long. It turns out that I didn’t finish the scraper on the flight to Cincinnati. Fortunately there is the return flight and a few ‘down’ hours to work on this project while visiting KY. Success! Upon landing at DIA on Monday, I had the scraper (for the test page that I downloaded on Friday) working. Now I needed to write a script to iterate across all teams and every year since 1970. I wrote the script, processed the ‘soup’ for all 700+ team/season combos and I give you the following figure!

So are baseball games getting longer? Well, the preceding figure gives the median length of games (in minutes) for all teams over each season since 1970. It looks like it is going from blue to red, suggesting an increase in the median length of games. However, you might also notice that I’ve added two vertical lines from 1994 through 2004. This roughly corresponds to the “Steroids Era” in MLB. It looks like that the game times have been decreasing a bit since the middle of this era or so. Can we look at the data in another way? Of course, I give you Figure 3!

Here’s a different way of displaying the same data as given in the second figure. I didn’t separate by team in this figure, however. I’m just looking at the median length of games across all teams for each season and added a smooth curve to show a trend to the data. Note that the peak on the curve corresponds to roughly 1998. If you recall, Mark McGuire and Sammy Sosa were hitting HRs at record rates and teams were scoring a lot of runs. And it looked as if their heads might suddenly explode from all of the ‘roids that they were taking.

So what do I take from this? Well, overall, I would say that they length of baseball games in general seems to be on the decline in the most recent years. Is the same true for just the Red Sox? Ah ha, I can make that figure. Here you go.

Interestingly enough, it looks like the smooth trend line for Red Sox games is increasing in recent years whereas it’s decreasing for all other teams. So maybe this Simmons character is onto something for his beloved Red Sox.

What do I think? I don’t really care. I just wanted an interesting data set to use so that I can learn a bit more about ggplot2, learn a bit more about scraping data using python/beautifulsoup, and kill some time on a flight. So that’s my story.

Note that all of the data was obtained from baseball-reference.com. Figures were made using the ggplot2 package in the R software. Further, I know that I could do more with this data set from a statistical perspective. That’s what I do, I’m a statistician. But I have a full-time job and just wanted to learn some stuff!

Some final thoughts. I’m happy to share the r and/or python code for this project. Just send me a message on twitter or gmail (twitter name @ gmail) and I’ll send it your way. Also, you’ll notice some missing data for the Red Sox, NYY, LAD, and CHC. Why? I’m not sure. Apparently the layout of the html code is a bit different than the rest of the pages and the scraper was returning missing values. I should also say that I scraped the appropriate pages when teams switched cities. For example, I was scraping Montreal prior to 2005 for the Washington Nationals data.

Ryan

@rtelmore on Twitter!

Here is the R code:

library(ggplot2) ## His Data sim.dat <- read.table("baseball_rs_game_times.txt",header=T) ggplot(data = sim.dat, aes(x=Year, y=Count, fill=Cat)) + geom_bar(width=1,stat='identity') + scale_fill_discrete(name="length category") + scale_x_continuous("year") + scale_y_continuous("games") ggplot(data = sim.dat, aes(x=as.factor(Year), y=Count, fill=Cat)) + geom_bar(stat='identity') + scale_fill_discrete(name="length",breaks=unique(sim.dat$Cat), labels=c("(,2]","(2,2.30]","(2.30,3]","(3,4]","(4,)")) + scale_x_discrete("year") + scale_y_continuous("games") ggsave("~/Sports/Simmons/mlb_length_1.png",hei=7,wid=7) ggplot(sim.dat, aes(Cat, y=Count, col=Count, fill=Count)) + geom_bar() + facet_wrap(~ Year) last_plot() + scale_y_continuous("number of games") + scale_colour_continuous("games") + scale_fill_continuous("games") + scale_x_discrete("length of games") ggsave("~/Sports/Simmons/mlb_length_2.png",hei=7,wid=7) ## My analysis rs.dat <- read.csv("~/Sports/Simmons/length_of_baseball_games_20100802.csv", header=F) names(rs.dat) <- c("team","year","mean_len","med_len","std_len","league","TNAT") ggplot(data=rs.dat, aes(x = year, y = team, fill = med_len)) + geom_tile() + scale_fill_continuous("minutes") #+ opts(title = "Median Length of MLB Games by Team (in minutes)") last_plot() + geom_vline(x=c(1993,2004),lty=3) ggsave("~/Sports/Simmons/mlb_length_3.png",hei=7,wid=7) rs.dat$bs <- rs.dat$team=='BOS' qplot(x=year, y=med_len, data=rs.dat, geom = c("point","smooth"), span = .5, colour=bs, ylab="length of game in minutes") last_plot() + scale_colour_discrete(name="Boston?") last_plot() + geom_vline(x=c(1993,2004),lty=2,col="black") ggsave("~/Sports/Simmons/mlb_length_5.png",hei=7,wid=7) qplot(x=year, y=med_len, data=rs.dat, geom = c("point","smooth"), span = .5, ylab="length of game in minutes") last_plot() + geom_vline(x=c(1993,2004),lty=2,col="red") ggsave("~/Sports/Simmons/mlb_length_4.png",hei=7,wid=7)

Very nice. I wonder if roids had any play in the increase in time length of games. I’m sure they were much more popular than when the public was first informed. The decline may be related to better drug testing practices in the majors.

I could certainly test it with these data that I’ve scraped. However, there are only so many hours in the day! :) The datasets on baseball-reference.com are so rich, there are a million things that one could explore.

Love the post! I found you through Matt Parker’s DRUG email list.

Share the python scrape code -> this would be a very interesting first DRUG user session I think -I’m a beginner R user (err a reformed “SAS”er) and learning fast and furious.

Also, it would be interesting to ‘regress’ total field area (the RedSox have the smallest park of anybody in the majors in terms of total area) = more homeruns, less foul outs etc. Then we can see if its the Sox “style” of an artifact of a small park.

Thanks!

Thanks, Scott. I’ll post the code this weekend. If you know of python developers in the area, I would be interested in hearing how they might optimize the code. I’m new to beautifulsoup and I’m sure that I’m not using it in the most efficient way possible.

I posted the code: https://thelogcabin.wordpress.com/

I just used

`....`

and it apparently doesn’t preserve the indentations. So it won’t run as is. If you know some python, it shouldn’t be hard to figure out where you need to put a tab.cool analysis! how did you scrap the data from the website?

Thanks, David. I’ll post it this weekend.

See my reply to Scott for the link to the code.

nice post! I’ve also been itching to learn ggplot2 after useR! 2010, and hopefully I’ll get to it soon.

Thanks Eric. It’s a pretty sweet package for putting together nice figures in a relatively short amount of time. The book is great; however, a lot of the information is available on Hadley Wickham’s website.

You can find game info using the Retrosheet game logs: http://www.retrosheet.org/gamelogs/index.html There’s an index of baseball datasets available on the web at http://infochimps.org/search?query=baseball — please contribute yours!

The Red Sox games are most likely longer because of their emphasis on on-base percentage, which implies in turn a high walk rate and thus a high number of pitches per plate appearance.

As a high-profile team, they also play more games on national TV — such games have more ads and thus more delays.

If you REALLY want to blow your mind data-wise, check out the MLB gameday data, as put to use here: http://labs.dataspora.com/gameday/pitcher/josh-beckett/277417 — it’s the game state, play-by-play, and **trajectory of every pitch**, for every major league baseball game since 2008. Raw XML data available at http://gdx.mlb.com/components/game/mlb/ Scrape carefully, it’s a lot lot lot of files.

Thanks Flip. I’ll upload the data when I get a chance.

I’m not sure if the OBP theory holds in this case. Other teams subscribe to the same philosophy (e.g., Oakland) and I don’t find a similar trend. Of course, they aren’t as successful as the Red Sox and aren’t on TV as much. I don’t really see a similar trend to Yankees or Cubs games either — they are on TV as many or more times as the Red Sox.

It would be nice to really investigate this phenomenon, but I’m on to my next project with MLS data and, of course, I have work to do!

Oh, I’ve seen the post at Dataspora. I used to live in SF and caught this analysis at a Bay Area R User Group meeting.