Last Saturday I had the good fortune to present a talk on finding, gathering, and analyzing some sports-related data on the web at the local SABR group meeting. In case you’re not familiar with the “SABR” acronym, it stands for “Society for American Baseball Research”; here’s a link to the national organization. The talk was light on tech and heavy on graphs (predominately made in R and in particular ggplot2). Good times were had by all. The slides from the talk are given below. Most of the slides and code are recycled from previous talks, so I apologize in advance if you’re already familiar with the content. It was, however, new to the SABR people.
Category Archives: ggplot2
A week or so ago I saw a tweet related how the NFL lockout was affecting the search traffic for “fantasy football” on Google (using Google Trends). Basically, the search traffic (properly normalized on Google Trends) was down prior to the end of the lockout. I decided to explore a bit with this myself and chose to look into the RGoogleTrends package for R. The RGoogleTrends package is basically a convenient way to interact with the Google Trends API using R. Very cool stuff. All of my code can be found at my GoogleTrends repo on github.
My first query was to pull the google trends data for the past seven or so years using the search term “fantasy football”. The following figure shows the results over time. It’s immediately obvious that the normalized search scores for “fantasy football” were on the decline (2011 over previous years) prior to the end of the lockout; however, it appears that interest has since recovered.
I then decided to look at the trends for “NFL”. There isn’t a dramatic decrease in the (normalized) searches for “NFL” prior to the lockout’s end, but you do see a huge spike in searches after the announcement.
- It would be interesting to align the curves in the last plot by the day of week. That is, it would be nice to compare the trends scores, as an example, for the 7th Wednesday prior to the start of the NFL preseason or something.
- In order to use the RGoogleTrends package, you can use the cookies in Firefox to pass username and password if you just log into gmail (or another google service).
I have to admit that working with my Major League Soccer data set has been slow going. There are a few reasons: (1) I have a full-time job at the National Renewable Energy Lab and (2) the data isn’t quite as “rich” as I initially thought. As an example, the MLS site doesn’t list the wins and losses for each team by year. That seems to be a fundamental piece of “sports”-type data, right? In any case, I did come across something that I can’t seem to answer. If you know somebody that works with MLS, send ’em my email address and tell them that I want answers, damnit!
So following up on my previous MLS-related post, I wanted to see if I could pinpoint why goals per game has been decreasing in recent years. My first thought was that with MLS expanding, more US-based players transferring overseas, etc., that the overall talent level in MLS has suffered a bit in the more recent years. One way that this might manifest itself in the data is by having less shots “on target” or “on goal”. Therefore, I looked at the number of shots on goal vs the number of shots and also vs the number of goals over the years. The two figures are given next.
Based on the first figure, one could argue that the shooters are becoming a little less accurate. That is, the number of shots on target per shot has decreased by about 10% over the course of the league’s lifetime. Shots on goal per goal seems relatively steady over this same time period. This might suggest that the league’s strikers are getting slightly worse whereas the quality of the keepers is holding steady. That, of course, could contribute to the decline of goals per game.
I also decided to look at the number of assists per goal. Why? Well, my logic is that if there are more assists per goal, then there might be better overall team play. Conversely, a decrease in this number might be a result of teams having one or more stars (hence, more individual brilliance) and less of the quality, build-up-type goals. Make sense? C’mon, I’m trying here! Anyway, here is the resulting graph.
Whoa, what in the hell happened there? The data look a bit suspicious. Specifically, there seems to be a serious change between the 2002 and 2003 seasons. So I made a similar graph, but I separated by the different time periods. Here ya go.
What does this mean? My hypothesis is that there was a fundamental change to the rules in how assists were recorded between the 2002 and 2003 seasons. Unfortunately, I can’t confirm this. I’ve searched the web, read the MLS Wikipedia page, read a good amount of the MLS website, and can’t seem to find anything related to a rules change that might result in this sort of phenomenon. Sooooo, if you have any ideas, send ’em my way!
This will likely be the last MLS-specific post for a while. Unless I can find some more data, I’m giving up — their data is just not that interesting. Notice that I didn’t say that this would be my last soccer post. Hopefully I can scrape some EPL (England’s Premiership) data. Given that their league has been around for more than 15 years, it should be a bit more interesting than mine.
If you’re interested in taking a look at the data and/or code yourself, I’ve created a github repository for your perusal. Feel free to pass along your comments and/or questions regarding any code — I have thick skin.
So what’s next? I am thinking about comparing my current workflow of (a) scrape with Python and (b) analyze with R to just doing everything in R (e.g., using the xml package). Hopefully, I can post some time comparisons soon!
Addendum: According to at least one blogger, the recording of “secondary assists” was changed after the 2002 season. I’m not sure why they record secondary assists in the first place — I guess MLS wanted to appeal to the hockey people in the early years. Here is the bloggers take on secondary assists:
I promised something related to Major League Soccer and here it is. Caveat: It’s not much. Why so sparse? (1) The data is a bit messy due to teams folding, expansion, name changes, etc. (2) I was backpacking all weekend and didn’t have time to work on this side project. Yes, I have a real job and working during the work week is a bit difficult.
My first step was to scrape the “stats” section of the MLS site to get all of their public data. Or at least all of the data that is relatively easy to find and easy to scrape. I’ll post the code soon once I setup the master repository on github. Needless to say, I think it looks a bit better than my initial foray into beautifulsoup as posted here.
I decided to look at goals per game by team and year. Most people who like soccer like goals, so this seems like a good starting point. Here is the initial figure.
As you can see, there are a lot of blank spaces. The reason for this is because a lot of teams changed their name and/or relocated (e.g., San Jose), some teams folded (e.g., Tampa Bay), and MLS added teams over the years (e.g., Chivas USA). The bottom line is that it makes for an ugly graph. In an attempt to clean it up a bit, I tried to consolidate some of the names. Here is the new figure.
It still doesn’t look great, but I do think that you can learn a bit from this figure. Overall, I would say that the goals per game for each team is decreasing over time. Is it a statistically significant decline? I dunno. I’m not writing a paper here — it’s a freaking blog, i.e., speculation reigns supreme! In any case, this raises more questions. For example,
- Does this apparent decline affect attendance numbers?
- What is the cause of this decline? Better defenders coming into the league? Um, I doubt it. I would imagine that quality strikers are being added at about the same rate.
I would hypothesize that it’s just that the quality of the league has improved significantly over the years. Hence, the teams are holding possession more and not just firing shots whenever they get a chance. As a result, I will look into attendance numbers, shots, shots on goal, etc. in the upcoming days or possibly weeks. I believe that some interesting questions can be answered with these data. However, I am still trying to discover what these questions might be. If you have any ideas for questions, let me hear about them in the comments section.
The R code for this project isn’t too interesting, so I won’t post it below — it will be on the github repository in time though. One thing that I did learn about R is that reading in numeric data measured in the thousands (e.g., attendance figures) can be problematic if the numbers have commas. It took me a while to find the workaround and it’s given below.
mls.reg.dat$h_tot <- as.numeric(gsub(",", "", mls.reg.dat$h_tot))
mls.reg.dat$h_avg <- as.numeric(gsub(",", "", mls.reg.dat$h_avg))
mls.reg.dat$a_tot <- as.numeric(gsub(",", "", mls.reg.dat$a_tot))
mls.reg.dat$a_avg <- as.numeric(gsub(",", "", mls.reg.dat$a_avg))
On July 29, 2010, I had a flight from Denver to Cincinnati. About an hour before boarding, I went to ESPN’s website and found a new article by Bill Simmons, a.k.a The Sports Guy (@sportsguy33 on Twitter). The basic premise of this article is that a core group of fans is losing interest in Red Sox games this season. So he decides to assign percentages to his reasons why people are losing interest (he’s a writer, not a statistician). Anyway, he states that the “biggie”, the “killer”, etc. is the time of games (his 55%), i.e., baseball games are too damn long. He gives some data from baseball-reference.com to back up his claim.
So what does this have to do with me flying to Cincinnati from Denver? Well, being the nerd that I am, I immediately went to baseball-reference.com to see if I could download more data! As an aside, I’ve been obsessing over learning how to use ggplot2 since my return from the useR2010 conference about two weeks ago. This seems like a good time to start learning. Ah shit…it appears that this project is going to be a little harder than I expected. I could download the data for one team and one season before I boarded, but I wanted the past 30 seasons for all teams that played every season. I suppose that I would have to write a scraper to collect the data from about 750 web pages. Sweet, now I have something to do on the flight rather than obsess over whether or not the person next to me will spill over into my seat.
Now I’m on the plane. I’ve downloaded the html of one webpage so that I can test my python scraper (using BeautifulSoup) and I have the data that Simmons used in his article. The first thing that I do is make a few graphs using his data. Here is the first.
He essentially discretized the length of each game into five bins: A – less than or equal to two hours, B – more than two but less than 2 hours and 30 min, and so on. It certainly looks like the relative size of the blue and purple rectangles together is increasing and the golden rectangle seems to be decreasing. Maybe Simmons is onto something. Note that this isn’t the only figure that I made. There were quite a few others, but I want to get to the good stuff.
(Back onboard the plane. I have two or three terminals open, Textmate with some python code open, R is open, and the dude sitting next to me is sufficiently freaked out by what’s going on.)
OK, this is getting too long. It turns out that I didn’t finish the scraper on the flight to Cincinnati. Fortunately there is the return flight and a few ‘down’ hours to work on this project while visiting KY. Success! Upon landing at DIA on Monday, I had the scraper (for the test page that I downloaded on Friday) working. Now I needed to write a script to iterate across all teams and every year since 1970. I wrote the script, processed the ‘soup’ for all 700+ team/season combos and I give you the following figure!
So are baseball games getting longer? Well, the preceding figure gives the median length of games (in minutes) for all teams over each season since 1970. It looks like it is going from blue to red, suggesting an increase in the median length of games. However, you might also notice that I’ve added two vertical lines from 1994 through 2004. This roughly corresponds to the “Steroids Era” in MLB. It looks like that the game times have been decreasing a bit since the middle of this era or so. Can we look at the data in another way? Of course, I give you Figure 3!
Here’s a different way of displaying the same data as given in the second figure. I didn’t separate by team in this figure, however. I’m just looking at the median length of games across all teams for each season and added a smooth curve to show a trend to the data. Note that the peak on the curve corresponds to roughly 1998. If you recall, Mark McGuire and Sammy Sosa were hitting HRs at record rates and teams were scoring a lot of runs. And it looked as if their heads might suddenly explode from all of the ‘roids that they were taking.
So what do I take from this? Well, overall, I would say that they length of baseball games in general seems to be on the decline in the most recent years. Is the same true for just the Red Sox? Ah ha, I can make that figure. Here you go.
Interestingly enough, it looks like the smooth trend line for Red Sox games is increasing in recent years whereas it’s decreasing for all other teams. So maybe this Simmons character is onto something for his beloved Red Sox.
What do I think? I don’t really care. I just wanted an interesting data set to use so that I can learn a bit more about ggplot2, learn a bit more about scraping data using python/beautifulsoup, and kill some time on a flight. So that’s my story.
Note that all of the data was obtained from baseball-reference.com. Figures were made using the ggplot2 package in the R software. Further, I know that I could do more with this data set from a statistical perspective. That’s what I do, I’m a statistician. But I have a full-time job and just wanted to learn some stuff!
Some final thoughts. I’m happy to share the r and/or python code for this project. Just send me a message on twitter or gmail (twitter name @ gmail) and I’ll send it your way. Also, you’ll notice some missing data for the Red Sox, NYY, LAD, and CHC. Why? I’m not sure. Apparently the layout of the html code is a bit different than the rest of the pages and the scraper was returning missing values. I should also say that I scraped the appropriate pages when teams switched cities. For example, I was scraping Montreal prior to 2005 for the Washington Nationals data.
@rtelmore on Twitter!
Here is the R code:
library(ggplot2) ## His Data sim.dat <- read.table("baseball_rs_game_times.txt",header=T) ggplot(data = sim.dat, aes(x=Year, y=Count, fill=Cat)) + geom_bar(width=1,stat='identity') + scale_fill_discrete(name="length category") + scale_x_continuous("year") + scale_y_continuous("games") ggplot(data = sim.dat, aes(x=as.factor(Year), y=Count, fill=Cat)) + geom_bar(stat='identity') + scale_fill_discrete(name="length",breaks=unique(sim.dat$Cat), labels=c("(,2]","(2,2.30]","(2.30,3]","(3,4]","(4,)")) + scale_x_discrete("year") + scale_y_continuous("games") ggsave("~/Sports/Simmons/mlb_length_1.png",hei=7,wid=7) ggplot(sim.dat, aes(Cat, y=Count, col=Count, fill=Count)) + geom_bar() + facet_wrap(~ Year) last_plot() + scale_y_continuous("number of games") + scale_colour_continuous("games") + scale_fill_continuous("games") + scale_x_discrete("length of games") ggsave("~/Sports/Simmons/mlb_length_2.png",hei=7,wid=7) ## My analysis rs.dat <- read.csv("~/Sports/Simmons/length_of_baseball_games_20100802.csv", header=F) names(rs.dat) <- c("team","year","mean_len","med_len","std_len","league","TNAT") ggplot(data=rs.dat, aes(x = year, y = team, fill = med_len)) + geom_tile() + scale_fill_continuous("minutes") #+ opts(title = "Median Length of MLB Games by Team (in minutes)") last_plot() + geom_vline(x=c(1993,2004),lty=3) ggsave("~/Sports/Simmons/mlb_length_3.png",hei=7,wid=7) rs.dat$bs <- rs.dat$team=='BOS' qplot(x=year, y=med_len, data=rs.dat, geom = c("point","smooth"), span = .5, colour=bs, ylab="length of game in minutes") last_plot() + scale_colour_discrete(name="Boston?") last_plot() + geom_vline(x=c(1993,2004),lty=2,col="black") ggsave("~/Sports/Simmons/mlb_length_5.png",hei=7,wid=7) qplot(x=year, y=med_len, data=rs.dat, geom = c("point","smooth"), span = .5, ylab="length of game in minutes") last_plot() + geom_vline(x=c(1993,2004),lty=2,col="red") ggsave("~/Sports/Simmons/mlb_length_4.png",hei=7,wid=7)