Last night I presented a talk at the DRUG introducing the R wrapper for the Data Science Toolkit. Lots of good questions, good forking, and good beer afterwards at Freshcraft. The slides are given below.
Tag Archives: Data
A while back I posted something about scraping a webpage using the BeautifulSoup module in Python. One of the comments to that post was by Larry — a blogger over at IEORTools — suggesting that I take a look at the XML library in R. Given that one of the points of this blog is to become more familiar with some of the R tools, it seemed like a reasonable suggestion — and I went with it.
I decided to replicate my work in python for scraping the MLS data using the XML package for R. OK, I didn’t replicate it exactly because I only scraped five years worth of data. I figured that five years would be a sufficient amount of time for comparison purposes. The only major criterion that I enforced was that they both had to export nearly identical .csv files. I say “nearly” because R seemed to want to wrap everything in parenthesis and it also exported the row names (1, 2, 3, etc.) as default options in write.table(). Neither of these defaults are an issue, so I didn’t bother changing them. I wrote in a few print statements for commenting purposes in both scripts to show where a difference (if any) in timing might exist. The code can be found in my scraping repository on github.
I don’t really know much about how system.time() works in R to be honest. However, I used this function as the basis of my comparison. Of course, I was source’ing an R file and using the system() function in R to run the python script, i.e., system(“path/to/py_script.py”). The results can be summarized in the following graph.
As you can see in the figure, there is about a 3x speedup in using the XML package relative to using BeautifulSoup! This is not what I was expecting. Further, it appears that the overall “user” speedup is approximately 5x. In fact, the only place where python seems to beat the R package is in the user.self portion of the time….whatever the hell this means.
As I said before, I decided to print out some system times within each script because scraping this data is iterative. That is, I scrape and process the data for each year within the loop (over years). So I was curious to see if there each option was scraping and processing at about the same speed. It turns out that XML beat BeautifulSoup here as well.
## From system call to python:
Sun Aug 29 18:07:57 2010 -- Starting
Sun Aug 29 18:08:00 2010 -- Year: 2005
Sun Aug 29 18:08:02 2010 -- Year: 2006
Sun Aug 29 18:08:04 2010 -- Year: 2007
Sun Aug 29 18:08:06 2010 -- Year: 2008
Sun Aug 29 18:08:08 2010 -- Year: 2009
Sun Aug 29 18:08:08 2010 -- Finished :)
and in R:
 "2010-08-29 18:10:29 -- Starting"
 "2010-08-29 18:10:29 -- Year: 2005"
 "2010-08-29 18:10:29 -- Year: 2006"
 "2010-08-29 18:10:29 -- Year: 2007"
 "2010-08-29 18:10:30 -- Year: 2008"
 "2010-08-29 18:10:30 -- Year: 2009"
 "2010-08-29 18:10:30 -- Finished :)"
What do I conclude from this? Well, use R damnit! The XML package is super easy to use and it’s fast. Will I still use python? Of course! I would bet thatpython/BeautifulSoup would be a superior option if I had to scrape and process huge amounts of data — which will happen sooner rather than later.
My computer’s technical specs: 2.66 GHz Intel Core 2 Duo, 8 GB RAM (DDR3), R version 2.11.0, XML v3.1-0, python 2.6.1, and BeautifulSoup v18.104.22.168.
Preview of upcoming post: I am going to compare my two fantasy football drafts with the results similar drafts that are posted online! Exciting stuff…you know, if you’re a nerd and like sports.
So why the name “The Log Cabin”? Well, the reason is two-fold. (1) I am constantly reminded about the general public’s lack-of-knowledge of and paranoia surrounding logs, or more specifically, logarithms. For example, I’ve had the following exchange on more than one occasion:
Random acquaintence: “What do you study, Ryan?”
Me: “Mathematics and Statistics.”
RA: “Wow; you must know a lot about logarithms.”
Ryan: “Um, I guess I know a bit about them. Probably nothing more than some other ‘special’ functions, e.g. sine, exponential, etc.”
RA: (shrieks and/or laughs nervously) “Yeah, did you see the score of the game?”
My best guess is that a person in RA’s shoes remembers “logarithm” because they think it’s a funny word or something. I have no idea. Any other hypotheses?
To hammer home the point, I recently read a NY Times article about using statistics to answer questions related to injury likelihood in professional baseball. The author mentioned that the analysts “build logarithm formulas and computer codes that test Conte’s hypotheses…” No shit! The article is referenced at the bottom of this post. As I said when I posted this on Facebook, I would bet that they did build a logit model (seems like a natural starting point), but this sounds like a gratuitous use of the word “logarithmic” in order to make them sound like mathematical geniuses or just plain nerds.
Um, Ryan, you said that there were two reasons. Right. Anyway, I needed something clever that can easily be coupled with log — naturally, a cabin. And the cabin evokes memories of a youngster walking to a little red school house (a cabin) with their books tethered together with a leather strap slung over their shoulder. OK, that may only work for the readers over 100 in age.
Nevertheless, this blog is going to be all about statistics and data in general. I’ll touch on topics ranging from beginning statistics to my favorite problems and solutions from graduate school. I hope to present solutions to classic problems as well as data mining topics (summarization, visualization, etc.) from the contemporary analytics world of Web 2.0. The point here is that I am going to attempt to demystify statistics, hopefully educate some, and remind others why they fell in love with statistics in the first place.