Using XML package vs. BeautifulSoup

A while back I posted something about scraping a webpage using the BeautifulSoup module in Python.  One of the comments to that post was by Larry — a blogger over at IEORTools — suggesting that I take a look at the XML library in R.  Given that one of the points of this blog is to become more familiar with some of the R tools, it seemed like a reasonable suggestion — and I went with it.

I decided to replicate my work in python for scraping the MLS data using the XML package for R.  OK, I didn’t replicate it exactly because I only scraped five years worth of data.  I figured that five years would be a sufficient amount of time for comparison purposes.  The only major criterion that I enforced was that they both had to export nearly identical .csv files.  I say “nearly” because R seemed to want to wrap everything in parenthesis and it also exported the row names (1, 2, 3, etc.) as default options in write.table().  Neither of these defaults are an issue, so I didn’t bother changing them.  I wrote in a few print statements for commenting purposes in both scripts to show where a difference (if any) in timing might exist.  The code can be found in my scraping repository on github.

I don’t really know much about how system.time() works in R to be honest.  However, I used this function as the basis of my comparison.  Of course, I was source’ing an R file and using the system() function in R to run the python script, i.e., system(“path/to/py_script.py”).  The results can be summarized in the following graph.

As you can see in the figure, there is about a 3x speedup in using the XML package relative to using BeautifulSoup!  This is not what I was expecting.  Further, it appears that the overall “user” speedup is approximately 5x.  In fact, the only place where python seems to beat the R package is in the user.self portion of the time….whatever the hell this means.

As I said before, I decided to print out some system times within each script because scraping this data is iterative.  That is, I scrape and process the data for each year within the loop (over years).  So I was curious to see if there each option was scraping and processing at about the same speed.  It turns out that XML beat BeautifulSoup here as well.

Results:

## From system call to python:
Sun Aug 29 18:07:57 2010 -- Starting
Sun Aug 29 18:08:00 2010 -- Year: 2005
Sun Aug 29 18:08:02 2010 -- Year: 2006
Sun Aug 29 18:08:04 2010 -- Year: 2007
Sun Aug 29 18:08:06 2010 -- Year: 2008
Sun Aug 29 18:08:08 2010 -- Year: 2009
Sun Aug 29 18:08:08 2010 -- Finished :)

and in R:

[1] "2010-08-29 18:10:29 -- Starting"
[1] "2010-08-29 18:10:29 -- Year: 2005"
[1] "2010-08-29 18:10:29 -- Year: 2006"
[1] "2010-08-29 18:10:29 -- Year: 2007"
[1] "2010-08-29 18:10:30 -- Year: 2008"
[1] "2010-08-29 18:10:30 -- Year: 2009"
[1] "2010-08-29 18:10:30 -- Finished :)"

What do I conclude from this?  Well, use R damnit!  The XML package is super easy to use and it’s fast.  Will I still use python?  Of course!  I would bet thatpython/BeautifulSoup would be a superior option if I had to scrape and process huge amounts of data — which will happen sooner rather than later.

My computer’s technical specs: 2.66 GHz Intel Core 2 Duo, 8 GB RAM (DDR3), R version 2.11.0, XML v3.1-0, python 2.6.1, and BeautifulSoup v3.1.0.1.

Preview of upcoming post: I am going to compare my two fantasy football drafts with the results similar drafts that are posted online!  Exciting stuff…you know, if you’re a nerd and like sports.

17 Comments

Filed under Python, R

17 responses to “Using XML package vs. BeautifulSoup

  1. Now you’ve got me excited about your fantasy football draft results. I’m going to be in a draft next week! I’ve done some R work with fantasy football myself for trying to predict performance. I might have to post about that as well.

  2. IMO, it’s not news that Python is relatively slow. The Python advantage is that it’s extremely simple and clean to code. I’m curious to know how much time each scraper took to script and debug?

    Basically, are the performance gains worth the investment time to switch from a well-documented (and for me, well-understood) library like BeautifulSoup to R XML?

    • Ryan

      Python is much faster than R when it comes to certain tasks — that’s why I was a little surprised. I agree that it is easy to use and well-documented. I tend to work primarily in R and keeping everything in there seems like a reasonable option. I would say that I found the docs related to XML to be much easier to read than the BeautifulSoup docs; however, keep in mind that my primary “stats” work load is done in R.

      A buddy just sent me a command-line option for scraping and I’ll compare that vs these two options soon. Of course, programming via the command line can be very non-intuitive at times, so I’m not sure if this is the best option. We’ll see.

  3. Master Jimmy

    Geez man, maybe you should change the name of the blog to “The Sports Log”, or “All About Sports Statistics” or something descriptive like that. I think you need to mix it up a little on the subject matter.

  4. Master Jimmy

    I can’t argue that I have a (strong) bias against fantasy “sports”. It’s probably because I live in the real world where real teams play real games – and where I’m neither on, nor do I “manage”, any of them.

  5. Interesting write up. There was a similar blog yesterday from a statistics professor at Columbia University. I think he has one of the most popular statistical blogs on the net. I posted a link to your blog on his and referred to your graph. Here’s a link to the blog I’m referring to:
    http://www.stat.columbia.edu/~gelman/blog/

  6. I have been surprised at how many things I normally would have done in Perl, I wind up doing in R now. Built-in hashtables would be nice (though I see someone has put a friendly face onto the Environment workaround).

    I recently tried out BeautifulSoup for the first time and was impressed. But when I needed to scrape some airfares from a search engine, I tried it in R with the XML and rjson packages. About 3/4 of a page of code later and I have what I want where I want it — in a data.frame. Can’t beat it!

    • Ryan

      I vaguely remember someone talking about a hash package at the last UseR! conference. That could be the environment workaround that you are talking about though — I have no idea.

  7. Pingback: NBA Analysis: Coming Soon! | The Log Cabin

  8. Pingback: Tennis Graph Masterpiece | Stats in the Wild

Leave a reply to Basil Cancel reply