Using XML package vs. BeautifulSoup

A while back I posted something about scraping a webpage using the BeautifulSoup module in Python. One of the comments to that post was by Larry — a blogger over at IEORTools — suggesting that I take a look at the XML library in R. Given that one of the points of this blog is to become more familiar with some of the R tools, it seemed like a reasonable suggestion — and I went with it.

I decided to replicate my work in python for scraping the MLS data using the XML package for R. OK, I didn’t replicate it exactly because I only scraped five years worth of data. I figured that five years would be a sufficient amount of time for comparison purposes. The only major criterion that I enforced was that they both had to export nearly identical .csv files. I say “nearly” because R seemed to want to wrap everything in parenthesis and it also exported the row names (1, 2, 3, etc.) as default options in write.table(). Neither of these defaults are an issue, so I didn’t bother changing them. I wrote in a few print statements for commenting purposes in both scripts to show where a difference (if any) in timing might exist. The code can be found in my scraping repository on github.

I don’t really know much about how system.time() works in R to be honest. However, I used this function as the basis of my comparison. Of course, I was source’ing an R file and using the system() function in R to run the python script, i.e., system(“path/to/py_script.py”). The results can be summarized in the following graph.

As you can see in the figure, there is about a 3x speedup in using the XML package relative to using BeautifulSoup! This is not what I was expecting. Further, it appears that the overall “user” speedup is approximately 5x. In fact, the only place where python seems to beat the R package is in the user.self portion of the time….whatever the hell this means.

As I said before, I decided to print out some system times within each script because scraping this data is iterative. That is, I scrape and process the data for each year within the loop (over years). So I was curious to see if there each option was scraping and processing at about the same speed. It turns out that XML beat BeautifulSoup here as well.

Results:

## From system call to python: Sun Aug 29 18:07:57 2010 -- Starting Sun Aug 29 18:08:00 2010 -- Year: 2005 Sun Aug 29 18:08:02 2010 -- Year: 2006 Sun Aug 29 18:08:04 2010 -- Year: 2007 Sun Aug 29 18:08:06 2010 -- Year: 2008 Sun Aug 29 18:08:08 2010 -- Year: 2009 Sun Aug 29 18:08:08 2010 -- Finished :)

and in R:
[1] "2010-08-29 18:10:29 -- Starting" [1] "2010-08-29 18:10:29 -- Year: 2005" [1] "2010-08-29 18:10:29 -- Year: 2006" [1] "2010-08-29 18:10:29 -- Year: 2007" [1] "2010-08-29 18:10:30 -- Year: 2008" [1] "2010-08-29 18:10:30 -- Year: 2009" [1] "2010-08-29 18:10:30 -- Finished :)"

What do I conclude from this? Well, use R damnit! The XML package is super easy to use and it’s fast. Will I still use python? Of course! I would bet thatpython/BeautifulSoup would be a superior option if I had to scrape and process huge amounts of data — which will happen sooner rather than later.

My computer’s technical specs: 2.66 GHz Intel Core 2 Duo, 8 GB RAM (DDR3), R version 2.11.0, XML v3.1-0, python 2.6.1, and BeautifulSoup v3.1.0.1.

Preview of upcoming post: I am going to compare my two fantasy football drafts with the results similar drafts that are posted online! Exciting stuff…you know, if you’re a nerd and like sports.

17 responses to “Using XML package vs. BeautifulSoup”

Larry (IEOR Tools)

September 1, 2010 at 7:19 am

Now you’ve got me excited about your fantasy football draft results. I’m going to be in a draft next week! I’ve done some R work with fantasy football myself for trying to predict performance. I might have to post about that as well.

- Ryan
  
  September 1, 2010 at 8:54 am
  
  Do you ever read Drew Conway’s blog, Zero Intelligence Agents? He has a post related to fantasy football drafts as well. His code is also posted on a github repo.
  
  - Larry (IEOR Tools)
    
    September 3, 2010 at 7:53 am
    
    Great link and interesting analysis.
    
    I get my NFL football stats primarily one place.
    http://www.quickstats.com/nfl/nfl.htm
Andrew

September 1, 2010 at 11:47 am

IMO, it’s not news that Python is relatively slow. The Python advantage is that it’s extremely simple and clean to code. I’m curious to know how much time each scraper took to script and debug?

Basically, are the performance gains worth the investment time to switch from a well-documented (and for me, well-understood) library like BeautifulSoup to R XML?

- Ryan
  
  September 1, 2010 at 11:56 am
  
  Python is much faster than R when it comes to certain tasks — that’s why I was a little surprised. I agree that it is easy to use and well-documented. I tend to work primarily in R and keeping everything in there seems like a reasonable option. I would say that I found the docs related to XML to be much easier to read than the BeautifulSoup docs; however, keep in mind that my primary “stats” work load is done in R.
  
  A buddy just sent me a command-line option for scraping and I’ll compare that vs these two options soon. Of course, programming via the command line can be very non-intuitive at times, so I’m not sure if this is the best option. We’ll see.
  
Master Jimmy

September 2, 2010 at 6:39 am

Geez man, maybe you should change the name of the blog to “The Sports Log”, or “All About Sports Statistics” or something descriptive like that. I think you need to mix it up a little on the subject matter.

- Ryan
  
  September 2, 2010 at 8:45 am
  
  This latest post is about comparing an R package to a python module. Thanks for the input.
  
  - Master Jimmy
    
    September 2, 2010 at 8:52 am
    
    Touche’. Although you did manage to sneak in a reference to fantasy football right at the end. You almost made it.
  - Ryan
    
    September 2, 2010 at 8:54 am
    
    You’ve always had a bias against fantasy sports….
Master Jimmy

September 4, 2010 at 8:35 am

I can’t argue that I have a (strong) bias against fantasy “sports”. It’s probably because I live in the real world where real teams play real games – and where I’m neither on, nor do I “manage”, any of them.

Basil

September 14, 2010 at 12:13 am

Interesting write up. There was a similar blog yesterday from a statistics professor at Columbia University. I think he has one of the most popular statistical blogs on the net. I posted a link to your blog on his and referred to your graph. Here’s a link to the blog I’m referring to:
http://www.stat.columbia.edu/~gelman/blog/

- Ryan
  
  September 14, 2010 at 8:45 am
  
  Which post did you comment on? I’m curious to read the thread.
  
  - consultingstatistics
    
    September 14, 2010 at 10:45 am
    
    http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ross_ihaka_to_r.html
    
    I don’t see my post. I think this blog has to approve comments before they are posted. I posted late last night. Later,
Jeffrey Breen

September 22, 2010 at 11:39 am

I have been surprised at how many things I normally would have done in Perl, I wind up doing in R now. Built-in hashtables would be nice (though I see someone has put a friendly face onto the Environment workaround).

I recently tried out BeautifulSoup for the first time and was impressed. But when I needed to scrape some airfares from a search engine, I tried it in R with the XML and rjson packages. About 3/4 of a page of code later and I have what I want where I want it — in a data.frame. Can’t beat it!

- Ryan
  
  September 22, 2010 at 12:55 pm
  
  I vaguely remember someone talking about a hash package at the last UseR! conference. That could be the environment workaround that you are talking about though — I have no idea.
  
Pingback: NBA Analysis: Coming Soon! | The Log Cabin
Pingback: Tennis Graph Masterpiece | Stats in the Wild

Using XML package vs. BeautifulSoup

17 responses to “Using XML package vs. BeautifulSoup”

Leave a reply to Basil Cancel reply

Categories

Blogroll

My Tweets

Admin

Using XML package vs. BeautifulSoup

Share this:

Related

17 responses to “Using XML package vs. BeautifulSoup”

Leave a reply to Basil Cancel reply

Categories

Blogroll

My Tweets

Admin