Category Archives: Python

Using XML package vs. BeautifulSoup

A while back I posted something about scraping a webpage using the BeautifulSoup module in Python.  One of the comments to that post was by Larry — a blogger over at IEORTools — suggesting that I take a look at the XML library in R.  Given that one of the points of this blog is to become more familiar with some of the R tools, it seemed like a reasonable suggestion — and I went with it.

I decided to replicate my work in python for scraping the MLS data using the XML package for R.  OK, I didn’t replicate it exactly because I only scraped five years worth of data.  I figured that five years would be a sufficient amount of time for comparison purposes.  The only major criterion that I enforced was that they both had to export nearly identical .csv files.  I say “nearly” because R seemed to want to wrap everything in parenthesis and it also exported the row names (1, 2, 3, etc.) as default options in write.table().  Neither of these defaults are an issue, so I didn’t bother changing them.  I wrote in a few print statements for commenting purposes in both scripts to show where a difference (if any) in timing might exist.  The code can be found in my scraping repository on github.

I don’t really know much about how system.time() works in R to be honest.  However, I used this function as the basis of my comparison.  Of course, I was source’ing an R file and using the system() function in R to run the python script, i.e., system(“path/to/py_script.py”).  The results can be summarized in the following graph.

As you can see in the figure, there is about a 3x speedup in using the XML package relative to using BeautifulSoup!  This is not what I was expecting.  Further, it appears that the overall “user” speedup is approximately 5x.  In fact, the only place where python seems to beat the R package is in the user.self portion of the time….whatever the hell this means.

As I said before, I decided to print out some system times within each script because scraping this data is iterative.  That is, I scrape and process the data for each year within the loop (over years).  So I was curious to see if there each option was scraping and processing at about the same speed.  It turns out that XML beat BeautifulSoup here as well.

Results:

## From system call to python:
Sun Aug 29 18:07:57 2010 -- Starting
Sun Aug 29 18:08:00 2010 -- Year: 2005
Sun Aug 29 18:08:02 2010 -- Year: 2006
Sun Aug 29 18:08:04 2010 -- Year: 2007
Sun Aug 29 18:08:06 2010 -- Year: 2008
Sun Aug 29 18:08:08 2010 -- Year: 2009
Sun Aug 29 18:08:08 2010 -- Finished :)

and in R:

[1] "2010-08-29 18:10:29 -- Starting"
[1] "2010-08-29 18:10:29 -- Year: 2005"
[1] "2010-08-29 18:10:29 -- Year: 2006"
[1] "2010-08-29 18:10:29 -- Year: 2007"
[1] "2010-08-29 18:10:30 -- Year: 2008"
[1] "2010-08-29 18:10:30 -- Year: 2009"
[1] "2010-08-29 18:10:30 -- Finished :)"

What do I conclude from this?  Well, use R damnit!  The XML package is super easy to use and it’s fast.  Will I still use python?  Of course!  I would bet thatpython/BeautifulSoup would be a superior option if I had to scrape and process huge amounts of data — which will happen sooner rather than later.

My computer’s technical specs: 2.66 GHz Intel Core 2 Duo, 8 GB RAM (DDR3), R version 2.11.0, XML v3.1-0, python 2.6.1, and BeautifulSoup v3.1.0.1.

Preview of upcoming post: I am going to compare my two fantasy football drafts with the results similar drafts that are posted online!  Exciting stuff…you know, if you’re a nerd and like sports.

Advertisements

17 Comments

Filed under Python, R

Apologies and Style Guides

I have to say that it’s pretty exciting to watch your blog go from a few hits over its lifetime to getting almost 200 in a single day.  I am currently negotiating with Google over the purchase of this blog.  Or maybe not.  Again, thanks be to @revodavid for posting to the Revolution Analytics Blog.  Anyway, I just wanted to apologize for the format of the code snippets.  Reading the python script and the R code again was painful to even me.  I’m going to try to follow the Google style guides for python and R and hopefully the code will be a bit more readable in future posts.  Further, I will try to host my projects on github so that I might find some collaborators.

I’ve added a few interesting links to the Blogroll on the right.  I would recommend subscribing to the RSS feed for r-bloggers.com.  I see some good stuff on there almost daily.

Next up:  An analysis of some MLS data!  All I’ve found so far is that the teams like to change their names.  Hopefully something useful will come of this project.

Leave a comment

Filed under Python, R, Rambling

Scrapin’ with Python

Here is the code that I used for the previous post.  A few notes:

  1. I know a little about Python, but I’m in no means a python programmer.
  2. I’ve attempted to use BeautifulSoup once prior to this project.  There must be a more efficient way of processing the soup than what I’ve done below.
  3. I have a bad habit of using single-variable names.  @cablelounger and @raycourtney tried to break me of this habit, but it’s a holdover from my math/stats days.  I mean, I would never declare a variable name in a math paper to be something like ‘result’.  It makes programmatic sense, however.  You’ll see a few holdovers in the code, e.g. the line beginning with ‘ll = ‘.  What the hell is ‘ll’?  Well, I used to use ‘l’ for a list and people didn’t like the single-variable name.  So I changed it to ‘ll’ mostly out of spite.  Then it just became something akin to a trash collector in my code.  Don’t judge me!

In any case, I would love to heard your comments/questions re this code because I will be using some variant of this script to get more data from baseball-reference.com in the near future.

Ryan

@rtelmore on Twitter!

#!/usr/bin/python

import urllib
import urllib2

from BeautifulSoup import BeautifulSoup
import re

from datetime import date, datetime
import time

import numpy

def get_baseball_soup(team, year):
url = "http://www.baseball-reference.com/teams/%s/%s-schedule-scores.shtml"% (team,year)
request = urllib2.Request(url)

response = urllib2.urlopen(request)
the_page = response.read()
try:
soup = BeautifulSoup(the_page)
except:
the_page = re.sub('<\/scr', the_page)
soup = BeautifulSoup(the_page)

return soup

def process_soup(soup):
data = soup.findAll('a', href=True)
data_new = [link for link in data if \
link['href'].startswith('/games/standings')]
mins = []
for stew in data_new:
ll = stew.findParent('td').findParent('tr')
hm = [str(td.string) for td in ll.findAll('td')][17].split(':')
try:
mins.append(int(hm[0])*60 + int(hm[1]))
except:
pass
return mins
# for min in mins:
# print min

if __name__ == '__main__':
teams = {
'ATL':'N', 'CHC':'N', 'CIN':'N', 'HOU':'N', 'NYM':'N', 'PHI':'N',
'PIT':'N', 'SDP':'N', 'SFG':'N', 'STL':'N', 'WSN':'N', 'MIL':'N',
'BAL':'A', 'BOS':'A', 'CHW':'A', 'CLE':'A', 'DET':'A', 'KCR':'A',
'ANA':'A', 'LAD':'A', 'MIN':'A', 'NYY':'A', 'OAK':'A', 'TEX':'A'}
outfile = '/Users/relmore/Sports/Simmons/length_of_baseball_games_%s.csv'%date.today().strftime('%Y%m%d')
f = open(outfile,'a')

for team in teams:
out_list = []
print time.ctime() + ' -- Getting data for the %s'% (team)
for year in xrange(1970, 2010):
league = teams[team]
team_2 = team
# print time.ctime() + ' -- Getting data for %s in %s'% (team, year)
if (int(year) < 1997 and team == 'ANA'):
team_2 = 'CAL'
if (int(year) < 1998 and team == 'MIL'):
league = 'A'
if (int(year) < 2005 and team == 'WSN'):
team_2 = 'MON'
if (int(year) < 1972 and team == 'TEX'):
team_2 = 'WSA'
soup = get_baseball_soup(team_2, year)
mins = process_soup(soup)
# print (team, year, numpy.mean(mins), numpy.median(mins), numpy.std(mins))

out_list.append('%s, %s, %s, %s, %s, %s, %s' % \
(team, year, numpy.mean(mins), numpy.median(mins), numpy.std(mins), league, team_2))
f.write('\n'.join(out_list) + '\n')
f.close()
print time.ctime() + ' -- Finished! :) '

5 Comments

Filed under Python, Web 2.0