Tag Archives: beautifulsoup

Scrapin’ with Python

Here is the code that I used for the previous post.  A few notes:

  1. I know a little about Python, but I’m in no means a python programmer.
  2. I’ve attempted to use BeautifulSoup once prior to this project.  There must be a more efficient way of processing the soup than what I’ve done below.
  3. I have a bad habit of using single-variable names.  @cablelounger and @raycourtney tried to break me of this habit, but it’s a holdover from my math/stats days.  I mean, I would never declare a variable name in a math paper to be something like ‘result’.  It makes programmatic sense, however.  You’ll see a few holdovers in the code, e.g. the line beginning with ‘ll = ‘.  What the hell is ‘ll’?  Well, I used to use ‘l’ for a list and people didn’t like the single-variable name.  So I changed it to ‘ll’ mostly out of spite.  Then it just became something akin to a trash collector in my code.  Don’t judge me!

In any case, I would love to heard your comments/questions re this code because I will be using some variant of this script to get more data from baseball-reference.com in the near future.


@rtelmore on Twitter!


import urllib
import urllib2

from BeautifulSoup import BeautifulSoup
import re

from datetime import date, datetime
import time

import numpy

def get_baseball_soup(team, year):
url = "http://www.baseball-reference.com/teams/%s/%s-schedule-scores.shtml"% (team,year)
request = urllib2.Request(url)

response = urllib2.urlopen(request)
the_page = response.read()
soup = BeautifulSoup(the_page)
the_page = re.sub('<\/scr', the_page)
soup = BeautifulSoup(the_page)

return soup

def process_soup(soup):
data = soup.findAll('a', href=True)
data_new = [link for link in data if \
mins = []
for stew in data_new:
ll = stew.findParent('td').findParent('tr')
hm = [str(td.string) for td in ll.findAll('td')][17].split(':')
mins.append(int(hm[0])*60 + int(hm[1]))
return mins
# for min in mins:
# print min

if __name__ == '__main__':
teams = {
'ATL':'N', 'CHC':'N', 'CIN':'N', 'HOU':'N', 'NYM':'N', 'PHI':'N',
'PIT':'N', 'SDP':'N', 'SFG':'N', 'STL':'N', 'WSN':'N', 'MIL':'N',
'BAL':'A', 'BOS':'A', 'CHW':'A', 'CLE':'A', 'DET':'A', 'KCR':'A',
'ANA':'A', 'LAD':'A', 'MIN':'A', 'NYY':'A', 'OAK':'A', 'TEX':'A'}
outfile = '/Users/relmore/Sports/Simmons/length_of_baseball_games_%s.csv'%date.today().strftime('%Y%m%d')
f = open(outfile,'a')

for team in teams:
out_list = []
print time.ctime() + ' -- Getting data for the %s'% (team)
for year in xrange(1970, 2010):
league = teams[team]
team_2 = team
# print time.ctime() + ' -- Getting data for %s in %s'% (team, year)
if (int(year) < 1997 and team == 'ANA'):
team_2 = 'CAL'
if (int(year) < 1998 and team == 'MIL'):
league = 'A'
if (int(year) < 2005 and team == 'WSN'):
team_2 = 'MON'
if (int(year) < 1972 and team == 'TEX'):
team_2 = 'WSA'
soup = get_baseball_soup(team_2, year)
mins = process_soup(soup)
# print (team, year, numpy.mean(mins), numpy.median(mins), numpy.std(mins))

out_list.append('%s, %s, %s, %s, %s, %s, %s' % \
(team, year, numpy.mean(mins), numpy.median(mins), numpy.std(mins), league, team_2))
f.write('\n'.join(out_list) + '\n')
print time.ctime() + ' -- Finished! :) '



Filed under Python, Web 2.0