Category Archives: Web 2.0

The RDSTK Presentation at Denver R Users Group

Last night I presented a talk at the DRUG introducing the R wrapper for the Data Science Toolkit.  Lots of good questions, good forking, and good beer afterwards at Freshcraft.  The slides are given below.

Leave a comment

Filed under Data Science, Data Viz, R, Web 2.0

Scrapin’ with Python

Here is the code that I used for the previous post.  A few notes:

  1. I know a little about Python, but I’m in no means a python programmer.
  2. I’ve attempted to use BeautifulSoup once prior to this project.  There must be a more efficient way of processing the soup than what I’ve done below.
  3. I have a bad habit of using single-variable names.  @cablelounger and @raycourtney tried to break me of this habit, but it’s a holdover from my math/stats days.  I mean, I would never declare a variable name in a math paper to be something like ‘result’.  It makes programmatic sense, however.  You’ll see a few holdovers in the code, e.g. the line beginning with ‘ll = ‘.  What the hell is ‘ll’?  Well, I used to use ‘l’ for a list and people didn’t like the single-variable name.  So I changed it to ‘ll’ mostly out of spite.  Then it just became something akin to a trash collector in my code.  Don’t judge me!

In any case, I would love to heard your comments/questions re this code because I will be using some variant of this script to get more data from baseball-reference.com in the near future.

Ryan

@rtelmore on Twitter!

#!/usr/bin/python

import urllib
import urllib2

from BeautifulSoup import BeautifulSoup
import re

from datetime import date, datetime
import time

import numpy

def get_baseball_soup(team, year):
url = "http://www.baseball-reference.com/teams/%s/%s-schedule-scores.shtml"% (team,year)
request = urllib2.Request(url)

response = urllib2.urlopen(request)
the_page = response.read()
try:
soup = BeautifulSoup(the_page)
except:
the_page = re.sub('<\/scr', the_page)
soup = BeautifulSoup(the_page)

return soup

def process_soup(soup):
data = soup.findAll('a', href=True)
data_new = [link for link in data if \
link['href'].startswith('/games/standings')]
mins = []
for stew in data_new:
ll = stew.findParent('td').findParent('tr')
hm = [str(td.string) for td in ll.findAll('td')][17].split(':')
try:
mins.append(int(hm[0])*60 + int(hm[1]))
except:
pass
return mins
# for min in mins:
# print min

if __name__ == '__main__':
teams = {
'ATL':'N', 'CHC':'N', 'CIN':'N', 'HOU':'N', 'NYM':'N', 'PHI':'N',
'PIT':'N', 'SDP':'N', 'SFG':'N', 'STL':'N', 'WSN':'N', 'MIL':'N',
'BAL':'A', 'BOS':'A', 'CHW':'A', 'CLE':'A', 'DET':'A', 'KCR':'A',
'ANA':'A', 'LAD':'A', 'MIN':'A', 'NYY':'A', 'OAK':'A', 'TEX':'A'}
outfile = '/Users/relmore/Sports/Simmons/length_of_baseball_games_%s.csv'%date.today().strftime('%Y%m%d')
f = open(outfile,'a')

for team in teams:
out_list = []
print time.ctime() + ' -- Getting data for the %s'% (team)
for year in xrange(1970, 2010):
league = teams[team]
team_2 = team
# print time.ctime() + ' -- Getting data for %s in %s'% (team, year)
if (int(year) < 1997 and team == 'ANA'):
team_2 = 'CAL'
if (int(year) < 1998 and team == 'MIL'):
league = 'A'
if (int(year) < 2005 and team == 'WSN'):
team_2 = 'MON'
if (int(year) < 1972 and team == 'TEX'):
team_2 = 'WSA'
soup = get_baseball_soup(team_2, year)
mins = process_soup(soup)
# print (team, year, numpy.mean(mins), numpy.median(mins), numpy.std(mins))

out_list.append('%s, %s, %s, %s, %s, %s, %s' % \
(team, year, numpy.mean(mins), numpy.median(mins), numpy.std(mins), league, team_2))
f.write('\n'.join(out_list) + '\n')
f.close()
print time.ctime() + ' -- Finished! :) '

5 Comments

Filed under Python, Web 2.0

A Whole New World of Data

The word twitter is about as ubiquitous as health-care reform in today’s media.  Not everybody knows what twitter is or what you can do with so-called “tweets”, but that does not stop them from posting their thoughts on Michael Jackson’s death in 140 characters or less.  What does this have to do with statistics you ask?  Well, there is a ton of information floating around in the twitter databases and we’re going to look at how you can access this information.

Think of a popular website and it’s almost a guarantee that your choice will have an Application Programming Interface (API).  The facebook, google maps, NY Times, and, of course, twitter APIs are some of the more popular ones that a data junkie might choose to use.  It is through the API that you have access (albeit limited in some cases) to the wealth of information that the particular website might choose to share.

As a simple example, let us look at my friend Michael Twardos‘ website that is a twitter-based surf report.  Michael uses the twitter search API to mine information about various surfing locations.  To see an API in action, consider the following sentence given on Twardos’ website: “Click here to view the latest surf reports”.   By clicking on that link, you are asking the computer to make a request to the twitter api, asking the api to send all of its information related to surf reports, summarizing the information, and presenting the information in an easily digestible way for the user.  In other words, there is a lot going on before that page is rendered to your web browser!

You might think that this sounds cool and all, but where is the statistical connection here?  The surf reports page is an example of a text-based data mining tool and is very much about summarizing qualitative information.  In my next blog post, I will consider and application of summarizing twitter data related to a thunderstorm and/or an application related to football games.  I will also include how you would go about getting the information given in these examples.

1 Comment

Filed under Web 2.0