Getting Started With Python (Part 2)

By Travis K. Jansen | Team Lead, Software Development

Welcome to Part 2 of my “Getting Started With Python” series. In part one, I looked at Python from an outsider’s perspective. I knew that most programming languages had common constructs for looping, conditionals, and other flow control mechanisms, but I also know that Python syntax is largely different than the curly brace languages I had been using for the entirety of my career. I wasn’t sure what kind of learning curve I was jumping into, but I had a plethora of baseball statistics at my fingertips and I wanted to write something that could massage them into a format I could use for my purposes. Thus, I carved out some time and got my hands dirty.

Beyond the Basics

One of my favorite websites on the Internet is Baseball-Reference.com. It’s filled with tons of baseball stats (including some very obscure ones you can only get by manually parsing game logs) dating back to the 1800s. Unfortunately, they don’t offer an API to consume all of this wonderful data, and I wanted it! Furthermore, a lot of the information is available, but spread out over team pages, individual player pages, and league-level pages. The basic steps were as follows:

  1. Scrape the main roster page to get the individual player page URLs
  2. For each batter and pitcher, scrape their stats and store them
  3. For each batter, scrape their fielding data from a completely different section of their page
  4. Compare individual player fielding stats to league-wide fielding stats to determine a defensive grade
  5. Print the whole shebang

I knew it was going to take more than loops and “if” statements to get the job done so I skipped the obligatory “Hello World” test and went straight to hitting the tubes for info on scraping libraries and web requests. I was basically attempting to answer the question, “does Python have something like cURL or WebClient, and how does RegEx work in Python?”

“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”

~ Fredrik Lundh

Turns out, a couple of import statements later and I literally had some very powerful stuff at my fingertips with no need for RegEx at all. Enter the BeautifulSoup library. Consider these two lines:

# query the website and return the html to the variable 'html'
# example: https://www.baseball-reference.com/teams/MIL/2017.shtml
team_html = urllib2.urlopen(team_url)

# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(team_html, 'html.parser')

This retrieves a the markup for a given team page on Baseball Reference and sets up the BeautifulSoup library for querying and parsing the markup.

This then allows me to do things like:

# Find the H1 attribute with itemprop="name" on its element, and then find the span underneath it.  This will give us a team name like "Milwaukee Brewers."
team_name = soup.find('h1', attrs={'itemprop': 'name'}).find_all('span')

The beauty of the soup was that by simply including this library, it really did reduce most of my problem scope down to simply learning Python constructs and syntax (along with a few BeautifulSoup nuances, of course). I was giddy. I was hooked! It’s been a long time - likely before I had kids - where I was this excited to sit down in my home office and nerd out into the wee hours of the night. I was coding, and I was stress-free! I’m told by other Python nerds that this feeling is normal, and I should not be alarmed.

The Code

My goal for this first pass was to get things to work functionally. It’s a fairly long script, but I made heavy use of comments. Now that I have achieved that goal, my next task is to clean it up and break it into reusable classes. If you want to check out the script, you can view it on my BitBucket repo.

It essentially turns: br-team.py 2017 MIL into this:

Batting
------------
Manny,Pina,C,R,330,359,0.278787878788,0.326815642458,92,21,0,9,2,79,,
Eric,Thames,1B/LF/RF,L,469,551,0.247334754797,0.359346642468,116,26,4,31,4,163,,
Jonathan,Villar,2B/CF,S,403,436,0.240694789082,0.292626728111,97,18,1,11,23,132,F-,
Orlando,Arcia,SS,R,506,548,0.276679841897,0.324175824176,140,17,2,15,14,100,,
Travis,Shaw,3B/1B,L,538,606,0.273234200743,0.348760330579,147,34,1,31,10,138,,
...

Pitching
------------
Zach,Davies,R,191.1,3.90894819466,33,33,124,55,0,0,.150
Jimmy,Nelson,R,175.1,3.49514563107,29,29,199,48,0,0,.150
Chase,Anderson,R,141.1,2.74273564848,25,25,133,41,0,0,.150
Matt,Garza,R,114.2,4.9649737303,24,22,79,45,0,0,.150
Brent,Suter,L,81.2,3.43596059113,22,14,64,22,0,0,.150
...

I’m pretty happy with the way it turned out. In my first ever Python script, I learned importing libraries, the basic language constructs (loops/conditionals/functions), lamda expressions, arrays, casting and type conversion, built-in functions, web-scraping, string manipulation, and much more. It was a lot to bite off, but a lot of fun. I’m looking forward to abstracting more of this into Python classes in the next update.

Written on July 11, 2017