Tuesday, August 3, 2010

Extracting "gems" from web pages using BeautifulSoup and Python - extracting I

The web crawling piece I talked previously seems a little bit easier. So I will start from there. There are actually two pieces involved: first one is to crawl the website and download its related webpages, second is to extract information out of the saved html pages.

I will start from the 2nd half of work and show step by step how I work with the sample webpages from yelp.com and process them in Python using BeautifulSoup.

Ok, let's get started. First of all, give you an idea of what does yelp.com look like. Below is a small screen shot on the restaurant search result for Kansas City metro area. What I really want to extract from this page is: the business unit name, address, zip, category, number of reviews and rating.



This is a typical search result from yelp website, which includes something totally unrelated with what we need (like the sponsor ads for "The Well"), then the stuff we care, and finally stuff totally irrelevant again. Not that bad, right? But when you look at the underneath html codes, you will see this



Really messy!!! How can you clearly see the stuff you want, let along extract anything them from such a html page? Here is when you should have thought about using BeautifulSoup, which is a terrific, flexible HTML/XML parser for Python.

Now, it's time to get hands dirty with some python codes.
**** step 1 ****
inside python (under windows), load the modules that will be used

import re # regular expression module
import os # operation system module
import csv
from BeautifulSoup import BeautifulSoup

**** step 2 ****
point python the html file I saved and open it of course

os.chdir('C:\\Documents and Settings\\v\\Desktop\\playground\\yelp data')
f=open('rest3.htm','r')

**** step 3 ****
create a soup object and extract the business units

soup=BeautifulSoup(f)

results=soup.findAll('div', attrs={'class':'businessresult clearfix'})

After the "BeautifulSoup" command applied to file f, the output "soup" becomes a soup object (basically just a text file). Observing the original html pages, you will find yelp organize the search result using a div with class="businessresult clearfix". So the second line is really saying give me all the div sections which has class="businessresult clearfix". In this case, "results" is a list of 40 text stings, each one contains the information we want about a single restaurant. Next I will loop through each element in the list ("for tt in results:") and extract "gems" one by one.

No comments:

Post a Comment