Thursday, August 12, 2010

Extracting "gems" from web pages using BeautifulSoup and Python - extracting II

**** step 4 ****
extract the business address (suppose tt=results[0]).
The address itself is closed to a tag called "address", for example

<address>
3002 W 47th Ave <br/>Kansas City, MO 64103 <br/>
</address>

Since I want the address and zip code separate, I used

# address
>> address_tmp=tt.find('address')
>> address_tmp=BeautifulSoup(re.sub('(<.*?/>)',' ',str(address_tmp))).text
>> address=address_tmp.rstrip('1234567890 ')
>> zipcode=re.search(r'\d+$',address_tmp).group()

The first line is saying "find the tag with name address". Because of the
stuff in the middle of the string, I have to use regular expression to replace them with a single space. Then I change the string back to BeautifulSoup object and finally get the text between the "address" tages.

u'3002 W 47th Ave Kansas City, KS 66103'

The third line tends to trim off zipcode and any space after the state abbreviation. On the contrary, the fourth line utilizes regular expression to extract the zipcodes (which are few digits at the end of the string).

There is one thing that needs to pay attention is to use "?" to make the search non-greedy, meaning as long as the search find the pattern, it will return it. see here Google Python Class for more details.

**** step 5 ****
Oops, did I forget to get the business name first? All right, if you realize the name part is heading in a div tag with it's attributes called "itemheading", you will get the entire piece and the name part easily through

>> name_rank=tt.find('div', attrs={'class':'itemheading'}).text
>> name_rank
u"1.\n \tOklahoma Joe's BBQ & Catering"
>> names=name_rank.lstrip('1234567890. ').replace('\n \t', '')
>> names
"Oklahoma Joe's BBQ & Catering"

Very similar to the address part, the new piece added is the replace function operated on strings to remove '\n \t'.

**** step 6 ****
Next, I'd like to know the category or categories for those restaurants. Those are stored in the text of 'a' tags with class label of "category". The 2nd line below join all category labels using ',' to make them one string. And the 3rd line removes '\n' from the string.

# multiple categories so
>> cats=tt.findAll('a',attrs={'class':'category'})
>> category=','.join([x.text for x in cats])
>> category=str(category).replace('\n', '')
>> category
u'Barbeque'


**** step 7 ****
Similar idea applies to get number of reviews and rating as following

# reviews '217 reviews', have to parse the number part using regular expression
>> review_tmp=tt.find('a',attrs={'class':'reviews'}).text
>> rev_count=re.search(r'^\d+', review_tmp).group()
>> rev_count
u'217'
# rating scale
>> rating_tmp=tt.find('div',attrs={'class':'rating'})
>> rating=re.search(r'^\w+\.*\w*',rating_tmp('img')[0]['title']).group()
>> rating
u'4.5'



After all these 7 steps, I write them into a csv file through

out = open('some.csv','a') # append and write
TWriter = csv.writer(out, delimiter='|', quoting=csv.QUOTE_MINIMAL)
TWriter.writerow([ranks, names, category, address, zipcode, rev_count,rating])
out.close()

No comments:

Post a Comment