捡芝麻拾谷子: python

Showing posts with label python. Show all posts

Tuesday, April 17, 2012

Some Python

The other day, I was trying to flag a posted (to the Redis server for quick lookup) recommendation from a dictionary of recommendations. Then the next round, I could do some weighted random sampling among the unposted ones, without actually going through the entire recommendation calculations. Anyway, it's a little bit tricky to neatly flag a 'used' recommendation in a python dictionary. Fortunately, someone has already provided a solution. I'd like to borrow it over for quick references.

>>> x = {'a':1, 'b': 2}
>>> y = {'b':10, 'c': 11}
>>> z = dict(x.items() + y.items())
>>> z
{'a': 1, 'c': 11, 'b': 10}

b's vlaue is properly overwritten by the value in second dictionary. In Python 3, this is suggested

>>> z = dict(list(x.items()) + list(y.items()))
>>> z
{'a': 1, 'c': 11, 'b': 10}

And I saw a cool post about using python's map, reduce, filter function to a dictionary. Once again, something could be cleanly and flexibly accomplished. Python is such a beautiful language.

Monday, March 14, 2011

Filesystem Traversing with Python

Lately I have accumulated a lot of files, most of which are papers in pdf format. Although they are put into folders properly, it's still hard to track down a single file, or even to check whether I have already collected that paper or not. So I needed to do a simple filesystem traversing and figure out 4 basic things about my documents: directory, file name, size of file, and last modified date. With a summary file that containing those information, I can do some kind of basic search, which is at least faster than my going through folders and digging files.

Also I wanted my results to satisfy two requirements (1) only output file information, if that file is in pdf, ppt, or doc format; (2) the size of file should be in meaningful format. I chose Python to finish this little task. Here is the code I used


import os
import time, stat 
from datetime import datetime

def sizeof_fmt(num):
    for x in ['bytes','KB','MB','GB','TB']:
        if num < 1024.0:
            return "%3.1f%s" % (num, x)
        num /= 1024.0

f = open('output.txt', 'w')  
  
for root, dirs, files in os.walk('E:\my_papers'):  
    for file in files:  
        if file.split('.')[-1] in ('pdf','ppt','doc'):
            st=os.stat(os.path.join(root,file))
            sz=st[stat.ST_SIZE]
            tm=time.ctime(st[stat.ST_MTIME])
            tm_tmp=datetime.strptime(tm, '%a %b %d %H:%M:%S %Y')
            tm=tm_tmp.strftime('%Y-%m-%d')
            sz2=sizeof_fmt(sz)
            strg=root+'\t'+file+'\t'+tm +'\t' + sz2 +'\n'
            f.write(strg)

The modules that are used here are "os", "time", "stat" and "datetime".

First there is a user-defined function that returns the file size in human readable format. I found this interesting function here .

Next os.walk function will walk through the given directory and stop until it finds files. Then for each file, the format is obtained by splitting the file names by '.'. Once I have the file format meets my requirement, its size in bytes format and its last modified-time is collected. Unfortunately the time is a very long string, some of which is not relevant at all. So I created a "datetime" object "tm_tmp" using "datetime.striptime()", and then created a string "tm" that only keeps year, month and date information of the file. Next the size function is called and the human readable file size is returned. Finally the directory, filename, time and size information are written to file.

Thursday, August 12, 2010

Extracting "gems" from web pages using BeautifulSoup and Python - extracting II

**** step 4 ****
extract the business address (suppose tt=results[0]).
The address itself is closed to a tag called "address", for example

<address>
3002 W 47th Ave <br/>Kansas City, MO 64103 <br/>
</address>

Since I want the address and zip code separate, I used

# address
>> address_tmp=tt.find('address')
>> address_tmp=BeautifulSoup(re.sub('(<.*?/>)',' ',str(address_tmp))).text
>> address=address_tmp.rstrip('1234567890 ')
>> zipcode=re.search(r'\d+$',address_tmp).group()

The first line is saying "find the tag with name address". Because of the
stuff in the middle of the string, I have to use regular expression to replace them with a single space. Then I change the string back to BeautifulSoup object and finally get the text between the "address" tages.

u'3002 W 47th Ave Kansas City, KS 66103'

The third line tends to trim off zipcode and any space after the state abbreviation. On the contrary, the fourth line utilizes regular expression to extract the zipcodes (which are few digits at the end of the string).

There is one thing that needs to pay attention is to use "?" to make the search non-greedy, meaning as long as the search find the pattern, it will return it. see here Google Python Class for more details.

**** step 5 ****
Oops, did I forget to get the business name first? All right, if you realize the name part is heading in a div tag with it's attributes called "itemheading", you will get the entire piece and the name part easily through

>> name_rank=tt.find('div', attrs={'class':'itemheading'}).text
>> name_rank
u"1.\n \tOklahoma Joe's BBQ & Catering"
>> names=name_rank.lstrip('1234567890. ').replace('\n \t', '')
>> names
"Oklahoma Joe's BBQ & Catering"

Very similar to the address part, the new piece added is the replace function operated on strings to remove '\n \t'.

**** step 6 ****
Next, I'd like to know the category or categories for those restaurants. Those are stored in the text of 'a' tags with class label of "category". The 2nd line below join all category labels using ',' to make them one string. And the 3rd line removes '\n' from the string.

# multiple categories so
>> cats=tt.findAll('a',attrs={'class':'category'})
>> category=','.join([x.text for x in cats])
>> category=str(category).replace('\n', '')
>> category
u'Barbeque'

**** step 7 ****
Similar idea applies to get number of reviews and rating as following

# reviews '217 reviews', have to parse the number part using regular expression
>> review_tmp=tt.find('a',attrs={'class':'reviews'}).text
>> rev_count=re.search(r'^\d+', review_tmp).group()
>> rev_count
u'217'
# rating scale
>> rating_tmp=tt.find('div',attrs={'class':'rating'})
>> rating=re.search(r'^\w+\.*\w*',rating_tmp('img')[0]['title']).group()
>> rating
u'4.5'

After all these 7 steps, I write them into a csv file through

out = open('some.csv','a') # append and write
TWriter = csv.writer(out, delimiter='|', quoting=csv.QUOTE_MINIMAL)
TWriter.writerow([ranks, names, category, address, zipcode, rev_count,rating])
out.close()

Tuesday, August 3, 2010

Extracting "gems" from web pages using BeautifulSoup and Python - extracting I

The web crawling piece I talked previously seems a little bit easier. So I will start from there. There are actually two pieces involved: first one is to crawl the website and download its related webpages, second is to extract information out of the saved html pages.

I will start from the 2nd half of work and show step by step how I work with the sample webpages from yelp.com and process them in Python using BeautifulSoup.

Ok, let's get started. First of all, give you an idea of what does yelp.com look like. Below is a small screen shot on the restaurant search result for Kansas City metro area. What I really want to extract from this page is: the business unit name, address, zip, category, number of reviews and rating.

This is a typical search result from yelp website, which includes something totally unrelated with what we need (like the sponsor ads for "The Well"), then the stuff we care, and finally stuff totally irrelevant again. Not that bad, right? But when you look at the underneath html codes, you will see this

Really messy!!! How can you clearly see the stuff you want, let along extract anything them from such a html page? Here is when you should have thought about using BeautifulSoup, which is a terrific, flexible HTML/XML parser for Python.

Now, it's time to get hands dirty with some python codes.
**** step 1 ****
inside python (under windows), load the modules that will be used

import re # regular expression module
import os # operation system module
import csv
from BeautifulSoup import BeautifulSoup

**** step 2 ****
point python the html file I saved and open it of course

os.chdir('C:\\Documents and Settings\\v\\Desktop\\playground\\yelp data')
f=open('rest3.htm','r')

**** step 3 ****
create a soup object and extract the business units

soup=BeautifulSoup(f)
results=soup.findAll('div', attrs={'class':'businessresult clearfix'})

After the "BeautifulSoup" command applied to file f, the output "soup" becomes a soup object (basically just a text file). Observing the original html pages, you will find yelp organize the search result using a div with class="businessresult clearfix". So the second line is really saying give me all the div sections which has class="businessresult clearfix". In this case, "results" is a list of 40 text stings, each one contains the information we want about a single restaurant. Next I will loop through each element in the list ("for tt in results:") and extract "gems" one by one.

Extracting "gems" from web pages using BeautifulSoup and Python - BI part

Lately, I am trying to find a way to answer this question: what's the best local restaurants or spas? So we are trying to figure out which local businesses will worth the most to go after.

Ok, that could be a supervised or unsupervised learning problem. If I can find some data that indicates brand awareness, and use some metrics as predictors, then it's a supervised problem. Well, if I cannot find the indicator kind of response, then I have to figure out another way to sort of build that index using whatever predictive metrics I can find online and for free.

The first thing came to my mind was yelp.com, which is a relatively comprehensive rating website on local businesses. Naturally what I want to do is to crawl that website and get some useful information out, like location of business (in order to calculate distance of that business to a particular user), number of ratings (one indicator of brand awareness), rating itself (brand quality), etc.

Secondly, I want to see if I can get some twitter data. If I happened to know the ip of the twitter user, I can figure out his/her location and just see how many tweets a local business can have. Similarly, google analytics might be helpful too, but I am not sure.

Monday, November 30, 2009

byte of python 2

Continue learning Python after thanksgiving breaks.

Some notes about flow control:
* It's very interesting to see that python has an if-elif-else statement. A colon has to be placed after each statement (if, elif, else) to indicate the beginning of the block.
* An optional else statement can be created after a for or while loop.
* range(1,5,2) means [1,3] (start from 1, step 2, up to 5)
* use break or continue statements for more control.

Some notes about function:
* "def foo(a,b=4):" defines a function named foo with inputs a and optional input b equals to 4.
* inside functions, use global var to make var global so that one can change the value of var inside function.
* Docstrings are cool. The author suggested using it this way - "The convention followed for a docstring is a multi-line string where the first line starts with a capital letter and ends with a dot. Then the second line is blank followed by any detailed explanation starting from the third line."

Some notes about Python data structure:
* So far, I have felt that Python is a very flexible language. Its data structures(list, tuple, dictionary, set) are very interesting. Slicing each structure is also fun.
* One thing I want to mention is that Python binds name to an object. The name only refers to the object but not represent the object itself. So "if you want to make a copy of a list or such kinds of sequences or complex objects (not simple objects such as integers), then you have to use the slicing operation to make a copy."
shoplist = ['apple', 'mango']
mylist = shoplist[:]

Wednesday, November 25, 2009

byte of python

After searching for a good tutorial for python beginners, I found this one here. In that 118 pages pdf file, I went through the first 20 very quickly and then finally begin to learn some python. So here is some of my class notes. I have a google notebook, but not quite like it though.

* ctrl-d or ctrl-z to quit Python (I really wish I knew this one before.)
* comment symbol is ＃
* ' and '' are interchangeable, escape sequences for them is \ and \\
* \n new line, \t tab
* add "r" before expression to indicate raw string
* implicit line join using \
* Python is a very flexible language, 'a'+'b' gives 'ab' (first time I see something like this)
* Boolean NOT, AND, OR are not, and, or (different with R)

捡芝麻拾谷子