捡芝麻拾谷子

Friday, May 27, 2011

One-liner R: remove data frame columns by names

Sometime, I want to remove some columns from my data. What I did in that case, is found out what are the unwanted column ids and removed those. It's just a couple of lines in R. But why not write a function to do the job and also some error handling. So next time, I just copy and paste!

rm.col.by.name <- function(dat, colname.list){

l1 = length(colname.list)
l2 = which(colnames(dat) %in% colname.list)

if(length(l2) == 0) {
    warning('Did not find any matching columns. Check input column names, please.')
}

if(length(l2) > 0 & length(l2) < l1) {
    warning('Did not find all matching columns. Only those found are removed.')
    return(dat[,-c(l2)])
} 

if(length(l2) > 0 & length(l2) == l1) {
    return(dat[,-c(l2)])
}

}

Wednesday, May 25, 2011

Canopy Clustering

Lately I was introduced to a clustering algorithm called "canopy clustering". In plain English, like all other clustering techniques, it is designed to divide observations into segments (subsets, clusters, groups, buckets, whatever you call it), so that observations inside each segment are "similar" in terms of certain measurement criteria. Those criteria are distances or dissimilarities.

There are a lot of algorithms that cluster observations. Some are more computation-intensive, like k-medoids (compared to k-means). Some yield hard assignment, i.e., one observation can only belong to one cluster. Some generate soft assignment, i.e., one observation can belong to multiple clusters, like LSH (locality sensitive hashing) and canopy clustering. Some scale better than others.

Canopy clustering procedure is relatively straightforward. The key ingredients are two distance thresholds T1 and T2, where T1 > T2. For a list of data points, if a point is within T2 distance to any of the existing canopy centers, then that point cannot be a new canopy center; otherwise, it should be added to the canopy center set. With this rule, a set of canopy centers are obtained. Then if any of the input data points is within T1 distance of the existing centers, then the point belong to the canopy formed by that center. As one can see, a data point could belong to multiple canopies. The following plot illustrates the concept.

Usually, canopy clustering uses "cheap" distance metrics (easy and fast to calculate). So the process takes shorter time compared to k-means clustering. This is a typical trade-off between precision and speed. Thus people combine k-means and canopy together to cluster huge amount of data. The way the two works together is:
1. Canopy uses cheaper distance metric (however in the same direction of k-means distance, meaning if two points are close in terms of k-means distance, they should be close in terms of canopy distance too.), and partition data into canopies.
2. K-means still try to cluster all the data points, however distance between an input data point and any k-mean centroid is calculated if and only if they are in the same canopy. In this way, a lot of "unnecessary" calculations are avoided.

There are some interesting variations around combining those two or simply use canopy clustering to generate k-means initial centroids. If serving canopy clustering as a k-means centroid feed, one can only perform the center generation part, or perform both the center generation and membership assignment parts, and then re-calculate the center of the canopy clusters. If utilizing canopy clustering to partition the data for k-means first, keeping the ratio between number of canopies and k to 1 and 10 has been suggested. That means you always want the number of canopies be smaller than k. So that each point is likely to be in the cluster with at least one k-mean centroid. But how to find out how many canopies you are going to get? The trick is you have to play with T1 and T2.

When implementing in map-reduce framework, this blog post is very clear. One thing that is worth pointing out is that after canopy centers are generated using one map-reduce job, the next map-reduce job is to assign membership of all the input data to the canopy centers, at that time, some data may no longer be within T1 of any canopy centers, solutions in this case includes forming a new canopy with this point as center or assigning it the nearest center or simply discarding that point.

Thursday, March 31, 2011

A few things I learnt in 2010

2010 was an "interesting" year. I write down the lessons that I learned and felt the most.

1. "Be confident and assertive."
Fine, call yourself "data analyst", "data scientist" or even "data ninja". Whatever you call yourself really does not matter. It matters that as a professional personnel, you work in an environment that requires you (and hopefully you would like also) to share your findings and convince others to see the potential value of your analysis the way you see them. At the end of the day, you are a consultant. And as a consultant, being confident and assertive about your expertise and findings help "tons" in the process.

2.Ask a lot of questions. Especially ask the "dumb" ones at the beginning. Getting clear on concepts and terms as early as possible helps reducing confusion and shortens project development time.

3.Follow up. Sometime people forget what they said or asked for a few days ago. Reasonably constant and consistent contact is the key to keep projects running and in track! Plus, it makes your clients feel needed, which theoretically is a good feeling. :P

Monday, March 14, 2011

Filesystem Traversing with Python

Lately I have accumulated a lot of files, most of which are papers in pdf format. Although they are put into folders properly, it's still hard to track down a single file, or even to check whether I have already collected that paper or not. So I needed to do a simple filesystem traversing and figure out 4 basic things about my documents: directory, file name, size of file, and last modified date. With a summary file that containing those information, I can do some kind of basic search, which is at least faster than my going through folders and digging files.

Also I wanted my results to satisfy two requirements (1) only output file information, if that file is in pdf, ppt, or doc format; (2) the size of file should be in meaningful format. I chose Python to finish this little task. Here is the code I used


import os
import time, stat 
from datetime import datetime

def sizeof_fmt(num):
    for x in ['bytes','KB','MB','GB','TB']:
        if num < 1024.0:
            return "%3.1f%s" % (num, x)
        num /= 1024.0

f = open('output.txt', 'w')  
  
for root, dirs, files in os.walk('E:\my_papers'):  
    for file in files:  
        if file.split('.')[-1] in ('pdf','ppt','doc'):
            st=os.stat(os.path.join(root,file))
            sz=st[stat.ST_SIZE]
            tm=time.ctime(st[stat.ST_MTIME])
            tm_tmp=datetime.strptime(tm, '%a %b %d %H:%M:%S %Y')
            tm=tm_tmp.strftime('%Y-%m-%d')
            sz2=sizeof_fmt(sz)
            strg=root+'\t'+file+'\t'+tm +'\t' + sz2 +'\n'
            f.write(strg)

The modules that are used here are "os", "time", "stat" and "datetime".

First there is a user-defined function that returns the file size in human readable format. I found this interesting function here .

Next os.walk function will walk through the given directory and stop until it finds files. Then for each file, the format is obtained by splitting the file names by '.'. Once I have the file format meets my requirement, its size in bytes format and its last modified-time is collected. Unfortunately the time is a very long string, some of which is not relevant at all. So I created a "datetime" object "tm_tmp" using "datetime.striptime()", and then created a string "tm" that only keeps year, month and date information of the file. Next the size function is called and the human readable file size is returned. Finally the directory, filename, time and size information are written to file.

Friday, January 28, 2011

What's next for virtual goods?

Nowadays,virtual goods market is very hot! Experts predict the sales total will be 2.5 billion by 2013.

And there are tons of information out there about virtual goods itself, and what the days will look at going forward. Particularly, I found the post by Maria Korolov , a few posts by Avril Korman and the "inside virtual goods" report by Justin Smith from Inside Network helped me to better understand the market mechanism and the its future.

I also studied IMVU, which is a site that allow you register an account and start to socialize with others in 3D in just a minute. I found their (or in general this type of company's) business model is very self-sufficient. Users generate content (virtual goods), sell the contents to other users, and they collect the money. Then their job would be to reach to more users, retain existing users and reach to even more users. And that's not easy job!

Based on the limited information and experience I have so far. In the next few years, what I like to see more are (1) creating new ways to partner with real life brand owners (what I mean here are retails, manufacturing companies)(2) grab the non-English speaking market (3) have something to offer for users with different taste and expectation.

Can brand name company produced some kind of virtual goods that are both suitable for selling on the sites and related to the products they sell in real life, and sell it at cheaper price than user created goods? So that they could use social sites as a lab to do product development and nurture their potential customers at the same time. Users probably feel more comfortable with brand names in virtual world just as they do in everyday life.

I found Avril's post on fashion very interesting. Then could company start to being more aggressive in this direction, meaning create more products to just improve user experience in certain categories? Now IMVU have a daily outfit contest, can we have a fashion show, and let user decide what they want to see? And can we work with some young designers as partners to see if we can make their label more aware and more sale. We do revenue share with them. By marketing deeper in certain categories, users will get more involved and addictive. It's a win-win for both sides.

How can anybody forget the other side of the world, the more populated continent and more conservative group of people? Conservative here only means their culture kind of educate them to value group more than themselves. These type of people must have stronger wish to express themselves. And virtual world seems to be an ideal choice for them - people don't know who you are and you can whatever you want without worrying the consequences. So I like to see companies like IMVU grabbing international users as fast as they can. It's going to be an asset with unpredictable value.

Lastly, say I have a user, who has no interest in outfits or the digital sex. But that user really want to climb to a cliff and jump down. It does not mean that user want to kill himself. It simply means he want to have other type experience. Will that type of user find something they want in virtual world too?

Tuesday, January 18, 2011

Visual That

Yesterday I opened my business week magazine and saw this picture in an article talking about CES in Las Vegas this year. The picture compares the size and weight of Panasonic new garage-size 3D TV to something people are very familiar with, like Koby Brant's height, length of mini cooper, weight of a cow. So that users reading this article have rough idea how big the TV is.

Precision of measurements is not very important in this case, the familiarity and approximation of measurements are the key. And this is an idea that have sat in my mind for a month now. How desperately I want to realize that idea by building my own website! I actually already have a name for the website. It's called "visualthat.com", isn't it a cool name? Maybe at this moment I don't have the ability to do that now. At the very minimal, I can tape my thoughts here. :)

What I have been thinking is a website that allows users to input the metric they are interested in (for example, height, weight, volume, currency etc) and the amount of the metric in numbers. And the website spits out a visualization of the metric comparing against something average people are familiar with. For example, I am interested in 1 yard. The website spits out a picture of a height of SUV, and visualize how much a yard is comparing to that height. This visualization could be chop some part of the SUV (suppose SUV is taller than 1 yard), or a ruler next to the picture indicating the approximate position of 1 yard. Another example, I am interested in how much 1 dollar is worth in Chinese currency. Well, I input 1 dollar, it shows me a mac burg from McDonald’s dollar menu, and probably 3/4 of the same thing over a map of China or 2 apples. To make it more fun, users are allowed to choose baseline they’d like to compare to. And an app for mobile devices can be created too.

The closest thing I can find on the internet is Wolfram Alpha . However, I don’t quite like (1) no pictures, (2) too many query results crowded in one page. This Scale of Universe from Primax Studio is also cool, but it does not (1) allow user input (2) choose baseline.

Anyway, this is going to be a very fun project, making measurement conversion allowing user input query and visualizing query result. And it’s going to be a challenging one! Hmmm, maybe I should start learning some html and JavaScript.

Friday, January 14, 2011

Market Basket Analysis using R

R has a package that deals with association rule mining tasks. Its name is "arules". It implemented Apriori algorithm and Eclat by calling C in the back-end.

I found a small transaction dataset (2k rows) online, which has 2 columns 'order_id' and 'item'. The first few rows look like this

/*
10248 item042
10248 item072
10249 item014
10249 item051
10250 item041
10250 item051
...
*/

Then I performed the following R codes to analyze this data

## call the 'arules' library [if you have not installed it, run "install.packages('arules')" in R.]
library(arules)

## import the data
# read.transactions is a very nice function to read
# in this type of transaction data very quickly.
# I have seen others reading the data using read.table
# and then transform it, which was too much manipulation.
# "format" argument tells R if each line has just one item
# (single) or multiply items (basket) seperated by "sep="
# "cols" argument is a numeric vector of length 2 giving
# the numbers of the columns with the transaction and item ids
# for single format. And it can be a numeric scalar
# giving the number of the column with transaction ids
# for basket format.
trans_data <- read.transactions(file='trans_data.csv', rm.duplicates=F, format='single', sep=',', cols=c(1,2))

# take a peak at what's in each basket
inspect(trans_data)

## after reading the data in, there are some trivial
#functions you can use on it

class(trans_data)
summary(trans_data)

## for each item, you can count the frequency of appearance
# or ratio of appearance relative to total number of transactions
itemFrequency(trans_data, type='absolute')
itemFrequency(trans_data, type='relative')
## item_freq/total_num_of_orders

## there are some visualization functions one can use
# to "see" the data
image(trans_data)
itemFrequencyPlot(trans_data)

## apply Apriori algorithms to mine association rules
# there are a few parameters that you can tweak with,
# see the package manual instruction on 'APparameter-class'
# in this example, I specifically tell R I want 1-way rules that
# has minsupport of .002 and minconfidence of .2 by setting up
# parameters in the following way

rules <- apriori(trans_data, parameter = list(supp = 0.002, conf = 0.2, target = "rules", maxlen=2, minlen=1))

parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target
0.2 0.1 1 none FALSE TRUE 0.002 1 2 rules
ext
FALSE

algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE

Warning in apriori(trans_data, parameter = list(supp = 0.002, conf = 0.2, :
You chose a very low absolute support count of 1. You might run out of memory! Increase minimum support.

apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[77 item(s), 830 transaction(s)] done [0.01s].
sorting and recoding items ... [77 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [30 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].

## check rules
inspect(rules[30])
# lhs rhs support confidence lift
# 1 {item021} => {item061} 0.009638554 0.2051282 7.094017

## extract measurments or other stuff you want
str(rules)
attributes(rules)$quality

The same rule showed up in the query mining too. The support, confidence and lift metrics match up between R and sql.

-[ RECORD 1 ]-----+-------------------------
rules | item021 => item061
lhs | item021
rhs | item061
num_lhs_rhs | 8.0
num_basket | 830.0
num_lhs | 39.0
num_rhs | 24.0
num_no_lhs | 791.0
num_no_rhs | 806.0
num_lhs_no_rhs | 31.0
num_no_lhs_rhs | 16.0
num_no_lhs_no_rhs | 775.0
support | 0.0096385542168
confidence | 0.2051282051282
lift | 7.0940170940170
chi_square | 45.253247622206
laplace | 0.2195121951219
conviction | 1.2216867469879
added_value | 0.1762125424776
certainty_factor | 0.1814595660749
j_measure | 0.0114057917394
gini_index | 0.0030619052969
jaccard | 0.1454545454545
shapiro | 0.0082798664537
cosine | 0.2614881801842
correlation | 0.2334994327337
odds_ratio | 12.500000000000