Thursday, June 16, 2011

One-linear R: multiple ggplots on one page

First, par(mfrow=c(nrow,ncol)) does not go together with ggplot2 plots. Too bad, the simplest way I know of just wouldn't work.

There are two solutions I found out. The first is posted here. It's a smart function that performs par(mfrow) tasks. I just copied the code here
# Source: http://gettinggeneticsdone.blogspot.com/2010/03/arrange-multiple-ggplot2-plots-in-same.html

# Function used below in the function 'arrange'
vp.layout <- function(x, y) viewport(layout.pos.row=x, layout.pos.col=y)

# Function to plot multiple ggplots together
arrange <- function(..., nrow=NULL, ncol=NULL, as.table=FALSE) {
dots <- list(...)
n <- length(dots)
if(is.null(nrow) & is.null(ncol)) { nrow = floor(n/2) ; ncol = ceiling(n/nrow)}
if(is.null(nrow)) { nrow = ceiling(n/ncol)}
if(is.null(ncol)) { ncol = ceiling(n/nrow)}

## NOTE see n2mfrow in grDevices for possible alternative
grid.newpage()
pushViewport(viewport(layout=grid.layout(nrow,ncol) ) )
ii.p <- 1
for(ii.row in seq(1, nrow)){
ii.table.row <- ii.row
if(as.table) {ii.table.row <- nrow - ii.table.row + 1}

for(ii.col in seq(1, ncol)){
ii.table <- ii.p
if(ii.p > n) break
print(dots[[ii.table]], vp=vp.layout(ii.table.row, ii.col))
ii.p <- ii.p + 1
}}}

Another solution would be to use the 'grid.arrange' function in 'gridExtra' package. This function cooperates with not only plots, but also tables ('tableGrob'). So a plot can have both plots and text tables in it. And it has more controls over the title and sub titles. Very slick!

# Test
x <- qplot(mpg, wt, data=mtcars)
y <- qplot(1:10, letters[1:10])
arrange(x, y, nrow=1)
grid.arrange(x, y, ncol=1)

Wednesday, June 15, 2011

Use LDA Models on User Behaviors

LDA (Latent Dirichlet Allocation) model is a Bayesian statistical technique that is designed to find hidden (latent) groups of the data. For example, LDA helps to find topics (latent groups of individual words) among large document corpora. LDA is a probabilistic generative model, i.e., it assumes a generating mechanism underneath the observations and then use the observations to infer the parameters for the underlying mechanism.

When trying to find the topics among documents, LDA makes the following assumptions:
  • ignoring the order of words, documents are simply bags of words and documents exhibits multiple topics.
  • topics have distributions over words. So each word could appear in all the topics with different probabilities though. Say "transmission" will have a higher probability in the topic about auto repair, but lower probability in the topic about child education.
  • each document is obtained by choosing some topics proportions, say {(topic A: .55), (topic B: .20), (topic C: 25)}.
  • and for each word inside a document, choosing a topic from those proportions (say you randomly got number .6 (between 0 and 1 of course), the topic got chosen is B) then looking at the topic over words distribution and drawing a word from that topic (say word X).

Only the actual words are visible to us, so inference has to be made on per-word topic assignment, per-document topic proportion and per-corpus topic distribution. Dirichlet distributions are chosen to be the prior distributions for the latter two distributions. And various algorithms are used to make the inference, like Gibbs sampling (implemented in R package 'topicmodels').

LDA assumes a fixed number of topics existing in a document corpora and assigns soft membership of the topics to documents (i.e, a topic could be exhibited in multiple documents). Indeed, LDA is under a bigger umbrella called "mixed membership model framework" [refer to source [1] for more) .

The soft assignment and mixed membership nature of LDA models make it a reasonable candidate for user behavior analysis, particularly user behavior through a certain period of time. Say one has a log of user behavior at different stage of user life cycle (1st day, 2nd day, etc of a user's life). Then one can treat behavior as words, each day's users aggregated behavior log as document, and let LDA figure out the topics (group of behaviors that tend to appear together, 'co-appearance') and the proportions of the topics. And that will give some idea on how behavior of the group evolves over time. A R visualization could look like this (each tick on the y axis represents a topic).



However, I do find that sometime LDA tries too hard on discovering topics with niche words (after filtering out stop words and rare words). Depending on application, this could be a bless or a curse.

sources:
[1] http://www.cs.cmu.edu/~lafferty/pub/efl.pdf
[2] http://videolectures.net/mlss09uk_blei_tm/
[3] http://cran.r-project.org/web/packages/topicmodels/index.html
[4] http://www.cs.princeton.edu/~blei/

Tuesday, June 14, 2011

One-linear R: neat way to slice the data

Recently found out that library 'caret' has a few interesting functions, which are very handy to do the following tasks:
series of test/training partition (createDataPartition function)
create one or more bootstrap samples (createResample function)
split the data into k groups (createFolds function)

'Caret' is a super wrapper package. One can run different type of models by calling functions in this package only.

Friday, May 27, 2011

One-liner R: remove data frame columns by names

Sometime, I want to remove some columns from my data. What I did in that case, is found out what are the unwanted column ids and removed those. It's just a couple of lines in R. But why not write a function to do the job and also some error handling. So next time, I just copy and paste!
rm.col.by.name <- function(dat, colname.list){

l1 = length(colname.list)
l2 = which(colnames(dat) %in% colname.list)

if(length(l2) == 0) {
    warning('Did not find any matching columns. Check input column names, please.')
}

if(length(l2) > 0 & length(l2) < l1) {
    warning('Did not find all matching columns. Only those found are removed.')
    return(dat[,-c(l2)])
} 

if(length(l2) > 0 & length(l2) == l1) {
    return(dat[,-c(l2)])
}

}

Wednesday, May 25, 2011

Canopy Clustering

Lately I was introduced to a clustering algorithm called "canopy clustering". In plain English, like all other clustering techniques, it is designed to divide observations into segments (subsets, clusters, groups, buckets, whatever you call it), so that observations inside each segment are "similar" in terms of certain measurement criteria. Those criteria are distances or dissimilarities.

There are a lot of algorithms that cluster observations. Some are more computation-intensive, like k-medoids (compared to k-means). Some yield hard assignment, i.e., one observation can only belong to one cluster. Some generate soft assignment, i.e., one observation can belong to multiple clusters, like LSH (locality sensitive hashing) and canopy clustering. Some scale better than others.

Canopy clustering procedure is relatively straightforward. The key ingredients are two distance thresholds T1 and T2, where T1 > T2. For a list of data points, if a point is within T2 distance to any of the existing canopy centers, then that point cannot be a new canopy center; otherwise, it should be added to the canopy center set. With this rule, a set of canopy centers are obtained. Then if any of the input data points is within T1 distance of the existing centers, then the point belong to the canopy formed by that center. As one can see, a data point could belong to multiple canopies. The following plot illustrates the concept.



Usually, canopy clustering uses "cheap" distance metrics (easy and fast to calculate). So the process takes shorter time compared to k-means clustering. This is a typical trade-off between precision and speed. Thus people combine k-means and canopy together to cluster huge amount of data. The way the two works together is:
1. Canopy uses cheaper distance metric (however in the same direction of k-means distance, meaning if two points are close in terms of k-means distance, they should be close in terms of canopy distance too.), and partition data into canopies.
2. K-means still try to cluster all the data points, however distance between an input data point and any k-mean centroid is calculated if and only if they are in the same canopy. In this way, a lot of "unnecessary" calculations are avoided.

There are some interesting variations around combining those two or simply use canopy clustering to generate k-means initial centroids. If serving canopy clustering as a k-means centroid feed, one can only perform the center generation part, or perform both the center generation and membership assignment parts, and then re-calculate the center of the canopy clusters. If utilizing canopy clustering to partition the data for k-means first, keeping the ratio between number of canopies and k to 1 and 10 has been suggested. That means you always want the number of canopies be smaller than k. So that each point is likely to be in the cluster with at least one k-mean centroid. But how to find out how many canopies you are going to get? The trick is you have to play with T1 and T2.

When implementing in map-reduce framework, this blog post is very clear. One thing that is worth pointing out is that after canopy centers are generated using one map-reduce job, the next map-reduce job is to assign membership of all the input data to the canopy centers, at that time, some data may no longer be within T1 of any canopy centers, solutions in this case includes forming a new canopy with this point as center or assigning it the nearest center or simply discarding that point.

Thursday, March 31, 2011

A few things I learnt in 2010

2010 was an "interesting" year. I write down the lessons that I learned and felt the most.

1. "Be confident and assertive."
Fine, call yourself "data analyst", "data scientist" or even "data ninja". Whatever you call yourself really does not matter. It matters that as a professional personnel, you work in an environment that requires you (and hopefully you would like also) to share your findings and convince others to see the potential value of your analysis the way you see them. At the end of the day, you are a consultant. And as a consultant, being confident and assertive about your expertise and findings help "tons" in the process.

2.Ask a lot of questions. Especially ask the "dumb" ones at the beginning. Getting clear on concepts and terms as early as possible helps reducing confusion and shortens project development time.

3.Follow up. Sometime people forget what they said or asked for a few days ago. Reasonably constant and consistent contact is the key to keep projects running and in track! Plus, it makes your clients feel needed, which theoretically is a good feeling. :P

Monday, March 14, 2011

Filesystem Traversing with Python

Lately I have accumulated a lot of files, most of which are papers in pdf format. Although they are put into folders properly, it's still hard to track down a single file, or even to check whether I have already collected that paper or not. So I needed to do a simple filesystem traversing and figure out 4 basic things about my documents: directory, file name, size of file, and last modified date. With a summary file that containing those information, I can do some kind of basic search, which is at least faster than my going through folders and digging files.

Also I wanted my results to satisfy two requirements (1) only output file information, if that file is in pdf, ppt, or doc format; (2) the size of file should be in meaningful format. I chose Python to finish this little task. Here is the code I used


import os
import time, stat
from datetime import datetime

def sizeof_fmt(num):
for x in ['bytes','KB','MB','GB','TB']:
if num < 1024.0:
return "%3.1f%s" % (num, x)
num /= 1024.0

f = open('output.txt', 'w')

for root, dirs, files in os.walk('E:\my_papers'):
for file in files:
if file.split('.')[-1] in ('pdf','ppt','doc'):
st=os.stat(os.path.join(root,file))
sz=st[stat.ST_SIZE]
tm=time.ctime(st[stat.ST_MTIME])
tm_tmp=datetime.strptime(tm, '%a %b %d %H:%M:%S %Y')
tm=tm_tmp.strftime('%Y-%m-%d')
sz2=sizeof_fmt(sz)
strg=root+'\t'+file+'\t'+tm +'\t' + sz2 +'\n'
f.write(strg)


The modules that are used here are "os", "time", "stat" and "datetime".

First there is a user-defined function that returns the file size in human readable format. I found this interesting function here .

Next os.walk function will walk through the given directory and stop until it finds files. Then for each file, the format is obtained by splitting the file names by '.'. Once I have the file format meets my requirement, its size in bytes format and its last modified-time is collected. Unfortunately the time is a very long string, some of which is not relevant at all. So I created a "datetime" object "tm_tmp" using "datetime.striptime()", and then created a string "tm" that only keeps year, month and date information of the file. Next the size function is called and the human readable file size is returned. Finally the directory, filename, time and size information are written to file.