捡芝麻拾谷子: data mining

Wednesday, June 15, 2011

Use LDA Models on User Behaviors

LDA (Latent Dirichlet Allocation) model is a Bayesian statistical technique that is designed to find hidden (latent) groups of the data. For example, LDA helps to find topics (latent groups of individual words) among large document corpora. LDA is a probabilistic generative model, i.e., it assumes a generating mechanism underneath the observations and then use the observations to infer the parameters for the underlying mechanism.

When trying to find the topics among documents, LDA makes the following assumptions:

ignoring the order of words, documents are simply bags of words and documents exhibits multiple topics.

topics have distributions over words. So each word could appear in all the topics with different probabilities though. Say "transmission" will have a higher probability in the topic about auto repair, but lower probability in the topic about child education.

each document is obtained by choosing some topics proportions, say {(topic A: .55), (topic B: .20), (topic C: 25)}.

and for each word inside a document, choosing a topic from those proportions (say you randomly got number .6 (between 0 and 1 of course), the topic got chosen is B) then looking at the topic over words distribution and drawing a word from that topic (say word X).

Only the actual words are visible to us, so inference has to be made on per-word topic assignment, per-document topic proportion and per-corpus topic distribution. Dirichlet distributions are chosen to be the prior distributions for the latter two distributions. And various algorithms are used to make the inference, like Gibbs sampling (implemented in R package 'topicmodels').

LDA assumes a fixed number of topics existing in a document corpora and assigns soft membership of the topics to documents (i.e, a topic could be exhibited in multiple documents). Indeed, LDA is under a bigger umbrella called "mixed membership model framework" [refer to source [1] for more) .

The soft assignment and mixed membership nature of LDA models make it a reasonable candidate for user behavior analysis, particularly user behavior through a certain period of time. Say one has a log of user behavior at different stage of user life cycle (1st day, 2nd day, etc of a user's life). Then one can treat behavior as words, each day's users aggregated behavior log as document, and let LDA figure out the topics (group of behaviors that tend to appear together, 'co-appearance') and the proportions of the topics. And that will give some idea on how behavior of the group evolves over time. A R visualization could look like this (each tick on the y axis represents a topic).

However, I do find that sometime LDA tries too hard on discovering topics with niche words (after filtering out stop words and rare words). Depending on application, this could be a bless or a curse.

sources:
[1] http://www.cs.cmu.edu/~lafferty/pub/efl.pdf
[2] http://videolectures.net/mlss09uk_blei_tm/
[3] http://cran.r-project.org/web/packages/topicmodels/index.html
[4] http://www.cs.princeton.edu/~blei/

Tuesday, January 11, 2011

Association Rule Mining (Market Basket Analysis)

Association rule mining is a well researched classical data mining technique. It's designed to find hidden pattern or relationship in large data set, pattern referring to co-appearance of high frequency different items given the existence of some item. Particularly, it's well suited for analyzing commercial transaction data. In that scenario (where it's usually called "market basket analysis"), the presence of items can be expressed using a binary variable taking value from {0,1}. For example, the transaction data at the checkout counters (each customer holding a basket of one or more items), shows that customer who buys A also purchases B. By examining huge amount of transaction data, interesting relationships/rules could be revealed, like people buy diaper also buy wipes (this is a trivial rule, not very interesting), or people buy diaper also buy beer (well, this is a lot more interesting than the previous rule). Rules like this can be very helpful for cross-marketing, catalog design, shelf organizing, products recommendation at the place-order page etc. Of course, besides this kind of consumer preference analysis, association rule mining works for other types of problems too, e.g., human resource management (employees who said positive things about initiative A also frequently complain about issue B), and the history of language.

Because association rule mining does not require users to provide a whole lot prior information and it deals with single values, words or items, it's well suited for data or text mining in large databases. In next posts, I will demonstrate how to do market basket analysis using SQL and R.

The rules can be denoted as A => B, meaning A implies B. The most commonly used measurements for association rule mining is support and confidence, where support =count(A & B together)/ count(total transactions) (joint probability for A and B), confidence =count(A & B together)/count(A) (conditional probability of B given A) and count(A) counts the number of transaction for event A. A lot of algorithms use support and confidence to prune the uninteresting less confident rules. For example, the well-known Apriori algorithm rapidly processes data based on preferred "threshold" values.

Here Professor Kumar showed some sample chapters for his book "Introduction to Data Mining", including the chapter for association rule mining.

捡芝麻拾谷子

Wednesday, June 15, 2011

Use LDA Models on User Behaviors

Tuesday, January 11, 2011

Association Rule Mining (Market Basket Analysis)

Total Pageviews

About Me