捡芝麻拾谷子: Use LDA Models on User Behaviors

LDA (Latent Dirichlet Allocation) model is a Bayesian statistical technique that is designed to find hidden (latent) groups of the data. For example, LDA helps to find topics (latent groups of individual words) among large document corpora. LDA is a probabilistic generative model, i.e., it assumes a generating mechanism underneath the observations and then use the observations to infer the parameters for the underlying mechanism.

When trying to find the topics among documents, LDA makes the following assumptions:

ignoring the order of words, documents are simply bags of words and documents exhibits multiple topics.

topics have distributions over words. So each word could appear in all the topics with different probabilities though. Say "transmission" will have a higher probability in the topic about auto repair, but lower probability in the topic about child education.

each document is obtained by choosing some topics proportions, say {(topic A: .55), (topic B: .20), (topic C: 25)}.

and for each word inside a document, choosing a topic from those proportions (say you randomly got number .6 (between 0 and 1 of course), the topic got chosen is B) then looking at the topic over words distribution and drawing a word from that topic (say word X).

Only the actual words are visible to us, so inference has to be made on per-word topic assignment, per-document topic proportion and per-corpus topic distribution. Dirichlet distributions are chosen to be the prior distributions for the latter two distributions. And various algorithms are used to make the inference, like Gibbs sampling (implemented in R package 'topicmodels').

LDA assumes a fixed number of topics existing in a document corpora and assigns soft membership of the topics to documents (i.e, a topic could be exhibited in multiple documents). Indeed, LDA is under a bigger umbrella called "mixed membership model framework" [refer to source [1] for more) .

The soft assignment and mixed membership nature of LDA models make it a reasonable candidate for user behavior analysis, particularly user behavior through a certain period of time. Say one has a log of user behavior at different stage of user life cycle (1st day, 2nd day, etc of a user's life). Then one can treat behavior as words, each day's users aggregated behavior log as document, and let LDA figure out the topics (group of behaviors that tend to appear together, 'co-appearance') and the proportions of the topics. And that will give some idea on how behavior of the group evolves over time. A R visualization could look like this (each tick on the y axis represents a topic).

However, I do find that sometime LDA tries too hard on discovering topics with niche words (after filtering out stop words and rare words). Depending on application, this could be a bless or a curse.

sources:
[1] http://www.cs.cmu.edu/~lafferty/pub/efl.pdf
[2] http://videolectures.net/mlss09uk_blei_tm/
[3] http://cran.r-project.org/web/packages/topicmodels/index.html
[4] http://www.cs.princeton.edu/~blei/

捡芝麻拾谷子

Wednesday, June 15, 2011

Use LDA Models on User Behaviors

No comments:

Post a Comment

Total Pageviews

About Me