Tuesday, January 11, 2011

Association Rule Mining (Market Basket Analysis)

Association rule mining is a well researched classical data mining technique. It's designed to find hidden pattern or relationship in large data set, pattern referring to co-appearance of high frequency different items given the existence of some item. Particularly, it's well suited for analyzing commercial transaction data. In that scenario (where it's usually called "market basket analysis"), the presence of items can be expressed using a binary variable taking value from {0,1}. For example, the transaction data at the checkout counters (each customer holding a basket of one or more items), shows that customer who buys A also purchases B. By examining huge amount of transaction data, interesting relationships/rules could be revealed, like people buy diaper also buy wipes (this is a trivial rule, not very interesting), or people buy diaper also buy beer (well, this is a lot more interesting than the previous rule). Rules like this can be very helpful for cross-marketing, catalog design, shelf organizing, products recommendation at the place-order page etc. Of course, besides this kind of consumer preference analysis, association rule mining works for other types of problems too, e.g., human resource management (employees who said positive things about initiative A also frequently complain about issue B), and the history of language.

Because association rule mining does not require users to provide a whole lot prior information and it deals with single values, words or items, it's well suited for data or text mining in large databases. In next posts, I will demonstrate how to do market basket analysis using SQL and R.

The rules can be denoted as A => B, meaning A implies B. The most commonly used measurements for association rule mining is support and confidence, where support =count(A & B together)/ count(total transactions) (joint probability for A and B), confidence =count(A & B together)/count(A) (conditional probability of B given A) and count(A) counts the number of transaction for event A. A lot of algorithms use support and confidence to prune the uninteresting less confident rules. For example, the well-known Apriori algorithm rapidly processes data based on preferred "threshold" values.

Here Professor Kumar showed some sample chapters for his book "Introduction to Data Mining", including the chapter for association rule mining.

No comments:

Post a Comment