Wednesday, January 12, 2011

Market Basket Analysis using SQL

Essentially, association rule mining is all about counting. So it fits database environment very well, especially big scalable database. With some "group by"s and "join"s, the job can be easily accomplished using SQL.

Here I use Apriori algorithm to generate all 0-way (i.e., {}=>item B) and 1-way (i.e., item A=>item B) rules. The code first generate frequent itemsets that contains single item and have minimum support count bigger than minsup. And then it will generate all 1-way rules that have frequent itemsets with 2 items with support bigger than minsup, and have confidence bigger than some minconf. Rules bigger than 1 way are more complicated and require going through data iteratively until no more frequent itemsets are generated. The following codes are just meant to show how market basket analysis works and what kind of metrics are available to evaluate the rules.

/* Assuming your transaction data looks like this (order_id, item) pair
order 1, item 1
order 1, item 2
order 1, item 32
order 2, item 3
order 2, item 2
order 2, item 5
order 3, item 3
...
and it's already stored in the database with table name trans_data.
*/

-- Generating 0-way frequent itemsets using minsup=5
CREATE TABLE arule_0w AS
SELECT item, SUM(1) AS cnt
FROM trans_data
GROUP BY item
HAVING SUM(1) > 5
DISTRIBUTED BY (item)
;

-- Generate 1-way frequent itemsets,
CREATE TABLE arule_1w AS
SELECT lhs, rhs, COUNT(1) AS num_lhs_rhs
FROM (
SELECT item_lhs.order_id, lhs, rhs
FROM (SELECT a.order_id, a.item AS lhs
FROM trans_data a
JOIN arule_0w USING(item)
GROUP BY order_id, item) item_lhs
JOIN (SELECT b.order_id, b.item AS rhs
FROM trans_data b
JOIN arule_0w USING(item)
GROUP BY order_id, item) item_rhs
ON (item_lhs.order_id=item_rhs.order_id
AND item_lhs.lhs <> item_rhs.rhs)
) rules
GROUP BY lhs, rhs
HAVING COUNT(1) > 5
DISTRIBUTED BY (lhs, rhs)
;

You might have noticed that trans_data joined with the 0-way itemsets table to filter out items that are not frequent for the 0-way table. This is exactly how Apriori algorithm works, at each iteration, filtering out items are not frequent from previous run. Because as we search deeper in the data, the sets will become smaller and smaller, and if items are not frequent at the previous iteration, they won't be in the later iterations.

All right, the above query shows how to generate 0-way and 1-way itemsets. Next chunk of queries will show how to generate rules from itemsets and measurements that are used to evaluate the rules. Besides support and confidence, there are other measurements too, like cosine, Kappa, Jaccard, Laplace, Gini index, J-measure to name a few (for more details, see sample chapter 6 on "introduction to data mining" book website). Most of the interest measurements are calculated based on the following 2X2 contingency table. f11 is the number of times A and B appear together in the same transaction. f10 is the number of transactions that contains A but not B. The column sum f+1 is the support count for B, and f1+ is the support count for A. N is the total number of transactions.

| B | !B |
----|-----|-----|------
A | f11 | f10 | f1+
!A | f01 | f00 | f0+
----|-----|-----|------
| f+1 | f+0 | N

Next, I show how to generate rules and calculate some of those interest measures in query

CREATE TABLE arule_1w_with_measure AS
SELECT a.*
, (power(exp_lhs_rhs - num_lhs_rhs, 2)/exp_lhs_rhs + power(exp_lhs_no_rhs - num_lhs_no_rhs, 2)/exp_lhs_no_rhs + power(exp_no_lhs_rhs - num_no_lhs_rhs,2)/exp_no_lhs_rhs + power(exp_no_lhs_no_rhs - num_no_lhs_no_rhs, 2)/exp_no_lhs_no_rhs) as Chi_Square
, (num_lhs_rhs+1)/(num_lhs+2) as Laplace
, (num_lhs*num_no_rhs)/(num_basket * num_lhs_no_rhs) as Conviction
, (num_lhs_rhs/num_lhs - num_rhs/num_basket) as Added_Value
, (num_lhs_rhs/num_lhs - num_rhs/num_basket) /(1-num_rhs/num_basket) as Certainty_Factor
, (num_lhs_rhs/num_basket * ln((num_basket*num_lhs_rhs)/(num_lhs*num_rhs)) + num_lhs_no_rhs/num_basket * ln((num_basket*num_lhs_no_rhs)/(num_lhs*num_no_rhs))) as J_Measure
, num_lhs/num_basket*(power(num_lhs_rhs, 2)+power(num_lhs_no_rhs, 2))/power(num_lhs, 2)-power(num_rhs/num_basket, 2) + num_no_lhs/num_basket *(power(num_no_lhs_rhs,2) + power(num_no_lhs_no_rhs, 2))/power(num_no_lhs, 2) - power(num_no_rhs/num_basket, 2) as Gini_Index
, num_lhs_rhs/(num_lhs+num_rhs-num_lhs_rhs) as Jaccard
, (num_lhs_rhs/num_basket - num_rhs*num_lhs/power(num_basket, 2)) as Shapiro
, num_lhs_rhs/sqrt(num_lhs*num_rhs) as Cosine
, (num_lhs_rhs*num_basket-num_lhs*num_rhs)/sqrt(num_lhs*num_rhs*num_no_lhs*num_no_rhs) as Correlation
, num_lhs_rhs*num_no_lhs_no_rhs/(num_lhs_no_rhs*num_no_lhs_rhs) as Odds_Ratio
FROM (SELECT lhs_rhs.lhs||' => '|| lhs_rhs.rhs as rules
, lhs_rhs.lhs
, lhs_rhs.rhs
, num_lhs_rhs*1.0 as num_lhs_rhs
, num_basket*1.0 as num_basket
, num_lhs*1.0 as num_lhs
, num_rhs*1.0 as num_rhs
, (num_basket - num_lhs)*1.0 as num_no_lhs
, (num_basket - num_rhs)*1.0 as num_no_rhs
, (num_lhs - num_lhs_rhs)*1.0 as num_lhs_no_rhs
, (num_rhs - num_lhs_rhs)*1.0 as num_no_lhs_rhs
, (num_basket - num_lhs - num_rhs + num_lhs_rhs)*1.0 as num_no_lhs_no_rhs
, num_lhs * num_rhs * 1.0 /num_basket as exp_lhs_rhs
, num_lhs * (1.0 * num_basket - num_rhs) * 1.0 /num_basket as exp_lhs_no_rhs
, (1.0 * num_basket - num_lhs) * num_rhs /num_basket as exp_no_lhs_rhs
, (1.0 * num_basket - num_lhs) * (1.0 * num_basket - num_rhs) /num_basket as exp_no_lhs_no_rhs
, num_lhs_rhs * 1.0 /num_basket as support
, num_lhs_rhs * 1.0 /num_lhs as confidence
, num_lhs_rhs * num_basket * 1.0/(num_lhs * num_rhs) as lift
FROM arule_1w as lhs_rhs
JOIN (SELECT item as lhs
, count(1) as num_lhs
FROM trans_data
GROUP BY lhs) sum_lhs
ON lhs_rhs.lhs=sum_lhs.lhs
JOIN (SELECT item as rhs
, COUNT(1) as num_rhs
FROM trans_data
GROUP BY rhs) sum_rhs
ON lhs_rhs.rhs=sum_rhs.rhs
CROSS JOIN (SELECT COUNT(1) as num_basket
FROM (SELECT order_id FROM trans_data GROUP BY order_id) foo) total
) a
WHERE lift > 1 -- only the good rules
AND confidence > .0001
DISTRIBUTED BY (lhs, rhs)
;

Notice the where clause at the end insures that we only generate rules with confidence bigger than a threshold of .0001 and lift > 1, meaning positively related rules. One trick I used is to multiply 1.0 to integer counts, so later on when you calculate those measure, the SQL won't complain about over floating of integers.

No comments:

Post a Comment