Essentially, association rule mining is all about counting. So it fits database environment very well, especially big scalable database. With some "group by"s and "join"s, the job can be easily accomplished using SQL.

Here I use Apriori algorithm to generate all 0-way (i.e., {}=>item B) and 1-way (i.e., item A=>item B) rules. The code first generate frequent itemsets that contains single item and have minimum support count bigger than minsup. And then it will generate all 1-way rules that have frequent itemsets with 2 items with support bigger than minsup, and have confidence bigger than some minconf. Rules bigger than 1 way are more complicated and require going through data iteratively until no more frequent itemsets are generated. The following codes are just meant to show how market basket analysis works and what kind of metrics are available to evaluate the rules.

/* Assuming your transaction data looks like this (order_id, item) pair

order 1, item 1

order 1, item 2

order 1, item 32

order 2, item 3

order 2, item 2

order 2, item 5

order 3, item 3

...

and it's already stored in the database with table name trans_data.

*/

-- Generating 0-way frequent itemsets using minsup=5

CREATE TABLE arule_0w AS

SELECT item, SUM(1) AS cnt

FROM trans_data

GROUP BY item

HAVING SUM(1) > 5

DISTRIBUTED BY (item)

;

-- Generate 1-way frequent itemsets,

CREATE TABLE arule_1w AS

SELECT lhs, rhs, COUNT(1) AS num_lhs_rhs

FROM (

SELECT item_lhs.order_id, lhs, rhs

FROM (SELECT a.order_id, a.item AS lhs

FROM trans_data a

JOIN arule_0w USING(item)

GROUP BY order_id, item) item_lhs

JOIN (SELECT b.order_id, b.item AS rhs

FROM trans_data b

JOIN arule_0w USING(item)

GROUP BY order_id, item) item_rhs

ON (item_lhs.order_id=item_rhs.order_id

AND item_lhs.lhs <> item_rhs.rhs)

) rules

GROUP BY lhs, rhs

HAVING COUNT(1) > 5

DISTRIBUTED BY (lhs, rhs)

;

You might have noticed that trans_data joined with the 0-way itemsets table to filter out items that are not frequent for the 0-way table. This is exactly how Apriori algorithm works, at each iteration, filtering out items are not frequent from previous run. Because as we search deeper in the data, the sets will become smaller and smaller, and if items are not frequent at the previous iteration, they won't be in the later iterations.

All right, the above query shows how to generate 0-way and 1-way itemsets. Next chunk of queries will show how to generate rules from itemsets and measurements that are used to evaluate the rules. Besides support and confidence, there are other measurements too, like cosine, Kappa, Jaccard, Laplace, Gini index, J-measure to name a few (for more details, see sample chapter 6 on "introduction to data mining" book website). Most of the interest measurements are calculated based on the following 2X2 contingency table. f11 is the number of times A and B appear together in the same transaction. f10 is the number of transactions that contains A but not B. The column sum f+1 is the support count for B, and f1+ is the support count for A. N is the total number of transactions.

| B | !B |

----|-----|-----|------

A | f11 | f10 | f1+

!A | f01 | f00 | f0+

----|-----|-----|------

| f+1 | f+0 | N

Next, I show how to generate rules and calculate some of those interest measures in query

CREATE TABLE arule_1w_with_measure AS

SELECT a.*

, (power(exp_lhs_rhs - num_lhs_rhs, 2)/exp_lhs_rhs + power(exp_lhs_no_rhs - num_lhs_no_rhs, 2)/exp_lhs_no_rhs + power(exp_no_lhs_rhs - num_no_lhs_rhs,2)/exp_no_lhs_rhs + power(exp_no_lhs_no_rhs - num_no_lhs_no_rhs, 2)/exp_no_lhs_no_rhs) as Chi_Square

, (num_lhs_rhs+1)/(num_lhs+2) as Laplace

, (num_lhs*num_no_rhs)/(num_basket * num_lhs_no_rhs) as Conviction

, (num_lhs_rhs/num_lhs - num_rhs/num_basket) as Added_Value

, (num_lhs_rhs/num_lhs - num_rhs/num_basket) /(1-num_rhs/num_basket) as Certainty_Factor

, (num_lhs_rhs/num_basket * ln((num_basket*num_lhs_rhs)/(num_lhs*num_rhs)) + num_lhs_no_rhs/num_basket * ln((num_basket*num_lhs_no_rhs)/(num_lhs*num_no_rhs))) as J_Measure

, num_lhs/num_basket*(power(num_lhs_rhs, 2)+power(num_lhs_no_rhs, 2))/power(num_lhs, 2)-power(num_rhs/num_basket, 2) + num_no_lhs/num_basket *(power(num_no_lhs_rhs,2) + power(num_no_lhs_no_rhs, 2))/power(num_no_lhs, 2) - power(num_no_rhs/num_basket, 2) as Gini_Index

, num_lhs_rhs/(num_lhs+num_rhs-num_lhs_rhs) as Jaccard

, (num_lhs_rhs/num_basket - num_rhs*num_lhs/power(num_basket, 2)) as Shapiro

, num_lhs_rhs/sqrt(num_lhs*num_rhs) as Cosine

, (num_lhs_rhs*num_basket-num_lhs*num_rhs)/sqrt(num_lhs*num_rhs*num_no_lhs*num_no_rhs) as Correlation

, num_lhs_rhs*num_no_lhs_no_rhs/(num_lhs_no_rhs*num_no_lhs_rhs) as Odds_Ratio

FROM (SELECT lhs_rhs.lhs||' => '|| lhs_rhs.rhs as rules

, lhs_rhs.lhs

, lhs_rhs.rhs

, num_lhs_rhs*1.0 as num_lhs_rhs

, num_basket*1.0 as num_basket

, num_lhs*1.0 as num_lhs

, num_rhs*1.0 as num_rhs

, (num_basket - num_lhs)*1.0 as num_no_lhs

, (num_basket - num_rhs)*1.0 as num_no_rhs

, (num_lhs - num_lhs_rhs)*1.0 as num_lhs_no_rhs

, (num_rhs - num_lhs_rhs)*1.0 as num_no_lhs_rhs

, (num_basket - num_lhs - num_rhs + num_lhs_rhs)*1.0 as num_no_lhs_no_rhs

, num_lhs * num_rhs * 1.0 /num_basket as exp_lhs_rhs

, num_lhs * (1.0 * num_basket - num_rhs) * 1.0 /num_basket as exp_lhs_no_rhs

, (1.0 * num_basket - num_lhs) * num_rhs /num_basket as exp_no_lhs_rhs

, (1.0 * num_basket - num_lhs) * (1.0 * num_basket - num_rhs) /num_basket as exp_no_lhs_no_rhs

, num_lhs_rhs * 1.0 /num_basket as support

, num_lhs_rhs * 1.0 /num_lhs as confidence

, num_lhs_rhs * num_basket * 1.0/(num_lhs * num_rhs) as lift

FROM arule_1w as lhs_rhs

JOIN (SELECT item as lhs

, count(1) as num_lhs

FROM trans_data

GROUP BY lhs) sum_lhs

ON lhs_rhs.lhs=sum_lhs.lhs

JOIN (SELECT item as rhs

, COUNT(1) as num_rhs

FROM trans_data

GROUP BY rhs) sum_rhs

ON lhs_rhs.rhs=sum_rhs.rhs

CROSS JOIN (SELECT COUNT(1) as num_basket

FROM (SELECT order_id FROM trans_data GROUP BY order_id) foo) total

) a

WHERE lift > 1 -- only the good rules

AND confidence > .0001

DISTRIBUTED BY (lhs, rhs)

;

Notice the where clause at the end insures that we only generate rules with confidence bigger than a threshold of .0001 and lift > 1, meaning positively related rules. One trick I used is to multiply 1.0 to integer counts, so later on when you calculate those measure, the SQL won't complain about over floating of integers.

## No comments:

## Post a Comment