捡芝麻拾谷子: sql

Showing posts with label sql. Show all posts

Wednesday, January 12, 2011

Market Basket Analysis using SQL

Essentially, association rule mining is all about counting. So it fits database environment very well, especially big scalable database. With some "group by"s and "join"s, the job can be easily accomplished using SQL.

Here I use Apriori algorithm to generate all 0-way (i.e., {}=>item B) and 1-way (i.e., item A=>item B) rules. The code first generate frequent itemsets that contains single item and have minimum support count bigger than minsup. And then it will generate all 1-way rules that have frequent itemsets with 2 items with support bigger than minsup, and have confidence bigger than some minconf. Rules bigger than 1 way are more complicated and require going through data iteratively until no more frequent itemsets are generated. The following codes are just meant to show how market basket analysis works and what kind of metrics are available to evaluate the rules.

/* Assuming your transaction data looks like this (order_id, item) pair
order 1, item 1
order 1, item 2
order 1, item 32
order 2, item 3
order 2, item 2
order 2, item 5
order 3, item 3
...
and it's already stored in the database with table name trans_data.
*/

-- Generating 0-way frequent itemsets using minsup=5
CREATE TABLE arule_0w AS
SELECT item, SUM(1) AS cnt
FROM trans_data
GROUP BY item
HAVING SUM(1) > 5
DISTRIBUTED BY (item)
;

-- Generate 1-way frequent itemsets,
CREATE TABLE arule_1w AS
SELECT lhs, rhs, COUNT(1) AS num_lhs_rhs
FROM (
SELECT item_lhs.order_id, lhs, rhs
FROM (SELECT a.order_id, a.item AS lhs
FROM trans_data a
JOIN arule_0w USING(item)
GROUP BY order_id, item) item_lhs
JOIN (SELECT b.order_id, b.item AS rhs
FROM trans_data b
JOIN arule_0w USING(item)
GROUP BY order_id, item) item_rhs
ON (item_lhs.order_id=item_rhs.order_id
AND item_lhs.lhs <> item_rhs.rhs)
) rules
GROUP BY lhs, rhs
HAVING COUNT(1) > 5
DISTRIBUTED BY (lhs, rhs)
;

You might have noticed that trans_data joined with the 0-way itemsets table to filter out items that are not frequent for the 0-way table. This is exactly how Apriori algorithm works, at each iteration, filtering out items are not frequent from previous run. Because as we search deeper in the data, the sets will become smaller and smaller, and if items are not frequent at the previous iteration, they won't be in the later iterations.

All right, the above query shows how to generate 0-way and 1-way itemsets. Next chunk of queries will show how to generate rules from itemsets and measurements that are used to evaluate the rules. Besides support and confidence, there are other measurements too, like cosine, Kappa, Jaccard, Laplace, Gini index, J-measure to name a few (for more details, see sample chapter 6 on "introduction to data mining" book website). Most of the interest measurements are calculated based on the following 2X2 contingency table. f11 is the number of times A and B appear together in the same transaction. f10 is the number of transactions that contains A but not B. The column sum f+1 is the support count for B, and f1+ is the support count for A. N is the total number of transactions.

| B | !B |
----|-----|-----|------
A | f11 | f10 | f1+
!A | f01 | f00 | f0+
----|-----|-----|------
| f+1 | f+0 | N

Next, I show how to generate rules and calculate some of those interest measures in query

CREATE TABLE arule_1w_with_measure AS
SELECT a.*
, (power(exp_lhs_rhs - num_lhs_rhs, 2)/exp_lhs_rhs + power(exp_lhs_no_rhs - num_lhs_no_rhs, 2)/exp_lhs_no_rhs + power(exp_no_lhs_rhs - num_no_lhs_rhs,2)/exp_no_lhs_rhs + power(exp_no_lhs_no_rhs - num_no_lhs_no_rhs, 2)/exp_no_lhs_no_rhs) as Chi_Square
, (num_lhs_rhs+1)/(num_lhs+2) as Laplace
, (num_lhs*num_no_rhs)/(num_basket * num_lhs_no_rhs) as Conviction
, (num_lhs_rhs/num_lhs - num_rhs/num_basket) as Added_Value
, (num_lhs_rhs/num_lhs - num_rhs/num_basket) /(1-num_rhs/num_basket) as Certainty_Factor
, (num_lhs_rhs/num_basket * ln((num_basket*num_lhs_rhs)/(num_lhs*num_rhs)) + num_lhs_no_rhs/num_basket * ln((num_basket*num_lhs_no_rhs)/(num_lhs*num_no_rhs))) as J_Measure
, num_lhs/num_basket*(power(num_lhs_rhs, 2)+power(num_lhs_no_rhs, 2))/power(num_lhs, 2)-power(num_rhs/num_basket, 2) + num_no_lhs/num_basket *(power(num_no_lhs_rhs,2) + power(num_no_lhs_no_rhs, 2))/power(num_no_lhs, 2) - power(num_no_rhs/num_basket, 2) as Gini_Index
, num_lhs_rhs/(num_lhs+num_rhs-num_lhs_rhs) as Jaccard
, (num_lhs_rhs/num_basket - num_rhs*num_lhs/power(num_basket, 2)) as Shapiro
, num_lhs_rhs/sqrt(num_lhs*num_rhs) as Cosine
, (num_lhs_rhs*num_basket-num_lhs*num_rhs)/sqrt(num_lhs*num_rhs*num_no_lhs*num_no_rhs) as Correlation
, num_lhs_rhs*num_no_lhs_no_rhs/(num_lhs_no_rhs*num_no_lhs_rhs) as Odds_Ratio
FROM (SELECT lhs_rhs.lhs||' => '|| lhs_rhs.rhs as rules
, lhs_rhs.lhs
, lhs_rhs.rhs
, num_lhs_rhs*1.0 as num_lhs_rhs
, num_basket*1.0 as num_basket
, num_lhs*1.0 as num_lhs
, num_rhs*1.0 as num_rhs
, (num_basket - num_lhs)*1.0 as num_no_lhs
, (num_basket - num_rhs)*1.0 as num_no_rhs
, (num_lhs - num_lhs_rhs)*1.0 as num_lhs_no_rhs
, (num_rhs - num_lhs_rhs)*1.0 as num_no_lhs_rhs
, (num_basket - num_lhs - num_rhs + num_lhs_rhs)*1.0 as num_no_lhs_no_rhs
, num_lhs * num_rhs * 1.0 /num_basket as exp_lhs_rhs
, num_lhs * (1.0 * num_basket - num_rhs) * 1.0 /num_basket as exp_lhs_no_rhs
, (1.0 * num_basket - num_lhs) * num_rhs /num_basket as exp_no_lhs_rhs
, (1.0 * num_basket - num_lhs) * (1.0 * num_basket - num_rhs) /num_basket as exp_no_lhs_no_rhs
, num_lhs_rhs * 1.0 /num_basket as support
, num_lhs_rhs * 1.0 /num_lhs as confidence
, num_lhs_rhs * num_basket * 1.0/(num_lhs * num_rhs) as lift
FROM arule_1w as lhs_rhs
JOIN (SELECT item as lhs
, count(1) as num_lhs
FROM trans_data
GROUP BY lhs) sum_lhs
ON lhs_rhs.lhs=sum_lhs.lhs
JOIN (SELECT item as rhs
, COUNT(1) as num_rhs
FROM trans_data
GROUP BY rhs) sum_rhs
ON lhs_rhs.rhs=sum_rhs.rhs
CROSS JOIN (SELECT COUNT(1) as num_basket
FROM (SELECT order_id FROM trans_data GROUP BY order_id) foo) total
) a
WHERE lift > 1 -- only the good rules
AND confidence > .0001
DISTRIBUTED BY (lhs, rhs)
;

Notice the where clause at the end insures that we only generate rules with confidence bigger than a threshold of .0001 and lift > 1, meaning positively related rules. One trick I used is to multiply 1.0 to integer counts, so later on when you calculate those measure, the SQL won't complain about over floating of integers.

Thursday, October 21, 2010

Thinking about self-join? Well, think again!

This week, the greenplum people came on-site and talked about a few tricks that I am very new to. The ones I have most feeling about are

(1) about function overloading

So PostgreSQL allows you to have functions with exact names but different inputs or outputs, even different number of inputs. This is called function overloading. It's a feature in various programming languages like C, Java etc. And since Greenplum is written in C, it has this feature by birth. And this is exactly WHY you have to specify the inputs when deleting a function in greenplum!

I know, this could be very annoying! So they should really have a switch, so you can choose whatever options you want, either follow the overloading rule or not. So if you refuse to follow overloading and you try to name a new function the same as an existing one, greenplum will throw an error on you. C language has the overloading thing. That's fine. But greenplum could have given user options.

(2) about "STRICT"

When writing user defined functions, by adding "STRICT" option at the end, you are telling PostgreSQL to return a null automatically when someone inputs a null to the function. The function will not even try to run with null inputs, it will return null immediately. This could be a nice option to speed things up.

(3) about windowing function

Windowing function is not new concept to me. But the instructor's comment about all self-join could be replaced by windowing function is quite new. It's kind of surprising to me. Of course, I believe it's not hard, just takes time to get used to, particularly when one has been so familiar with table self-joins.

Windowing function allows user to access a set of rows associated with the current row. It can perform various calculations, like sum, rank, average, max/min etc. PostgreSql has a very good tutorial on windowing functions. There are two things I like particularly, running totals and naming windows.

So now the running total revenue calculation of campaigns is

SELECT campaign_id
, revenue
, SUM(revenue) OVER (w) as running_total
, SUM(revenue) OVER() AS overall_total
,(SUM(revenue) OVER (w)) / (SUM(revenue) OVER()) AS pct
FROM rev_table
WINDOW w AS (ORDER BY revenue DESC)

Very nice and clean, right?

Last but not least, first_value, last_value or nth_value could be extremely useful sometime, for example, I'd like to know that's the last campaign a user have clicked on in the past week.

SELECT enc_email
, FIRST_VALUE(campaign_id) OVER(PARTITION BY enc_email ORDER BY click_date DESC) AS last_campaign
FROM user_history
LIMIT 1;

The "limit 1" at the end is very important. Without it, the end result will has as many rows as the original table filling with exact same rows.

Friday, July 30, 2010

Exists and Correlated Subquery are tricky

Today is the second of our sql training. The stuff was too trivial yesterday. But it starts to get interesting now.

The subquery refers to those SELECT statements nested inside another (SELECT) query. Then the SQL got the structured. The nested query is normally called "sub query", "inner query" and "lower query" relative to "main query", "out query" and "upper query".

There is a lot going on in subquery. It's flexible, however, a lot of subqueries can be replaced by joins, which are more efficient.

A tricky example I found during the class is this one:

SELECT empno, bonus, comm
FROM emp
WHERE EXISTS (SELECT 'x' FROM dept WHERE mgrno=EMP.empno);

This query is designed to return employee number, bonus and commission for any employees who are listed as a manager on dept table.

There are 4 things that draw my attention:
(1) emp table and dept table are correlated in the sub select query.
(2) EXISTS is used. This is an operator I barely use.
(3) An alphabetic constant 'x' is used. It can be replaced with anything constant like 1 or 'y'. Because they simply don't make differences, since EXISTS only serves as "TRUE"/"FALSE".
(4) The way those queries became correlated is quite weird. You just call the emp table directly in the subquery. However if you try to run the subquery (SELECT 'x' FROM dept WHERE mgrno=EMP.empno) itself, you will get syntax error.

One thing needs to be pointed out is, if the two table are not correlated in the subquery, EXISTS will always return as TRUE, so you end up selecting every row in emp table.

捡芝麻拾谷子