Friday, January 14, 2011

Market Basket Analysis using R

R has a package that deals with association rule mining tasks. Its name is "arules". It implemented Apriori algorithm and Eclat by calling C in the back-end.

I found a small transaction dataset (2k rows) online, which has 2 columns 'order_id' and 'item'. The first few rows look like this

/*
10248 item042
10248 item072
10249 item014
10249 item051
10250 item041
10250 item051
...
*/

Then I performed the following R codes to analyze this data

## call the 'arules' library [if you have not installed it, run "install.packages('arules')" in R.]
library(arules)

## import the data
# read.transactions is a very nice function to read
# in this type of transaction data very quickly.
# I have seen others reading the data using read.table
# and then transform it, which was too much manipulation.
# "format" argument tells R if each line has just one item
# (single) or multiply items (basket) seperated by "sep="
# "cols" argument is a numeric vector of length 2 giving
# the numbers of the columns with the transaction and item ids
# for single format. And it can be a numeric scalar
# giving the number of the column with transaction ids
# for basket format.
trans_data <- read.transactions(file='trans_data.csv', rm.duplicates=F, format='single', sep=',', cols=c(1,2))

# take a peak at what's in each basket
inspect(trans_data)

## after reading the data in, there are some trivial
#functions you can use on it

class(trans_data)
summary(trans_data)

## for each item, you can count the frequency of appearance
# or ratio of appearance relative to total number of transactions
itemFrequency(trans_data, type='absolute')
itemFrequency(trans_data, type='relative')
## item_freq/total_num_of_orders

## there are some visualization functions one can use
# to "see" the data
image(trans_data)
itemFrequencyPlot(trans_data)

## apply Apriori algorithms to mine association rules
# there are a few parameters that you can tweak with,
# see the package manual instruction on 'APparameter-class'
# in this example, I specifically tell R I want 1-way rules that
# has minsupport of .002 and minconfidence of .2 by setting up
# parameters in the following way

rules <- apriori(trans_data, parameter = list(supp = 0.002, conf = 0.2, target = "rules", maxlen=2, minlen=1))

parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target
0.2 0.1 1 none FALSE TRUE 0.002 1 2 rules
ext
FALSE

algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE

Warning in apriori(trans_data, parameter = list(supp = 0.002, conf = 0.2, :
You chose a very low absolute support count of 1. You might run out of memory! Increase minimum support.

apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[77 item(s), 830 transaction(s)] done [0.01s].
sorting and recoding items ... [77 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [30 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].

## check rules
inspect(rules[30])
# lhs rhs support confidence lift
# 1 {item021} => {item061} 0.009638554 0.2051282 7.094017


## extract measurments or other stuff you want
str(rules)
attributes(rules)$quality

The same rule showed up in the query mining too. The support, confidence and lift metrics match up between R and sql.

-[ RECORD 1 ]-----+-------------------------
rules | item021 => item061
lhs | item021
rhs | item061
num_lhs_rhs | 8.0
num_basket | 830.0
num_lhs | 39.0
num_rhs | 24.0
num_no_lhs | 791.0
num_no_rhs | 806.0
num_lhs_no_rhs | 31.0
num_no_lhs_rhs | 16.0
num_no_lhs_no_rhs | 775.0
support | 0.0096385542168
confidence | 0.2051282051282
lift | 7.0940170940170
chi_square | 45.253247622206
laplace | 0.2195121951219
conviction | 1.2216867469879
added_value | 0.1762125424776
certainty_factor | 0.1814595660749
j_measure | 0.0114057917394
gini_index | 0.0030619052969
jaccard | 0.1454545454545
shapiro | 0.0082798664537
cosine | 0.2614881801842
correlation | 0.2334994327337
odds_ratio | 12.500000000000

1 comment:

  1. Hi V..

    Thanks for this informative blog. Can you share the link from where you downloaded the data online?
    Thanks
    Arundhati

    ReplyDelete