Association Rules Mining

 

 

 

(srikant95mining.pdf) (srikant95mining.pdf) (techopedia, n.d.)

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

Done
by KURAKULA AJAY BABU

 

Student
id-13472219

 

 

 

                         

 

Contents:

1.
Introduction

2.
What is Associative Rules Mining

3.  Apriori Algorithm

4.
Algorithms
5.  Types of Association
Rules

  5.1. Multidimensional Association Rules

   5.2. Quantitative Associative Rules

   5.3 Sequential Pattern Mining

6.
Applications

7.
Conclusion

8.
References

 

 

 

Abstract:

Association rules mining is the second
widely used techniques in data mining. It searches for interesting
relationships among items in a given data set especially in transactional
databases. This will investigate what Association rules mining is, application
areas, variants, etc. The problem of discovering association rules has received
considerable research attention and several fast algorithms for mining
association rules have been developed. In practice, users are often interested
in a subset of association rules. For example, they may only want rules that
contain a specific item or rules that contain children of a specific item in a
hierarchy. While such constraints can be applied as a post processing step,
integrating them into the mining algorithm can dramatically reduce the
execution time.

 

 

 

 

 

 

1.    
Introduction

 

 

Data mining also called as knowledge discovery in databases,
was discovered as a new era for database research. The area is used to find
interesting rules from large sets of data.

  Given a set of
transactions, set of items is each transaction, an association rule is an
expression X=>Y, X and Y are sets of items. The meaning of this thing is
that whatever items there may be in X however it also contains Y. An example of
such a thing is 98% of people who buy tires and auto accessories also buy some
automotive services the 98% of people here are called as confidence of the
rule. The percentage of rule that contain both X and Y are called as the
support of the rule where X=>Y. The problem of the mining association rules
is that it needs to find all the rules that satisfy the user-specified minimum
support and minimum confidence. The applications with which the association
rules are linked with are attached mailing, catalog design, cross
marketing, store layout, loss-leader analysis and customer segmentation based
on buying patterns.

 

  In most cases
Taxonomies over the items are available. A simple example according to the
taxonomy rules is that we may think that people who bought outwear along with
hiking boots buy hiking boots every time they buy the outer wear like ski pants
along with hiking boots, jackets along with hiking boots. As many people had
bought these items together. Also outer wear and hiking boots is a valid rule.
But not for clothes and hiking boots it may not be a valid rule because it may
not have minimum support and the latter may not minimum confidence.

 

                         
Clothes                              Footwear

 

                             Outerwear                 Shirts              Shoes                  Hiking
Boots

 

Jackets                         Ski pants

 

 

                     The taxonomies mostly work
on the leaf level nodes rather than the parent nodes. However finding rules for
different taxonomies is valuable because

1.) Taxonomies
can be used to prune uninteresting or redundant rules.

2.) Rules at
lower at lower levels may not have support. There is less minimum support for
the people to buy hiking boots along with clothes. But it doesn’t say that the
taxonomy is limited to leaf level comparisons. We cannot find many association
rules if we are limited to leaf level. If we take the supermarket into
consideration we have hundreds of products available there. But discounts are
available only if we buy a pair of items as many people buy those things together.

 

2. What is Associative Rules Mining?

 Associative Rule
mining is a technique which we use to find frequent patterns, correlations,
associations, or causal structures from data sets found in different kinds of
databases such as relational databases, transactional databases, and other
forms of data storages. Given a set of transactions, association rule mining
aims to find the rules which enable us to predict the occurrence of a specific
item based on the occurrences of the other items in the transaction.

Association rule mining is
the data mining process of finding the rules that may govern associations and
causal objects between sets of items. So in a given transaction with multiple
items, it tries to find the rules that govern how or why such items are often
bought together. For example, peanut butter and jelly are often bought together
because a lot of people like to make PB sandwiches. Also incredibly,
diapers and beer are bought in combination because, as it turns out, that dads
are often tasked to do the buying groceries while the moms are left near the
baby.

The main
applications of association rule mining:

 

•   Basket data analysis – is to analyse the
cooperative of purchased items in a single basket or single purchase.

•   Cross marketing – is to work with other
organizations that complement your own, not competitors. For example, vehicle
dealerships and manufacturers have cross marketing campaigns with oil and gas
companies for obvious reasons.

•   Catalog design – the choice of items in a
business’ catalog are often designed to complement each other so that shopping
for one item will lead to buying of an alternative. So these items are often
complements or very related. (techopedia,
n.d.)

 

2.    
Apriori Algorithm

 Mining for associations among items in a large
database of sales transaction is an important database mining function. For
example, the information that a customer who purchases a keyboard also tends to
buy a mouse at the same time is represented in association rule below:              Keyboard =>Mouse  support = 6%,
confidence = 70%

 

•      
Apriori pruning principle:

     If there is any itemset which
is infrequent, its superset should not be generated or tested!

•      
Method:

–     
Initially,
scan DB once to get frequent 1-itemset

–     
Generate length
(k+1) candidate item sets from length k frequent item sets

–     
Test the
candidates against DB

–     
Terminate
when no frequent or candidate set can be generated

 

Example:

Transactional
Database

TID

Items

10

A, C, D

20

B, C, E

30

A, B, C, E

40

B, E

 

Item set

sup

{A}

2

{B}

3

{C}

3

{D}

1

{E}

3

Item set

sup

{A}

2

{B}

3

{C}

3

{E}

3

 

Itemset

sup

{A}

2

 {B}

3

{C}

3

{E}

3

                                                 1st
scan

 

C1

Item set

sup

{A, B}

1

{A, C}

2

{A, E}

1

{B, C)

2

{B, E}

3

{C, E}

2

 

Itemset

sup

{A, C}

2

{B, C}

2

{B, E}

3

{C, E}

2

Item set

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

L2

 

                                                                                                                                                                          

                          

 

Scan2

 

 

 

Itemset

{B, C, E}

Itemset

sup

{B, C, E}

2

3rd scan L3                                                                                                                                                                                                                                                                                                                                                                          

DETAILS OF APRIORI

•      
Generate candidates

–     
Step 1:
self-joining Lk

–     
Step 2:
pruning

•      
Count
supports of candidates

•      
Example of
Candidate-generation

–     
L3={abc, abd, acd, ace, bcd}

–     
Self-joining:
L3*L3

•      
abcd from abc and abd

•      
acde from acd and ace

–     
Pruning:

•      
acde is removed because ade is not in L3

–     
C4={abcd}

                                                                                                                                                          

BOTTLE NECK OF APRIORI

•      
Challenges

–     
Multiple
scans of transaction database

–     
Huge
number of candidates

–     
Tedious
workload of support counting for candidates

•      
Improving
Apriori: general ideas

–     
Reduce
passes of transaction database scans

–     
Shrink
number of candidates

–     
Facilitate
support counting of candidates

Possible ways of improving performance of the algorithms

•      
Implementation
techniques

–         
Use of
good data structures

–         
Fast
implementation of basic operations

•      
Algorithm
improvement

–         
Finding
algorithms that are more efficient

•      
Use of
parallel processing

•      
Sampling
the transaction databases

Interactive Discovery

In ARM, the user plays an important role in the process

•      
The user
is responsible for setting the initial minimum support and confidence
thresholds

•      
During the
discovery, the user may decide to further fine-tune the thresholds

•      
The user
can specify what items are to appear on either or both sides of the resulting
rules. (for different purposes)

(e.g.)  {X} -> {nappies}  or {nappies} -> {X}  or {Bear} -> {nappies}

•      
The user
can exploit a category hierarchy of some kind among items.

(e.g.) Instead of “Bread à Coke”, “Bakery products” -> “Soft drinks”

 

Visualization of Association:
Plane Graph

 

 

                                                                                                                                                                               

                                    Visualization
of Association Rules
                                                     (SGI/MineSet 3.0)

 

Measures of Interestingness

•      
play
basketball  Þ eat cereal 40%, 66.7% 
is misleading

–     
The
overall % of students eating cereal is 75% > 66.7%.

•      
play
basketball  Þ not eat cereal 20%, 33.3% is more accurate, although with
lower support and confidence

•      
Another
measure of interestingness:  lift

Basketball

Not
basketball

Sum
(row)

Cereal

2000

1750

3750

Not
cereal

1000

250

1250

Sum(col.)

3000

2000

5000