Chapter 3 Flashcards

1
Q

What is association rule mining?

A

An unsupervised data mining technique that finds interesting associations and/or correlation relationships among a large set of data items.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What do association rules show?

A

Attribute-value conditions that occur frequently together in a given dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a use of association rule mining?

A

Market basket analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a transaction?

A

A single vist

Each transaction is associated with a purchase date and items purchased.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does basket data refer to?

A

Transaction data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the market-basket problem assume?

A

We have some large number of items (eg all items available in store). Customers fill their baskets with some subset of the items and we get to know what items people buy together, even if we don’t know who they are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What can retail organisations gain by analysing basket data?

A

They can extract information to drive their market strategy. They would would want to put effort into items that are frequently purchased.

  • Targeted marketing
  • Plan store layouts - use the information to position items and control the way a typical customer traverses the store. May place commonly purchased together items in close proximity to encourage the sale of such items together, or may place them on opposite ends of the store to entice customers to pick up other items on the way.
  • Help retailers plan which items to put on sale at reduced prices
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a boolean vector in market basket analysis?

A

Each basket can be replaced by a Boolean vector if we considered that each item has a Boolean variable representing the presence or absence of the item.

The boolean vector can be analysed for buying patterns that reflect items that are frequently associated or purchased together. These patterns can be represented in the form of association rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Aside from the marketing application, where else can Association rules be used?

A
  • Baskets = documents, items = words
    Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering, predictive text etc.
  • Baskets = documents, items = sentences
    Two documents with many of the same sentences could represent plagiarism or mirror sites on the web
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What form do association rules take?

A

Association rules are statements of the form

{X1, X2, …, Xn} => Y

Meaning if we find all of X1, X2, …, Xn in the market basket, then we have. good chance of finding Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the confidence of the association rule and how can it be determined?

A

The strength of implication in the rule

Determine the probability of finding Y given X1, X2, …, Xn occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which rules are we interested in?

A

Rules with a confidence above a certain threshold.

We may also ask that the confidence be significantly higher than it would be if items were placed at random into baskets.
eg we may find a rule like {milk, butter} => bread - simply because a lot of people buy bread

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the properties of market-basket analysis?

A
  • Association rules
  • Causality
  • Frequent itemsets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do we want to consider causality?

A

Correlation does not equal causation. From the marketing strategy perspective, it is important to understand where the causation is coming from.

We want to know that in an association rule, the presence of X1, X2, …, Xn actually “causes” Y to be bought. eg diapers and beer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why do we want to consider frequent itemsets?

A

In most situations, we only care about association rules or causalities involving sets of items that appear frequently in baskets.

We can’t run a good marketing strategy involving items that no one buys.

Data mining starts with the assumption that we only care about sets of items with high support ie they appear together in many baskets. Sets of items must appear in at least a certain percent of the baskets called the support support threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the support threshold?

A

A certain percentage of baskets that an item must appear in to be considered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the symbolised version of “transactions can be considered to be a subset of the set of all possible items”?

A

T ⊆ I

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the formula for the support s?

A

[See flashcard]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the formula for the confidence c?

A

[See flashcard]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does a higher confidence of the association rule A => B mean?

A

The greater the probability that if a customer buys product A, they will also buy product B.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What way do we write support and confidence values?

A

To occur between 0% and 100% (rather than 0 and 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are confidence and support values compared to when association rules are created?

A

Confidence and support thresholds

23
Q

What is an itemset?

A

A set of items

24
Q

What is a k-itemset?

A

An items containing k items

25
Q

What is the occurrence frequency or support count of an items?

A

The number of transactions that contain the itemset

26
Q

What is a frequent items?

A

A set of items that appears in at least fraction s of the baskets, where s, the support, is some pre-defied chose constant.

27
Q

What does Lk denote?

A

The set of frequent k-itemsets

28
Q

What can we reduce the problem of mining association rules to?

A

That of mining frequent itemsets.

29
Q

What is the two-step process of association rule mining?

A
  • Find all frequent itemsets
    Each of these itemsets will occur at least as frequently as a predetermined support count, min_sup
  • Generate strong association rules from the frequent itemsets
    If all association rules were generated from frequent itemsets, only those above a minimum level of confidence are retained as strong association rules.
30
Q

What is a strong association rule?

A

One that satisfies minimum support and minimum confidence.

31
Q

Which of the two steps is less costly?

A

Generating strong association rules is much less costly than finding all frequent itemsets. Hence, the overall performance of mining association rules is determined by the first step.

32
Q

How can frequent pattern mining can be classified?

A

There are three criteria

  • Based on levels of abstraction
  • Based on the number of data dimensions involved in the rule
  • Based on the type of values handled in the rule
33
Q

What are rules at different levels of abstraction?

A

The items bought are referenced at different levels of abstraction (eg computer is a higher-level abstraction than laptop).

We refer to the rule set mined as consisting of multi-level association rules.

34
Q

What are rules based on the number of data dimensions involved in the rule?

A

If the items or variables in an association rule reference only one dimension, then it is a single dimensional association rule.

A rule with dimensions such as age and income, is a multi-dimensional association rule.

35
Q

What are rules based on the types of values handled in the rule?

A

If a rule involves association between presence or absence items, it is a Boolean association rule.

If the rule involves associations between quantitative items of variables, these are discretised and the rule is referred to as a quantitative association rule.

36
Q

Considering mining single dimensional Boolean itemsets, what is the first step in the frequent items mining process?

A

Finding all frequent itemsets

37
Q

What is the Naive algorithm?

A

In this algorithm, one considers all possible subsets of I (items in the shop) and in each case, the support is calculated. Only subsets above the minimum support threshold are considered to be frequent itemsets.

38
Q

What is the issue with the Naive algorithm?

A

This algorithm requires the evaluation of all subsets of the items I - a large number. There are 2^m subsets of I and to calculate the support of each subset, there are (2^m * n) order of operations - the computational effort grows exponentially with the number of items m.

39
Q

What algorithm is used to reduce the computational effort seen in the naive algorithm?

A

The Apriori algorithm which utilises the Priori property.

40
Q

What does the Apriori property state?

A

That if a (k-1) items A is not a frequent items, then any superset of A, B (ie A is a subset of B), will also not be a frequent itemset.

It also indicates that all non-empty subsets of a frequent items must also be frequent.

41
Q

What kind of approach does the Apriori algorithm employ?

A

An iterative approach, known as level-wise search, where frequent (k-1)-itemsets are used to obtain potential (or candidate) frequent (k)-itemsets.

42
Q

Briefly describe the apriori algorithm

A
  • Given a specified support threshold s, the first pass finds the items that appear in at least fraction s of the baskets. This is all frequent 1-itemsets called L1
  • Pairs of items in L1 become the candidate pairs C2 for the second pass. The pairs in C2 whose count reaches s are the frequent pairs, L2. This is all frequent 2-itemsets called L2. Candidate pairs who do not meet the minimum support requirement are pruned.
  • The items in L2 are then used to create C3, which then form L3 etc.
  • This iterative process continues until no more frequent itemsets can be found
43
Q

What is the general action of the Apriori method?

A

L(k-1) is used to define Lk for k >= 2

44
Q

What are the two steps of the Apriori method?

A

Join and Prune action

45
Q

Define the join step of the apriori method

A
  • To find Lk of a set of candidate k-itemsets is generated by joining Lk-1 with itself.
  • This set of candidates is denoted Ck
  • Members of Lk-1 are joinable if their first (k-2) items are in common
46
Q

Define the prune step of the prior method

A
  • Ck is a superset of Lk - its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck
  • A scan of the database to determine the count of each candidate in Ck results in the determination of Lk
  • We can compare members of Ck to those in Ck-1 to determine Lk by removing members straight away - if all items within an infrequent member Ck-1 are also within a member of Ck, then that member of Ck can be eliminated straight away from Lk
47
Q

When does the join/prune cycle stop?

A

When Lk = Null

48
Q

What can you do once the frequent itemsets from transactions in a database have been found?

A

Generate strong association rules from them

Strong association rules satisfy minimum support and minimum confidence.

49
Q

What are the equations associated with generating association rules from frequent itemsets?

A

[See flashcard]

50
Q

Why does each rule automatically satisfy minimum support?

A

The rules are generated from frequent itesmets

51
Q

What are the advantages of association rules?

A
  • Algorithm is very scalable ie capable of working with large amounts of transactional data
  • Result sin rules are very easy to understand
  • Useful for data mining and discovering unexpected knowledge in databases
52
Q

What are the disadvantages of association rules?

A
  • Not very helpful for small datasets
  • Requires effort to separate the true insight from common sense
  • Easy to draw spurious conclusions from random partners
53
Q
A