Data Mining: Association Rules Flashcards

1
Q

Knowledge Discovery Process

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Analysis techniques, methods

A

Descriptive methods:

Extract interpretable models describing data, for example client segmentation.

Predictive methods:

Exploidt some known variables to predict unknown or future values of variables, for example spam emails.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Attributes types

A

Nominal: ID, eye color, zip codes

Ordinal: Rankings, grades, size in {tall, medium, short}

Interval: calendar dates, temperatures in celsius

Ratio: temperature in Kelvin, length, time, counts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Nominal attributes possesses

A

Distinctness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ordinal attribute possesses

A

distinctness, order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Interval attribute

A

Distinctness, order, addition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Ratio attribute

A

Distinctness, order, addition, multiplication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data quality problems

A

Noise, outliers, missing values, duplicate data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Important characteristics of structured data

A

Dimensionality: curse of dimensionality

Sparsity: Only presence counts

Resolution: Patterns depend on the scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Aggregation

A

Combining attributes into single one.

Purpose:

Data reduction: reducing attribute number ( sampling, feature selection, discretization)

Change of scale: from regions into states

Stability: aggregated data tends to be more stable (less deviation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sampling

A

Samping is necessary because employing the entire data set is too expensive.

Sampling works if the sample set is representative of the entire dataset.

A sample is representative if it has approximately the same property as the original set of data.

Types:

Simple Random: Randomly selected

Without replacement: An object can be taken only once

With replacement: Same object can be takes more than once.

Stratified: Split data into several partitions, take random samples from each partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Dimensionality reduction

A

When dimensionality increases, data becomes more sparse in the space it occupies; Definition of distance and density between points become less meaningful. To prevent this we have dim. reduction:

Principal comp. analysis: find projection that captures largest amount of variation in data.

Singular value decomp.

Feature subset selection: remove redundant features and irrelevant features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Feature subset selection techniques

A

Bruteforce: try all possible subsets as input of data mining algo

Emebedded: features are selected naturally by the data mining algo

Filter: features selected before algorithm is run

Wrapper: use algorithms as black box to get best subset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Feature Creation

A

Create new attribute that represent better the inormation in the data set.

Feature Extraction: domain specific

Mapping Data to new space: for example Fourier Transform

Feature Construction: combine features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Discretization

A

Split attribute domain from continuos into discrete.

Reduces cardinality of attribute domain.

Techniques:

  • N intervals with same width (Incremental, easy to do, can be badly affected by outliers and sparse data)
  • N intervals approx same cardinality (non incremental approach, good for sparse data and outliers)
  • Clustering (fits wel sparse data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Attribute transformation

A

Function that maps attribute values to a new set of values;

Example: Normalization ( min-max, z-score, decimal scaling)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Similarity/Dissimilarity for simple attributes

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Minkowski distance

A

r=1: city block (hamming distance)

r = 2: euclidean distance

r -> ∞: maximum distance between any component of the vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Properties of metrics

20
Q

Common properties of similarities

A

s(p,q) = 1 only if p = q

s(p, q) = s(q, p) for all p and q.

21
Q

Similarity between binary vectors

A

Simple matching:

SMC = matches/(number of attributes)

Jaccoard Coefficients:

J = (11 matches) / (number not both zero attribute values) = (M11) / (M01 + M10 + M11)

22
Q

Cosine similarity

23
Q

Combining similarities and combining weighted similarities

24
Q

Goal of association rules

A

Expoloratory technique.

Extraction of frequent correlations or pattern form a transactional database

Example:

diapers => beer

  • 2% of transactions contains both items
  • 30% of the ones containing diapers contais also beer
25
transactional formats
A transaction can be a set of: - Market basket data - Textual data - Structured data
26
Association rule: Itemset
Set including one or more items
27
Association rule: k-itemset
Itemset with cardinality of k
28
Association rule: Support count(#)
Frequency of occurrence of itemset example #{Diapers, Beer} = 2
29
Association rule: support
is the fraction of transactions that contain an itemset example sup({beer, diapers}) = 2/5
30
Association rule, Frequent itemset
Is an itemset whose support is greater than or equal to a minsup threshold
31
Association rule, rule quality metrics
Given A=\>B Support := #{A,B}/ |T| where |T| is the cardinality of the transactional database. Confidence := sup(A,B)/sup(A), conditional probability of finding B having found A.
32
Association rule extraction
Given a set of transaction, association rule mining is the extraction of rules satisfying the constraints: - support \>= minsup threshold - confidence \>= minconf threshold Result is - complete (all rules satisfying both contraints) - correct (only the rules satisfying both contraints)
33
Association rule, frequent itemset generation and extraction of association rules
Frequent itemsets - many different techniques - level-wise approaches (apriori) - without candidate generation (fp-growth) - other - this is the most computationally expensive step Extraction of association rules - generation of all possible binary partitioning of each frequent itemset - possibly enforcing a confidence threshold
34
Association rule, Apriori principle
If an itemset is frequent, then all of its subsets must also be frequent. - The support of an itemset can never exceed the support of any of its subsets - it reduces then number of candidates.
35
Association rule, apriori algo
1. Candidate generation - Join step: generate candidates of len k+1 by joining frequent itemsets of len k - Prune step: apply apriori principle := prune len k+ 1 that contain at least on k-itemset that is not frequent. 2. Frequent itemset generation - ScanDB to count support for k+1 candidates - prune candidates below minsup Counting support of candidates: - candidate itemsets are stored in a hash-tree - subset function finds all candidates contain Performance issues: Candidate sets may be huge - 2-itemset generation is the most critical - extracting long frequent itemsets requires generating all frequent subsets Multiple database scans: n+1 scans when longest frequent pattern len is n Factors affecting performance: - min support threshold - dimensionality
36
Association rule, FP-growth algo
Exploits compressed representation of the database = FP-TREE -\> high compression fo dense data distributions, complete representation for frequent pattern mining Frequent pattern mining: 1. recursive visit of fp-tree 2. applies divide and conquer approach Only 2 DB scans: cout item support + fp-tree build Fp-tree is just a Trie that counts how many childs a node has. Also each node with key k points to another node with key k and to the corresponding element on the header table. The header table counts the frequency of each single item. Alog: Scan header table from lowest support item up. For each item in header table extract freqeunt itemsets including item i and items preceding it in header table.
37
Association rule, vertical data layout
38
Association rule, Frequent Itemset
39
Association rule, closed itemset
An itemset is closed if none of its immediatesupersets has the same support as the itemset
40
Association rule, maximal, closed itemsets set
41
Association rule, effect of support threshold
minsup too high: itemsets including rare but interesting items may be lost minsup is too low: too much data -\> computationally expensive.
42
Confidence
Objective measure. Not always reliable, results are influenced by the cardinality of a set of data points.
43
Correlation
Subjective measure: r: A =\> B correlation = P(A|B) / (P(A)P(B)) = conf(r) / sup(B) correlation = 1 =\> statistical independence correlation \> 1 =\> positive correlation correlation \< 1 =\> negative correlation.
44
Weighted association rules
Consider item/transaction weights during association rule estraction Extend rule quality measures -\> weighted support, weigheted confidence. Apply ad-hoc weight aggregation -\> min, max, avg.
45
Hierarchical association rules
Enable aggregation over attributes in a dataset... Typically user provided Example: time hierarchy, product category, location hierarchy.
46
Generalized itemsets association rules
Sets of item at different generalization levels. Generalized itemset covers a transaction when: - all its generalized items are ancestors of items included in the transaction - its data items are included in the transaction Generalized itemset support: - ratio between number of covered transactions and dataset cardinality.
47
High level, cross level, low level rules
High level -\> Only generalized itemsets (high level info) Cross level -\> generalized items and data items combined Low-level reles -\> only data itemsets.