C4 Flashcards

1
Q

pattern

A

describes local structure in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

pattern mining

A

given a database D, a pattern language and a set of constraints C, find the set of patterns S in the pattern language such that each pattern satisfies each constraint on D and S is maximal
=> find all patterns satisfying the constraints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

problems with pattern mining

A
  • high thresholds: few, already well-known patterns
  • low thresholds: pattern explosion
  • many similar patterns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

pattern set mining

A

find the set of patterns S in the pattern language such that each pattern in S satisfies each constraint on D and S is optimal with respect to optimisation criterion G

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

pattern set mining

A

find the set of patterns S in the pattern language such that each pattern in S satisfies each constraint on D and S is optimal with respect to optimisation criterion G

find a global model of local patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

MDL-based pattern set mining

A

find the set of patterns S in the pattern language such that each pattern in S satisfies each constraint on D and S is MDL optimal

find a model that compresses the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

compressing with code tables

A
  • lossless encoding/decoding
  • code table consisting of itemsets and their codes
    cover algorithm: encodes a single transaction and replaces each occurrence of itemset with its code
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

optimal code length when compressing with code tables

A

for x in code table
usage(x) = |{x in Cover(CT, t)}|
P(x) = usage(x) / sum_{y in CT} usage(y)

according to Shannon, the optimal code for distribution P has length L(code_CT(x)) = -log(P(x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

total encoded size of data set D with code table CT

A

L(D, CT) = L(D|CT) + L(CT)
where L(D|CT) is the size of D encoded with CT and L(CT) the size of the code table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

KRIMP

A

iteratively consider candidate patterns, add a pattern if it results in a total encoding with a smaller compressed size, otherwise discard it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SLIM

A

KRIMP is computationally (many candidates and database scans to construct a cover)
SLIM combines pattern pairs in code table and estimate improvement in compression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what does it mean if a pattern-based model is characteristic?

A

different data distributions get different models and different models imply different data distributions

optimal MDL results are characteristic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

compression based classification

A

Split a dataset into classes, apply KRIMP and make a code table per class. Then encode an unseen transaction with both code tables and assign it to the class with the shortest code length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

data-driven clustering algorithm

A

inspired by k means
1. start with k equal-sized random clusters D_i
2. construct code table CT_i for each D_i
3. use Bayes optimal choice to re-assign transactions
4. repeat 2 and 3 until clusters are stable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

dissimilarity measure

A

MDL says: the optimal compressor induced from a database x
compresses x better than a compressor induced from a
database y

C_x(y) is the size of database y as compressed by the compressor induced from database x

compressor dissimilarity DS(x,y) = max{C_y(x) - C_x(x)/C_x(x), vice versa}

the higher the dissimilarity, the higher the classification accuracy tends to be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly