C4 Flashcards

Question 1

Q

pattern

Answer

A

describes local structure in the data

Question 2

Q

pattern mining

Answer

A

given a database D, a pattern language and a set of constraints C, find the set of patterns S in the pattern language such that each pattern satisfies each constraint on D and S is maximal
=> find all patterns satisfying the constraints

Question 3

Q

problems with pattern mining

Answer

A

high thresholds: few, already well-known patterns
low thresholds: pattern explosion
many similar patterns

Question 4

Q

pattern set mining

Answer

A

find the set of patterns S in the pattern language such that each pattern in S satisfies each constraint on D and S is optimal with respect to optimisation criterion G

Question 5

Q

pattern set mining

Answer

A

find the set of patterns S in the pattern language such that each pattern in S satisfies each constraint on D and S is optimal with respect to optimisation criterion G

find a global model of local patterns

Question 6

Q

MDL-based pattern set mining

Answer

A

find the set of patterns S in the pattern language such that each pattern in S satisfies each constraint on D and S is MDL optimal

find a model that compresses the data

Question 7

Q

compressing with code tables

Answer

A

lossless encoding/decoding
code table consisting of itemsets and their codes
cover algorithm: encodes a single transaction and replaces each occurrence of itemset with its code

Question 8

Q

optimal code length when compressing with code tables

Answer

A

for x in code table
usage(x) = |{x in Cover(CT, t)}|
P(x) = usage(x) / sum_{y in CT} usage(y)

according to Shannon, the optimal code for distribution P has length L(code_CT(x)) = -log(P(x))

Question 9

Q

total encoded size of data set D with code table CT

Answer

A

L(D, CT) = L(D|CT) + L(CT)
where L(D|CT) is the size of D encoded with CT and L(CT) the size of the code table

Question 10

Q

KRIMP

Answer

A

iteratively consider candidate patterns, add a pattern if it results in a total encoding with a smaller compressed size, otherwise discard it

Question 11

Q

SLIM

Answer

A

KRIMP is computationally (many candidates and database scans to construct a cover)
SLIM combines pattern pairs in code table and estimate improvement in compression

Question 12

Q

what does it mean if a pattern-based model is characteristic?

Answer

A

different data distributions get different models and different models imply different data distributions

optimal MDL results are characteristic

Question 13

Q

compression based classification

Answer

A

Split a dataset into classes, apply KRIMP and make a code table per class. Then encode an unseen transaction with both code tables and assign it to the class with the shortest code length

Question 14

Q

data-driven clustering algorithm

Answer

A

inspired by k means
1. start with k equal-sized random clusters D_i
2. construct code table CT_i for each D_i
3. use Bayes optimal choice to re-assign transactions
4. repeat 2 and 3 until clusters are stable

Question 15

Q

dissimilarity measure

Answer

A

MDL says: the optimal compressor induced from a database x
compresses x better than a compressor induced from a
database y

C_x(y) is the size of database y as compressed by the compressor induced from database x

compressor dissimilarity DS(x,y) = max{C_y(x) - C_x(x)/C_x(x), vice versa}

the higher the dissimilarity, the higher the classification accuracy tends to be

C4 Flashcards

(15 cards)