Data Mining Flashcards

Question

AR: FP-growth Algorithm

Answer 1

1. Build header table by sorting items (NOT item sets) in decreasing support order. 2. Add header itemsets to frequent if support is greater than minsup. 3. FP-tree construction 4. Select lowest item in header table. 5. Create new transaction with FP-tree. 6. Repeat, exploring all items in the latest header table.

Answer 2

r: A => B conf(r) / sup(B) Statistical independence -> correlation = 1 Positive correlation -> correlation>1 Else negative.

Answer 3

Quality of the prediction

Answer 4

Interpretable?

Answer 5

Model updating in presence of new labelled records

Answer 6

Building and prediction times

Answer 7

Training set size and attribute number

Answer 8

How it reacts to noise and missing data

Answer 9

NODES: Gini(N1) = 1 - sum[P(C)^2] Example: C1 1 C2 5 P1=1/6 P2=5/6 1-P1^2-P2^2-... SPLITS: Gini(split) = SUM[(#NodeKRecords/#SplitRecords)*Gini(Nk)

Answer 10

Node: Final prediction node Split: Nodes containing another attribute

Answer 11

Measures impurity of a node or split. 0: low impurity 1/#records: high impurity

Answer 12

1. Calculate Gini index for each range 2. Select one range between each continuous value.

Answer 13

Entropy = - SUM[P(Ck) log2 P(Ck)]

Answer 14

Accuracy: Average Interpretability: For small trees Incrementality: Not incremental Efficiency: Fast Scalability: Scalable Robustness: Difficult management of missing data

Answer 15

Divides original training data in random subsets, then develops a decision tree for each subset and label is decided by majority voting.

Answer 16

Accuracy: Better than decision trees Interpretability: No Incrementality: No Efficiency: Fast Scalability: Scalable Robustness: Robust

Answer 17

Classify records by using a collection of “if...then...” rules. (Condition) → y

Answer 18

Accuracy: Better than decision trees Interpretable Not incremental Efficient Scalable Robust

Answer 19

Store training records and use them to predict the class label of unseen cases.

Answer 20

Selects label based on the k nearest neighbors

Answer 21

Accuracy: Average Not interpretable Incremental Efficient building but slow classification Scalable Distances might affect robustness

Answer 22

Computes probability that a given record belongs to C, calculated using Bayes theorem. X: Record C1: class 1 C2: class 2 For C1: P(X1|C1)*P(X1|C1)*...*P(C1) P(X1|C1) = #ofX1s / #ofC1s P(C1) = #ofC1s / #ofRecords

Answer 23

Accuracy: Average Not interpretable Incremental Efficient Scalable Robust for statistically independent attributes

Answer 24

Good accuracy Not interpretable Not incremental Averagely efficient Medium scalability Robust

Answer 25

1. Partitioning data into K subsets 2. Train on K-1 partitions and test on remaining one 3. Repeat for all subsets (or folds)

Answer 26

Table with true positives...

Answer 27

correct/number of classified objects Not appropriate for unbalanced class label distributions or classes with different relevances.

Answer 28

Number of objects correctly assigned to C / Number of objects belonging to C

Answer 29

Number of objects correctly assigned to C / Number of objects assigned to C

Answer 30

Receiver Operating Characteristic True Positive vs. False Positive curve

Answer 31

1. Selects K random points as centroids 2. Assigns points to the closest centroid 3. Define a new centroid, normally the mean 4. Iterate until fixed centroids

Answer 32

1. Split in areas 2. Split area with the highest SSE 3. Repeat until having K clusters

Answer 33

Construct a hierarchical tree.

Answer 34

Density based clustering

Data Mining Flashcards

(58 cards)