Data Mining Flashcards

1
Q

Data Mining

A

Extraction of implicit, unknown, useful information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

KDD

A

Knowledge Discovery from Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Descriptive vs. Predictive

A

Descriptive extract interpretable models describing data, meanwhile predictive methods predict unknown or future values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types of attributes

A

Nominal: Not ordered data (eye color, ID numbers)
Ordinal: Ordered data (grades, rankings, height)
Interval: By ranges (calendar dates)
Ratio: Scaled data (temperature in Kelvins, Length, time)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Properties of attributes

A

Distinctness (D)
Order (O)
Addition (A)
Multiplication (M)

Nominal -> D
Ordinal -> D, O
Interval -> D, O, A
Ratio -> D, O, A, M

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Discrete vs. Continuous

A

Discrete has a finite set of values, meanwhile continuous have infinite possible values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Quality Problems

A
  • Noise: Modification of original values.
  • Outliers: Data objects considerably different than most of the other data.
  • Missing Values: Data not collected or attributes not applicable for all objects.
  • Duplicate data: Merging problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data Reduction Types

A
  • Sampling
  • Feature Selection / Dimensionality reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Reduction Types

A
  • Sampling
  • Feature Selection / Dimensionality reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Discretization

A

Supervised or unsupervised conversion of continuous values into a finite set of discrete values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Binarization

A

One hot encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Similarity and Dissimilarity

A

Two measures of how alike and different two data objects are.
[0-1] -> 1 refers to maximum _______ (similarity or dissimilarity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Types of distance measures

A
  • Euclidean Distance
  • Minkowski Distance
  • Mahalanobis Distance
  • Cosine Similarity
  • SMC Similarity
  • Combined similarities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Correlation

A

Measure between [-1, 1] of the linear relationship between two data objects.

corr(x, y) = covariance(x,y) / std_deviation(x)*std_deviation(y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Association Rules

A

Extraction of frequent correlations or pattern from a transactional database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

AR: Itemset

A

A set of items.
Example: {Beer, Diapers}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

AR: Support

A

Fraction of transactions that contain an itemset.
Example: sup({beer, diapers})=2/5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

AR: Frequent Itemset

A

An itemset whose support is greater or equal to a minus threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

AR: Confidence

A

Frequency of {A, B} in transactions containing A, where A=>B.
Example: conf=sup(A,B)/sup(A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Association Rule Extraction Steps

A
  1. Extract frequent item sets
  2. Extract association rules
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

AR: Brute-force approach

A

Generate all possible item sets and extract association rules. Computationally unfeasible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

AR: Apriori Principle

A

If an itemset is frequent, then all of its subsets must also be frequent.
Example: {A,B} -> Frequent
{A} and {B} must also be frequent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

AR: Apriori Algorithm

A
  1. Extract 1 element itemsets
  2. Prune item sets below minusup
  3. Generate new possible item sets.
  4. Loop
    Draw it or create an example.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Association Rule Extraction Algorithms

A
  • Brute-force approach
  • Apriori Algortihm
  • FP-growth algorithm
25
Q

AR: FP-growth Algorithm

A
  1. Build header table by sorting items (NOT item sets) in decreasing support order.
  2. Add header itemsets to frequent if support is greater than minsup.
  3. FP-tree construction
  4. Select lowest item in header table.
  5. Create new transaction with FP-tree.
  6. Repeat, exploring all items in the latest header table.
26
Q

AR: Correlation

A

r: A => B
conf(r) / sup(B)
Statistical independence -> correlation = 1
Positive correlation -> correlation>1
Else negative.

27
Q

Accuracy

A

Quality of the prediction

28
Q

Interpretability

A

Interpretable?

29
Q

Incrementality

A

Model updating in presence of new labelled records

30
Q

Efficiency

A

Building and prediction times

31
Q

Scalability

A

Training set size and attribute number

32
Q

Robustness

A

How it reacts to noise and missing data

33
Q

Gini Index Formula

A

NODES:
Gini(N1) = 1 - sum[P(C)^2]
Example:
C1 1
C2 5

P1=1/6
P2=5/6
1-P1^2-P2^2-…

SPLITS:
Gini(split) = SUM[(#NodeKRecords/#SplitRecords)*Gini(Nk)

34
Q

Decision Tree Elements

A

Node: Final prediction node
Split: Nodes containing another attribute

35
Q

Gini Index

A

Measures impurity of a node or split.
0: low impurity
1/#records: high impurity

36
Q

Gini Index for continuous values

A
  1. Calculate Gini index for each range
  2. Select one range between each continuous value.
37
Q

Entropy Index

A

Entropy = - SUM[P(Ck) log2 P(Ck)]

38
Q

Decision Tree Evaluation

A

Accuracy: Average
Interpretability: For small trees
Incrementality: Not incremental
Efficiency: Fast
Scalability: Scalable
Robustness: Difficult management of missing data

39
Q

Random Forest

A

Divides original training data in random subsets, then develops a decision tree for each subset and label is decided by majority voting.

40
Q

Random Forest Evaluation

A

Accuracy: Better than decision trees
Interpretability: No
Incrementality: No
Efficiency: Fast
Scalability: Scalable
Robustness: Robust

41
Q

Rule Based Classifiers

A

Classify records by using a collection of “if…then…” rules.
(Condition) → y

42
Q

Rule Based Classifier Evaluation

A

Accuracy: Better than decision trees
Interpretable
Not incremental
Efficient
Scalable
Robust

43
Q

Instance Based Classifiers

A

Store training records and use them to predict the class label of unseen cases.

44
Q

K-nearest Neighbors

A

Selects label based on the k nearest neighbors

45
Q

K-nearest Neighbor Evaluation

A

Accuracy: Average
Not interpretable
Incremental
Efficient building but slow classification
Scalable
Distances might affect robustness

46
Q

Bayesian Classification

A

Computes probability that a given record belongs to C, calculated using Bayes theorem.

X: Record
C1: class 1
C2: class 2
For C1:
P(X1|C1)P(X1|C1)…*P(C1)
P(X1|C1) = #ofX1s / #ofC1s
P(C1) = #ofC1s / #ofRecords

47
Q

Bayes Classifier Evaluation

A

Accuracy: Average
Not interpretable
Incremental
Efficient
Scalable
Robust for statistically independent attributes

48
Q

SVM Evaluation

A

Good accuracy
Not interpretable
Not incremental
Averagely efficient
Medium scalability
Robust

49
Q

Cross validation

A
  1. Partitioning data into K subsets
  2. Train on K-1 partitions and test on remaining one
  3. Repeat for all subsets (or folds)
50
Q

Confusion Matrix

A

Table with true positives…

51
Q

Accuracy

A

correct/number of classified objects

Not appropriate for unbalanced class label distributions or classes with different relevances.

52
Q

Recall

A

Number of objects correctly assigned to C / Number of objects belonging to C

53
Q

Precision

A

Number of objects correctly assigned to C / Number of objects assigned to C

54
Q

ROC

A

Receiver Operating Characteristic

True Positive vs. False Positive curve

55
Q

K-means clustering

A
  1. Selects K random points as centroids
  2. Assigns points to the closest centroid
  3. Define a new centroid, normally the mean
  4. Iterate until fixed centroids
56
Q

Bisecting K-means Clustering

A
  1. Split in areas
  2. Split area with the highest SSE
  3. Repeat until having K clusters
57
Q

Hierarchical Clustering

A

Construct a hierarchical tree.

58
Q

DBSCAN

A

Density based clustering