Data Mining Flashcards
Data Mining
Extraction of implicit, unknown, useful information
KDD
Knowledge Discovery from Data
Descriptive vs. Predictive
Descriptive extract interpretable models describing data, meanwhile predictive methods predict unknown or future values.
Types of attributes
Nominal: Not ordered data (eye color, ID numbers)
Ordinal: Ordered data (grades, rankings, height)
Interval: By ranges (calendar dates)
Ratio: Scaled data (temperature in Kelvins, Length, time)
Properties of attributes
Distinctness (D)
Order (O)
Addition (A)
Multiplication (M)
Nominal -> D
Ordinal -> D, O
Interval -> D, O, A
Ratio -> D, O, A, M
Discrete vs. Continuous
Discrete has a finite set of values, meanwhile continuous have infinite possible values.
Data Quality Problems
- Noise: Modification of original values.
- Outliers: Data objects considerably different than most of the other data.
- Missing Values: Data not collected or attributes not applicable for all objects.
- Duplicate data: Merging problems
Data Reduction Types
- Sampling
- Feature Selection / Dimensionality reduction
Data Reduction Types
- Sampling
- Feature Selection / Dimensionality reduction
Discretization
Supervised or unsupervised conversion of continuous values into a finite set of discrete values.
Binarization
One hot encoding
Similarity and Dissimilarity
Two measures of how alike and different two data objects are.
[0-1] -> 1 refers to maximum _______ (similarity or dissimilarity)
Types of distance measures
- Euclidean Distance
- Minkowski Distance
- Mahalanobis Distance
- Cosine Similarity
- SMC Similarity
- Combined similarities
Correlation
Measure between [-1, 1] of the linear relationship between two data objects.
corr(x, y) = covariance(x,y) / std_deviation(x)*std_deviation(y)
Association Rules
Extraction of frequent correlations or pattern from a transactional database.
AR: Itemset
A set of items.
Example: {Beer, Diapers}
AR: Support
Fraction of transactions that contain an itemset.
Example: sup({beer, diapers})=2/5
AR: Frequent Itemset
An itemset whose support is greater or equal to a minus threshold.
AR: Confidence
Frequency of {A, B} in transactions containing A, where A=>B.
Example: conf=sup(A,B)/sup(A)
Association Rule Extraction Steps
- Extract frequent item sets
- Extract association rules
AR: Brute-force approach
Generate all possible item sets and extract association rules. Computationally unfeasible.
AR: Apriori Principle
If an itemset is frequent, then all of its subsets must also be frequent.
Example: {A,B} -> Frequent
{A} and {B} must also be frequent.
AR: Apriori Algorithm
- Extract 1 element itemsets
- Prune item sets below minusup
- Generate new possible item sets.
- Loop
Draw it or create an example.
Association Rule Extraction Algorithms
- Brute-force approach
- Apriori Algortihm
- FP-growth algorithm
AR: FP-growth Algorithm
- Build header table by sorting items (NOT item sets) in decreasing support order.
- Add header itemsets to frequent if support is greater than minsup.
- FP-tree construction
- Select lowest item in header table.
- Create new transaction with FP-tree.
- Repeat, exploring all items in the latest header table.
AR: Correlation
r: A => B
conf(r) / sup(B)
Statistical independence -> correlation = 1
Positive correlation -> correlation>1
Else negative.
Accuracy
Quality of the prediction
Interpretability
Interpretable?
Incrementality
Model updating in presence of new labelled records
Efficiency
Building and prediction times
Scalability
Training set size and attribute number
Robustness
How it reacts to noise and missing data
Gini Index Formula
NODES:
Gini(N1) = 1 - sum[P(C)^2]
Example:
C1 1
C2 5
P1=1/6
P2=5/6
1-P1^2-P2^2-…
SPLITS:
Gini(split) = SUM[(#NodeKRecords/#SplitRecords)*Gini(Nk)
Decision Tree Elements
Node: Final prediction node
Split: Nodes containing another attribute
Gini Index
Measures impurity of a node or split.
0: low impurity
1/#records: high impurity
Gini Index for continuous values
- Calculate Gini index for each range
- Select one range between each continuous value.
Entropy Index
Entropy = - SUM[P(Ck) log2 P(Ck)]
Decision Tree Evaluation
Accuracy: Average
Interpretability: For small trees
Incrementality: Not incremental
Efficiency: Fast
Scalability: Scalable
Robustness: Difficult management of missing data
Random Forest
Divides original training data in random subsets, then develops a decision tree for each subset and label is decided by majority voting.
Random Forest Evaluation
Accuracy: Better than decision trees
Interpretability: No
Incrementality: No
Efficiency: Fast
Scalability: Scalable
Robustness: Robust
Rule Based Classifiers
Classify records by using a collection of “if…then…” rules.
(Condition) → y
Rule Based Classifier Evaluation
Accuracy: Better than decision trees
Interpretable
Not incremental
Efficient
Scalable
Robust
Instance Based Classifiers
Store training records and use them to predict the class label of unseen cases.
K-nearest Neighbors
Selects label based on the k nearest neighbors
K-nearest Neighbor Evaluation
Accuracy: Average
Not interpretable
Incremental
Efficient building but slow classification
Scalable
Distances might affect robustness
Bayesian Classification
Computes probability that a given record belongs to C, calculated using Bayes theorem.
X: Record
C1: class 1
C2: class 2
For C1:
P(X1|C1)P(X1|C1)…*P(C1)
P(X1|C1) = #ofX1s / #ofC1s
P(C1) = #ofC1s / #ofRecords
Bayes Classifier Evaluation
Accuracy: Average
Not interpretable
Incremental
Efficient
Scalable
Robust for statistically independent attributes
SVM Evaluation
Good accuracy
Not interpretable
Not incremental
Averagely efficient
Medium scalability
Robust
Cross validation
- Partitioning data into K subsets
- Train on K-1 partitions and test on remaining one
- Repeat for all subsets (or folds)
Confusion Matrix
Table with true positives…
Accuracy
correct/number of classified objects
Not appropriate for unbalanced class label distributions or classes with different relevances.
Recall
Number of objects correctly assigned to C / Number of objects belonging to C
Precision
Number of objects correctly assigned to C / Number of objects assigned to C
ROC
Receiver Operating Characteristic
True Positive vs. False Positive curve
K-means clustering
- Selects K random points as centroids
- Assigns points to the closest centroid
- Define a new centroid, normally the mean
- Iterate until fixed centroids
Bisecting K-means Clustering
- Split in areas
- Split area with the highest SSE
- Repeat until having K clusters
Hierarchical Clustering
Construct a hierarchical tree.
DBSCAN
Density based clustering