Data Mining Flashcards
Data Mining
Extraction of implicit, unknown, useful information
KDD
Knowledge Discovery from Data
Descriptive vs. Predictive
Descriptive extract interpretable models describing data, meanwhile predictive methods predict unknown or future values.
Types of attributes
Nominal: Not ordered data (eye color, ID numbers)
Ordinal: Ordered data (grades, rankings, height)
Interval: By ranges (calendar dates)
Ratio: Scaled data (temperature in Kelvins, Length, time)
Properties of attributes
Distinctness (D)
Order (O)
Addition (A)
Multiplication (M)
Nominal -> D
Ordinal -> D, O
Interval -> D, O, A
Ratio -> D, O, A, M
Discrete vs. Continuous
Discrete has a finite set of values, meanwhile continuous have infinite possible values.
Data Quality Problems
- Noise: Modification of original values.
- Outliers: Data objects considerably different than most of the other data.
- Missing Values: Data not collected or attributes not applicable for all objects.
- Duplicate data: Merging problems
Data Reduction Types
- Sampling
- Feature Selection / Dimensionality reduction
Data Reduction Types
- Sampling
- Feature Selection / Dimensionality reduction
Discretization
Supervised or unsupervised conversion of continuous values into a finite set of discrete values.
Binarization
One hot encoding
Similarity and Dissimilarity
Two measures of how alike and different two data objects are.
[0-1] -> 1 refers to maximum _______ (similarity or dissimilarity)
Types of distance measures
- Euclidean Distance
- Minkowski Distance
- Mahalanobis Distance
- Cosine Similarity
- SMC Similarity
- Combined similarities
Correlation
Measure between [-1, 1] of the linear relationship between two data objects.
corr(x, y) = covariance(x,y) / std_deviation(x)*std_deviation(y)
Association Rules
Extraction of frequent correlations or pattern from a transactional database.
AR: Itemset
A set of items.
Example: {Beer, Diapers}
AR: Support
Fraction of transactions that contain an itemset.
Example: sup({beer, diapers})=2/5
AR: Frequent Itemset
An itemset whose support is greater or equal to a minus threshold.
AR: Confidence
Frequency of {A, B} in transactions containing A, where A=>B.
Example: conf=sup(A,B)/sup(A)
Association Rule Extraction Steps
- Extract frequent item sets
- Extract association rules
AR: Brute-force approach
Generate all possible item sets and extract association rules. Computationally unfeasible.
AR: Apriori Principle
If an itemset is frequent, then all of its subsets must also be frequent.
Example: {A,B} -> Frequent
{A} and {B} must also be frequent.
AR: Apriori Algorithm
- Extract 1 element itemsets
- Prune item sets below minusup
- Generate new possible item sets.
- Loop
Draw it or create an example.