Data Preparation Flashcards
Preprocessing
Data Cleaning: missing, noisy, outliers, incosistencies, duplicate.
Data Integration: multiple databases, data cubes, or files.
Data Reduction: dimensionality, numerosity, compression
Data transformation and discretization: normalization, hierarchy.
Steps
Aggregation: combining two or more. Data with ess variability.
Sampling: too expensive or time consuming.
Random, without replacement, with replacement, stratified.
Dimens. Reduction: avoid curse of dimensionality. Amount of time and memory.
Reduce noise. PCA (numeric), SVD.
Feature subset selec: reduce dimens. Redundant features. Irrelevant.
Feature creation: birthday from age.
Discret. and Binarization: divide the range of cont attributes. Supervised (using class), Unsupersvised.
PCA
Goal is to find a projection that captures the largest amount of variation in data.
Normalize input data.
Compute K orthonormal vectors. (Principal Comp)
Input data is linear combination of k principal component vectors
Principal components sorted in order of decreasing significance or strength
We can eliminate weak components (low variance)
SVD
M=UEV^(T) M: original data (n x m) U: n examples using r new concepts (n x r) E: strength each concept (r x r) V: m terms, r concepts (m x r) For dimensionality reduction Smallest k singular values to zero.
Feature SubSelection
Brute Force: try all possible feature subsets as input
Embedded approaches: naturally as part of data mining algorithm
Filter: independent from data mining algorithm
Wrapper: black box to find best subset.