2. Data Pre-processing Flashcards
1
Q
Types/Techniques of Data Preprocessing
A
- Aggregation
- Sampling (Simple random, with replacement and stratified)
- Dimensionality reduction
- Discretization + Binarization
- Attribute Transformation (normalization, standardization)
- Attribute Subset Selection
- Attribute Creation
( Data visualization for relationship visualization )
2
Q
Aggregation purpose
A
- change of scale
- data reduction
- data variability reduction
3
Q
Sampling purpose
A
- Reduce time and cost associated with re-sampling the full data set
== data reduction
4
Q
Sampling techniques
A
- Simple random sampling
- Replacement
- Stratified (split into partitions)
5
Q
Discretization purpose
A
Change of scale
6
Q
Binarization purpose
A
Change of scale
7
Q
Attribute transformation purpose
A
Change of scale
Mapping of complete values
e.g. simple functions: log(x), |x|, e^x, x^k, normalization, standardization
8
Q
Dimensionality reduction purpose
A
- Reduce time and memory costs for data mining algorithms
- Remove irrelevant features (noise)
- Makes visualization easier
Example: PCA
9
Q
Dimensionality reduction techniques
A
- Principle Component Analysis (PCA)
- Attribute Subset Selection (if data set contains irrelevant or redundant/duplicate information)
10
Q
Attribute creation purpose
A
to Capture important information better
( Map attributes on newly created ones )
11
Q
Data visualization techniques
A
- Box plot: Percentiles + outliers
- Scatter plot: relationship analysis (2D 3D)
- Mean graph: relationship analysis (categorical)
- Matrix plot: relationship analysis (often normalized to prevent domination)
- Parallel Coordinates (each object is a line): relationship analysis (=> higher D)