Data Preprocessing Flashcards
8 Step of Data Preprocessing
Data cleaning
Sampling
Aggregation
Discretization and Binarization
Feature Transformation & Scaling
Dimensionality Reduction
Feature subset selection
Feature Creation
5 Type of Error Data
Missing data
Structural errors - Typographical errors and other inconsistencies.
Duplicate data
Irrelevant data
Outliers
Sampling
The process of selecting a representative subset of individuals, items, or events from a larger population, in order to estimate or infer information about the population as a whole.
2 Type of Sampling
Simple Random Sampling
Stratified Sampling
4 Motivation of Aggregation
A change of scope or scale
Data reduction
Noise reduction
Computation
1 Disadvantage of Aggregation
Potential loss of interesting details.
3 Type of Unsupervised Discretization
Equal Interval Width
Equal Frequency
K-means - Divide into discrete group
1 Type of Supervised Discretization
Decision Tree
Feature Transformation & Scaling
A function that maps the entire set of values of a given attribute to a new set of replacement values.
2 Algorithm sensitive to feature scaling
Gradient Descent Based Algorithms
Distance-Based Algorithm
1 Algorithm insensitive to the scale of the features
Tree-Based Algorithms
3 Type of Feature Transformation
|x|Transformation - Symmetric variable around zero.
1/x Transformation - Upper end of distribution to lower end.
Log Transform - Convert a skewed distribution to a normal distribution / less-skewed distribution.
4 Type of Feature Scaling
Max Abs Scaling (Between -1 and +1)
Min-Max Scaling (Between 0 and +1)
Standardisation (Z-Score Normalisation) (Between -3 and +3)
Normal distribution / Gaussian distribution (Bell-shaped curve)
Normal distribution
A statistical concept that describes a probability distribution of a random variable.
Dimensionality Reduction
The process of reducing the number of input features in a dataset while preserving the key information or patterns.