Data Preprocessing Flashcards
8 Step of Data Preprocessing
Data cleaning
Sampling
Aggregation
Discretization and Binarization
Feature Transformation & Scaling
Dimensionality Reduction
Feature subset selection
Feature Creation
5 Type of Error Data
Missing data
Structural errors - Typographical errors and other inconsistencies.
Duplicate data
Irrelevant data
Outliers
Sampling
The process of selecting a representative subset of individuals, items, or events from a larger population, in order to estimate or infer information about the population as a whole.
2 Type of Sampling
Simple Random Sampling
Stratified Sampling
4 Motivation of Aggregation
A change of scope or scale
Data reduction
Noise reduction
Computation
1 Disadvantage of Aggregation
Potential loss of interesting details.
3 Type of Unsupervised Discretization
Equal Interval Width
Equal Frequency
K-means - Divide into discrete group
1 Type of Supervised Discretization
Decision Tree
Feature Transformation & Scaling
A function that maps the entire set of values of a given attribute to a new set of replacement values.
2 Algorithm sensitive to feature scaling
Gradient Descent Based Algorithms
Distance-Based Algorithm
1 Algorithm insensitive to the scale of the features
Tree-Based Algorithms
3 Type of Feature Transformation
|x|Transformation - Symmetric variable around zero.
1/x Transformation - Upper end of distribution to lower end.
Log Transform - Convert a skewed distribution to a normal distribution / less-skewed distribution.
4 Type of Feature Scaling
Max Abs Scaling (Between -1 and +1)
Min-Max Scaling (Between 0 and +1)
Standardisation (Z-Score Normalisation) (Between -3 and +3)
Normal distribution / Gaussian distribution (Bell-shaped curve)
Normal distribution
A statistical concept that describes a probability distribution of a random variable.
Dimensionality Reduction
The process of reducing the number of input features in a dataset while preserving the key information or patterns.
Principal Components / Latent Variables
Features that are combinations of the original features
Principal Component Analysis (PCA)
A technique to transform a high-dimensional dataset into a lower dimensional space, while retaining as much of the information in the original data as possible.
5 Motivations of Dimensionality Reduction
Reduced noise
Enhanced interpretability
To visualise the data and gain insights
To speed up a subsequent training algorithm
To save space (compression)
4 Drawbacks of Dimensionality Reduction
Some information is lost
Transformed features are often hard to interpret
Can be computationally intensive
Adds some complexity to pipelines
3 Approach of Feature Subset Selection
Embedded approaches - Occurs naturally as part of the data mining algorithm, e.g. Decision tree
Filter approaches - Selected before running the data mining algorithm; Low correlation
Wrapper approaches - Find the best subset of attributes
Forward selection - Take in one by one
Backward selection - Take out one by one
Bi-directional elimination (Stepwise Selection)