Data Preprocessing Flashcards
Name 4 potential data issues
- Incompleteness (NAs)
- Inconsistency (age = 52, DoB = 29/06/92)
- Duplicates
- Noise (aka outliers)
Name 4 methods used in Data Inspection (aka Exploratory Data Analysis)
- Scatter plots - correlations for bivariate
- Histograms - distributions
- Boxplots - 5 number summary - outliers
- Attribute combination experiments - ratios, minus
Name 3 ways to deal with missing data
- Elimination
- Identification/inspection
- Imputation
Name 3 ways to encode categorical variables
- Integer encoding - communicates that there is ordinality to the categories
- One hot encoding - one new column per category. Eliminates ordinality implications, efficient to store
- Binary encoding - transformation of integer encoding to remove implied ordinality
What is the goal of data transformation?
To provide data to the ML model in a format that is more accurately interpretable.
Name Youfar’s 5 ‘SSAAG’ ways for transforming data
- Scaling
- Smoothing
- Attribute construction
- Aggregation
- Generalization
Name the additional data transformation described in ‘Hands-on Machines Learning’
Transforming distributions
- Making asymmetrical distributions symmetrical
- Making multimodal distributions more interpretable for regression models
3 reasons for scaling data
- Distance/similarity measuring techniques will be skewed towards values with larger ranges
- Faster convergence for gradient descent learning algorithms
- Appropriately penalizing coefficients in loss functions involving regularisation
Name 2 methods for scaling
- MinMaxScaling (normalisation) - uses min-max-range, normally puts values in a range between 0 and 1.
- Z-score scaling (standardisation) - uses mean and standard deviation, outputs values without a specific range, but their mean will be 0 and SD will be 1.
Name the con of MinMaxScaling and pro of Z- score scaling
Con = sensitive to outliers
Pro = less affected by outliers
How might you transform an asymmetrical distribution?
- Logarithm
- Bin into percentiles
How might you transform a multimodal distribution?
- Binning and then categorising
- Create new features for each mode using a distribution generated using radial bias function (RBF)
When and why is imbalanced data a problem?
When: classification problems where the target class is not uniform
Why:
1. could lead to poorer model performance for the minority class
2. When we’re more interested in classifying the minority class
Name 2 ways of correcting imbalanced data
- Oversampling eg using SMOTE (Synthetic Minority Over -sampling Technique)
- Under sampling by removing data points from the majority class
Describe SMOTE (Synthetic Minority Over -sampling Technique)
- Randomly select a minority class data point
- Find its nearest 5 neighbours (from any class)
- Select one at random and connect with a line
- Generate a new data points along that line at random