L6 - Preprocessing ( Cleaning, Transformation, Visualisation ) Flashcards
What are the 3 issues that require data cleaning…?
Missing Values
Outlier
Errors
What are missing values in the context of data preprocessing?
Data that we expect to have but is missing.
This can be due to human error, bugs etc…
What are the 3 types of missing values? Define each…
MCAR - A value that is missing by change. The model can often account for this.
MAR - Certain data values are more likely to be missing. The reason for the missing data is related to the observed data, but not the actual observed data. For example, high wind speed breaks an air quality sensor.
MNAR - We know which values will have missing data, and the reason for the missing value is related to the actual missing data. E.g air quality sensor can’t measure b/c air is too poor quality.
When solving missing values, what are the 2 principles we need to keep in mind?
- Prioritise data information preservation
- Minimise bias introduction
What are the 4 solutions to missing values?
- Keep as is
- Remove rows
- Remove columns
- Impute values
When should we use each of the missing value solutions?
Keep as is - When sharing data so a collective decision can be made regarding what to do.
Remove rows - Use as a last resort when dealing with MCAR. Don’t use with MAR or MNAR due to bias introduction.
Remove columns - If miss rate of column is +25%, column can be removed.
Impute values - Replace the values with a calculated value e.g mean of the column.
What are the 3 methods for imputed values?
Average (mean, mode, median)
Regression - Use regression to predict missing values.
Interpolation
What is the purpose of imputing values?
To use predicted values that minimise the introduction of bias.
In data cleaning, what are outliers? How are they detected?
Anomalies in the dataset
Detected by setting quartiles. This establishes a central tendency of the data, and data outside of this area is considered an outlier and ignored.
What are the 4 possible responses to outliers? When should each be used?
Do nothing - Use when model is robust against outliers.
Replace outliers with upper or lower cap - Use when all data objects are needed.
Log transformation - Use when data is skewed such that there is an abnormally large deviation between size of objects.
Remove data objects with outliers - Worst option due to loss of information. Done if other methods aren’t possible.
What are the 2 types of errors in Data Cleaning?
Random Errors - Due to inconsistency in data.
Systematic Errors - Repeatable errors that can be tracked to a source.
What is the purpose of Data Transformation?
Ensure thats data is compatible for input into model. Must have the correct Encoding and Data Ranges.
What are the 3 methods to bring data ranges to the same scale?
Standardisation - Rescale data to have a mean of 0 and a std. deviation of 1. For each feature, subtract the mean and divide by the std. deviation.
Normalisation - Rescale data to be between a range, usually 0 and 1. For each feature, subtract the mean and divide by the range.
Log - Addresses skewed data from extreme values. Simply apply log function to data. Useful when we want to keep ratio of data whilst scaling it down.
Why and how do we perform Data Encoding?
To transform data to numerical form. Used for categorical data.
Transform categories into binary columns. Give each category a rank. Construct an attribute column from the rank column.
What is Smoothing? When is it used?
Eliminates noise and fluctuations in data by using the average of neighbours to plot the point.
Used in Time-series