L6 - Preprocessing ( Cleaning, Transformation, Visualisation ) Flashcards

Question 1

Q

What are the 3 issues that require data cleaning…?

Answer

A

Missing Values
Outlier
Errors

Question 2

Q

What are missing values in the context of data preprocessing?

Answer

A

Data that we expect to have but is missing.

This can be due to human error, bugs etc…

Question 3

Q

What are the 3 types of missing values? Define each…

Answer

A

MCAR - A value that is missing by change. The model can often account for this.

MAR - Certain data values are more likely to be missing. The reason for the missing data is related to the observed data, but not the actual observed data. For example, high wind speed breaks an air quality sensor.

MNAR - We know which values will have missing data, and the reason for the missing value is related to the actual missing data. E.g air quality sensor can’t measure b/c air is too poor quality.

Question 4

Q

When solving missing values, what are the 2 principles we need to keep in mind?

Answer

A

Prioritise data information preservation
Minimise bias introduction

Question 5

Q

What are the 4 solutions to missing values?

Answer

A

Keep as is
Remove rows
Remove columns
Impute values

Question 6

Q

When should we use each of the missing value solutions?

Answer

A

Keep as is - When sharing data so a collective decision can be made regarding what to do.

Remove rows - Use as a last resort when dealing with MCAR. Don’t use with MAR or MNAR due to bias introduction.

Remove columns - If miss rate of column is +25%, column can be removed.

Impute values - Replace the values with a calculated value e.g mean of the column.

Question 7

Q

What are the 3 methods for imputed values?

Answer

A

Average (mean, mode, median)

Regression - Use regression to predict missing values.

Interpolation

Question 8

Q

What is the purpose of imputing values?

Answer

A

To use predicted values that minimise the introduction of bias.

Question 9

Q

In data cleaning, what are outliers? How are they detected?

Answer

A

Anomalies in the dataset

Detected by setting quartiles. This establishes a central tendency of the data, and data outside of this area is considered an outlier and ignored.

Question 10

Q

What are the 4 possible responses to outliers? When should each be used?

Answer

A

Do nothing - Use when model is robust against outliers.

Replace outliers with upper or lower cap - Use when all data objects are needed.

Log transformation - Use when data is skewed such that there is an abnormally large deviation between size of objects.

Remove data objects with outliers - Worst option due to loss of information. Done if other methods aren’t possible.

Question 11

Q

What are the 2 types of errors in Data Cleaning?

Answer

A

Random Errors - Due to inconsistency in data.

Systematic Errors - Repeatable errors that can be tracked to a source.

Question 12

Q

What is the purpose of Data Transformation?

Answer

A

Ensure thats data is compatible for input into model. Must have the correct Encoding and Data Ranges.

Question 13

Q

What are the 3 methods to bring data ranges to the same scale?

Answer

A

Standardisation - Rescale data to have a mean of 0 and a std. deviation of 1. For each feature, subtract the mean and divide by the std. deviation.

Normalisation - Rescale data to be between a range, usually 0 and 1. For each feature, subtract the mean and divide by the range.

Log - Addresses skewed data from extreme values. Simply apply log function to data. Useful when we want to keep ratio of data whilst scaling it down.

Question 14

Q

Why and how do we perform Data Encoding?

Answer

A

To transform data to numerical form. Used for categorical data.

Transform categories into binary columns. Give each category a rank. Construct an attribute column from the rank column.

Question 15

Q

What is Smoothing? When is it used?

Answer

A

Eliminates noise and fluctuations in data by using the average of neighbours to plot the point.

Used in Time-series

Question 16

Q

Define Data Cleaning…

Answer

Study These Flashcards

A

The process of detecting and correcting corrupt or inaccurate records in the data.

Question 17

Q

Answer

Study These Flashcards

A

L6 - Preprocessing ( Cleaning, Transformation, Visualisation ) Flashcards

(17 cards)