Data Preprocessing Flashcards

Question 1

Q

Name 4 potential data issues

Answer

A

Incompleteness (NAs)
Inconsistency (age = 52, DoB = 29/06/92)
Duplicates
Noise (aka outliers)

Question 2

Q

Name 4 methods used in Data Inspection (aka Exploratory Data Analysis)

Answer

A

Scatter plots - correlations for bivariate
Histograms - distributions
Boxplots - 5 number summary - outliers
Attribute combination experiments - ratios, minus

Question 3

Q

Name 3 ways to deal with missing data

Answer

A

Elimination
Identification/inspection
Imputation

Question 4

Q

Name 3 ways to encode categorical variables

Answer

A

Integer encoding - communicates that there is ordinality to the categories
One hot encoding - one new column per category. Eliminates ordinality implications, efficient to store
Binary encoding - transformation of integer encoding to remove implied ordinality

Question 5

Q

What is the goal of data transformation?

Answer

A

To provide data to the ML model in a format that is more accurately interpretable.

Question 6

Q

Name Youfar’s 5 ‘SSAAG’ ways for transforming data

Answer

A

Scaling
Smoothing
Attribute construction
Aggregation
Generalization

Question 7

Q

Name the additional data transformation described in ‘Hands-on Machines Learning’

Answer

A

Transforming distributions

Making asymmetrical distributions symmetrical
Making multimodal distributions more interpretable for regression models

Question 8

Q

3 reasons for scaling data

Answer

A

Distance/similarity measuring techniques will be skewed towards values with larger ranges
Faster convergence for gradient descent learning algorithms
Appropriately penalizing coefficients in loss functions involving regularisation

Question 9

Q

Name 2 methods for scaling

Answer

A

MinMaxScaling (normalisation) - uses min-max-range, normally puts values in a range between 0 and 1.
Z-score scaling (standardisation) - uses mean and standard deviation, outputs values without a specific range, but their mean will be 0 and SD will be 1.

Question 10

Q

Name the con of MinMaxScaling and pro of Z- score scaling

Answer

A

Con = sensitive to outliers
Pro = less affected by outliers

Question 11

Q

How might you transform an asymmetrical distribution?

Answer

A

Logarithm
Bin into percentiles

Question 12

Q

How might you transform a multimodal distribution?

Answer

A

Binning and then categorising
Create new features for each mode using a distribution generated using radial bias function (RBF)

Question 13

Q

When and why is imbalanced data a problem?

Answer

A

When: classification problems where the target class is not uniform

Why:
1. could lead to poorer model performance for the minority class
2. When we’re more interested in classifying the minority class

Question 14

Q

Name 2 ways of correcting imbalanced data

Answer

A

Oversampling eg using SMOTE (Synthetic Minority Over -sampling Technique)
Under sampling by removing data points from the majority class

Question 15

Q

Describe SMOTE (Synthetic Minority Over -sampling Technique)

Answer

A

Randomly select a minority class data point
Find its nearest 5 neighbours (from any class)
Select one at random and connect with a line
Generate a new data points along that line at random

Question 16

Q

What is the purpose of data reduction

Answer

Study These Flashcards

A

Efficiency - computational and storage
Simplicity - interpretability
Accuracy - removing noise or just not reducing accuracy but having fewer features

Question 17

Q

3 methods for data reduction

Answer

Study These Flashcards

A

Feature selection
Instance/pattern selection
Data transformation

Question 18

Q

Name 4 types of supervised feature selection methods (FERW

Answer

Study These Flashcards

A

Filter methods - statistical scores for each feature
Embedded - learn the most relevant features as the model is created
Regularisation -LASSO, elastic net etc
Wrapper - search for the best features

Question 19

Q

When should you create the test data set?

Answer

Study These Flashcards

A

Before any significant exploratory analysis to avoid ‘data snooping’ bias
Before any transformations to avoid data leakage into your transformations eg: what your scales are.

Question 20

Q

How can you create a test data set (2 methods)

Answer

Study These Flashcards

A

Random sampling with a fixed seed
Stratified sampling if certain factors are very important predictors and/or are skewed.

Data Preprocessing Flashcards

(20 cards)