Data Preprocessing Flashcards

1
Q

Name 4 potential data issues

A
  1. Incompleteness (NAs)
  2. Inconsistency (age = 52, DoB = 29/06/92)
  3. Duplicates
  4. Noise (aka outliers)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name 4 methods used in Data Inspection (aka Exploratory Data Analysis)

A
  1. Scatter plots - correlations for bivariate
  2. Histograms - distributions
  3. Boxplots - 5 number summary - outliers
  4. Attribute combination experiments - ratios, minus
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Name 3 ways to deal with missing data

A
  1. Elimination
  2. Identification/inspection
  3. Imputation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Name 3 ways to encode categorical variables

A
  1. Integer encoding - communicates that there is ordinality to the categories
  2. One hot encoding - one new column per category. Eliminates ordinality implications, efficient to store
  3. Binary encoding - transformation of integer encoding to remove implied ordinality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the goal of data transformation?

A

To provide data to the ML model in a format that is more accurately interpretable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name Youfar’s 5 ‘SSAAG’ ways for transforming data

A
  1. Scaling
  2. Smoothing
  3. Attribute construction
  4. Aggregation
  5. Generalization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name the additional data transformation described in ‘Hands-on Machines Learning’

A

Transforming distributions

  1. Making asymmetrical distributions symmetrical
  2. Making multimodal distributions more interpretable for regression models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

3 reasons for scaling data

A
  1. Distance/similarity measuring techniques will be skewed towards values with larger ranges
  2. Faster convergence for gradient descent learning algorithms
  3. Appropriately penalizing coefficients in loss functions involving regularisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Name 2 methods for scaling

A
  1. MinMaxScaling (normalisation) - uses min-max-range, normally puts values in a range between 0 and 1.
  2. Z-score scaling (standardisation) - uses mean and standard deviation, outputs values without a specific range, but their mean will be 0 and SD will be 1.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Name the con of MinMaxScaling and pro of Z- score scaling

A

Con = sensitive to outliers
Pro = less affected by outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How might you transform an asymmetrical distribution?

A
  1. Logarithm
  2. Bin into percentiles
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How might you transform a multimodal distribution?

A
  1. Binning and then categorising
  2. Create new features for each mode using a distribution generated using radial bias function (RBF)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When and why is imbalanced data a problem?

A

When: classification problems where the target class is not uniform

Why:
1. could lead to poorer model performance for the minority class
2. When we’re more interested in classifying the minority class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Name 2 ways of correcting imbalanced data

A
  1. Oversampling eg using SMOTE (Synthetic Minority Over -sampling Technique)
  2. Under sampling by removing data points from the majority class
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe SMOTE (Synthetic Minority Over -sampling Technique)

A
  1. Randomly select a minority class data point
  2. Find its nearest 5 neighbours (from any class)
  3. Select one at random and connect with a line
  4. Generate a new data points along that line at random
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the purpose of data reduction

A
  1. Efficiency - computational and storage
  2. Simplicity - interpretability
  3. Accuracy - removing noise or just not reducing accuracy but having fewer features
17
Q

3 methods for data reduction

A
  1. Feature selection
  2. Instance/pattern selection
  3. Data transformation
18
Q

Name 4 types of supervised feature selection methods (FERW

A
  1. Filter methods - statistical scores for each feature
  2. Embedded - learn the most relevant features as the model is created
  3. Regularisation -LASSO, elastic net etc
  4. Wrapper - search for the best features
19
Q

When should you create the test data set?

A
  1. Before any significant exploratory analysis to avoid ‘data snooping’ bias
  2. Before any transformations to avoid data leakage into your transformations eg: what your scales are.
20
Q

How can you create a test data set (2 methods)

A

Random sampling with a fixed seed
Stratified sampling if certain factors are very important predictors and/or are skewed.