! Session 2: Pre-Processing & EDA Flashcards

1
Q

Cleaning - examples

A

o Convert fields with text into numeric form
o Remove rows with missing / faulty data
o Fix rows with faulty datapoints
o Drop unnecessary variables (columns)
o Scaling or normalizing data
o Creating new variables
o Renaming variables (remove capitalization, spaces)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Methods to changing categorical (& ordinal) to numerical values

A
  • one hot encoding
  • label encoding
  • ordinal encoding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

one hot encoding

A

= transforming each possible value from categorical into new binary (dummy-variable) column (e.g. true = 1, false = 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

label encoding

A

= transforming each possible value from categorical into a unique integer value
! Issue: Algorithms might wrongly misjudge order / relationship in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ordinal encoding

A

= transforming each possible value from categorical into a unique integer value where values maintain order (e.g. body mass g in size)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

EDA

A
  • Exploratory Data Analysis
  • approach for data analysis that employs a variety of techniques (mostly graphical) to
  • uncover underlying structure
  • extract important variables
  • detect outliers and anomalies
  • test underlying assumptions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

2-D Histogram

A

= dividing points among 2D bins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Skewness

A

= measure of asymmetry of probability distribution of variable about its mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Boxplot

A

= method for graphically depicting groups of numerical data through their quartiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Scaling & Normalizing - Relevance

A
  • Some learning algorithms are sensitive to the scale differences in variables
  • e.g. Distance bases Algorithms: knn & svm
  • e.g. 1€ !=1dkk
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Mathematical Transformation

A
  • scaling & normalizing
  • log, exp
  • standardization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Standardization

A

data points expressed as SD from mean
Make mean 0 and variance 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Scaling

A
  • adjusts the range of feature values to a specific range
  • e.g. between 0 and 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Pipelines

A
  • Data transformation steps needed to be executed multiple times in right order
  • Method: Pipeline class from Scikit-Learn
  • Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Splitting of data

A
  • Training set: train the model
  • Test set: confirm that the model works
  • Validation: tune the hyperparameters -> Optional
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Validation set - purpose

A
  • ML = iterative process; fine tuning
  • Used to ensure model can generalize, used during hyperparameter tuning
  • Test set: just for final evaluation
17
Q

Normalizing

A
  • transform features to have a common scale by adjusting the distribution of data (standardization a form of normalizing)
  • centers the data around zero & adjusts the spread of values