! Session 2: Pre-Processing & EDA Flashcards

Question 1

Q

Cleaning - examples

Answer

A

o Convert fields with text into numeric form
o Remove rows with missing / faulty data
o Fix rows with faulty datapoints
o Drop unnecessary variables (columns)
o Scaling or normalizing data
o Creating new variables
o Renaming variables (remove capitalization, spaces)

Question 2

Q

Methods to changing categorical (& ordinal) to numerical values

Answer

A

one hot encoding
label encoding
ordinal encoding

Question 3

Q

one hot encoding

Answer

A

= transforming each possible value from categorical into new binary (dummy-variable) column (e.g. true = 1, false = 0)

Question 4

Q

label encoding

Answer

A

= transforming each possible value from categorical into a unique integer value
! Issue: Algorithms might wrongly misjudge order / relationship in data

Question 5

Q

Ordinal encoding

Answer

A

= transforming each possible value from categorical into a unique integer value where values maintain order (e.g. body mass g in size)

Question 6

Q

EDA

Answer

A

Exploratory Data Analysis
approach for data analysis that employs a variety of techniques (mostly graphical) to
uncover underlying structure
extract important variables
detect outliers and anomalies
test underlying assumptions

Question 7

Q

2-D Histogram

Answer

A

= dividing points among 2D bins

Question 8

Q

Skewness

Answer

A

= measure of asymmetry of probability distribution of variable about its mean

Question 9

Q

Boxplot

Answer

A

= method for graphically depicting groups of numerical data through their quartiles

Question 10

Q

Scaling & Normalizing - Relevance

Answer

A

Some learning algorithms are sensitive to the scale differences in variables
e.g. Distance bases Algorithms: knn & svm
e.g. 1€ !=1dkk

Question 11

Q

Mathematical Transformation

Answer

A

scaling & normalizing
log, exp
standardization

Question 12

Q

Standardization

Answer

A

data points expressed as SD from mean
Make mean 0 and variance 1

Question 13

Q

Scaling

Answer

A

adjusts the range of feature values to a specific range
e.g. between 0 and 1

Question 14

Q

Pipelines

Answer

A

Data transformation steps needed to be executed multiple times in right order
Method: Pipeline class from Scikit-Learn
Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps

Question 15

Q

Splitting of data

Answer

A

Training set: train the model
Test set: confirm that the model works
Validation: tune the hyperparameters -> Optional

Question 16

Q

Validation set - purpose

Answer

Study These Flashcards

A

ML = iterative process; fine tuning
Used to ensure model can generalize, used during hyperparameter tuning
Test set: just for final evaluation

Question 17

Q

Normalizing

Answer

Study These Flashcards

A

transform features to have a common scale by adjusting the distribution of data (standardization a form of normalizing)
centers the data around zero & adjusts the spread of values

! Session 2: Pre-Processing & EDA Flashcards

(17 cards)