! Session 2: Pre-Processing & EDA Flashcards
Cleaning - examples
o Convert fields with text into numeric form
o Remove rows with missing / faulty data
o Fix rows with faulty datapoints
o Drop unnecessary variables (columns)
o Scaling or normalizing data
o Creating new variables
o Renaming variables (remove capitalization, spaces)
Methods to changing categorical (& ordinal) to numerical values
- one hot encoding
- label encoding
- ordinal encoding
one hot encoding
= transforming each possible value from categorical into new binary (dummy-variable) column (e.g. true = 1, false = 0)
label encoding
= transforming each possible value from categorical into a unique integer value
! Issue: Algorithms might wrongly misjudge order / relationship in data
Ordinal encoding
= transforming each possible value from categorical into a unique integer value where values maintain order (e.g. body mass g in size)
EDA
- Exploratory Data Analysis
- approach for data analysis that employs a variety of techniques (mostly graphical) to
- uncover underlying structure
- extract important variables
- detect outliers and anomalies
- test underlying assumptions
2-D Histogram
= dividing points among 2D bins
Skewness
= measure of asymmetry of probability distribution of variable about its mean
Boxplot
= method for graphically depicting groups of numerical data through their quartiles
Scaling & Normalizing - Relevance
- Some learning algorithms are sensitive to the scale differences in variables
- e.g. Distance bases Algorithms: knn & svm
- e.g. 1€ !=1dkk
Mathematical Transformation
- scaling & normalizing
- log, exp
- standardization
Standardization
data points expressed as SD from mean
Make mean 0 and variance 1
Scaling
- adjusts the range of feature values to a specific range
- e.g. between 0 and 1
Pipelines
- Data transformation steps needed to be executed multiple times in right order
- Method: Pipeline class from Scikit-Learn
- Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps
Splitting of data
- Training set: train the model
- Test set: confirm that the model works
- Validation: tune the hyperparameters -> Optional