Preprocess Flashcards
The 6 steps of CRISP-DM
- Business Understanding
- Data understanding
- Data preparation
- Modelling
- Evaluation
- Deployment
EDA GOAL
Frame hypothesis, visualies, discover patterns, spot anomalies
3 steps of EDA
- Generate questions
- Search for the answers with visualising, transforming, modelling.
- Refine/generate new questions with the answers
One-hot enconding
For each different value, a new feature is added and the values is either 1 or 0 for the feature.
Normalisation
KNN, SVM, gradient descent are affected by scale of data. Normalisation brings to center around 0. Kinda makes it normally distributed.
Standardisation
Use if the distribution is normal. Brings the data to the same range.
Discretisation
Bring continuous values to discrete. E.g. for decision tree you would need discrete values.
MAR
Missing data depends on other observed variables but not on the value of the missing data. e.g. missing income depends on the education level but not on the income itself.
MNAR
e.g. high income individuals are less likely to disclose their income. The missing data depends on the missing values.
DELETION
- When less than 5% contains missing values
- Or MCAR
The curse of dimensionality:
High dimensional data has irrelevant and redundant features. High correlation between features.
Filter FS
uses existing measures, no learning algorithm
Wrapper FS
Uses learning performance to train ML models many times
Embedded FS
Train the ML model once. Select features based on learned model.
Feature Raking
Evaluate features individually
rank features and select top-ranked feautres
Simple, efficient
But ignores feature interactions (it can select redundant features)