Preprocess Flashcards
The 6 steps of CRISP-DM
- Business Understanding
- Data understanding
- Data preparation
- Modelling
- Evaluation
- Deployment
EDA GOAL
Frame hypothesis, visualies, discover patterns, spot anomalies
3 steps of EDA
- Generate questions
- Search for the answers with visualising, transforming, modelling.
- Refine/generate new questions with the answers
One-hot enconding
For each different value, a new feature is added and the values is either 1 or 0 for the feature.
Normalisation
KNN, SVM, gradient descent are affected by scale of data. Normalisation brings to center around 0. Kinda makes it normally distributed.
Standardisation
Use if the distribution is normal. Brings the data to the same range.
Discretisation
Bring continuous values to discrete. E.g. for decision tree you would need discrete values.
MAR
Missing data depends on other observed variables but not on the value of the missing data. e.g. missing income depends on the education level but not on the income itself.
MNAR
e.g. high income individuals are less likely to disclose their income. The missing data depends on the missing values.
DELETION
- When less than 5% contains missing values
- Or MCAR
The curse of dimensionality:
High dimensional data has irrelevant and redundant features. High correlation between features.
Filter FS
uses existing measures, no learning algorithm
Wrapper FS
Uses learning performance to train ML models many times
Embedded FS
Train the ML model once. Select features based on learned model.
Feature Raking
Evaluate features individually
rank features and select top-ranked feautres
Simple, efficient
But ignores feature interactions (it can select redundant features)
Feature Subset Selection
Iteratively evaluate the whole feature subset. Considers feature interactions and ahs better performance but is more expensive/complicated
SFS steps
- Start with emptey feature set
- Sequentially add the feature X that results in the highest value when combined with the current set.
- Stop when the defined N amount of features is selected.
(works best when the optimal subset has a small N number of features)
SBFS steps
- start from the full set
- sequentially remove the feature X that results in the highets objective value
- Stop when a pre-defined number of features is selected.
(Works best when the N number of features is large)
Filter FC
Class. Accuracy: low, computational cost: low, generality: high
Embedded FC
Class. acc: medium, comp. cost: medium, Generality : medium
Wrapper FC
Class. acc: high, Comp. cost: high, Generality: low.
PCA
a mathematical procedure that linearly transforms
(possibly) correlated features into a (smaller) number of
uncorrelated features called principal components.
Goal is to achieve high data variance
What is data variance? Why higher variance?
Variance: squared deviations from the mean. Spread of your dataset. We want higher varaince because we want to retain as much informationa s possible.