Process simulation - Tatu Flashcards
What are the different types of data preprocessing
- Normalization
Data between [0, 1] - Standardization
Data mean 0, standard deviation 1 - Combining data from different sources (concatenation, appending, joining)
- Missing values imputation
Previous value (time series), average, linear interpolation? - Calculated variables, such as moving averages, variances, etc.
- General filtering
Describe Naive Bayes algorithms?
Purely probabilistic classification algorithm
- Assumes independent (uncorrelated) variables
- ”Probability that Y has happened when we know X has happened”
Each variable and value is evaluated separately
- Mathematically simple
- Gives probability as the result
Binning potentially required with continuous variables
E.g. limits where the variable is low, medium, or high
Describe Scoring / Confusion Matrix?
Confusion Matrix can be built from the TN/TP/FN/FP values
Goal is to find ”REJECT”
Top left (TN) and bottom right (TP) are the ”good” predictions
Different statistics can be calculated based on these
No single metric tells the absolute truth
High accuracy doesn’t necessarily mean anything with imbalanced data (unless 100 %)
High precision with low recall may not be very useful
Describe Scoring / Receiver Operating Characteristic (ROC) curve?
Compares True Positive rate to False Positive rate at different classifier sensitivities (prediction confidence) Operating point can be chosen Area Under Curve is the measured metric 1 = perfect classification 0.5 = as good as random guessing Not perfect for every case Highly unbalanced data problematic Precision/Recall curve better Only available if the model produces prediction confidence