Process simulation - Tatu Flashcards

Question 1

Q

What are the different types of data preprocessing

Answer

A

Normalization
Data between [0, 1]
Standardization
Data mean 0, standard deviation 1
Combining data from different sources (concatenation, appending, joining)
Missing values imputation
Previous value (time series), average, linear interpolation?
Calculated variables, such as moving averages, variances, etc.
General filtering

Question 2

Q

Describe Naive Bayes algorithms?

Answer

A

Purely probabilistic classification algorithm
- Assumes independent (uncorrelated) variables
- ”Probability that Y has happened when we know X has happened”
Each variable and value is evaluated separately
- Mathematically simple
- Gives probability as the result
Binning potentially required with continuous variables
E.g. limits where the variable is low, medium, or high

Question 3

Q

Describe Scoring / Confusion Matrix?

Answer

A

Confusion Matrix can be built from the TN/TP/FN/FP values
Goal is to find ”REJECT”
Top left (TN) and bottom right (TP) are the ”good” predictions
Different statistics can be calculated based on these
No single metric tells the absolute truth
High accuracy doesn’t necessarily mean anything with imbalanced data (unless 100 %)
High precision with low recall may not be very useful

Question 4

Q

Describe Scoring / Receiver Operating Characteristic (ROC) curve?

Answer

A

Compares True Positive rate to False Positive rate at different classifier sensitivities (prediction confidence)
Operating point can be chosen
Area Under Curve is the measured metric
1 = perfect classification
0.5 = as good as random guessing
Not perfect for every case
Highly unbalanced data problematic
Precision/Recall curve better
Only available if the model produces prediction confidence

Process simulation - Tatu Flashcards

(4 cards)