Process simulation - Tatu Flashcards

1
Q

What are the different types of data preprocessing

A
  • Normalization
    Data between [0, 1]
  • Standardization
    Data mean 0, standard deviation 1
  • Combining data from different sources (concatenation, appending, joining)
  • Missing values imputation
    Previous value (time series), average, linear interpolation?
  • Calculated variables, such as moving averages, variances, etc.
  • General filtering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe Naive Bayes algorithms?

A

Purely probabilistic classification algorithm
- Assumes independent (uncorrelated) variables
- ”Probability that Y has happened when we know X has happened”
Each variable and value is evaluated separately
- Mathematically simple
- Gives probability as the result
Binning potentially required with continuous variables
E.g. limits where the variable is low, medium, or high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe Scoring / Confusion Matrix?

A

Confusion Matrix can be built from the TN/TP/FN/FP values
Goal is to find ”REJECT”
Top left (TN) and bottom right (TP) are the ”good” predictions
Different statistics can be calculated based on these
No single metric tells the absolute truth
High accuracy doesn’t necessarily mean anything with imbalanced data (unless 100 %)
High precision with low recall may not be very useful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe Scoring / Receiver Operating Characteristic (ROC) curve?

A
Compares True Positive rate to False Positive rate at different classifier sensitivities (prediction confidence)
Operating point can be chosen
Area Under Curve is the measured metric
1 = perfect classification
0.5 = as good as random guessing
Not perfect for every case
Highly unbalanced data problematic
Precision/Recall curve better
Only available if the model produces prediction confidence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly