Data Mining - Formulas Flashcards
How do you compute accuracy in classification?
True Positive + True Negative / n
How do you compute total cost?
(TP * cost TP) + (FP * cost FP) + (TN * cost TN) + (FN * cost FN)
What is the Kappa statistics in multiclass prediction?
You can use it if you have an actual predictor confusion matrix and a random predictor confusion matrix.
-> It measures the improvement compared to the random predictor
(success rate actual predict - success rate random predictor) / (1 - succes rate random predictor)
What is Recall?
The ability of the model to find all of the items of the class.
(True Positive) / (True Positive + False Negative)
What is Precision?
The ability of the model to correctly detect class items.
True Positive) / (True Positive + False Positives
What is the F-Measure?
Takes into account both Recall as Precision
F = (2 / (1/R) + (1/P)
How do you calculate P (Play tennis = yes | outlook = sunny)?
You check how many days were sunny that you could play tennis.
You divide that by the total amount of days that you played tennis
So you compute a conditional probability P(yes and sunny)/(yes)
How do we use Laplace Smoothing?
For every probability, you add a 1 to the numerator and the number of possible classes to the denominator.
How do you calculate P(Play Tennis = yes)
You look at the amount of outcomes that are yes and you divide that by the total amount of outcomes.
This extends to other classes of course.
How do you calculate the P that a record is classified in a class given 3 variables?
P(class) * (Pclass given x1) * (Pclass given x2) * (Pclass given x3)
What is the Euclidean distance and how does it work?
Most popular distance measure for numerical values.
dij = √ ((Xi1 - Xj1)^2 + (Xi2 - Xj2)^2 +…+ (Xip -Xjp)^2)
What is the Manhattan distance?
A more robust distance than the Euclidean distance. It looks at the absolute differences instead of the squared differences.
dij = |Xi1 - Xj1| + |Xi2 - Xj2| +…+ |Xip - Xjp|
How do you normalize values?
(Value - average) / standard deviation
How do you calculate the OLS?
You calculate the error for every Yi that you have.
Yi is the actual observation for X.
The error for Yi is calculated by: Yi - Yhat
Yhat is the sample value, i.e. the model’s estimation for X.
OLS = SUM (Yi - Yhat)^2. You pick the model that has the smallest OLS.
Remember to compute and square each error béfore you add them up.
How do you compute the GINI index?
If you have a split, you have two (or more) classes.
For each class, you divide the #records in that class by the #records of that node level. This way you have the proportion per class.
You square those proportions and you subtract them both from 1. That is your GINI.
Example: Class 1 has 2 and Class 2 has 4. Total of node level = 6.
1 - (2/6)^2 - (4/6)^2 = GINI