Data Mining - Formulas Flashcards

Question 1

Q

How do you compute accuracy in classification?

Answer

A

True Positive + True Negative / n

Question 2

Q

How do you compute total cost?

Answer

A

(TP * cost TP) + (FP * cost FP) + (TN * cost TN) + (FN * cost FN)

Question 3

Q

What is the Kappa statistics in multiclass prediction?

Answer

A

You can use it if you have an actual predictor confusion matrix and a random predictor confusion matrix.
-> It measures the improvement compared to the random predictor

(success rate actual predict - success rate random predictor) / (1 - succes rate random predictor)

Question 4

Q

What is Recall?

Answer

A

The ability of the model to find all of the items of the class.

(True Positive) / (True Positive + False Negative)

Question 5

Q

What is Precision?

Answer

A

The ability of the model to correctly detect class items.

True Positive) / (True Positive + False Positives

Question 6

Q

What is the F-Measure?

Answer

A

Takes into account both Recall as Precision

F = (2 / (1/R) + (1/P)

Question 7

Q

How do you calculate P (Play tennis = yes | outlook = sunny)?

Answer

A

You check how many days were sunny that you could play tennis.

You divide that by the total amount of days that you played tennis

So you compute a conditional probability P(yes and sunny)/(yes)

Question 8

Q

How do we use Laplace Smoothing?

Answer

A

For every probability, you add a 1 to the numerator and the number of possible classes to the denominator.

Question 9

Q

How do you calculate P(Play Tennis = yes)

Answer

A

You look at the amount of outcomes that are yes and you divide that by the total amount of outcomes.

This extends to other classes of course.

Question 10

Q

How do you calculate the P that a record is classified in a class given 3 variables?

Answer

A

P(class) * (Pclass given x1) * (Pclass given x2) * (Pclass given x3)

Question 11

Q

What is the Euclidean distance and how does it work?

Answer

A

Most popular distance measure for numerical values.

dij = √ ((Xi1 - Xj1)^2 + (Xi2 - Xj2)^2 +…+ (Xip -Xjp)^2)

Question 12

Q

What is the Manhattan distance?

Answer

A

A more robust distance than the Euclidean distance. It looks at the absolute differences instead of the squared differences.

dij = |Xi1 - Xj1| + |Xi2 - Xj2| +…+ |Xip - Xjp|

Question 13

Q

How do you normalize values?

Answer

A

(Value - average) / standard deviation

Question 14

Q

How do you calculate the OLS?

Answer

A

You calculate the error for every Yi that you have.

Yi is the actual observation for X.

The error for Yi is calculated by: Yi - Yhat

Yhat is the sample value, i.e. the model’s estimation for X.

OLS = SUM (Yi - Yhat)^2. You pick the model that has the smallest OLS.

Remember to compute and square each error béfore you add them up.

Question 15

Q

How do you compute the GINI index?

Answer

A

If you have a split, you have two (or more) classes.

For each class, you divide the #records in that class by the #records of that node level. This way you have the proportion per class.

You square those proportions and you subtract them both from 1. That is your GINI.

Example: Class 1 has 2 and Class 2 has 4. Total of node level = 6.

1 - (2/6)^2 - (4/6)^2 = GINI

Question 16

Q

What is Entropy measure?

Answer

Study These Flashcards

A

Similar to GINI, but a different computation.

(proportie node 1)log2 (proportie node 1) - (proportie node 2)log2(proportie node 2) etc.

Question 17

Q

What is the combined impurity?

Answer

Study These Flashcards

A

You calculate the GINI index for both nodes in which a layer above is split.

You then perform a weighted average to get the combined GINI:

((#records Node 1 /((#records node 1 + 2) * GINI Node 1) )+ ((# Records node 2 / ((#Records node 1 + 2) * GINI Node 2)

Question 18

Q

How do you calculate support?

Answer

Study These Flashcards

A

It is the frequency that an itemset occurs in a dataset.

Can be divided by the total number of records to get a percentage.

Question 19

Q

How do you calculate confidence?

Answer

Study These Flashcards

A

Frequency A and C happen / Frequency A happens

Question 20

Q

How do you calculate lift?

Answer

Study These Flashcards

A

Confidence / (frequency C / n)

Question 21

Q

How do you calculate distance for numerical values in clustering?

Answer

Study These Flashcards

A

Euclidean

- Manhatten

Question 22

Q

How do you calculate distance for categorical values in clustering?

Answer

Study These Flashcards

A

You compare how many of the attributes are different between two variable arrays.

This number is then divide by the length of the array.

x𝐴
: (‘young’, ‘myope’, ‘no’, ‘reduced’, ‘none’)
x𝐵
: (‘young’, ‘hypermetrope’, ‘no’, ‘reduced’, ‘none’)

→ d(A,B) = 1 / 5

Data Mining - Formulas Flashcards

(22 cards)