Machine Learning Flashcards
Task
What we want to obtain given a set of data.
Features
The properties of the data that are used as input by the model
Model
Gets the data as input and returns an output.
Learning algorithm
The algorithm that generates the model, given a specific set of data.
Classification
Task: give a label C1 of a set of labels Cs, to a given input
Learning task: find a function c*, called classifier, that approximate at best the real classification function c.
Evaluation
We need to evaluate a model, to know how well it works for the task. There are a lot of measures to use to evaluate a model:
- accuracy
- recall
Decision tree
Is a model for CLASSIFICATION TASKS that uses a tree in which the nodes are feature-wise question to answer, the answers label the arcs to follow, and the leaves are the labels
Accuracy
It’s an evaluation measure that calculates:
Number of correctly labeled examples
/
Number of labeled examples
(TRUE POSITIVES + TRUE NEGATIVES) / (ALL POSITIVES + ALL NEGATIEVS)
Recall
In a BINARY CLASSIFICATION we define:
- POSITIVE RECALL = TRUE POSITIVE / ALL POSITIVES
- NEGATIVE RECALL (SPECIFICITY) = TRUE NEGATIVE / ALL NEGATIVES
- AVERAGE RECALL = (POSITIVE RECALL + NEGATIVE RECALL) / 2
Contingency table
It’s a tool used for BINARY CLASSIFICATION, we put on the columns the number of the predicted classes, and on the rows the number of the true classes.
Predicted Class | Positive | Negative | Actual Class |------------|------------| Positive | TP | FN | Negative | FP | TN |
Coverage plot
It’s a graph large N and height M.
N: number of FALSE POSITIVES
M: number of TRUE POSITIVES
Can be used to MODELS PERFORMANCES by finding the representing coordinates for each model in the graph, using the number of the model’s TRUE and FALSE POSITIVES.
Classifiers with the same accuracy are connected by lines of slope 1 (DEMONSTRATION)
Classifiers with the same average recall are connected by lines parallel to the main diagonal (Neg/Pos slope) (DEMONSTRATION)
NEEDS NORMALIZATION!
ROC plot
It’s the coverage plot, normalized. So the graph is large 1 and height 1.
Let confront models that have performances calculated on different datasets and numbers.
Classifiers with the same accuray are connected by lines of slope Neg/Pos (DEMONSTRATION)
Classifiers with the same average recall are connected by lines of slope 1 (DEMONSTRATION)
Scoring classification
This is a task similar to the classification, but the model is a SCORING CLASSIFIER s*, that takes an input and returns a Nth-vector of scores where N is the number of classes.
So it does not tell the class associated to the input, but the SCORE for each class associated to the input.
The TRAINING SET is the same as the one for the CLASSIFICATION.
Margin
For a SCORING BINARY CLASSIFIER the margin is a function that takes an input and returns a positive value if the input is classified correcty, negative otherwise.
Can be written as:
margin(x) = c(x)*s(x) =
1. +|s(x)| if s is correct
2. - |s(x)| if s is not correct
where
margin(x): is the margin function
c(x): is the true class of x (+1 for positive, -1 for negative class)
s(x): is the score for x given by the classifier
Loss function
For the purpose of rewarding large positive margins and penalize large negative margins, we define a LOSS FUNCTION.
Loss function is a function L that:
L: R –> [0,+inf)
that maps each example’s margin z(x) to an associated LOSS L(x).
We bound, or assume, L(x) to be:
L(x) > 1 for z(x) < 0
L(x) = 1 for z(x) = 0
L(x) > 0 and L(x) < 1 for z(x) > 0
So the value of L is less than 1 for each correctly classified example, and more than 1 otherwise.
0-1 Loss
A loss function
L(x) = 1 for z(x) <= 0
L(x) = 0 for z(x) > 0
Has some problems:
- discrete, not continuous so not entirely derivable
- has really bad derivatives (it is composed of 2 lines of slope 0)
- once the examples from the train set are correctly classified we cannot use the function to improve the model
Hinge Loss
A Loss function
L(x) = (1 - z(x)) for z(x) < 1
L(x) = 0 for z(x) >= 1
Solves the discontinuity problems of the 0-1 Loss, but still has a discontinuity for z(x)=1
Logistic Loss
A continuous loss function
L(x) = log2( 1 + exp( -z(x) ) )
Has a similar slope than Hinge loss, but all continuous
Exponential Loss
A continuous loss function
L(x) = exp( -z(x) )
Quadratic Loss
A continuous loss function
L(x) = (1 - z)2
this is the only one in the simple loss functions that goes to +inf for z(x) going to +inf, this could be a problem in optimization.
Ranking
When using a scoring classifier, we can use the score associated to an example to sort the example by its score. This is a RANKING function on the scores.
Ranking error
When using a RANKING we will put the (hopefully few) FALSE POSITIVES and FALSE NEGATIVES that has been classified with high score ahead of TRUE POSITIVES and TRUE NEGATIVES that have been classified with a lesser score. This is a RANKING ERROR.
The RANKING ERROR RATE can be calculated by taking each possible couple of 1 POS and 1 NEG example, and count how many of these couples are ordered right in the ranking, so the POS is ranked before the NEG:
RANKING ERROR RATE =
SUM( Numberof(s(x) < s(x)) + 0.5 * Numberof(s(x) = s(x))
/
Numberof(POS) * Numberof(NEG)
Where s(x) is the score of a POSITIVE and s(x*) is the score of a NEGATIVE.
Example:
++-++—
we have 1 - in front of 2 +, so it’s 12 = 2 wrong couples. We assume each example having a different score, so we do not have any 0.5 values. We have a total of 4 POS and 4 NEG, so:
RANKING ERROR RATE = 12 / 4*4 = 2/16 = 1/8
Binary classifiers and Ranking functions
If N examples are ordered by their SCORE, we can define N+1 BINARY CLASSIFIERS to separate the ordered list into 2 sets. We can determine the value of those classifiers calculating the RANKING ERROR RATE from the ordered list.
Class probabiliy estimation
This is a scoring classifier with the peculiarity that the score is the PROBABILITY ESTIMATION for the example to be correctly associated to the class.
p: X –> [0,1]k
where p is the probability estimation function
X is the example
K is the number of the classes
so p*(x) is a vector of probability estimations for the given example x to be correctly labeled in the k classes.
For a binary classification we omit the probability for the NEGATIVE class, since it’s derivable from the POS probability.
Mean squared probability error
This is one way to estimate the error in the probability assigned to an example x, knowing the true class probability c(x). It’s valued as the following
MSE = 0.5* ||p*(x) - Ic(x)||2,2 =
= 0.5 * sum ( pi(x) - Ic(xi) )2
where:
- p*(x) is the probability vector for the classes for the example x
- Ic(x) is the vector that has 1 for the right class position, and 0 otherwise
- the ||A||2,2 operation is the euclidean length for vector A, calculated as squared(squareRoot( sumOnEach_i( Ai * Ai) ))
Empirical probabilities
We can use empirical probabilities to define a reference for a class probability estimation. We can define the EMPIRICAL PROBABILITY VECTOR as follows
p^(S) = (n1/|S|, n2/|S|, …., nk/|S|)
where
S is the set of labeled examples
ni is the number of the examples labeled on the class Ci
Laplace correction
It’s the most common way to smooth the relative frequencies of classes for the caluclation of EMPIRICAL PROBABILITIES.
Can occour uniformly by adding 1 to each one of the k class occurrencies, so the empirical probability becomes:
p^i(S) = (ni + 1) / |S| + k
Can also be done with specific weights for each class,:
p^i(S) = (ni + wi*m) / |S| + m
where
wi is the a priori probability of the class i
m > k is any number
k is the number of classes
sum(wi) is 1
Binary classifier Multiclass handling
A binary classifier can be used to generate a multiclass labeling. We have various approaches:
- one vs rest
- unordered learning
- fixed-order learning
- one vs one
- symmetric
- asymmentric
1 vs rest - unordered
The classifier is made of n classifiers where n is the number of classes. Each classifier will be a binary classifier that returns 1 if the class is found, -1 if it thinks its another any class.
We can define an OUTPUT CODE MATRIX for this classifier (with n=3):
+1 -1 -1
-1 +1 -1
-1 -1 +1
where the columns are the results of each classifier on classes C1,C2,C3
and the rows are the classes C1,C2,C3