Machine Learning Flashcards

1
Q

Task

A

What we want to obtain given a set of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Features

A

The properties of the data that are used as input by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Model

A

Gets the data as input and returns an output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Learning algorithm

A

The algorithm that generates the model, given a specific set of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Classification

A

Task: give a label C1 of a set of labels Cs, to a given input

Learning task: find a function c*, called classifier, that approximate at best the real classification function c.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Evaluation

A

We need to evaluate a model, to know how well it works for the task. There are a lot of measures to use to evaluate a model:

  • accuracy
  • recall
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Decision tree

A

Is a model for CLASSIFICATION TASKS that uses a tree in which the nodes are feature-wise question to answer, the answers label the arcs to follow, and the leaves are the labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Accuracy

A

It’s an evaluation measure that calculates:

Number of correctly labeled examples
/
Number of labeled examples

(TRUE POSITIVES + TRUE NEGATIVES) / (ALL POSITIVES + ALL NEGATIEVS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Recall

A

In a BINARY CLASSIFICATION we define:

  • POSITIVE RECALL = TRUE POSITIVE / ALL POSITIVES
  • NEGATIVE RECALL (SPECIFICITY) = TRUE NEGATIVE / ALL NEGATIVES
  • AVERAGE RECALL = (POSITIVE RECALL + NEGATIVE RECALL) / 2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Contingency table

A

It’s a tool used for BINARY CLASSIFICATION, we put on the columns the number of the predicted classes, and on the rows the number of the true classes.

                  Predicted Class
             |  Positive  |  Negative  |
Actual Class |------------|------------|
  Positive   |     TP     |     FN     |
  Negative   |     FP     |     TN     |
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Coverage plot

A

It’s a graph large N and height M.
N: number of FALSE POSITIVES
M: number of TRUE POSITIVES

Can be used to MODELS PERFORMANCES by finding the representing coordinates for each model in the graph, using the number of the model’s TRUE and FALSE POSITIVES.

Classifiers with the same accuracy are connected by lines of slope 1 (DEMONSTRATION)

Classifiers with the same average recall are connected by lines parallel to the main diagonal (Neg/Pos slope) (DEMONSTRATION)

NEEDS NORMALIZATION!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ROC plot

A

It’s the coverage plot, normalized. So the graph is large 1 and height 1.
Let confront models that have performances calculated on different datasets and numbers.

Classifiers with the same accuray are connected by lines of slope Neg/Pos (DEMONSTRATION)

Classifiers with the same average recall are connected by lines of slope 1 (DEMONSTRATION)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Scoring classification

A

This is a task similar to the classification, but the model is a SCORING CLASSIFIER s*, that takes an input and returns a Nth-vector of scores where N is the number of classes.

So it does not tell the class associated to the input, but the SCORE for each class associated to the input.

The TRAINING SET is the same as the one for the CLASSIFICATION.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Margin

A

For a SCORING BINARY CLASSIFIER the margin is a function that takes an input and returns a positive value if the input is classified correcty, negative otherwise.
Can be written as:

margin(x) = c(x)*s(x) =
1. +|s(x)| if s is correct
2. - |s(x)| if s is not correct

where
margin(x): is the margin function
c(x): is the true class of x (+1 for positive, -1 for negative class)
s(x): is the score for x given by the classifier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Loss function

A

For the purpose of rewarding large positive margins and penalize large negative margins, we define a LOSS FUNCTION.

Loss function is a function L that:

L: R –> [0,+inf)
that maps each example’s margin z(x) to an associated LOSS L(x).

We bound, or assume, L(x) to be:

L(x) > 1 for z(x) < 0
L(x) = 1 for z(x) = 0
L(x) > 0 and L(x) < 1 for z(x) > 0

So the value of L is less than 1 for each correctly classified example, and more than 1 otherwise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

0-1 Loss

A

A loss function

L(x) = 1 for z(x) <= 0
L(x) = 0 for z(x) > 0

Has some problems:
- discrete, not continuous so not entirely derivable
- has really bad derivatives (it is composed of 2 lines of slope 0)
- once the examples from the train set are correctly classified we cannot use the function to improve the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Hinge Loss

A

A Loss function

L(x) = (1 - z(x)) for z(x) < 1
L(x) = 0 for z(x) >= 1

Solves the discontinuity problems of the 0-1 Loss, but still has a discontinuity for z(x)=1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Logistic Loss

A

A continuous loss function

L(x) = log2( 1 + exp( -z(x) ) )

Has a similar slope than Hinge loss, but all continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Exponential Loss

A

A continuous loss function

L(x) = exp( -z(x) )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Quadratic Loss

A

A continuous loss function

L(x) = (1 - z)2

this is the only one in the simple loss functions that goes to +inf for z(x) going to +inf, this could be a problem in optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Ranking

A

When using a scoring classifier, we can use the score associated to an example to sort the example by its score. This is a RANKING function on the scores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Ranking error

A

When using a RANKING we will put the (hopefully few) FALSE POSITIVES and FALSE NEGATIVES that has been classified with high score ahead of TRUE POSITIVES and TRUE NEGATIVES that have been classified with a lesser score. This is a RANKING ERROR.

The RANKING ERROR RATE can be calculated by taking each possible couple of 1 POS and 1 NEG example, and count how many of these couples are ordered right in the ranking, so the POS is ranked before the NEG:

RANKING ERROR RATE =
SUM( Numberof(s(x) < s(x)) + 0.5 * Numberof(s(x) = s(x))
/
Numberof(POS) * Numberof(NEG)

Where s(x) is the score of a POSITIVE and s(x*) is the score of a NEGATIVE.

Example:
++-++—

we have 1 - in front of 2 +, so it’s 12 = 2 wrong couples. We assume each example having a different score, so we do not have any 0.5 values. We have a total of 4 POS and 4 NEG, so:
RANKING ERROR RATE = 1
2 / 4*4 = 2/16 = 1/8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Binary classifiers and Ranking functions

A

If N examples are ordered by their SCORE, we can define N+1 BINARY CLASSIFIERS to separate the ordered list into 2 sets. We can determine the value of those classifiers calculating the RANKING ERROR RATE from the ordered list.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Class probabiliy estimation

A

This is a scoring classifier with the peculiarity that the score is the PROBABILITY ESTIMATION for the example to be correctly associated to the class.

p: X –> [0,1]k
where p
is the probability estimation function
X is the example
K is the number of the classes

so p*(x) is a vector of probability estimations for the given example x to be correctly labeled in the k classes.

For a binary classification we omit the probability for the NEGATIVE class, since it’s derivable from the POS probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Mean squared probability error

A

This is one way to estimate the error in the probability assigned to an example x, knowing the true class probability c(x). It’s valued as the following

MSE = 0.5* ||p*(x) - Ic(x)||2,2 =
= 0.5 * sum ( pi(x) - Ic(xi) )2

where:
- p*(x) is the probability vector for the classes for the example x
- Ic(x) is the vector that has 1 for the right class position, and 0 otherwise
- the ||A||2,2 operation is the euclidean length for vector A, calculated as squared(squareRoot( sumOnEach_i( Ai * Ai) ))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Empirical probabilities

A

We can use empirical probabilities to define a reference for a class probability estimation. We can define the EMPIRICAL PROBABILITY VECTOR as follows

p^(S) = (n1/|S|, n2/|S|, …., nk/|S|)

where
S is the set of labeled examples
ni is the number of the examples labeled on the class Ci

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Laplace correction

A

It’s the most common way to smooth the relative frequencies of classes for the caluclation of EMPIRICAL PROBABILITIES.
Can occour uniformly by adding 1 to each one of the k class occurrencies, so the empirical probability becomes:

p^i(S) = (ni + 1) / |S| + k

Can also be done with specific weights for each class,:

p^i(S) = (ni + wi*m) / |S| + m

where
wi is the a priori probability of the class i
m > k is any number
k is the number of classes
sum(wi) is 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Binary classifier Multiclass handling

A

A binary classifier can be used to generate a multiclass labeling. We have various approaches:

  • one vs rest
    • unordered learning
    • fixed-order learning
  • one vs one
    • symmetric
    • asymmentric
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

1 vs rest - unordered

A

The classifier is made of n classifiers where n is the number of classes. Each classifier will be a binary classifier that returns 1 if the class is found, -1 if it thinks its another any class.

We can define an OUTPUT CODE MATRIX for this classifier (with n=3):

+1 -1 -1
-1 +1 -1
-1 -1 +1

where the columns are the results of each classifier on classes C1,C2,C3
and the rows are the classes C1,C2,C3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

1 vs rest - fixed order

A

The classifier is made of n-1 classifiers, where n is the number of classes. Each classifier is a binary classifier, but we use the classifiers with a fixed order ASSUMING that each non classified example at k-th classifier cannot be of the k classes checked by the previous classifiers.

So for example:
3 classes.
Checks are:

  1. C1 vs C2,C3 -> is C1 or go to step 2.
  2. C2 vs C3 -> C2 or C3

For this example the OUTPUT CODE MATRIX will be:

+1 0
-1 +1
-1 -1

NB:
We see that there is a THIRD VALUE, the 0 returned by the binary classifier, it’s not really a value returned since we do not accept that as a prediction.

31
Q

1 vs 1 - simmetric

A

The classifier is made of combination of n classifiers where n is the number of classes. So one classifier for each possible couple of classes.
Each classifier is a binary classifier, each classifier determines if the example is inside one of two classes.

32
Q

1 vs 1 - asimmetric

A

The classifier is made of combination of two classifier for each possible couple of classes.
Each classifier is a binary classifier, each classifier determines if the example is inside one of two classes.

33
Q

Binary classifier Multiclass Decoding

A

When the classifier is trained and we have the OUTPUT CODE MATRIX available, we can use it to define the classification for a new example given to the model.

Given n classes, we receive a WORD wn, a vector of dimension n. We can choose the class for the given example as the one that MINIMIZES the distance between wn and the corresponding row on the OUTPUT CODE MATRIX.

Distance on Cx = D[Cx] = SUMi / 2
where ci are the values of the selected row, and wi are the values of the vector w.

For example:
+1 -1 -1
-1 +1 -1
-1 -1 +1

w = (-1, -1, +1)
D[C1] = 1-(-1) + 1-1 + 1-(-1) = 2
D[C2] = 1-(1) + 1-(-1) + 1-(1) = 2
D[C3] = 1-(1) + 1-(1) + 1-(1) = 0

So the row with less Distance is 3, so the data has to be labeled as C3

34
Q

Binary classifier Multiclass Decoding With Same Distance

A

If in the decoding part, with the task of classifing new data, i get more rows with the same distance i can choose:

  1. decide randomly between the rows with same minimum distance
  2. lessen the possibility of same score by adding new classifier by merging more type of multiclass binary classifiers (ex. i add asimetric columns)
  3. instead of classifiers, that for definition returns 1 if the right class and -1 (or 0) if not, i can use a scoring classifier or a probabilistic classifier. This way the distance is less probable to be the same.
35
Q

Regression

A

This is a task in which the input is a data in the example set, and the output is a number in R.

We do not want to understand a class, but we want to find a point on a function.

So given the train set (xi, f(xi)) we want to define a function f* that is the more similar to f(xi), given f*(xi)

ATTENTION
To solve the task during the training the easiest way is to define f* as a polynome with n-1 parameters, where n is the number of example in the train set. This way f* and f would match perfectly, but this would be OVERFITTING, since f* would be exessively defined on the data from the example.
So rule of thumb is to have a smaller number of parameters than the example set dimension (this does not apply on neural networks)

36
Q

Bias-Variance Dilemma

A

If we underestimate the number of parameters we won’t be able to decrease the LOSS FUNCTION no matter how many examples we use in training.

If we overestimate the parameters the model will be more dependent on the training set, being less flexible to new cases.

37
Q

What is Concept Learning in Machine Learning?

A

Concept learning is the task of inferring a boolean(?)-valued function from training examples of its input and output. It involves identifying a general category or concept from specific instances.

38
Q

What is a Hypothesis Space?

A

The hypothesis space is the set of all hypotheses that can be formulated given the learning algorithm and the data.

39
Q

Define Version Space.

A

The version space is the subset of the hypothesis space that is consistent with the training examples. It includes all hypotheses that correctly classify the training data.

40
Q

What is the Candidate Elimination Algorithm?

A

The Candidate Elimination Algorithm is an algorithm that iteratively refines the version space by eliminating hypotheses that are inconsistent with each new training example.

41
Q

What is the General Boundary (G) in Candidate Elimination?

A

The General Boundary (G) is the set of the most general hypotheses in the version space that are consistent with the training data.

42
Q

What is the Specific Boundary (S) in Candidate Elimination?

A

The Specific Boundary (S) is the set of the most specific hypotheses in the version space that are consistent with the training data.

43
Q

What is Inductive Bias in Concept Learning?

A

Inductive bias refers to the set of assumptions a learning algorithm makes to predict outputs given inputs that it has not encountered. These assumptions guide the learning process and influence the hypothesis chosen.

44
Q

What role do positive and negative examples play in concept learning?

A

Positive examples provide instances that belong to the concept being learned, while negative examples provide instances that do not belong to the concept. Both are crucial for defining and refining the concept.

45
Q

Define Overfitting in Concept Learning.

A

Overfitting occurs when a hypothesis fits the training data too closely, capturing noise or specific details, and thus performs poorly on new, unseen data.

46
Q

What is the difference between a consistent hypothesis and a correct hypothesis?

A

A consistent hypothesis correctly classifies all training examples, while a correct hypothesis accurately classifies both seen and unseen examples, generalizing well to new data.

47
Q

Explain the concept of Generalization in Concept Learning.

A

Generalization is the ability of a model to apply what it has learned from the training examples to new, unseen examples, effectively predicting the output.

48
Q

What is the role of a hypothesis in concept learning?

A

A hypothesis is a proposed explanation or model that represents the concept being learned, used to predict the classification of new examples.

49
Q

What is the Find-S Algorithm?

A

The Find-S Algorithm is a simple method in concept learning that finds the most specific hypothesis that fits all positive examples.

50
Q

What is the primary limitation of the Find-S Algorithm?

A

The primary limitation of the Find-S Algorithm is that it only considers positive examples, ignoring negative examples, which can lead to incomplete or incorrect hypotheses.

51
Q

Describe the term “Hypothesis Testing” in the context of concept learning.

A

Hypothesis testing in concept learning involves evaluating hypotheses against a set of examples to determine their consistency and accuracy.

52
Q

What is a Feature Tree in Machine Learning?

A

A feature tree is a hierarchical representation of features used to describe data, where features are organized in a tree structure, typically to capture relationships and dependencies among them.

53
Q

What is the purpose of using a Feature Tree?

A

Feature trees help in organizing and managing features systematically, improving feature selection, engineering, and interpretation by capturing the relationships and dependencies between features.

54
Q

How are Feature Trees related to Decision Trees?

A

While decision trees are used for making predictions based on feature splits, feature trees specifically focus on the hierarchical organization and relationship of features themselves, rather than directly making predictions.

55
Q

What is a Leaf Node in a Feature Tree?

A

A leaf node in a feature tree represents a basic or primitive feature that does not have any further sub-features.

56
Q

What is an Internal Node in a Feature Tree?

A

An internal node in a feature tree represents a feature that has one or more sub-features, capturing a higher-level concept derived from its children.

57
Q

Give an example of a domain where Feature Trees can be useful.

A

Feature trees are useful in domains like natural language processing (NLP), where features such as syntactic structures, semantic roles, and word embeddings can be hierarchically organized.

58
Q

What is Feature Hierarchy?

A

Feature hierarchy is the organization of features in a multi-level structure where higher-level features are derived from combinations or transformations of lower-level features.

59
Q

How do Feature Trees aid in Feature Selection?

A

Feature trees aid in feature selection by highlighting important features and their relationships, helping to choose relevant features and avoid redundancy.

60
Q

Explain the concept of Feature Granularity in Feature Trees.

A

Feature granularity refers to the level of detail represented by features in the tree, with finer granularity at the leaves and coarser granularity at higher levels.

61
Q

Can Feature Trees be used in combination with other models?

A

Yes, feature trees can be used to organize and select features, which can then be used as input to other machine learning models like neural networks, SVMs, or ensemble methods.

62
Q

What is the Grow Tree Algorithm in Machine Learning?

A

The Grow Tree algorithm is a method used to construct decision trees by recursively splitting the data into subsets based on the feature that provides the best split according to a chosen impurity measure.

63
Q

Describe the basic steps of the Grow Tree Algorithm.

A
  1. Select the Best Feature: Choose the feature that best separates the data according to an impurity measure (e.g., Gini Index, Entropy).
  2. Split the Data: Partition the data into subsets based on the selected feature.
  3. Create Decision Nodes: Assign a decision node that represents the chosen feature and its splitting criterion.
  4. Repeat Recursively: Apply the same process to each subset, creating further splits and nodes.
  5. Stop Criteria: Continue until a stopping criterion is met (e.g., maximum depth, minimum samples per node, or no further improvement in impurity).
64
Q

What is the primary goal of the Grow Tree Algorithm?

A

The primary goal of the Grow Tree algorithm is to create a tree that accurately predicts the target variable by partitioning the data into homogenous subsets that are as pure as possible.

65
Q

What are Impurity Measures in Decision Trees?

A

Impurity measures are metrics used to evaluate how well a split separates the data into distinct classes, guiding the tree-growing process by selecting the best features to split on.

66
Q

What is the Minimum Error Impurity Measure?

A

The Minimum Error impurity measure evaluates the misclassification rate of a node, aiming to minimize the error by selecting splits that lead to the fewest incorrect classifications.

67
Q

How is the Minimum Error Impurity Measure calculated?

A

The Minimum Error impurity measure is calculated as the proportion of the most common class subtracted from one:
Error = 1 - max(p1,p2,…pk)
where pi is the proportion of class i in the node, and k is the number of classes

68
Q

What is the Gini Index?

A

The Gini Index is an impurity measure that calculates the likelihood of a random sample being incorrectly classified if it was randomly labeled according to the distribution of labels in the node.

Gini = 1 - Sum1,k
where pi is the proportion of class i in the node, and k is the number of classes

68
Q

What is Entropy in the context of decision trees?

A

Entropy is an impurity measure that quantifies the amount of uncertainty or disorder within a node, used to determine the best feature to split the data by reducing uncertainty.

Entropy = -Sum1,k
where pi is the proportion of class i in the node, and k is the number of classes

69
Q

What is the purpose of using impurity measures in the Grow Tree Algorithm?

A

The purpose of using impurity measures is to evaluate and select the best feature splits, aiming to create nodes that are as pure as possible, thereby improving the accuracy of the decision tree.

70
Q

How does the Grow Tree Algorithm handle continuous features?

A

The Grow Tree algorithm handles continuous features by finding the optimal threshold value that best splits the data into distinct classes, often using impurity measures to evaluate different threshold values.

71
Q

What is Pruning in Decision Trees?

A

Pruning is the process of removing parts of the tree that do not provide significant power to predict target variables, helping to prevent overfitting and improve generalization.

72
Q

Compare the Gini Index and Entropy as impurity measures.

A

Both Gini Index and Entropy measure impurity, but Gini Index is computationally simpler and often faster to compute, while Entropy can be more sensitive to changes in class distribution. They often lead to similar splits in decision trees.

73
Q
A