Final Review Flashcards
Who developed the VC dimension?
Dr. Vladimir Vapnik and Dr. Chervonenkis
What does VC stand for?
Vapnik-Chervonekis
Why was the VC dimension developed?
To help scientists develop better machine learning models that would be better at classifying data
What is the VC dimension?
The Vapnik-Chervonenkis dimension of a hypothesis space is the maximum number of points that can be “shattered” by the hypothesis of that space
In terms of VC dimension, what does “shattering” mean?
“Shattering” means that for every possible way of labeling these points (with binary labels), there is a hypothesis in the space that correctly classifies the points according to that labeling
Is either of these true? Why or why not?
The VC dimension of the hypothesis class of circles is lower than that of squares.
The VC dimension of the hypothesis class of squares is lower than that of circles.
No, because both have a VC dimension of 3.
Is this true? Why or why not?
The VC dimension of the hypothesis class of rings is higher that that of circles.
Yes, because rings have a VC dimension of 4 while circles have a VC dimension of 3.
What are 2 important notes when discussing the VC dimension of a hypothesis space?
- Knowing if the hypothesis space is centered on the origin
- Knowing that the VC dimension is NOT always equivalent to 1 + (number of dimensions)
Is either of these true? Why or why not?
The VC dimension of circles is higher than that of squares.
The VC dimension of squares is higher that that of circles.
No, because both have a VC dimension of 3.
What was the goal of Bayes’ Rule?
To solve the problem of inverse probability - inferring the probability of causes (hypotheses) from observed effects (data)
What is Bayes’ Rule?
A fundamental theorem in probability theory for updating the probability of a hypothesis based on new evidence
What is the formula for Bayes’ Rule?
P(h|D) = [P(D|h) * P(h)] / P(D)
In Bayes’ Rule, what is P(h|D) and what does it represent?
The posterior probability of the hypothesis given the data.
It represents our updated belief in the hypothesis after seeing the data.
In Bayes’ Rule, what is P(D|h) and what does it represent?
P(D|h) is the likelihood, which is the probability of observing the data assuming the hypothesis is true.
It tells us how likely the observed data is under the assumption of the hypothesis.
In Bayes’ Rule, what is P(h) and what does it represent?
P(h) is the prior probability of the hypothesis before observing any data.
It represents our belief in the hypothesis based on prior knowledge or assumptions.
In Bayes’ Rule, what is P(D) and what does it represent?
P(D) is the marginal likelihood (evidence), which normalizes the posterior distribution by ensuring the total probability sums to 1.
It represents the total probability of observing the data across all possible hypotheses.
For Bayes’ Rule, when would we ignore P(D)?
We ignore the probability of the data given any hypothesis when we’re interested in finding the hypothesis that maximizes the posterior probability - P(h|D)
What is MLE?
Maximum Likelihood Estimation is a way to estimate the parameters of a model by finding the values that make the observed data most likely
What does MLE mean in the context of binary classification?
This means finding the parameters of the hypothesis h(x) that make the observed outcomes y more likely.
What is MSE? How is it written?
Mean Squared Error
MSE = (1/n) * (Σ(y - h(x))^2
Why is MSE unsuitable for binary classification?
- Binary classification treats targets as discrete (0 or 1), while MSE assumes targets are continuous values.
- MSE doesn’t penalize incorrect predictions made with high confidence as effectively as cross-entropy loss does
- MSE here can result in poor performance because it focuses on shortening distance between predicted and actual values, ignoring the probability estimates produced by models in classification tasks
What function is typically used for a binary classification neural network’s output layer? Why?
A sigmoid activation function is typically used because it outputs a value between 0 and 1 which can be evaluated with cross-entropy loss to predict how well it matches the actual binary label.
Describe Cross-Entropy Loss (CEL)
- Used in binary classification
- Measures the difference between the predicted probabilities and the true binary labels
- Directly handles probabilistic predictions
- Strongly penalizes wrong predictions made with high confidence
Describe Mean Squared Error (MSE)
- Typically used in regression tasks where the target variable is continuous
- Measures the squared difference between predicted values and true values
- Treats output as continuous, making it LESS suitable for binary classification