Final Review Flashcards
Who developed the VC dimension?
Dr. Vladimir Vapnik and Dr. Chervonenkis
What does VC stand for?
Vapnik-Chervonekis
Why was the VC dimension developed?
To help scientists develop better machine learning models that would be better at classifying data
What is the VC dimension?
The Vapnik-Chervonenkis dimension of a hypothesis space is the maximum number of points that can be “shattered” by the hypothesis of that space
In terms of VC dimension, what does “shattering” mean?
“Shattering” means that for every possible way of labeling these points (with binary labels), there is a hypothesis in the space that correctly classifies the points according to that labeling
Is either of these true? Why or why not?
The VC dimension of the hypothesis class of circles is lower than that of squares.
The VC dimension of the hypothesis class of squares is lower than that of circles.
No, because both have a VC dimension of 3.
Is this true? Why or why not?
The VC dimension of the hypothesis class of rings is higher that that of circles.
Yes, because rings have a VC dimension of 4 while circles have a VC dimension of 3.
What are 2 important notes when discussing the VC dimension of a hypothesis space?
- Knowing if the hypothesis space is centered on the origin
- Knowing that the VC dimension is NOT always equivalent to 1 + (number of dimensions)
Is either of these true? Why or why not?
The VC dimension of circles is higher than that of squares.
The VC dimension of squares is higher that that of circles.
No, because both have a VC dimension of 3.
What was the goal of Bayes’ Rule?
To solve the problem of inverse probability - inferring the probability of causes (hypotheses) from observed effects (data)
What is Bayes’ Rule?
A fundamental theorem in probability theory for updating the probability of a hypothesis based on new evidence
What is the formula for Bayes’ Rule?
P(h|D) = [P(D|h) * P(h)] / P(D)
In Bayes’ Rule, what is P(h|D) and what does it represent?
The posterior probability of the hypothesis given the data.
It represents our updated belief in the hypothesis after seeing the data.
In Bayes’ Rule, what is P(D|h) and what does it represent?
P(D|h) is the likelihood, which is the probability of observing the data assuming the hypothesis is true.
It tells us how likely the observed data is under the assumption of the hypothesis.
In Bayes’ Rule, what is P(h) and what does it represent?
P(h) is the prior probability of the hypothesis before observing any data.
It represents our belief in the hypothesis based on prior knowledge or assumptions.
In Bayes’ Rule, what is P(D) and what does it represent?
P(D) is the marginal likelihood (evidence), which normalizes the posterior distribution by ensuring the total probability sums to 1.
It represents the total probability of observing the data across all possible hypotheses.
For Bayes’ Rule, when would we ignore P(D)?
We ignore the probability of the data given any hypothesis when we’re interested in finding the hypothesis that maximizes the posterior probability - P(h|D)
What is MLE?
Maximum Likelihood Estimation is a way to estimate the parameters of a model by finding the values that make the observed data most likely
What does MLE mean in the context of binary classification?
This means finding the parameters of the hypothesis h(x) that make the observed outcomes y more likely.
What is MSE? How is it written?
Mean Squared Error
MSE = (1/n) * (Σ(y - h(x))^2
Why is MSE unsuitable for binary classification?
- Binary classification treats targets as discrete (0 or 1), while MSE assumes targets are continuous values.
- MSE doesn’t penalize incorrect predictions made with high confidence as effectively as cross-entropy loss does
- MSE here can result in poor performance because it focuses on shortening distance between predicted and actual values, ignoring the probability estimates produced by models in classification tasks
What function is typically used for a binary classification neural network’s output layer? Why?
A sigmoid activation function is typically used because it outputs a value between 0 and 1 which can be evaluated with cross-entropy loss to predict how well it matches the actual binary label.
Describe Cross-Entropy Loss (CEL)
- Used in binary classification
- Measures the difference between the predicted probabilities and the true binary labels
- Directly handles probabilistic predictions
- Strongly penalizes wrong predictions made with high confidence
Describe Mean Squared Error (MSE)
- Typically used in regression tasks where the target variable is continuous
- Measures the squared difference between predicted values and true values
- Treats output as continuous, making it LESS suitable for binary classification
Describe Neural Networks with Sum of Squared Error (SSE)
- If it is trained for a binary classification task, it may treat the output as continuous rather than probabilistic
- For binary classification, it uses CEL (Cross-Entropy Loss) with a sigmoid output layer
What is a limitation of K-Means?
- K-Means struggles with non-spherical clusters
- K-Means requires a pre-defined number of clusters
- K-Means may converge to local minima
What are the steps of spectral clustering?
- Construct a similarity graph
- Compute the graph Laplacian
- Compute the eigenvalues and eigenvectors
- Perform clustering in the reduced eigenspace
What is the kernel trick?
This trick maps data into a higher-dimensional space where linear separation is possible
Why does the kernel trick matter for concentric circles?
This trick allows spectral clustering to transform into the original space, enabling the separation of non-linearly separable points, such as inner and outer circles
What is the primary advantage of Isomap over traditional MDS?
Isomaps approximate geodesic distances, capturing global structure
Advantages and Limitations of Isomap
Advantage:
- Captures global structure effectively, especially for data with a natural manifold shape
Limitations:
- Computationally intensive on large datasets
- Struggles with non-manifold high-dimensional data
What is the core concept of Laplacian Eigenmaps?
Laplacian eigenmaps use spectral graph theory to focus on local neighborhood relationships, preserving local structure
What is the process for Laplacian Eigenmaps?
- Build a weighted similarity graph of the data
- Compute the graph’s Laplacian matrix
- Perform eigenvalue decomposition to generate a low-dimensional embedding
Laplacian Eigenmaps are best suited for preserving what type of structure?
Local neighborhood structure
Name applications of Laplacian Eigenmaps
- Image Processing
- Clustering in High-Dimensional Data
- Sensor Data Analysis
- Gene Expression Data
Advantages and Limitations of Laplacian Eigenmaps
Advantage: Efficient for preserving local structures, which is useful for clustering tasks
Limitation: May lose global relationships as it focuses solely at the local level
Why might pruning be applied to a decision tree?
- To simplify the model and improve interpretability by removing less significant branches
- To reduce overfitting by removing branches that were fit to noise or other training data fluctuations
- To remove branches that provide little to no predictive power
Why are the following NOT reasons to prune a decision tree?
- To ensure the tree is balanced
- To always achieve the best accuracy
- To increase tree depth
- The goal of pruning is not to balance a tree
- Pruning doesn’t guarantee the BEST accuracy
- Pruning does the opposite of increase tree depth
Why are the following statements around cross-validation true?
- In k-fold cross-validation, the dataset is divided into k subsets, and each subset is used as a validation set
- Cross-validation helps in estimating the generalization performance of a regression model
- The dataset is divided into k subsets (folds), and each fold is used as a validation set once, while the remaining k - 1 folds are used for training
- Testing on multiple validation sets helps to identify overfitting or underfitting
Is the following statement incorrect? Why?
- Cross-validation is only applicable to regression tasks and not classification
Cross-validation is a general technique applicable to both regression and classification tasks
In the context of training neural networks, why is direct interpretability a challenge?
Neural networks are often referred to as “black-box” models because it is difficult to directly interpret how they arrive at predictions, especially as the number of layers and neurons increases
In the context of training neural networks, why is overfitting a challenge?
Overfitting occurs when a neural network learns the training data too well, including noise and irrelevant patterns, leading to poor generalization on unseen data
In the context of training neural networks, why is getting stuck in local minima a challenge?
While modern optimizers like stochastic gradient descent (SGD) with momentum or Adam help mitigate this, getting stuck in poor local minima or saddle points can still be a problem, especially for non-convex loss functions
In the context of training neural networks, why is the vanishing gradient problem a challenge?
The vanishing gradient problem occurs when gradients become very small during backpropagation, particularly in deep networks with activation functions like sigmoid or tanh. This slows down learning or even halts it in early layers
In the context of training neural networks, why is inefficient support vector calculations not a challenge?
Support vector calculations are specific to Support Vector Machines (SVMs), a different type of machine learning algorithm
In the context of training neural networks, why is collinearity among features not a challenge?
While collinearity among features can affect simpler models like linear regression, neural networks can often learn to handle correlated inputs effectively due to their ability to model complex relationships
Which of the following are characteristics of instance-based learning?
- It always uses a probabilistic model for predictions
- It’s often sensitive to irrelevant or redundant features
- It is always faster than model-based learning
- New instances are classified based on similarity measures
- Training is typically computationally intensive
- The model ”memorizes” the training instances
- It’s often sensitive to irrelevant or redundant features because instance-based learning relies on similarity measures, which can be skewed by outliers
- New instances are classified based on similarity measures
- The model ”memorizes” the training instances
Why are the following invalid reasons to use ensemble learning techniques?
- To handle missing values in the data
- To speed up training times for large datasets
- To provide a more interpretive model
- Ensemble learning does not focus on handling missing values
- Ensemble training may actually increase training times because multiple models need to be trained
- Ensemble models are often less interpretable because they combine multiple models, making them very complex
Why is finding a hyperplane that maximizes the margin between classes in the transformed space a reason for using kernel methods in SVMs?
Kernel methods enable SVMs to find a separating hyperplane in a higher-dimensional transformed space when the data is not linearly separable in the original space
Why is handling missing values in data NOT a reason for using kernel methods in SVMs?
Kernel methods don’t address missing data, but preprocessing can do so
Why is decreasing the number of support vectors in the model NOT a reason for using kernel methods in SVMs?
Kernel methods don’t affect the support vectors. That depends on the data and the model’s parameters
Why is tackling non-linearly separable data a reason for using kernel methods in SVMs?
Kernel methods allow SVMs to handle non-linearly separable data by implicitly mapping it into a higher-dimensional space where separation IS possible
Why is reducing the computational complexity of the algorithm NOT a reason for using kernel methods in SVMs?
Kernel methods actually often increase the computational complexity because they require pairwise comparisons of ALL data points through the kernel function
Why is a fixed set of features to represent all possible inputs NOT an essential component of the PAC learning framework?
PAC is agnostic to the feature space and focuses on the hypothesis space, error bounds, and learning guarantees
Why is a confidence parameter an essential component in the PAC learning framework?
The confidence parameter represents the probability that the chosen hypothesis will not meet the desired accuracy
What does PAC stand for?
Probably Approximately Correct
What is the definition of PAC-learnable?
What makes up the “Approximately Correct” in PAC?
Requiring the true error not be zero, but bounded only by some constant that can be arbitrarily small
What make the “Probably” in PAC?
Requiring that the learner’s probability of failure be bounded by some constant that can be made arbitrarily small
Why is a hypothesis space essential to the PAC learning framework?
PAC assumes there exists some hypothesis space from which all possible models/functions can be selected by the learner to approximate the target
Why is the VC Dimension for all linear classifiers in a 2D space not 1?
Certain configurations (a triangle) can shatter 3 points, but not four. This makes the VC Dimension for all linear classifiers in a 2D space 3.
Why is it not true that the number of features in the dataset is equal to the VC Dimension?
They aren’t equivalent. Some models with nonlinear hypothesis spaces can have a VC Dimension greater than the number of features.
Why is a high VC Dimension an indicator of potentially overfitting?
A higher VC Dimension means it can create more complex patterns, which means it can overfit to the given training data.
In the context of Bayesian learning, why is a higher likelihood NOT always indicative of a more probable hypothesis?
A higher likelihood means that the data is more probable given the hypothesis, but the overall probability of the hypothesis depends on the prior
In the context of Bayesian learning, why is probability NOT always uniform across all hypotheses?
Some hypotheses explain the data better than others, resulting in different likelihood values
In the context of Bayesian learning, why is it NOT true that the likelihood of a hypothesis is the same as the prior possibility of a hypothesis?
The likelihood P(Data | Hypothesis) is different from the prior P(Hypothesis).
The prior represents the probability of the hypothesis before seeing the data, while the likelihood measures how well the data supports the hypothesis.
In Bayesian learning, what is MAP and its formula?
MAP is the maximum a posterior, representing an estimate of the most likely value of a variable.
h_MAP = argmax_(h in H) (Pr(h | D))
In Bayesian learning, what is ML and its formula?
ML is the maximum likelihood, which represents when all hypotheses are equally likely.
h_ML = argmax_(h in H) (Pr(D | h))
How do we measure information gain?
Gain(S, A) = Entropy(S) - (summation(|S_v|/S) * Entropy(s_v))
What is restriction bias?
Bias that automatically occurs when we decide our hypothesis set
What is preference bias?
Bias that tells us what sort of hypotheses from our hypothesis space we prefer
What types of decision trees are preferred?
Ones with good splits at the top because it means less traversal on average
In terms of Ensemble Learning, what is bagging?
Bagging is when we make choosing data uniformly random, then combine the results of various weak learners.
Bagging is the shortened named for Bootstrap Aggregation.
When Bagging in Ensemble Learning, why does taking the average of a set of weak learners trained on different subsets outperform a single learner trained on the entire dataset?
Overfitting. Overfitting a subset will not overfit the overall dataset, and the average will “smooth out” the specifics of each individual learner.
In Ensemble Learning, what is Boosting?
Boosting is focused on data that previous learners struggled with in order to form a cohesive picture of the entire dataset.
In the context of Ensemble Learning, describe the process of Boosting
- Initialize all training examples to be equally weighted
- After each round, identify the weakest learner
- Raise the weights of every example that weaker learner misclassified
- Combine all the weak learners into a final learner composed of their weighted average
What is a weak learner?
A weak learner is a model that performs better than chance for any distribution of data
When can you not create a weak learner?
If there is any distribution for which a set of hypotheses can’t do better than random chance, then there is no way to create a weak learner.
What are two rules designed for training the weights on perceptron units/networks?
- Perceptron Rule (threshold)
- Gradient Descent / Delta Rule (unthresholded)
Give the Perceptron Rule formula
activation = Σ(w_i * x_i) + b
where b is some bias constant
Why does the Perceptron Rule care about linearly separable data?
If the given data is linearly separable, the Perceptron Rule will find the dividing halfplane in a finite number of iterations
Describe the 4 main components of the Perceptron Rule
- The target y
- The output y
- The learning rate (how much we adjust)
- The input
In the Perceptron Rule, what happens when the target and the output match?
When both match, we see no change in the weight because there is no error
In the Perceptron Rule, what happens when the difference between the target and output is negative (y - y_hat = -1)? When it is positive (y - y_hat = 1)?
When the output is positive, that means our output is too small so we need to increase the weight’s size according to the learning rate and given input.
When the output is negative, that means our sum is too large so we need to decrease the weight’s size according to the learning rate and given input.
What is the criteria to stop running the Perceptron Rule?
We stop the Perceptron Rule once the change in the given weight return 0 (this means we’ve found a plane that separates the data)
What decides when to use the Perceptron Rule vs. Gradient Descent?
Whether the data is/is not linearly separable.
If it is, use the Perceptron Rule. If not, use Gradient Descent.
Why not do Gradient Descent on y_hat (the thresholded activation) instead of a (the pure activation value)?
It’s not differentiable. There is no way to take the derivative at a discontinuous point.
Essentially, the function spikes in value after we meet the activation threshold and causes the function to be discontinuous.
Describe the behavior of the Sigmoid function
As x approaches infinity, e becomes infinitely small (x = -100 –> e^100) so the function approaches 0 because the denominator is basically infinite.
As x approaches negative infinity, e becomes infinitely large (x = 100 –> e^-100) so the function approaches 1 because the denominator is basically 1.
Give the formula of the Sigmoid function
a(x) = 1 / (1 + e^(-x))
Backpropagation is dependent on what?
On ALL of the activation functions being differentiable. We cannot propagate the error information back up the layers if the functions cannot be differentiated
What is backpropagation?
It is the process of calculating the error between output and actual, then adjusting weights form the “bottom” of the network back up to the top
Backpropagation is also called what?
Error back propagation
What are the 2 main issues with gradient descent?
- Activation functions need to be differentiable
- Can get stuck in local optima
Why does gradient descent have an issue with getting stuck in local optima?
As we combine more and more functions, the “landscape” of that combined error function becomes more complex and introduces more answers other than the best answer, the global maxima
With neural networks, what are the 3 ways to cause overfitting?
- Increasing the number of nodes
- Increasing the number of layers
- Increasing the size of the weights
What is the reference bias of neural networks?
With a sufficiently complex network, there is no hypothesis we will not consider
What is the preference bias of neural networks?
Simpler, more correct network configurations
What is the reference bias of decision trees?
What is the preference bias of decision trees?
What is the reference bias of K-NN?
- Points closer to each other are more similar to each other
- We are expecting functions to act smoothly
- We assume ALL features have equal importance
What is the preference bias of K-NN?
What is the Curse of Dimensionality?
As we add more features to our data, the amount of data needed to consider it sufficiently test grows exponentially
What is Ensemble Learning?
When you test multiple instances of the model against different subsets of the training data and then compile all of those resulting models into a singular model
What is the relationship between SVMs and the margin(s) of our data?
The bigger the margin, the less overfitting done by the SVM because it commits less to the specific data it was trained on
What actually is a support vector?
The points from the input data that are necessary for defining the largest margin
What is the kernel trick?
Projecting our currently given, non-linearly separable data into a higher dimensional space so that it becomes linearly separable
In SVMs, how is the kernel trick accomplished? What does it do?
The kernel trick is accomplished by turning the dot product into some other similarity metric.
This gives us a way to imbed domain knowledge.
What is the only condition for a kernel trick to be successful?
The Mercer condition; that a kernel function is valid if and only if:
- It is symmetric (K(x, y) = K(y, x))
- It is positive semi-definite (c^T * K * c >= 0, essentially that any vector we put it gives out a positive number or zero as the answer)
What happens to Boosting over time in SVMs?
Overtime, Boosting results in the SVMs becoming more confident in their classifications so we actually don’t end up seeing any issues with overfitting
What is a consistent learner?
A model that produces a candidate hypothesis which is equal to the true hypothesis for every data point in the training set
When is a version space epsilon-exhausted (ε)
A version space is ε-exhausted if and only if (iff) for every candidate hypothesis in the version space, the error of those hypotheses is less than or equal to ε where 0 <= ε <= 1/2
Describe how exploration is accomplished in Simulated Annealing
With temperature. As temperature increases, simulated annealing is will to take initially worse steps hoping that they will eventually lead to a better, or global, optima
Describe how exploitation is accomplished in Simulated Annealing
With hill climbing. If we are solely focused on exploitation, then we reduced the temperature as close to 0 as possible so that we’re only attempting steps that will improve our fitness function score
What is the inspiration behind Simulated Annealing?
That the same process for forging metals by repeatedly changing the ordering of molecules via heating and pressing until a viable combination is achieved can be used for randomized optimization
How does single link clustering work?
- Initialize each data point on its own cluster
- Find the two clusters with the smallest distance between any two points belonging to those clusters
- Merge those two clusters into a single cluster
- Repeat steps 2 and 3 until all points are assigned to some defined number of clusters
How does k-means clustering work?
- Randomly initialize k centers
- Each center “claims” its closest point
- Re-compute the center of each cluster by averaging the clustered points
- Repeat steps 2 and 3 until all points are assigned to some defined number of clusters
What kinds of shapes does single link clustering produce? Why?
What kinds of shapes does k-means clustering produce? Why?
How does soft clustering work?
What kinds of shapes does soft clustering produce? Why?
How does the EM (Expectation Maximization) algorithm work?
What are the properties of the EM (Expectation Maximization) algorithm?
- Monotonically non-decreasing likelihood (creates scores that never get lower)
- Does not converge (but gets close)
- Will not diverge
- Can get stuck in local optima
- Works with any distribution (so long as you can define “expectation” and “maximization”)
Why does k-means clustering eventually converge?
There is some finite number of neighbors. As long as you have a means of always breaking ties, you will eventually converge on some configuration
What are the 3 clustering properties?
- Richness: the idea that we could achieve ANY cluster configuration available
- Scale Invariance: the idea that making values more positive doesn’t change the clustering, assuming that the relative positions stays the same
- Consistency: the idea that shrinking intracluster distances and expanding intercluster distances doesn’t change the clustering
What is the Kleinberg Impossibility Theorem?
There is no clustering algorithm that can achieve all three properties: richness, scale-invariance, and consistency
What are the two types of methods for feature selection?
- Filtering: we take the given features and narrow the list to a subset before handing that subset to some learning algorithm
- Wrapping: the search for features is wrapped around our learning algorithm as we continuously give it a subset, assess its performance, and adjust which features its given the next time
What are the properties of Feature Selection via Filtering?
+ Generally increased speed
- Ignores the learning problem
What are the properties of Feature Selection via Wrapping?
+ Takes into account model bias
+ Takes into account the learning problem
- SOOOOOO slow
What are the 2 linear transformation algorithms we reviewed?
- Principal Component Analysis (PCA)
- Independent Component Analysis (ICA)
What does PCA do?
- It finds the direction of maximum variance for some given set of training data
- It finds directions that are mutually orthogonal (perpendicular to each other)
What are the characteristics of PCA?
- It is an eigen problem
- It gives you the best reconstruction (lowest reconstruction error) of the data
- Each principal component returns it’s eigenvalue (positive value)
What does it mean if the eigenvalue of a principal component from PCA is 0?
If the eigenvalue is 0, then we can ignore that component (feature)
What are the characteristics of ICA?
- Constructs new features that are mutually independent of each other
- Aims for the maximum mutual information between the original features and the constructed features
What is the Central Limit Theorem?
What is the main difference between PCA and ICA?
ICA is highly directional. PCA doesn’t care about being given the original or transposed features because they’re all just rotations of data
What are some alternatives to PCA and ICA?
- RCA (Random Component Analysis)
- LDA (Linear Discriminant Analysis)
What is the main advantage of RCA?
It is FAST, so we can run many, many iterations of it
How does RCA work?
RCA generates a random direction for projection, and works quite well if used for classification
How does LDA work?
LDA finds a projection that discriminates based on the given labels
What does Entropy capture?
Entropy captures the amount of information contained in a random variable
What is Joint Entropy?
Joint Entropy is a measure of randomness contained in two variables together
What is Mutual Information?
A measure of the reduction of randomness of a variable, given knowledge of another variable
Why is Mutual Information a particular case of Kullback-Leibler Divergence?
Mutual Information is a case of KL Divergence because minimizing KL Divergence between two distributions maximizes the information shared between them, resulting in information gain between the two distributions
What is the Kullback-Liebler Diverence?
KL Divergence is a measure of the difference between two distributions, quantifying the information lost when one distribution is used to approximate another
What is Conditional Entropy?
Conditional Entropy is a measure of the randomness of one variable given another variable