Final Review Flashcards

1
Q

Who developed the VC dimension?

A

Dr. Vladimir Vapnik and Dr. Chervonenkis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does VC stand for?

A

Vapnik-Chervonekis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why was the VC dimension developed?

A

To help scientists develop better machine learning models that would be better at classifying data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the VC dimension?

A

The Vapnik-Chervonenkis dimension of a hypothesis space is the maximum number of points that can be “shattered” by the hypothesis of that space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In terms of VC dimension, what does “shattering” mean?

A

“Shattering” means that for every possible way of labeling these points (with binary labels), there is a hypothesis in the space that correctly classifies the points according to that labeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Is either of these true? Why or why not?

The VC dimension of the hypothesis class of circles is lower than that of squares.

The VC dimension of the hypothesis class of squares is lower than that of circles.

A

No, because both have a VC dimension of 3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Is this true? Why or why not?

The VC dimension of the hypothesis class of rings is higher that that of circles.

A

Yes, because rings have a VC dimension of 4 while circles have a VC dimension of 3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are 2 important notes when discussing the VC dimension of a hypothesis space?

A
  • Knowing if the hypothesis space is centered on the origin
  • Knowing that the VC dimension is NOT always equivalent to 1 + (number of dimensions)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is either of these true? Why or why not?

The VC dimension of circles is higher than that of squares.

The VC dimension of squares is higher that that of circles.

A

No, because both have a VC dimension of 3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What was the goal of Bayes’ Rule?

A

To solve the problem of inverse probability - inferring the probability of causes (hypotheses) from observed effects (data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Bayes’ Rule?

A

A fundamental theorem in probability theory for updating the probability of a hypothesis based on new evidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the formula for Bayes’ Rule?

A

P(h|D) = [P(D|h) * P(h)] / P(D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In Bayes’ Rule, what is P(h|D) and what does it represent?

A

The posterior probability of the hypothesis given the data.

It represents our updated belief in the hypothesis after seeing the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In Bayes’ Rule, what is P(D|h) and what does it represent?

A

P(D|h) is the likelihood, which is the probability of observing the data assuming the hypothesis is true.

It tells us how likely the observed data is under the assumption of the hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In Bayes’ Rule, what is P(h) and what does it represent?

A

P(h) is the prior probability of the hypothesis before observing any data.

It represents our belief in the hypothesis based on prior knowledge or assumptions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In Bayes’ Rule, what is P(D) and what does it represent?

A

P(D) is the marginal likelihood (evidence), which normalizes the posterior distribution by ensuring the total probability sums to 1.

It represents the total probability of observing the data across all possible hypotheses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

For Bayes’ Rule, when would we ignore P(D)?

A

We ignore the probability of the data given any hypothesis when we’re interested in finding the hypothesis that maximizes the posterior probability - P(h|D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is MLE?

A

Maximum Likelihood Estimation is a way to estimate the parameters of a model by finding the values that make the observed data most likely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does MLE mean in the context of binary classification?

A

This means finding the parameters of the hypothesis h(x) that make the observed outcomes y more likely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is MSE? How is it written?

A

Mean Squared Error
MSE = (1/n) * (Σ(y - h(x))^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why is MSE unsuitable for binary classification?

A
  • Binary classification treats targets as discrete (0 or 1), while MSE assumes targets are continuous values.
  • MSE doesn’t penalize incorrect predictions made with high confidence as effectively as cross-entropy loss does
  • MSE here can result in poor performance because it focuses on shortening distance between predicted and actual values, ignoring the probability estimates produced by models in classification tasks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What function is typically used for a binary classification neural network’s output layer? Why?

A

A sigmoid activation function is typically used because it outputs a value between 0 and 1 which can be evaluated with cross-entropy loss to predict how well it matches the actual binary label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Describe Cross-Entropy Loss (CEL)

A
  • Used in binary classification
  • Measures the difference between the predicted probabilities and the true binary labels
  • Directly handles probabilistic predictions
  • Strongly penalizes wrong predictions made with high confidence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Describe Mean Squared Error (MSE)

A
  • Typically used in regression tasks where the target variable is continuous
  • Measures the squared difference between predicted values and true values
  • Treats output as continuous, making it LESS suitable for binary classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe Neural Networks with Sum of Squared Error (SSE)

A
  • If it is trained for a binary classification task, it may treat the output as continuous rather than probabilistic
  • For binary classification, it uses CEL (Cross-Entropy Loss) with a sigmoid output layer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is a limitation of K-Means?

A
  • K-Means struggles with non-spherical clusters
  • K-Means requires a pre-defined number of clusters
  • K-Means may converge to local minima
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the steps of spectral clustering?

A
  • Construct a similarity graph
  • Compute the graph Laplacian
  • Compute the eigenvalues and eigenvectors
  • Perform clustering in the reduced eigenspace
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the kernel trick?

A

This trick maps data into a higher-dimensional space where linear separation is possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Why does the kernel trick matter for concentric circles?

A

This trick allows spectral clustering to transform into the original space, enabling the separation of non-linearly separable points, such as inner and outer circles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the primary advantage of Isomap over traditional MDS?

A

Isomaps approximate geodesic distances, capturing global structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Advantages and Limitations of Isomap

A

Advantage:
- Captures global structure effectively, especially for data with a natural manifold shape

Limitations:
- Computationally intensive on large datasets
- Struggles with non-manifold high-dimensional data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the core concept of Laplacian Eigenmaps?

A

Laplacian eigenmaps use spectral graph theory to focus on local neighborhood relationships, preserving local structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the process for Laplacian Eigenmaps?

A
  • Build a weighted similarity graph of the data
  • Compute the graph’s Laplacian matrix
  • Perform eigenvalue decomposition to generate a low-dimensional embedding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Laplacian Eigenmaps are best suited for preserving what type of structure?

A

Local neighborhood structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Name applications of Laplacian Eigenmaps

A
  • Image Processing
  • Clustering in High-Dimensional Data
  • Sensor Data Analysis
  • Gene Expression Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Advantages and Limitations of Laplacian Eigenmaps

A

Advantage: Efficient for preserving local structures, which is useful for clustering tasks

Limitation: May lose global relationships as it focuses solely at the local level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Why might pruning be applied to a decision tree?

A
  • To simplify the model and improve interpretability by removing less significant branches
  • To reduce overfitting by removing branches that were fit to noise or other training data fluctuations
  • To remove branches that provide little to no predictive power
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Why are the following NOT reasons to prune a decision tree?

  • To ensure the tree is balanced
  • To always achieve the best accuracy
  • To increase tree depth
A
  • The goal of pruning is not to balance a tree
  • Pruning doesn’t guarantee the BEST accuracy
  • Pruning does the opposite of increase tree depth
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Why are the following statements around cross-validation true?

  • In k-fold cross-validation, the dataset is divided into k subsets, and each subset is used as a validation set
  • Cross-validation helps in estimating the generalization performance of a regression model
A
  • The dataset is divided into k subsets (folds), and each fold is used as a validation set once, while the remaining k - 1 folds are used for training
  • Testing on multiple validation sets helps to identify overfitting or underfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Is the following statement incorrect? Why?

  • Cross-validation is only applicable to regression tasks and not classification
A

Cross-validation is a general technique applicable to both regression and classification tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

In the context of training neural networks, why is direct interpretability a challenge?

A

Neural networks are often referred to as “black-box” models because it is difficult to directly interpret how they arrive at predictions, especially as the number of layers and neurons increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

In the context of training neural networks, why is overfitting a challenge?

A

Overfitting occurs when a neural network learns the training data too well, including noise and irrelevant patterns, leading to poor generalization on unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

In the context of training neural networks, why is getting stuck in local minima a challenge?

A

While modern optimizers like stochastic gradient descent (SGD) with momentum or Adam help mitigate this, getting stuck in poor local minima or saddle points can still be a problem, especially for non-convex loss functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

In the context of training neural networks, why is the vanishing gradient problem a challenge?

A

The vanishing gradient problem occurs when gradients become very small during backpropagation, particularly in deep networks with activation functions like sigmoid or tanh. This slows down learning or even halts it in early layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

In the context of training neural networks, why is inefficient support vector calculations not a challenge?

A

Support vector calculations are specific to Support Vector Machines (SVMs), a different type of machine learning algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

In the context of training neural networks, why is collinearity among features not a challenge?

A

While collinearity among features can affect simpler models like linear regression, neural networks can often learn to handle correlated inputs effectively due to their ability to model complex relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Which of the following are characteristics of instance-based learning?

  • It always uses a probabilistic model for predictions
  • It’s often sensitive to irrelevant or redundant features
  • It is always faster than model-based learning
  • New instances are classified based on similarity measures
  • Training is typically computationally intensive
  • The model ”memorizes” the training instances
A
  • It’s often sensitive to irrelevant or redundant features because instance-based learning relies on similarity measures, which can be skewed by outliers
  • New instances are classified based on similarity measures
  • The model ”memorizes” the training instances
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Why are the following invalid reasons to use ensemble learning techniques?

  • To handle missing values in the data
  • To speed up training times for large datasets
  • To provide a more interpretive model
A
  • Ensemble learning does not focus on handling missing values
  • Ensemble training may actually increase training times because multiple models need to be trained
  • Ensemble models are often less interpretable because they combine multiple models, making them very complex
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Why is finding a hyperplane that maximizes the margin between classes in the transformed space a reason for using kernel methods in SVMs?

A

Kernel methods enable SVMs to find a separating hyperplane in a higher-dimensional transformed space when the data is not linearly separable in the original space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Why is handling missing values in data NOT a reason for using kernel methods in SVMs?

A

Kernel methods don’t address missing data, but preprocessing can do so

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Why is decreasing the number of support vectors in the model NOT a reason for using kernel methods in SVMs?

A

Kernel methods don’t affect the support vectors. That depends on the data and the model’s parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Why is tackling non-linearly separable data a reason for using kernel methods in SVMs?

A

Kernel methods allow SVMs to handle non-linearly separable data by implicitly mapping it into a higher-dimensional space where separation IS possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Why is reducing the computational complexity of the algorithm NOT a reason for using kernel methods in SVMs?

A

Kernel methods actually often increase the computational complexity because they require pairwise comparisons of ALL data points through the kernel function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Why is a fixed set of features to represent all possible inputs NOT an essential component of the PAC learning framework?

A

PAC is agnostic to the feature space and focuses on the hypothesis space, error bounds, and learning guarantees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Why is a confidence parameter an essential component in the PAC learning framework?

A

The confidence parameter represents the probability that the chosen hypothesis will not meet the desired accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What does PAC stand for?

A

Probably Approximately Correct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is the definition of PAC-learnable?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What makes up the “Approximately Correct” in PAC?

A

Requiring the true error not be zero, but bounded only by some constant that can be arbitrarily small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What make the “Probably” in PAC?

A

Requiring that the learner’s probability of failure be bounded by some constant that can be made arbitrarily small

59
Q

Why is a hypothesis space essential to the PAC learning framework?

A

PAC assumes there exists some hypothesis space from which all possible models/functions can be selected by the learner to approximate the target

60
Q

Why is the VC Dimension for all linear classifiers in a 2D space not 1?

A

Certain configurations (a triangle) can shatter 3 points, but not four. This makes the VC Dimension for all linear classifiers in a 2D space 3.

61
Q

Why is it not true that the number of features in the dataset is equal to the VC Dimension?

A

They aren’t equivalent. Some models with nonlinear hypothesis spaces can have a VC Dimension greater than the number of features.

62
Q

Why is a high VC Dimension an indicator of potentially overfitting?

A

A higher VC Dimension means it can create more complex patterns, which means it can overfit to the given training data.

63
Q

In the context of Bayesian learning, why is a higher likelihood NOT always indicative of a more probable hypothesis?

A

A higher likelihood means that the data is more probable given the hypothesis, but the overall probability of the hypothesis depends on the prior

64
Q

In the context of Bayesian learning, why is probability NOT always uniform across all hypotheses?

A

Some hypotheses explain the data better than others, resulting in different likelihood values

65
Q

In the context of Bayesian learning, why is it NOT true that the likelihood of a hypothesis is the same as the prior possibility of a hypothesis?

A

The likelihood P(Data | Hypothesis) is different from the prior P(Hypothesis).

The prior represents the probability of the hypothesis before seeing the data, while the likelihood measures how well the data supports the hypothesis.

66
Q

In Bayesian learning, what is MAP and its formula?

A

MAP is the maximum a posterior, representing an estimate of the most likely value of a variable.

h_MAP = argmax_(h in H) (Pr(h | D))

67
Q

In Bayesian learning, what is ML and its formula?

A

ML is the maximum likelihood, which represents when all hypotheses are equally likely.

h_ML = argmax_(h in H) (Pr(D | h))

68
Q

How do we measure information gain?

A

Gain(S, A) = Entropy(S) - (summation(|S_v|/S) * Entropy(s_v))

69
Q

What is restriction bias?

A

Bias that automatically occurs when we decide our hypothesis set

70
Q

What is preference bias?

A

Bias that tells us what sort of hypotheses from our hypothesis space we prefer

71
Q

What types of decision trees are preferred?

A

Ones with good splits at the top because it means less traversal on average

72
Q

In terms of Ensemble Learning, what is bagging?

A

Bagging is when we make choosing data uniformly random, then combine the results of various weak learners.

Bagging is the shortened named for Bootstrap Aggregation.

73
Q

When Bagging in Ensemble Learning, why does taking the average of a set of weak learners trained on different subsets outperform a single learner trained on the entire dataset?

A

Overfitting. Overfitting a subset will not overfit the overall dataset, and the average will “smooth out” the specifics of each individual learner.

74
Q

In Ensemble Learning, what is Boosting?

A

Boosting is focused on data that previous learners struggled with in order to form a cohesive picture of the entire dataset.

75
Q

In the context of Ensemble Learning, describe the process of Boosting

A
  • Initialize all training examples to be equally weighted
  • After each round, identify the weakest learner
  • Raise the weights of every example that weaker learner misclassified
  • Combine all the weak learners into a final learner composed of their weighted average
76
Q

What is a weak learner?

A

A weak learner is a model that performs better than chance for any distribution of data

77
Q

When can you not create a weak learner?

A

If there is any distribution for which a set of hypotheses can’t do better than random chance, then there is no way to create a weak learner.

78
Q

What are two rules designed for training the weights on perceptron units/networks?

A
  • Perceptron Rule (threshold)
  • Gradient Descent / Delta Rule (unthresholded)
79
Q

Give the Perceptron Rule formula

A

activation = Σ(w_i * x_i) + b

where b is some bias constant

80
Q

Why does the Perceptron Rule care about linearly separable data?

A

If the given data is linearly separable, the Perceptron Rule will find the dividing halfplane in a finite number of iterations

81
Q

Describe the 4 main components of the Perceptron Rule

A
  • The target y
  • The output y
  • The learning rate (how much we adjust)
  • The input
82
Q

In the Perceptron Rule, what happens when the target and the output match?

A

When both match, we see no change in the weight because there is no error

83
Q

In the Perceptron Rule, what happens when the difference between the target and output is negative (y - y_hat = -1)? When it is positive (y - y_hat = 1)?

A

When the output is positive, that means our output is too small so we need to increase the weight’s size according to the learning rate and given input.

When the output is negative, that means our sum is too large so we need to decrease the weight’s size according to the learning rate and given input.

84
Q

What is the criteria to stop running the Perceptron Rule?

A

We stop the Perceptron Rule once the change in the given weight return 0 (this means we’ve found a plane that separates the data)

85
Q

What decides when to use the Perceptron Rule vs. Gradient Descent?

A

Whether the data is/is not linearly separable.

If it is, use the Perceptron Rule. If not, use Gradient Descent.

86
Q

Why not do Gradient Descent on y_hat (the thresholded activation) instead of a (the pure activation value)?

A

It’s not differentiable. There is no way to take the derivative at a discontinuous point.

Essentially, the function spikes in value after we meet the activation threshold and causes the function to be discontinuous.

87
Q

Describe the behavior of the Sigmoid function

A

As x approaches infinity, e becomes infinitely small (x = -100 –> e^100) so the function approaches 0 because the denominator is basically infinite.

As x approaches negative infinity, e becomes infinitely large (x = 100 –> e^-100) so the function approaches 1 because the denominator is basically 1.

88
Q

Give the formula of the Sigmoid function

A

a(x) = 1 / (1 + e^(-x))

89
Q

Backpropagation is dependent on what?

A

On ALL of the activation functions being differentiable. We cannot propagate the error information back up the layers if the functions cannot be differentiated

90
Q

What is backpropagation?

A

It is the process of calculating the error between output and actual, then adjusting weights form the “bottom” of the network back up to the top

91
Q

Backpropagation is also called what?

A

Error back propagation

92
Q

What are the 2 main issues with gradient descent?

A
  • Activation functions need to be differentiable
  • Can get stuck in local optima
93
Q

Why does gradient descent have an issue with getting stuck in local optima?

A

As we combine more and more functions, the “landscape” of that combined error function becomes more complex and introduces more answers other than the best answer, the global maxima

94
Q

With neural networks, what are the 3 ways to cause overfitting?

A
  • Increasing the number of nodes
  • Increasing the number of layers
  • Increasing the size of the weights
95
Q

What is the reference bias of neural networks?

A

With a sufficiently complex network, there is no hypothesis we will not consider

96
Q

What is the preference bias of neural networks?

A

Simpler, more correct network configurations

97
Q

What is the reference bias of decision trees?

98
Q

What is the preference bias of decision trees?

99
Q

What is the reference bias of K-NN?

A
  • Points closer to each other are more similar to each other
  • We are expecting functions to act smoothly
  • We assume ALL features have equal importance
100
Q

What is the preference bias of K-NN?

101
Q

What is the Curse of Dimensionality?

A

As we add more features to our data, the amount of data needed to consider it sufficiently test grows exponentially

102
Q

What is Ensemble Learning?

A

When you test multiple instances of the model against different subsets of the training data and then compile all of those resulting models into a singular model

103
Q

What is the relationship between SVMs and the margin(s) of our data?

A

The bigger the margin, the less overfitting done by the SVM because it commits less to the specific data it was trained on

104
Q

What actually is a support vector?

A

The points from the input data that are necessary for defining the largest margin

105
Q

What is the kernel trick?

A

Projecting our currently given, non-linearly separable data into a higher dimensional space so that it becomes linearly separable

106
Q

In SVMs, how is the kernel trick accomplished? What does it do?

A

The kernel trick is accomplished by turning the dot product into some other similarity metric.

This gives us a way to imbed domain knowledge.

107
Q

What is the only condition for a kernel trick to be successful?

A

The Mercer condition; that a kernel function is valid if and only if:
- It is symmetric (K(x, y) = K(y, x))
- It is positive semi-definite (c^T * K * c >= 0, essentially that any vector we put it gives out a positive number or zero as the answer)

108
Q

What happens to Boosting over time in SVMs?

A

Overtime, Boosting results in the SVMs becoming more confident in their classifications so we actually don’t end up seeing any issues with overfitting

109
Q

What is a consistent learner?

A

A model that produces a candidate hypothesis which is equal to the true hypothesis for every data point in the training set

110
Q

When is a version space epsilon-exhausted (ε)

A

A version space is ε-exhausted if and only if (iff) for every candidate hypothesis in the version space, the error of those hypotheses is less than or equal to ε where 0 <= ε <= 1/2

111
Q

Describe how exploration is accomplished in Simulated Annealing

A

With temperature. As temperature increases, simulated annealing is will to take initially worse steps hoping that they will eventually lead to a better, or global, optima

112
Q

Describe how exploitation is accomplished in Simulated Annealing

A

With hill climbing. If we are solely focused on exploitation, then we reduced the temperature as close to 0 as possible so that we’re only attempting steps that will improve our fitness function score

113
Q

What is the inspiration behind Simulated Annealing?

A

That the same process for forging metals by repeatedly changing the ordering of molecules via heating and pressing until a viable combination is achieved can be used for randomized optimization

114
Q

How does single link clustering work?

A
  • Initialize each data point on its own cluster
  • Find the two clusters with the smallest distance between any two points belonging to those clusters
  • Merge those two clusters into a single cluster
  • Repeat steps 2 and 3 until all points are assigned to some defined number of clusters
115
Q

How does k-means clustering work?

A
  • Randomly initialize k centers
  • Each center “claims” its closest point
  • Re-compute the center of each cluster by averaging the clustered points
  • Repeat steps 2 and 3 until all points are assigned to some defined number of clusters
116
Q

What kinds of shapes does single link clustering produce? Why?

117
Q

What kinds of shapes does k-means clustering produce? Why?

118
Q

How does soft clustering work?

119
Q

What kinds of shapes does soft clustering produce? Why?

120
Q

How does the EM (Expectation Maximization) algorithm work?

121
Q

What are the properties of the EM (Expectation Maximization) algorithm?

A
  • Monotonically non-decreasing likelihood (creates scores that never get lower)
  • Does not converge (but gets close)
  • Will not diverge
  • Can get stuck in local optima
  • Works with any distribution (so long as you can define “expectation” and “maximization”)
122
Q

Why does k-means clustering eventually converge?

A

There is some finite number of neighbors. As long as you have a means of always breaking ties, you will eventually converge on some configuration

123
Q

What are the 3 clustering properties?

A
  • Richness: the idea that we could achieve ANY cluster configuration available
  • Scale Invariance: the idea that making values more positive doesn’t change the clustering, assuming that the relative positions stays the same
  • Consistency: the idea that shrinking intracluster distances and expanding intercluster distances doesn’t change the clustering
124
Q

What is the Kleinberg Impossibility Theorem?

A

There is no clustering algorithm that can achieve all three properties: richness, scale-invariance, and consistency

125
Q

What are the two types of methods for feature selection?

A
  • Filtering: we take the given features and narrow the list to a subset before handing that subset to some learning algorithm
  • Wrapping: the search for features is wrapped around our learning algorithm as we continuously give it a subset, assess its performance, and adjust which features its given the next time
126
Q

What are the properties of Feature Selection via Filtering?

A

+ Generally increased speed
- Ignores the learning problem

127
Q

What are the properties of Feature Selection via Wrapping?

A

+ Takes into account model bias
+ Takes into account the learning problem
- SOOOOOO slow

128
Q

What are the 2 linear transformation algorithms we reviewed?

A
  • Principal Component Analysis (PCA)
  • Independent Component Analysis (ICA)
129
Q

What does PCA do?

A
  • It finds the direction of maximum variance for some given set of training data
  • It finds directions that are mutually orthogonal (perpendicular to each other)
130
Q

What are the characteristics of PCA?

A
  • It is an eigen problem
  • It gives you the best reconstruction (lowest reconstruction error) of the data
  • Each principal component returns it’s eigenvalue (positive value)
131
Q

What does it mean if the eigenvalue of a principal component from PCA is 0?

A

If the eigenvalue is 0, then we can ignore that component (feature)

132
Q

What are the characteristics of ICA?

A
  • Constructs new features that are mutually independent of each other
  • Aims for the maximum mutual information between the original features and the constructed features
133
Q

What is the Central Limit Theorem?

134
Q

What is the main difference between PCA and ICA?

A

ICA is highly directional. PCA doesn’t care about being given the original or transposed features because they’re all just rotations of data

135
Q

What are some alternatives to PCA and ICA?

A
  • RCA (Random Component Analysis)
  • LDA (Linear Discriminant Analysis)
136
Q

What is the main advantage of RCA?

A

It is FAST, so we can run many, many iterations of it

137
Q

How does RCA work?

A

RCA generates a random direction for projection, and works quite well if used for classification

138
Q

How does LDA work?

A

LDA finds a projection that discriminates based on the given labels

139
Q

What does Entropy capture?

A

Entropy captures the amount of information contained in a random variable

140
Q

What is Joint Entropy?

A

Joint Entropy is a measure of randomness contained in two variables together

141
Q

What is Mutual Information?

A

A measure of the reduction of randomness of a variable, given knowledge of another variable

142
Q

Why is Mutual Information a particular case of Kullback-Leibler Divergence?

A

Mutual Information is a case of KL Divergence because minimizing KL Divergence between two distributions maximizes the information shared between them, resulting in information gain between the two distributions

143
Q

What is the Kullback-Liebler Diverence?

A

KL Divergence is a measure of the difference between two distributions, quantifying the information lost when one distribution is used to approximate another

144
Q

What is Conditional Entropy?

A

Conditional Entropy is a measure of the randomness of one variable given another variable