Final Review Flashcards

Question

Describe Neural Networks with Sum of Squared Error (SSE)

Answer 1

- If it is trained for a binary classification task, it may treat the output as continuous rather than probabilistic - For binary classification, it uses CEL (Cross-Entropy Loss) with a sigmoid output layer

Answer 2

- K-Means struggles with non-spherical clusters - K-Means requires a pre-defined number of clusters - K-Means may converge to local minima

Answer 3

- Construct a similarity graph - Compute the graph Laplacian - Compute the eigenvalues and eigenvectors - Perform clustering in the reduced eigenspace

Answer 4

This trick maps data into a higher-dimensional space where linear separation is possible

Answer 5

This trick allows spectral clustering to transform into the original space, enabling the separation of non-linearly separable points, such as inner and outer circles

Answer 6

Isomaps approximate geodesic distances, capturing global structure

Answer 7

Advantage: - Captures global structure effectively, especially for data with a natural manifold shape Limitations: - Computationally intensive on large datasets - Struggles with non-manifold high-dimensional data

Answer 8

Laplacian eigenmaps use spectral graph theory to focus on local neighborhood relationships, preserving local structure

Answer 9

- Build a weighted similarity graph of the data - Compute the graph's Laplacian matrix - Perform eigenvalue decomposition to generate a low-dimensional embedding

Answer 10

Local neighborhood structure

Answer 11

- Image Processing - Clustering in High-Dimensional Data - Sensor Data Analysis - Gene Expression Data

Answer 12

Advantage: Efficient for preserving local structures, which is useful for clustering tasks Limitation: May lose global relationships as it focuses solely at the local level

Answer 13

- To simplify the model and improve interpretability by removing less significant branches - To reduce overfitting by removing branches that were fit to noise or other training data fluctuations - To remove branches that provide little to no predictive power

Answer 14

- The goal of pruning is not to balance a tree - Pruning doesn't guarantee the BEST accuracy - Pruning does the opposite of increase tree depth

Answer 15

- The dataset is divided into k subsets (folds), and each fold is used as a validation set once, while the remaining k - 1 folds are used for training - Testing on multiple validation sets helps to identify overfitting or underfitting

Answer 16

Cross-validation is a general technique applicable to both regression and classification tasks

Answer 17

Neural networks are often referred to as "black-box" models because it is difficult to directly interpret how they arrive at predictions, especially as the number of layers and neurons increases

Answer 18

Overfitting occurs when a neural network learns the training data too well, including noise and irrelevant patterns, leading to poor generalization on unseen data

Answer 19

While modern optimizers like stochastic gradient descent (SGD) with momentum or Adam help mitigate this, getting stuck in poor local minima or saddle points can still be a problem, especially for non-convex loss functions

Answer 20

The vanishing gradient problem occurs when gradients become very small during backpropagation, particularly in deep networks with activation functions like sigmoid or tanh. This slows down learning or even halts it in early layers

Answer 21

Support vector calculations are specific to Support Vector Machines (SVMs), a different type of machine learning algorithm

Answer 22

While collinearity among features can affect simpler models like linear regression, neural networks can often learn to handle correlated inputs effectively due to their ability to model complex relationships

Answer 23

- It’s often sensitive to irrelevant or redundant features because instance-based learning relies on similarity measures, which can be skewed by outliers - New instances are classified based on similarity measures - The model ”memorizes” the training instances

Answer 24

- Ensemble learning does not focus on handling missing values - Ensemble training may actually increase training times because multiple models need to be trained - Ensemble models are often less interpretable because they combine multiple models, making them very complex

Answer 25

Kernel methods enable SVMs to find a separating hyperplane in a higher-dimensional transformed space when the data is not linearly separable in the original space

Answer 26

Kernel methods don't address missing data, but preprocessing can do so

Answer 27

Kernel methods don't affect the support vectors. That depends on the data and the model's parameters

Answer 28

Kernel methods allow SVMs to handle non-linearly separable data by implicitly mapping it into a higher-dimensional space where separation IS possible

Answer 29

Kernel methods actually often increase the computational complexity because they require pairwise comparisons of ALL data points through the kernel function

Answer 30

PAC is agnostic to the feature space and focuses on the hypothesis space, error bounds, and learning guarantees

Answer 31

The confidence parameter represents the probability that the chosen hypothesis will not meet the desired accuracy

Answer 32

Probably Approximately Correct

Answer 33

Requiring the true error not be zero, but bounded only by some constant that can be arbitrarily small

Answer 34

Requiring that the learner's probability of failure be bounded by some constant that can be made arbitrarily small

Answer 35

PAC assumes there exists some hypothesis space from which all possible models/functions can be selected by the learner to approximate the target

Answer 36

Certain configurations (a triangle) can shatter 3 points, but not four. This makes the VC Dimension for all linear classifiers in a 2D space 3.

Answer 37

They aren't equivalent. Some models with nonlinear hypothesis spaces can have a VC Dimension greater than the number of features.

Answer 38

A higher VC Dimension means it can create more complex patterns, which means it can overfit to the given training data.

Answer 39

A higher likelihood means that the data is more probable given the hypothesis, but the overall probability of the hypothesis depends on the prior

Answer 40

Some hypotheses explain the data better than others, resulting in different likelihood values

Answer 41

The likelihood P(Data | Hypothesis) is different from the prior P(Hypothesis). The prior represents the probability of the hypothesis before seeing the data, while the likelihood measures how well the data supports the hypothesis.

Answer 42

MAP is the maximum a posterior, representing an estimate of the most likely value of a variable. h_MAP = argmax_(h in H) (Pr(h | D))

Answer 43

ML is the maximum likelihood, which represents when all hypotheses are equally likely. h_ML = argmax_(h in H) (Pr(D | h))

Answer 44

Gain(S, A) = Entropy(S) - (summation(|S_v|/S) * Entropy(s_v))

Answer 45

Bias that automatically occurs when we decide our hypothesis set

Answer 46

Bias that tells us what sort of hypotheses from our hypothesis space we prefer

Answer 47

Ones with good splits at the top because it means less traversal on average

Answer 48

Bagging is when we make choosing data uniformly random, then combine the results of various weak learners. Bagging is the shortened named for Bootstrap Aggregation.

Answer 49

Overfitting. Overfitting a subset will not overfit the overall dataset, and the average will "smooth out" the specifics of each individual learner.

Answer 50

Boosting is focused on data that previous learners struggled with in order to form a cohesive picture of the entire dataset.

Answer 51

- Initialize all training examples to be equally weighted - After each round, identify the weakest learner - Raise the weights of every example that weaker learner misclassified - Combine all the weak learners into a final learner composed of their weighted average

Answer 52

A weak learner is a model that performs better than chance for any distribution of data

Answer 53

If there is any distribution for which a set of hypotheses can't do better than random chance, then there is no way to create a weak learner.

Answer 54

- Perceptron Rule (threshold) - Gradient Descent / Delta Rule (unthresholded)

Answer 55

activation = Σ(w_i * x_i) + b where b is some bias constant

Answer 56

If the given data is linearly separable, the Perceptron Rule will find the dividing halfplane in a finite number of iterations

Answer 57

- The target y - The output y - The learning rate (how much we adjust) - The input

Answer 58

When both match, we see no change in the weight because there is no error

Answer 59

When the output is positive, that means our output is too small so we need to increase the weight's size according to the learning rate and given input. When the output is negative, that means our sum is too large so we need to decrease the weight's size according to the learning rate and given input.

Answer 60

We stop the Perceptron Rule once the change in the given weight return 0 (this means we've found a plane that separates the data)

Answer 61

Whether the data is/is not linearly separable. If it is, use the Perceptron Rule. If not, use Gradient Descent.

Answer 62

It's not differentiable. There is no way to take the derivative at a discontinuous point. Essentially, the function spikes in value after we meet the activation threshold and causes the function to be discontinuous.

Answer 63

As x approaches infinity, e becomes infinitely small (x = -100 --> e^100) so the function approaches 0 because the denominator is basically infinite. As x approaches negative infinity, e becomes infinitely large (x = 100 --> e^-100) so the function approaches 1 because the denominator is basically 1.

Answer 64

a(x) = 1 / (1 + e^(-x))

Answer 65

On ALL of the activation functions being differentiable. We cannot propagate the error information back up the layers if the functions cannot be differentiated

Answer 66

It is the process of calculating the error between output and actual, then adjusting weights form the "bottom" of the network back up to the top

Answer 67

Error back propagation

Answer 68

- Activation functions need to be differentiable - Can get stuck in local optima

Answer 69

As we combine more and more functions, the "landscape" of that combined error function becomes more complex and introduces more answers other than the best answer, the global maxima

Answer 70

- Increasing the number of nodes - Increasing the number of layers - Increasing the size of the weights

Answer 71

With a sufficiently complex network, there is no hypothesis we will not consider

Answer 72

Simpler, more correct network configurations

Answer 73

- Points closer to each other are more similar to each other - We are expecting functions to act smoothly - We assume ALL features have equal importance

Answer 74

As we add more features to our data, the amount of data needed to consider it sufficiently test grows exponentially

Answer 75

When you test multiple instances of the model against different subsets of the training data and then compile all of those resulting models into a singular model

Answer 76

The bigger the margin, the less overfitting done by the SVM because it commits less to the specific data it was trained on

Answer 77

The points from the input data that are necessary for defining the largest margin

Answer 78

Projecting our currently given, non-linearly separable data into a higher dimensional space so that it becomes linearly separable

Answer 79

The kernel trick is accomplished by turning the dot product into some other similarity metric. This gives us a way to imbed domain knowledge.

Answer 80

The Mercer condition; that a kernel function is valid if and only if: - It is symmetric (K(x, y) = K(y, x)) - It is positive semi-definite (c^T * K * c >= 0, essentially that any vector we put it gives out a positive number or zero as the answer)

Answer 81

Overtime, Boosting results in the SVMs becoming more confident in their classifications so we actually don't end up seeing any issues with overfitting

Answer 82

A model that produces a candidate hypothesis which is equal to the true hypothesis for every data point in the training set

Answer 83

A version space is ε-exhausted if and only if (iff) for every candidate hypothesis in the version space, the error of those hypotheses is less than or equal to ε where 0 <= ε <= 1/2

Answer 84

With temperature. As temperature increases, simulated annealing is will to take initially worse steps hoping that they will eventually lead to a better, or global, optima

Answer 85

With hill climbing. If we are solely focused on exploitation, then we reduced the temperature as close to 0 as possible so that we're only attempting steps that will improve our fitness function score

Answer 86

That the same process for forging metals by repeatedly changing the ordering of molecules via heating and pressing until a viable combination is achieved can be used for randomized optimization

Answer 87

- Initialize each data point on its own cluster - Find the two clusters with the smallest distance between any two points belonging to those clusters - Merge those two clusters into a single cluster - Repeat steps 2 and 3 until all points are assigned to some defined number of clusters

Answer 88

- Randomly initialize k centers - Each center "claims" its closest point - Re-compute the center of each cluster by averaging the clustered points - Repeat steps 2 and 3 until all points are assigned to some defined number of clusters

Answer 89

- Monotonically non-decreasing likelihood (creates scores that never get lower) - Does not converge (but gets close) - Will not diverge - Can get stuck in local optima - Works with any distribution (so long as you can define "expectation" and "maximization")

Answer 90

There is some finite number of neighbors. As long as you have a means of always breaking ties, you will eventually converge on some configuration

Answer 91

- Richness: the idea that we could achieve ANY cluster configuration available - Scale Invariance: the idea that making values more positive doesn't change the clustering, assuming that the relative positions stays the same - Consistency: the idea that shrinking intracluster distances and expanding intercluster distances doesn't change the clustering

Answer 92

There is no clustering algorithm that can achieve all three properties: richness, scale-invariance, and consistency

Answer 93

- Filtering: we take the given features and narrow the list to a subset before handing that subset to some learning algorithm - Wrapping: the search for features is wrapped around our learning algorithm as we continuously give it a subset, assess its performance, and adjust which features its given the next time

Answer 94

+ Generally increased speed - Ignores the learning problem

Answer 95

+ Takes into account model bias + Takes into account the learning problem - SOOOOOO slow

Answer 96

- Principal Component Analysis (PCA) - Independent Component Analysis (ICA)

Answer 97

- It finds the direction of maximum variance for some given set of training data - It finds directions that are mutually orthogonal (perpendicular to each other)

Answer 98

- It is an eigen problem - It gives you the best reconstruction (lowest reconstruction error) of the data - Each principal component returns it's eigenvalue (positive value)

Answer 99

If the eigenvalue is 0, then we can ignore that component (feature)

Answer 100

- Constructs new features that are mutually independent of each other - Aims for the maximum mutual information between the original features and the constructed features

Answer 101

ICA is highly directional. PCA doesn't care about being given the original or transposed features because they're all just rotations of data

Answer 102

- RCA (Random Component Analysis) - LDA (Linear Discriminant Analysis)

Answer 103

It is FAST, so we can run many, many iterations of it

Answer 104

RCA generates a random direction for projection, and works quite well if used for classification

Answer 105

LDA finds a projection that discriminates based on the given labels

Answer 106

Entropy captures the amount of information contained in a random variable

Answer 107

Joint Entropy is a measure of randomness contained in two variables together

Answer 108

A measure of the reduction of randomness of a variable, given knowledge of another variable

Answer 109

Mutual Information is a case of KL Divergence because minimizing KL Divergence between two distributions maximizes the information shared between them, resulting in information gain between the two distributions

Answer 110

KL Divergence is a measure of the difference between two distributions, quantifying the information lost when one distribution is used to approximate another

Answer 111

Conditional Entropy is a measure of the randomness of one variable given another variable