MidtermExam Flashcards

1
Q

What is the definition of ML that Dr. Isbell prefers?

A

ML is about using math, computation, engineering (among other things) to build computational artifacts that learn over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is inductive reasoning?

A

Reasoning that goes from specifics –> generalities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is deductive reasoning?

A

Applying general rules to draw specific (logical) conclusions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

At a high-level, what is supervised learning considered an example of?

A

Function approximation. It’s the process of inductively learning a general rule from observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is unsupervised learning all about?

A

It’s about DESCRIBING data. In UML, we’re only given “inputs”, so the objective is to learn if there is some latent structure in the data that we can use to describe the data in a more efficient (i.e. more concise) way. Clustering is a common time of unsupervised ML.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are two types of supervised machine learning?

A
  1. Classification (Discrete output values)
  2. Regression (Continuous output values)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In a Decision Tree (DT) model, nodes are _______ and edges are _____?

A

Attributes, Values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the decision tree algorithm?

A
  1. Pick the best attribute (where best splits the data roughly in half)
  2. Ask the question
  3. Follow the answer path
  4. Go to 1. (until arrive at answer)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the complexity of XOR?

A

O(2^n) [i.e. exponential, NP-Hard]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the complexity of OR?

A

O(n) [i.e. linear]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two types of biases we worry about when searching through a hypothesis space?

A
  1. Restriction bias: This is how restrictive the function space is we’ve chosen. So given that there are an infinite number of ways of representing a function, the restriction bias of a decision tree algorithm limits it to the space of boolean functions.
  2. Preference bias: This tells us what hypotheses within the space that we prefer. THIS IS AT THE HEART OF INDUCTIVE BIAS. (Because we have a preference for hypotheses that fit the data well.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some of the inductive biases of the ID3 algorithm for decision trees?

A
  1. Preference for good splits at the top of the tree as opposed to the bottom of the tree (because we build from the top of the tree down)
  2. Prefers models that fit the data well (i.e. prefers correct over incorrect)
  3. Tends to prefer shorter trees over taller trees (this is just a natural corollary that stems from the fact that we prefer trees that perform better splits at the top)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

According to Mitchell, what three features must be present in order to have a well-posed learning problem?

A
  1. The class of tasks
  2. The measure of performance to be improved
  3. The source of experience
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Historically, how did we end up with the term “regression”?

A

The idea of regression to the mean, e.g. the heights of a taller/shorter than average person have children that tend to ‘regress’ back to the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some examples of where error comes from?

A
  1. Sensor Error
  2. Malicious/adversarial data
  3. Transcription error
  4. Un-modeled influences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is one of the fundamental assumptions that we make in most supervised learning algorithms?

A

That the data from the training set are REPRESENTATIVE of what we expect in the future. If this isn’t the case, then our model won’t GENERALIZE, which is what we really care about as ML practitioners. More formally, the general assumption is that data are I.I.D. (Independent, Identically Distributed); that is, that the process the generated the training data is the same process that is generating the test data (in fact, any future data!).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

In the context of regression, the best constant in terms of the squared error is the _____?

A

mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe cross-validation and why we use it?

A

We split the training data into k-folds. Then we use n-1 of the folds for training and use the final n fold as a validation set (i.e. a stand-in for the test set). We can do this for all the combinatorial sets of k-folds, and then average the validation error across all of them. The best model is then the one with the lowest average error.

We use cross validation to avoid overfitting the model. Since we should have no access to the test set when developing our model, using cross validation improves generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Logical AND is expressible as a perceptron unit? (True/False)

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Logical OR is not expressible as a perceptron unit? (True/False)

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Logical NOT is expressible as a perceptron unit? (True/False)

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Logical XOR is not expressible as a perceptron unit? (True/False)

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

For perceptron network training, what is the difference between the “perceptron rule” and the “gradient descent” rule?

A

The perceptron rule uses thresholded output values while gradient descent uses the UNthresholded values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

If the data are linearly separable, wile the perceptron rule find the hyperplane that separates them in a finite number of iterations?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Why can we not a priori know whether a dataset is linearly separable (at least in non-trivial cases)?

A

Because while the perceptron rule does have a convergence guarantee to find the hyperplane that separates the data if it is linearly separable in a finite number of iterations, we have no way of knowing what “finite” means. Maybe the error finally goes to zero, but maybe it doesn’t. It’s really just a halting problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

The perceptron training rule is more robust against non-linearly separable data than gradient descent is? (True/False)

A

False.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

The perceptron rule does not have the theoretical guarantee of finite convergence? (True/False)

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What assumption is made for the theoretical guarantee of finite convergence for the perceptron rule?

A

That the data are linearly separable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What convergence guarantee does gradient descent offer?

A

Convergence to a local optimum. This is in contrast to the perceptron rule which does offer the guarantee of convergence to the global optimum in finite time, but only in the case of linearly separable data. Gradient descent is more ROBUST to non-linearly separable data than the perceptron rule.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Since gradient descent is more robust than the perceptron rule, why not just use GD on the weight update for the perceptron rule also?

A

Because the perceptron rule uses the THRESHOLDED output values, and the thresholding operation isn’t smooth (it has a discontinuity at the point of thresholding), hence it is not differentiable, which means we can’t use the calculus required for GD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the domain for the sigmoid function? How does the output behave at the limits?

A

The domain is from -infinity to +infinity. As the function goes to -infinity, the output goes to 0; conversely, as the input goes to +infinity, the output asymptotically approaches +1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the derivative of the sigmoid function?

A

Let ‘a’ := the activation and S(a) := the sigmoid function. Then the derivative dS(a)/da = S(a)*(1 - S(a))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is backpropogation?

A

It’s a computationally beneficial organization of the chain rule from calculus that allows us to calculate how the weights should move to minimize the error of the output from a neural network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are some of the more advanced methods that are sometimes beneficial to help mitigate the tendency for gradient descent to get stuck in local minima?

A
  1. Momentum terms
  2. Using-higher order derivatives (the analogy would be like adding an error “velocity/acceleration” term(s) to borrow from physical motion equations)
  3. Randomized optimization
  4. Penalty for “complexity”, where complexity might be the result of two large of network STRUCTURES (in terms of breadth and/or depth) or weights that are excessively large (in the latter case of weights we typically refer to this complexity penalty as “regularization”)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is the ‘Restriction Bias’

A

It relates to the representational power of a learner by limiting the set of hypotheses that we will consider as valid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the hypothesis space of a single perceptron unit?

A

Half-spaces (i.e. a hyperplane that perfectly separates data [although only in the case where the data is linearly separable!])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the restriction bias of a simple network of threshold-like units?

A

Boolean functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

If we add a single fully connected hidden layer that includes sigmoided activation functions, what is the restriction bias of the network?

A

Continuous functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

If we stitch together two or more fully connected hidden layers that include smooth (i.e. differentiable) activation functions, what is the restriction bias of the network?

A

Arbitrary functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

A network with two hidden layers containing 50 neurons each can represent any arbitrary function? (True/False)

A

False. While a finite neural net with at least two hidden layers can theoretically model an arbitrary function, we don’t necessarily (in fact in almost every case we don’t) know how “big” a sufficiently big network is required to model the function of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

If the error on the validation set increases while the error on the training set continues to decrease, what situation has occurred?

A

We’ve OVERFIT the training data. The model is no longer capable of generalizing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is ‘Preference Bias’?

A

When a given algorithm is making a selection between one representation or another, it’s preference bias determines which one of those representations it is more likely to choose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Say while training a NN, a single weight is growing exponentially, becoming much larger than all the other weights in the network. What situation is likely occurring?

A

We’re likely overfitting the dataset. In the case where one weight is growing much larger than all the others, the model has probably honed in on a single feature of the data that is highly correlated with the outputs in the training dataset, but it is unlikely to generalize to new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What are some of the preference biases of a Neural Network?

A

Initialization of weights to small random values results in a preference for - ceteris paribus - models with lower complexity (i.e. if most of the weights can stay small and still model the problem sufficiently well, then the model is going to prefer this situation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Which is preferable: a restriction bias or preference bias?

A

Preference bias. At least in that case, we can be assured that the hypothesis space contains our target function of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Where does the inductive bias of the ID3 algorithm for decision trees come from?

A

It’s search strategy. (Which makes it prefer shorter trees over taller trees, and tends to place attributes with the highest information gain towards the top of the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

In terms of computational complexity, learning for linear regression is _______ and querying ______?

A

Expensive | Cheap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

In terms of computational complexity, learning for K-NN is _______ and querying ______?

A

Cheap | Expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is the computional complexity and (run time and space) for KNN vs linear regression?

A

See screenshot below.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What are some of the preference biases of the KNN algorithm?

A
  1. Locality: nearer points are similar
  2. Smoothness: expectation that averaging data points works (i.e. the underlying function generating the data is smooth)
  3. Equality: all features matter equally
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is the “Curse of Dimensionality”?

A

As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

The curse of dimensionality only applies to the KNN algorithm? (True/False)

A

False. It’s true for any ML algorithm that as the number of features grow, we need exponentially more data to cover the space in order to generalize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What is the difference between ‘eager’ and ‘lazy’ learners? What are some examples of each?

A

A lazy learner puts off learning as long as possible; eager learners solve the problem as soon as it is posed, and saves the result as some sort of parametric function that can be used later to quickly evaluate new data.

KNN is an example of a lazy learner. Linear regression is an example of an eager learner.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

KNN can only handle classification problems?

A

False, it can handle classification and regression. In the case of regression, it could (as one example) simply take the average of the points to get a continuous output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

How does locally weighted KNN work and what does it allow us to do?

A

Locally weighted KNN allows us to represent more complex non-linear functions. It does this by approximating data locally as piece-wise functions. By stacking up these local functions we can generate non-linear outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

How does ensemble learning work?

A
  1. Take a subset of the data and create individual learners that learn a rule for that subset
  2. Combine all the individual learners into an ensemble
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What are the two important ideas in boosting?

A
  1. Select data subsets based on their difficulty
  2. Use a weighted mean instead of simple average
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is the definition of a “weak learner”?

A

It’s a learner that, no matter what the distribution of data is that you pull from, outputs a prediction that is better than random chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Define ‘Inductive Learning’ (per the computational learning theory lecture)

A

It’s the process of “Learning from Examples”.

  1. Probability of successful training (i.e. 1 - δ, where δ is the probability of failure)
  2. Number of examples to train on (i.e. m)
  3. Complexity of the hypothesis class H
  4. Accuracy to which target concept is approximate (often denoted as ε)
  5. Manner in which training examples presented (batch vs online)
  6. Maner in which training examples selected
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What are the two main ways that examples can be presented to a learning algorithm?

A
  1. Batch: we get a set of training data, and it’s handed over to the learner all at once to learn from
  2. Online: The learner is presented with samples one at a time, predicts a label, and then gets feedback from the algorithm/oracle about whether that prediction was correct or not
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Describe the relationship between Learners and Teachers. How do they interact with one another?

A
  1. Learner asks questions of the teacher - so for example, the learner gets some data/observation x, the teacher provides a concept c(x) that explains that data
  2. Teacher gives examples to help learner (i.e. Teacher chooses x, tells c(x) to Learner)
  3. Fixed Distribution, i.e x is chosen from D by nature (Dr. Isbell notes that this is mostly what we’ve talked about up until this point)
  4. Evil distribution - i.e. some sort of adversarial approach
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Random thoughts to consider: in the lectures discussing computational learning theory, one of the quizzes asks for the game like ‘20 Questions’, where we have H hypotheses containing the set of possible people and X questions we can ask, how many questions does the learner need to ask if the learner chooses x to figure out the correct h? Dr. Isbell says log2(x) because we can, on average, eliminate half the hypotheses for each question we ask. Makes sense from an information theoretic perspective. This is predicated on the assumption that we can ask “good” questions, but doesn’t really say anything about what makes a question “good”. What distinguishes good questions from bad questions?

A

A naive response would be that it’s one that eliminates roughly half the answers. But that’s tautological. It seems to me that domain knowledge is really implicit here. In the case where H is a set of people, we naturally have background knowledge of this that admits a lot of binary questions (at least from an information theoretic perspective, if not exactly politically correct): is the person male/female? Are they older than 40? Etc, etc. Not only that, we also have really strong priors we can derive from this background knowledge: we know that the world population is roughly 50/50 male/female. Human life expectancy gives us a good idea how to split intro roughly equal groups.

But what about the case were we’re much less certain (or have no information at all) about what H contains? I guess since we’re human, the most reasonable place to start would be asking questions based on prevalance of some attribute that H is most likely to contain? So questions like “Is h in the space of people?”, etc. It’s tough to swallow as things become more abstract though. What if H is the set of all atoms in the universe, and the answer we’re looking for is something ridiculous like: “the 3rd atom in Saturn’s ring”. (Interesting note that the first example that jumped to my mind was Saturn: I’ve also just shown my own preference bias for hypotheses that are in the set of things in our solar system).

I don’t know exactly where I’m going with this, and I guess it comes down to the problem of a potentiall infinite hypothesis space, which I’m sure will discuss in the lectures. Anyway, it’s interesting stuff to think about.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Why do constrained queries make things so hard for learners?

A

Because we can only ask things that amount to data points, and not the thing we’re actually interested in itself. So think about the case of bit strings and a hypothesis class that consists of conjunctions of literals or negations. What we’d really like to ask are things like: is x1 in the formula? But the constraint forces us to only ask questions where x takes on binary values. This means it could take a huge amount of samples before we finally get a positive result; the negative results simply don’t tell us much. This means in the worse case it could take us 2k queries to find the actual answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What is ‘Version Space’?

A

Let S be a training set of data drawn from X. The version space is simply the set of all candidate hypotheses h such that h is consistent w.r.t. the data S that it has seen thus far. More succinctly, it’s the “set of hypotheses consistent with the examples”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

In the context of PAC learning, what are the two types of error?

A

Training error and True Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What is training error?

A

It is the fraction of training examples misclassified by hypothesis h

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What is the ‘True Error’?

A

The fraction of examples that would be missclassified on a sample drawn from D in the infinitie limit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What does this formula represent:

A

The True Error, i.e. the fraction of examples that would be misclassified by some hypothesis h(x) compared to the true hypothesis (i.e. concept) c(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What is the definition of PAC Learning (“Probably Approximately Correct”)?

A

Concept class C is learnable by learner L using hypothesis space H if and only if the learner will, with probability 1 - δ, output a hypothesis h in H such that the error(h) <= ε in time and samples polynomial in 1/ε, 1/δ, and n (where n is the size of the hypothesis space).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Mathematically, what do the words “Probably Approximately Correct” represent?

A

Probably := 1 - δ [i.e. the certainty goal]

Approximately := ε [i.e. the error goal]

Correct := h(x)=c(x) [i.e. the chosen hypothesis is equal to the actual concept hypothesis]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

At a high-level, describe what PAC Learning is?

A

Something is considered PAC learnable if the learner L can find a hypothesis that has low true error and can be learned in polynomial time as a function of 1/ε, 1/δ and n (the size of the hypothesis space).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Is the concept in the image below PAC learnable?

A

Yes. We just keep track of the version space (the hypotheses that are consistent with the data) and then - since we don’t have any additional information to suggest otherwise - we should simply pick uniformly from that version space.

My original guess on the answer (it’s wrong per the lectures)

I’m going to go with ‘No’. So first let’s remember what it means for something to be PAC learnable: it means the learner has to be able to find to a hypothesis h in H that has a low, bounded true error and that can be learned in polynomial time as a function of the error rate, certainty goal and hypothesis space size.

Well first off, we can see right away that our hypothesis space is going to go as a function of k. But that’s constant, we at least tick that box. The error bounds seem more of an issue to me. Think about the case where we increased the length of k for each new sample

0

01

010

1000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

What is ε-exhaustion of a Version Space (VS)?

A

A Version Space is epsilon exhausted in the case when every hypothesis you might possibly choose in the VS has an error less than ε. In that case we can return any of the hypotheses in the VS by choosing uniformly because we have no prior information that compells us to choose one over the other, and they all provide the same error guarantees! [Note: This is mentioned as a key concept!]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

If we know the size of our hypothesis space and desire to find a hypothesis that’s bounded by some error goal ε and certainty goal δ, then the number of samples required to achieve that is polynomial? (True/False)

A

True. This comes from the Haussler Theorem (see screenshot)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

One way to think of why the distribution need not be uniform for the number of samples needed for something to be PAC learnable is that the distribution in some sense sort of cancels out between the training and true error? (True/False)

A

True. See the second to last video in the computational learning theory lectures for a really good example of this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

What is the ‘agnostic’ situation in the context of PAC learnability?

A

It occurs when the target concept is not in our hypothesis space. In that case, the learner doesn’t match the true concept, but just chooses the best one. The error bounds change a bit, but are more or less the same, and the learning sample complexity is still polynomial in all of the terms we’ve been discussing previously (size of the hypothesis space, error goal, failure rate)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

Which one of these hypothesis spaces are infinite?

A

[yes] Linear separators - As an example, think of an SVM using a linear kernel to create a hyperplane between two classes. Could you rotate it 0.0000000001 degrees one direction or the other and still split the classes? There’s a good chance.

[yes] Artificial Neural Networks - think about GPT3 with its billions of parameters. Each one of those parameters is some floating point number that could take on any real valued number. Definitely an infinite space.

[no] Decision Trees (discrete inputs) - there are only so many ways we can split discrete data, so the hypothesis space for discrete inputs is not infinite.

[yes] Decision Trees (continuous inputs) - a continuous input could be split at any arbitrary point along the number line and at infinite precision, so definitely hypothesis space is infinite.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Is the version space for a KNN classifier finite or infinite?

A

Trick question, and it really depends on your definition. Dr. Littman argues that it’s finite because if your dataset is fixed, then there’s only one configuration. But if the data is not fixed, then each new dataset yields a different hypothesis, so you could argue it is infinite. It all comes down to the fact that KNN is a non-parametric model; the data is what matters for a KNN.

79
Q

What is the difference between a ‘syntactic’ and ‘semantic’ hypothesis space?

A

A syntactic space is one that consists of everything we could possible say. A semantic space just considers the space that is meaningfully different. Intuitively, think about a set of characters like {‘h’, ‘e’, ‘r’}. The syntactic space might consist of all the 3 letter strings you could form, but the semantic space would only care about things that have meaning in some sense (perhaps actual words like ‘her’, etc).

80
Q

What is the “Power of a hypothesis space”?

A

It is the larget set of inputs that the hypothesis class can label in all possible ways. Said another way, it is the larget set of inputs that the hypothesis class can shatter. It gives us an idea of how expressive a certain hypothesis class is.

81
Q

What does VC stand for in the context of VC Dimensions?

A

Vapnik-Chervonenkis

82
Q

What is shattering?

A

It’s the largest input size that a hypothesis can label in all possible ways.

83
Q

What is the VC dimension?

A

It is the larget set of inputs that a hypothesis class can shatter.

84
Q

What does the VC dimension allow us to do?

A

Basically it allows us to obtain a bound on the amount of data needed to learn for some hypothesis class, even if the hypothesis space is infinite. It does this using the notion of shattering (i.e. the larget set of inputs that the hypothesis class can label in all possible ways)

85
Q

A linear separator cannot model the XOR function? (True/False)

A

True.

86
Q

The VC dimension is often the same as the number of parameters? (True/False)

A

True (See the example for d-dimensional linear separators)

87
Q

What is the connection between the sample complexity and VC dimension?

A
88
Q

If something is PAC learnable, then its VC dimension is finite? (True/False)

A

True

89
Q

If something is of finite VC dimension, then it is PAC learnable? (True/False)

A

True.

90
Q

VC dimension is related to the number of “true” hypothesis space parameters? (True/False)

A

True. See the example in the lectures about repeating a feature in a decision tree. When you do that though, you’re not actually changing what the hypothesis can represent, so the VC dimension stays the same in that case.

91
Q

What is a ‘concept’?

A

It’s a function or mapping between an object and an items membership in a set. Example: mapping pixels in an image to whether it’s an image of a dog or cat.

92
Q

What is the ‘target concept’?

A

It’s the concept (i.e. function or mapping) that we actually want to find (i.e. mapping from pixels in an image to whether that image contains a car or not)

93
Q

What is a ‘hypothesis’ as defined in the lectures?

A

It’s the set of functions that we’re willing to entertain as containing the target concept we’re interested in. (note that it could be ALL functions!)

94
Q

What is a ‘sample’ as defined in the lectures?

A

It’s just the training set of instances containing data and it’s associated label or value.

95
Q

What is a ‘candidate hypothesis’ as defined in the lectures?

A

It’s the posing of the question: ?= target concept.

96
Q

If we use a decision tree to represent n-or statements, how does the tree grow as a function of n?

A

It’s linear in n.

97
Q

If we use a decision tree to represent n-xor statements, how does the tree grow as a function of n? (Assume odd parity)

A

It’s exponential in in, i.e. O(2^n)

98
Q

How big is the search space for a decision tree as a function of the number of attributes n? (exclusive of the outputs)

A

2^n. See the truth table below.

99
Q

How big is the search space for a decision tree as a function of the number of attributes n? (inclusive of the outputs)

A

2^(2^n)

100
Q

Why are decision trees so expressive?

A

Because the hypothesis space grows according to 2^(2^n), where n is the number of attributes (this is assuming binary attributes)

101
Q

Does it ever make sense to repeat an attribute along a path in a decision tree for discrete valued attributes? (by same path meaning along the same branch)

A

My intuitition says False (this is also the answer they give in the lectures). My reasoning is that if we’re splitting based on maximally entropic methods, then once we’ve split along that attribute, all the information it contains has been extracted; that information could never “reappear”, so it doesn’t make sense to ever repeat the attribute. See lecture videos for additional explanation.

102
Q

Does it ever make sense to repeat an attribute along a path in a decision tree for continuous valued attributes? (by same path meaning along the same branch)

A

Yes, it makes sense, as long as the question isn’t exactly repeated. Think about an attribute that looks at whether a person’s age is between 20 and 30. There wouldn’t be any reason to repeat that exact question later in the tree (for the same information theoretic reasons as in the discrete valued case), but it’s perfectly reasonable to ask a related but different question, like “is the person younger than 13?”

103
Q

What are some of the stopping criteria we could use for the decision tree algorithm?

A
  1. Everything is classified correctly (probably not a likely thing to happen in real-world cases)
  2. No more attributes remaining (this works in the case of discrete valued trees, but not in the case of continuous variables since we could essentially ask an infinite number of questions)
  3. No overfitting (having a tree that is too big), e.g. expand the tree until we meet some threshold of perfomance on my cross-validation set,
  4. Pruning - overfit the tree then prune until we reach some balance between complexity and performance.
104
Q

For linear regression, the best constant in terms of squared error is the _____?

A

mean (the lecture videos prove this using simple calculus)

105
Q

What is a perceptron?

A

A computational unit that “activates” if the input signal x multiplied by some arbitrary weights w is above some threshold θ, i.e. it’s some linear function of the inputs thresholded by some value

106
Q

Let’s say your using a simple network of perceptrons to build a binary classifer. When training the model, after several iterations you find your weights are no longer changing. What could we conclude about the dataset?

A

The data must be linearly separable. The perceptron rule is guaranteed to converge in finite time to the correct rule in the event that the data are in fact linearly separable.

107
Q

In an ANN, _______ weight values increase the model __________? What is one of combatting this?

A

larger, complexity. We can use regularization techniques to try to minimize this problem.

108
Q

What is the ‘maximum likelihood’ hypothesis?

A

See screenshot. It’s the product of all the individual probabilities of the data given the hypothesis.

109
Q

What conditions are required for Bayesian learning to be equivalent to minimizing the sum of squared error?

A
  1. Deterministic function f(x)
  2. Purely Gaussian noise (0 mean, unit variance)
110
Q

What is the “maximum a posteriori”? What does it do?

A

MAP says that the optimal hypothesis is the one that minimizes the misclassification (i.e. error) and the size of the hypothesis (i.e. the complexity of the model). It’s essentially a mathematical/information theoretical recapitulation of Occam’s Razor in Bayesian terms. See screenshot.

111
Q

The complexity of a model is the number of parameters it contains? (True/False)

A

False. It’s related to this, but it’s more about encoding those parameters, not the raw number of them. Think about neural networks. You can overfit because you have too many neurons (i.e. weights), but you can also overfit if the weights become really large. This makes sense from an information theoretic perspective. If the weight values become large, it requires more “bits” of information to encode those models, hence a higher complexity. This is what is meant by the “size of hypothesis” in the lectures.

112
Q

Bayes Rule effectively allows you to swap _______ and ________?

A

Causes, Effects

113
Q

Generally speaking, which is easier to compute, Pr(h|D) or Pr(D|h)?

A

Pr(D|h)

114
Q

What is the connection between MAP and ML (Maximum Likelihood)?

A

Maximum likelihood is what you get when the prior over the hypotheses is uniform

115
Q

To make a Baysian Optimal Classifier, what is the thing you should do?

A

Since what we’re actually interested in is getting the label of the data correct, what we should do is allow the hypotheses to vote on the label (which makes it sort of connected to ensemble learning). Trying to calculate the Pr(h|D) is really just a means to an end at the end of the day; the label is what we care about, i.e. Pr(label|D)

116
Q

What is the definition of conditional independence?

A

X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z. That is:

P(X|Y,Z) = P(X|Z)

Basically if we can predict X given Z alone, then Y provides no information to us, hence X is conditionally independent of Y.

117
Q

For a Bayesian Belief Network, the size of the graph grows exponentially with more variables? (True/False)

A

True

118
Q

We can infer causality by looking at the directional relationships in a Baysian Belief Network? (True/False)

A

False. What we’re actually looking at is only dependent in a statistical sense; causality can’t necessarily be inferred.

119
Q

When sampling from the join distribution in a Bayesian net, the graph must be _______ ______?

A

directed acyclic. This is because a Baysian net is essentially trying to encode cyclical independies, so if there were cycles, that would be tantamount to saying there are none.

120
Q

For Reference: Cheat Sheet for Probabilities

A
121
Q

In a Bayesian network, arrows between variablies indicate dependencies? (True/False)

A

False. They indicate conditional independencies

122
Q

What’s one of they key takeaways from Naive Bayes? What is the MAP estimate for that class?

A

The takeaway is that in the real world, we often get an example of some class V, say spam mail. We observe that that class has certain attributes a. But for classification, what we really want is NOT a model that goes from V –> a, but rather a model that maps from a –> V. Bayes Rule helps us do this.

The MAP estimate for the class is simply the argmax of the prior probability of the class P(V) multiplied by the product of all the conditional probabilities of the attributes given the value.

123
Q

What are some of the advantages of Naive Bayes?

A
  1. Inference is cheap (we can also perform inference in any direction, e.g. to generate missing attributes)
  2. Few parameters
  3. Estimate parameters with labeled data (see screenshot for how this is actually calculated)
  4. Connects inference and classification
  5. Empirically successful
124
Q

What are some of the challenges with Naive Bayes?

A
  1. Doesn’t model interrelationships between attributes (it assumes independence between attributes)
  2. Unseen attributes - Since the calculate requires the product of all the conditional probabilities of the attributes, that means if we have a feature that is important for classification but for some reason doesn’t appear in the dataset (perhaps too small of a dataset, etc.), the product ends up being 0!
125
Q

As the annealing temperature T —> 0 in simulated annealing, what the does algorithm behave like?

A

Regular hill climbing

126
Q

As the annealing temperature T–>infinity in simulated annealing, what does the algorithm behave like?

A

Random walk

127
Q

Compare and contrast genetic algorithms with hill climbing with random restarts?

A

In some ways you can think of them as the same. We’re still initially sampling to get a fitness score, and then restarting to ensure we don’t get stuck in local optima. The difference is with genetic algorithms, we’re making the assumption that the population distribution contains information that we can exploit to generate new combinations that are better than the individual values.

128
Q

What are some of the ways you can select the “most fit” individuals in a genetic algorithm.

A

Truncation selection, roulette wheel (i.e. weighted probabability, etc.). The latter one kind of reminds me of particles filters and the weighted resampling that we do there.

129
Q

What’s one of the assumptions we make when performing crossover in genetic algorithms?

A
  1. Locality of the bits (assuming a discrete problem)
  2. More generally, that there are independent subspaces to optimize. That is, that there is some structure to the problem that we can optimize parts of it without impacting other parts (or in practice, at least minimally so). Think of the bit string example in the lectures. If we swapped the first half of the 8-bit sequence in “Charles” with the corresponding 4-bit sequence from “Sheila”, that’s only a sensible thing to do if there’s some structure in place that makes it such that those positions exist in the same topological space.
130
Q

What is uniform crossover in genetic algorithms?

A

Basically just randomly corrupting bits to form new string.

131
Q

When is random optimization useful?

A

If there is no gradient pointing the way (e.g. function is too complex, unknown, not differentiable, etc.)

132
Q

What’s one of the big problem that all 4 of the first random optimization algorithms that are discussed in the lectures suffer from?

A

They’re “amnesiacs”. That is, they keep track of history, but they don’t really use that history to discover and communicate structure in the problem that they might be able to exploit to find a better solution (or to find it more quickly)

133
Q

MMIC does not require locality in the same way that genetic algorithms do? (True/False)

A

True.

134
Q

What is the MMIC algorithm trying to represent?

A

Structure.

135
Q

What price do you pay for using MMIC? What do you get in return for this price?

A

Time. MMIC often runs in many orders of magnitude fewer iterations than things like Simulated Annealing or GA, but each one of those iterations takes a lot more time because we’re estimating the parameters for the distribution at each step.

The bonus that you get is information that is contained in the structure of the problem.

136
Q
A
137
Q

What is the connection between MAP (Maximum A Posteriori) and Bayesian Learning?

A

The connection is between the MAP formula: HMAP = argmax[P(D|h) * P(h)] and information theory. If we manipulate the math a bit, this works out to being the same as minimizing the entropy of the misclassification (i.e. error) plus the size of the hypothesis). It’s basically a recapitulation of Occam’s Razor: “We should prefer the simplest (i.e. lowest complexity) hypothesis that best explains (i.e. accuracy) of the data.”

138
Q

If you were to take a KNN and set k=n, (where n is the number of datapoints), what value would you get out? What’s a way you could change that value? What other algorithm is this situation similar to?

A

Since KNN is instance based learning, if k=n then you’re simpling using all the datapoints to generate your output, in which case you’ll simply get a constant value: the mean. One way of changing the value would be to use a weighted average.

Using k=n for KNN, using boosting (i.e. ensemble learning) would output the same value (the mean) in the case where each datapoint forms it’s own subset that a 0th order polynomial learner is trained on. Each of the 0th order polynomials simply learns the value for that specific point, and the ensemble would just then average over all of those, hence the overall average of the dataset would be returned.

139
Q

Similarity captures _______ _________?

A

Domain Knowledge (see the lectures on distance metrics and K-NN)

140
Q

One of the weaknesses of K-NN is that it assumes smoothness of the underlying function. What’s one way of getting around this?

A

You could use locally weighted regression. In fact, you could really stuff any learning algorithm you want into small, piecewise regions, and then stitch those together to form a non-linear hypothesis space.

141
Q

If I want to use more features, how much more data do I need?

A

It grows exponentially in the number of dimensions: O(2d)

142
Q

What is the connection between SVMs and K-NN?

A

K-NN is all about finding what datapoints actually matter, i.e. the k-nearest points to the query point according to some distance metric. SVM is similar, except that instead of taking the lazy approach of a K-NN, it performs a calculation (eagerly) to figure out what datapoints it can throw away. It can do this because only the points near the boundary actually matter in terms of constructing the margin for the hyper-plane.

143
Q

How do we inject domain knowledge into the K-NN algorithm? How do we inject it into an SVM?

A

For K-NNs we do it using the distance (similarity) metric. For SVMs, the kernel plays the same role.

144
Q

What is the Mercer Condition?

A

It’s a requirement that for an SVM the kernel type must be a well-defined distance or similarity metric.

As a side note, I was wondering about using KL divergence as a kernel function, but I’m thinking that the Mercer Condition probably precludes that since KLD isn’t symmetric.

145
Q

What is the connection between SVMs and Boosting?

A

Boosting is in a sense increasing the confidence of its predictions by increasing the margin of it’s decision boundary. This is the same thing that an SVM does, maximizing the margin of the hyperplane that separates the classes.

146
Q

What type of noise can cause boosting to overfit?

A

Pink noise (i.e. uniform noise)

147
Q
A
148
Q
A
149
Q
A
150
Q
A
151
Q
A
152
Q
A
153
Q
A
154
Q
A
155
Q
A
156
Q
A
157
Q
A
158
Q
A
159
Q
A
160
Q
A
161
Q
A
162
Q
A
163
Q
A
164
Q
A
165
Q
A
166
Q
A
167
Q
A
168
Q
A
169
Q
A
170
Q
A
171
Q
A
172
Q
A
173
Q
A
174
Q
A
175
Q
A
176
Q
A
177
Q
A
178
Q
A
179
Q
A
180
Q
A
181
Q
A
182
Q
A
183
Q
A
184
Q
A
185
Q
A
186
Q
A
187
Q
A
188
Q
A
189
Q
A
190
Q
A
191
Q
A
192
Q
A
193
Q
A