MidtermExam Flashcards

Question

Why can we not a priori know whether a dataset is linearly separable (at least in non-trivial cases)?

Answer 1

Because while the perceptron rule does have a convergence guarantee to find the hyperplane that separates the data if it is linearly separable in a finite number of iterations, we have no way of knowing what "finite" means. Maybe the error finally goes to zero, but maybe it doesn't. It's really just a halting problem.

Answer 2

That the data are linearly separable.

Answer 3

Convergence to a local optimum. This is in contrast to the perceptron rule which does offer the guarantee of convergence to the global optimum in finite time, but only in the case of linearly separable data. Gradient descent is more ROBUST to non-linearly separable data than the perceptron rule.

Answer 4

Because the perceptron rule uses the THRESHOLDED output values, and the thresholding operation isn't smooth (it has a discontinuity at the point of thresholding), hence it is not differentiable, which means we can't use the calculus required for GD.

Answer 5

The domain is from -infinity to +infinity. As the function goes to -infinity, the output goes to 0; conversely, as the input goes to +infinity, the output asymptotically approaches +1.

Answer 6

Let 'a' := the activation and S(a) := the sigmoid function. Then the derivative dS(a)/da = S(a)\*(1 - S(a))

Answer 7

It's a computationally beneficial organization of the chain rule from calculus that allows us to calculate how the weights should move to minimize the error of the output from a neural network.

Answer 8

1. Momentum terms 2. Using-higher order derivatives (the analogy would be like adding an error "velocity/acceleration" term(s) to borrow from physical motion equations) 3. Randomized optimization 4. Penalty for "complexity", where complexity might be the result of two large of network STRUCTURES (in terms of breadth and/or depth) or weights that are excessively large (in the latter case of weights we typically refer to this complexity penalty as "regularization")

Answer 9

It relates to the representational power of a learner by limiting the set of hypotheses that we will consider as valid.

Answer 10

Half-spaces (i.e. a hyperplane that perfectly separates data [although only in the case where the data is linearly separable!])

Answer 11

Boolean functions

Answer 12

Continuous functions

Answer 13

Arbitrary functions

Answer 14

False. While a finite neural net with at least two hidden layers can theoretically model an arbitrary function, we don't necessarily (in fact in almost every case we don't) know how "big" a sufficiently big network is required to model the function of interest.

Answer 15

We've OVERFIT the training data. The model is no longer capable of generalizing.

Answer 16

When a given algorithm is making a selection between one representation or another, it's preference bias determines which one of those representations it is more likely to choose.

Answer 17

We're likely overfitting the dataset. In the case where one weight is growing much larger than all the others, the model has probably honed in on a single feature of the data that is highly correlated with the outputs in the training dataset, but it is unlikely to generalize to new data.

Answer 18

Initialization of weights to small random values results in a preference for - ceteris paribus - models with lower complexity (i.e. if most of the weights can stay small and still model the problem sufficiently well, then the model is going to prefer this situation)

Answer 19

Preference bias. At least in that case, we can be assured that the hypothesis space contains our target function of interest.

Answer 20

It’s search strategy. (Which makes it prefer shorter trees over taller trees, and tends to place attributes with the highest information gain towards the top of the tree.

Answer 21

Expensive | Cheap

Answer 22

Cheap | Expensive

Answer 23

See screenshot below.

Answer 24

1. Locality: nearer points are similar 2. Smoothness: expectation that averaging data points works (i.e. the underlying function generating the data is smooth) 3. Equality: all features matter equally

Answer 25

As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.

Answer 26

False. It's true for any ML algorithm that as the number of features grow, we need exponentially more data to cover the space in order to generalize.

Answer 27

A lazy learner puts off learning as long as possible; eager learners solve the problem as soon as it is posed, and saves the result as some sort of parametric function that can be used later to quickly evaluate new data. KNN is an example of a lazy learner. Linear regression is an example of an eager learner.

Answer 28

False, it can handle classification and regression. In the case of regression, it could (as one example) simply take the average of the points to get a continuous output.

Answer 29

Locally weighted KNN allows us to represent more complex non-linear functions. It does this by approximating data locally as piece-wise functions. By stacking up these local functions we can generate non-linear outputs.

Answer 30

1. Take a subset of the data and create individual learners that learn a rule for that subset 2. Combine all the individual learners into an ensemble

Answer 31

1. Select data subsets based on their difficulty 2. Use a weighted mean instead of simple average

Answer 32

It's a learner that, no matter what the distribution of data is that you pull from, outputs a prediction that is better than random chance.

Answer 33

It's the process of "Learning from Examples". 1. Probability of successful training (i.e. 1 - δ, where δ is the probability of failure) 2. Number of examples to train on (i.e. *m*) 3. Complexity of the hypothesis class *H* 4. Accuracy to which target concept is approximate (often denoted as ε) 5. Manner in which training examples presented (batch vs online) 6. Maner in which training examples selected

Answer 34

1. Batch: we get a set of training data, and it's handed over to the learner all at once to learn from 2. Online: The learner is presented with samples one at a time, predicts a label, and then gets feedback from the algorithm/oracle about whether that prediction was correct or not

Answer 35

1. Learner asks questions of the teacher - so for example, the learner gets some data/observation x, the teacher provides a concept *c(x)* that explains that data 2. Teacher gives examples to help learner (i.e. Teacher chooses *x,* tells *c(x)* to Learner) 3. Fixed Distribution, i.e *x* is chosen from *D* by **nature** (Dr. Isbell notes that this is mostly what we've talked about up until this point) 4. Evil distribution - i.e. some sort of adversarial approach

Answer 36

A naive response would be that it's one that eliminates roughly half the answers. But that's tautological. It seems to me that domain knowledge is really implicit here. In the case where *H* is a set of people, we naturally have background knowledge of this that admits a lot of binary questions (at least from an information theoretic perspective, if not exactly politically correct): is the person male/female? Are they older than 40? Etc, etc. Not only that, we also have really strong priors we can derive from this background knowledge: we know that the world population is roughly 50/50 male/female. Human life expectancy gives us a good idea how to split intro roughly equal groups. But what about the case were we're much less certain (or have no information at all) about what *H* contains? I guess since we're human, the most reasonable place to start would be asking questions based on prevalance of some attribute that *H* is most likely to contain? So questions like "Is *h* in the space of people?", etc. It's tough to swallow as things become more abstract though. What if *H* is the set of all atoms in the universe, and the answer we're looking for is something ridiculous like: "the 3rd atom in Saturn's ring". (Interesting note that the first example that jumped to my mind was Saturn: I've also just shown my own preference bias for hypotheses that are in the set of things in our solar system). I don't know exactly where I'm going with this, and I guess it comes down to the problem of a potentiall infinite hypothesis space, which I'm sure will discuss in the lectures. Anyway, it's interesting stuff to think about.

Answer 37

Because we can only ask things that amount to data points, and not the thing we're actually interested in itself. So think about the case of bit strings and a hypothesis class that consists of conjunctions of literals or negations. What we'd really like to ask are things like: is *x₁* in the formula? But the constraint forces us to only ask questions where x takes on binary values. This means it could take a huge amount of samples before we finally get a positive result; the negative results simply don't tell us much. This means in the worse case it could take us 2^k queries to find the actual answer.

Answer 38

Let *S* be a training set of data drawn from *X*. The version space is simply the set of all candidate hypotheses *h* such that h is consistent w.r.t. the data *S* that it has seen thus far. More succinctly, it's the **"set of hypotheses consistent with the examples"**

Answer 39

Training error and True Error

Answer 40

It is the fraction of training examples misclassified by hypothesis *h*

Answer 41

The fraction of examples that **would be** missclassified on a sample drawn from *D* in the infinitie limit

Answer 42

The True Error, i.e. the fraction of examples that would be misclassified by some hypothesis h(x) compared to the true hypothesis (i.e. concept) c(x)

Answer 43

Concept class *C* is learnable by learner *L* using hypothesis space *H* if and only if the learner will, with probability 1 - δ, output a hypothesis *h in H* such that the error(h) \<= ε in time and samples polynomial in 1/ε, 1/δ, and n (where n is the size of the hypothesis space).

Answer 44

Probably := 1 - δ [i.e. the certainty goal] Approximately := ε [i.e. the error goal] Correct := h(x)=c(x) [i.e. the chosen hypothesis is equal to the actual concept hypothesis]

Answer 45

Something is considered PAC learnable if the learner *L* can find a hypothesis that has low *true error* **and** can be learned in polynomial time as a function of 1/ε, 1/δ and *n (*the size of the hypothesis space).

Answer 46

Yes. We just keep track of the version space (the hypotheses that are consistent with the data) and then - since we don't have any additional information to suggest otherwise - we should simply pick uniformly from that version space. --------------------- My original guess on the answer (it's wrong per the lectures) I'm going to go with 'No'. So first let's remember what it means for something to be PAC learnable: it means the learner has to be able to find to a hypothesis *h* in *H* that has a low, bounded true error and that can be learned in polynomial time as a function of the error rate, certainty goal and hypothesis space size. Well first off, we can see right away that our hypothesis space is going to go as a function of *k*. But that's constant, we at least tick that box. The error bounds seem more of an issue to me. Think about the case where we increased the length of k for each new sample 0 01 010 1000

Answer 47

A Version Space is epsilon exhausted in the case when every hypothesis you might possibly choose in the VS has an error less than ε. In that case we can return _any_ of the hypotheses in the VS by choosing uniformly because we have no prior information that compells us to choose one over the other, and they all provide the same error guarantees! [**Note: This is mentioned as a key concept!]**

Answer 48

True. This comes from the Haussler Theorem (see screenshot)

Answer 49

True. See the second to last video in the computational learning theory lectures for a really good example of this.

Answer 50

It occurs when the target concept is **not** in our hypothesis space. In that case, the learner doesn't match the true concept, but just chooses the best one. The error bounds change a bit, but are more or less the same, and the learning sample complexity is still polynomial in all of the terms we've been discussing previously (size of the hypothesis space, error goal, failure rate)

Answer 51

[yes] Linear separators - As an example, think of an SVM using a linear kernel to create a hyperplane between two classes. Could you rotate it 0.0000000001 degrees one direction or the other and still split the classes? There's a good chance. [yes] Artificial Neural Networks - think about GPT3 with its billions of parameters. Each one of those parameters is some floating point number that could take on any real valued number. Definitely an infinite space. [no] Decision Trees (discrete inputs) - there are only so many ways we can split discrete data, so the hypothesis space for discrete inputs is not infinite. [yes] Decision Trees (continuous inputs) - a continuous input could be split at any arbitrary point along the number line and at infinite precision, so definitely hypothesis space is infinite.

Answer 52

Trick question, and it really depends on your definition. Dr. Littman argues that it's finite because if your dataset is fixed, then there's only one configuration. But if the data is not fixed, then each new dataset yields a different hypothesis, so you could argue it is infinite. It all comes down to the fact that KNN is a non-parametric model; the data is what matters for a KNN.

Answer 53

A syntactic space is one that consists of everything we could possible say. A semantic space just considers the space that is *meaningfully different*. Intuitively, think about a set of characters like {'h', 'e', 'r'}. The syntactic space might consist of all the 3 letter strings you could form, but the semantic space would only care about things that have meaning in some sense (perhaps actual words like 'her', etc).

Answer 54

It is the larget set of inputs that the hypothesis class can label in all possible ways. Said another way, it is the larget set of inputs that the hypothesis class can ***shatter.*** It gives us an idea of how expressive a certain hypothesis class is.

Answer 55

Vapnik-Chervonenkis

Answer 56

It's the largest input size that a hypothesis can label in all possible ways.

Answer 57

It is the larget set of inputs that a hypothesis class can shatter.

Answer 58

Basically it allows us to obtain a bound on the amount of data needed to learn for some hypothesis class, *even if the hypothesis space is infinite.* It does this using the notion of *shattering* (i.e. the larget set of inputs that the hypothesis class can label in all possible ways)

Answer 59

True (See the example for d-dimensional linear separators)

Answer 60

True. See the example in the lectures about repeating a feature in a decision tree. When you do that though, you're not actually changing what the hypothesis can represent, so the VC dimension stays the same in that case.

Answer 61

It's a function or mapping between an object and an items membership in a set. Example: mapping pixels in an image to whether it's an image of a dog or cat.

Answer 62

It's the concept (i.e. function or mapping) that we actually want to find (i.e. mapping from pixels in an image to whether that image contains a car or not)

Answer 63

It's the *set of functions that we're willing to entertain as containing the target concept we're interested in.* (note that it could be ALL functions!)

Answer 64

It's just the training set of instances containing data and it's associated label or value.

Answer 65

It's the posing of the question: ?= target concept.

Answer 66

It's linear in n.

Answer 67

It's exponential in in, i.e. O(2^n)

Answer 68

2^n. See the truth table below.

Answer 69

Because the hypothesis space grows according to 2^(2^n), where n is the number of attributes (this is assuming binary attributes)

Answer 70

My intuitition says **False** (this is also the answer they give in the lectures). My reasoning is that if we're splitting based on maximally entropic methods, then once we've split along that attribute, all the information it contains has been extracted; that information could never "reappear", so it doesn't make sense to ever repeat the attribute. See lecture videos for additional explanation.

Answer 71

Yes, it makes sense, as long as the question isn't exactly repeated. Think about an attribute that looks at whether a person's age is between 20 and 30. There wouldn't be any reason to repeat that exact question later in the tree (for the same information theoretic reasons as in the discrete valued case), but it's *perfectly reasonable to ask a related but different question*, like "is the person younger than 13?"

Answer 72

1. Everything is classified correctly (probably not a likely thing to happen in real-world cases) 2. No more attributes remaining (this works in the case of discrete valued trees, but not in the case of continuous variables since we could essentially ask an infinite number of questions) 3. No overfitting (having a tree that is too big), e.g. expand the tree until we meet some threshold of perfomance on my cross-validation set, 4. Pruning - overfit the tree then prune until we reach some balance between complexity and performance.

Answer 73

mean (the lecture videos prove this using simple calculus)

Answer 74

A computational unit that "activates" if the input signal *x* multiplied by some arbitrary weights *w* is above some threshold θ, i.e. it's some linear function of the inputs thresholded by some value

Answer 75

The data must be *linearly separable*. The perceptron rule is *guaranteed to converge in finite time* to the correct rule in the event that the data are in fact linearly separable.

Answer 76

larger, complexity. We can use regularization techniques to try to minimize this problem.

Answer 77

See screenshot. It's the *product* of all the individual probabilities of the data given the hypothesis.

Answer 78

1. Deterministic function f(x) 2. Purely Gaussian noise (0 mean, unit variance)

Answer 79

MAP says that the optimal hypothesis is the one that *minimizes* the *misclassification* (i.e. error) and the *size of the hypothesis* (i.e. the complexity of the model). It's essentially a mathematical/information theoretical recapitulation of Occam's Razor in Bayesian terms. See screenshot.

Answer 80

False. It's related to this, but it's more about *encoding* those parameters, not the raw number of them. Think about neural networks. You can overfit because you have too many neurons (i.e. weights), but you can also overfit if the weights become really large. This makes sense from an information theoretic perspective. If the *weight values* become large, it requires more "bits" of information to encode those models, hence a higher complexity. This is what is meant by the "*size of hypothesis"* in the lectures.

Answer 81

Causes, Effects

Answer 82

Maximum likelihood is what you get when the prior over the hypotheses is uniform

Answer 83

Since what we're actually interested in is *getting the label of the data correct,* what we should do is allow the hypotheses to ***vote*** on the label (which makes it sort of connected to ensemble learning). Trying to calculate the Pr(h|D) is really just a means to an end at the end of the day; the label is what we care about, i.e. Pr(label|D)

Answer 84

*X* is conditionally independent of *Y* given *Z* if the probability distribution governing *X* is independent of the value of *Y* given the value of *Z.* That is: P(X|Y,Z) = P(X|Z) Basically if we can predict X given Z alone, then Y provides no information to us, hence X is conditionally independent of Y.

Answer 85

False. What we're actually looking at is only dependent in a statistical sense; causality can't necessarily be inferred.

Answer 86

***directed acyclic.*** This is because a Baysian net is essentially trying to encode cyclical independies, so if there were cycles, that would be tantamount to saying there are none.

Answer 87

False. They indicate ***conditional independencies***

Answer 88

The takeaway is that in the real world, we often get an example of some class *V*, say spam mail. We observe that that class has certain attributes *a.* But for classification, what we really want is **NOT** a model that goes from *V* --\> *a,* but rather a model that maps from *a --\> V.* Bayes Rule helps us do this. The MAP estimate for the class is simply the argmax of the prior probability of the class P(V) multiplied by the *product* of all the conditional probabilities of the attributes given the value.

Answer 89

1. Inference is cheap (we can also perform inference in any direction, e.g. to generate missing attributes) 2. Few parameters 3. Estimate parameters with labeled data (see screenshot for how this is actually calculated) 4. Connects inference and classification 5. Empirically successful

Answer 90

1. **Doesn't model interrelationships** between attributes (it assumes independence between attributes) 2. **Unseen attributes** - Since the calculate requires the product of all the conditional probabilities of the attributes, that means if we have a feature that is important for classification but for some reason doesn't appear in the dataset (perhaps too small of a dataset, etc.), the product ends up being 0!

Answer 91

Regular hill climbing

Answer 92

Random walk

Answer 93

In some ways you can think of them as the same. We're still initially sampling to get a fitness score, and then restarting to ensure we don't get stuck in local optima. The difference is with genetic algorithms, we're making the assumption that the *population distribution contains information that we can exploit to generate new combinations that are better than the individual values.*

Answer 94

Truncation selection, roulette wheel (i.e. weighted probabability, etc.). The latter one kind of reminds me of particles filters and the weighted resampling that we do there.

Answer 95

1. Locality of the bits (assuming a discrete problem) 2. More generally, that there are ***independent* *subspaces to optimize***. That is, that there is some structure to the problem that we can optimize parts of it without impacting other parts (or in practice, at least minimally so). Think of the bit string example in the lectures. If we swapped the first half of the 8-bit sequence in "Charles" with the corresponding 4-bit sequence from "Sheila", that's only a sensible thing to do if there's some structure in place that makes it such that those positions exist in the same topological space.

Answer 96

Basically just randomly corrupting bits to form new string.

Answer 97

If there is no gradient pointing the way (e.g. function is too complex, unknown, not differentiable, etc.)

Answer 98

They're "amnesiacs". That is, they keep track of history, but they *don't really use that history to discover and communicate structure in the problem that they might be able to exploit to find a better solution* (or to find it more quickly)

Answer 99

Structure.

Answer 100

Time. MMIC often runs in many orders of magnitude *fewer* iterations than things like Simulated Annealing or GA, but each one of those iterations takes a lot *more time* because we're estimating the parameters for the distribution at each step. The bonus that you get is *information* that is contained in the *structure of the problem.*

Answer 101

The connection is between the MAP formula: *H_MAP* = argmax[P(D|h) \* P(h)] and information theory. If we manipulate the math a bit, this works out to being the same as *minimizing* the entropy of the *misclassification (i.e. error) **plus** the size of the hypothesis).* It's basically a recapitulation of Occam's Razor: ***"We should prefer the simplest (i.e. lowest complexity) hypothesis that best explains (i.e. accuracy) of the data."***

Answer 102

Since KNN is instance based learning, if k=n then you're simpling using all the datapoints to generate your output, in which case you'll simply get a constant value: *the mean*. One way of changing the value would be to use a w*eighted average.* Using k=n for KNN, using boosting (i.e. ensemble learning) would output the same value (the mean) in the case where each datapoint forms it's own subset that a 0th order polynomial learner is trained on. Each of the 0th order polynomials simply learns the value for that specific point, and the ensemble would just then average over all of those, hence the overall average of the dataset would be returned.

Answer 103

Domain Knowledge (see the lectures on distance metrics and K-NN)

Answer 104

You could use locally weighted regression. In fact, you could really stuff any learning algorithm you want into small, piecewise regions, and then stitch those together to form a non-linear hypothesis space.

Answer 105

It grows exponentially in the number of dimensions: *O*(2*^d*)

Answer 106

K-NN is all about finding what datapoints actually *matter,* i.e. the k-nearest points to the query point according to some distance metric. SVM is similar, *except that instead of taking the lazy approach of a K-NN, it performs a calculation (eagerly) to figure out what datapoints it can throw away.* It can do this because only the points near the boundary actually matter in terms of constructing the margin for the hyper-plane.

Answer 107

For K-NNs we do it using the distance (*similarity*) metric. For SVMs, the *kernel* plays the same role.

Answer 108

It's a requirement that for an SVM the kernel type must be a well-defined distance or similarity metric. As a side note, I was wondering about using KL divergence as a kernel function, but I'm thinking that the Mercer Condition probably precludes that since KLD isn't symmetric.

Answer 109

Boosting is in a sense increasing the confidence of its predictions by increasing the margin of it's decision boundary. This is the same thing that an SVM does, maximizing the margin of the hyperplane that separates the classes.

Answer 110

Pink noise (i.e. *uniform noise*)

MidtermExam Flashcards

(193 cards)