Midterm Prep Flashcards

Question

Which components in KNN represent your domain knowledge?

Answer 1

The distance metric and the number of neighbors.

Answer 2

Find the k nearest neighbors as determined by your distance metric. Then you could use voting or a weighted vote.

Answer 3

Find the k nearest neighbors as determined by your distance metric. Then take the mean of the y values for the nearest neighbors (or weighted mean, etc).

Answer 4

For learning it's always constant. query time is log_2+k time because you can basically find the query point in log_2(n) time and then bounce back and forth to find your k points. if k is on the order of n/2 then it would dominate the log_2 and become O(n) instead of O(log(n))

Answer 5

* Locality: assumes near points are similar * Smoothness: averages * All features matter equally (because of the distance metric) Example: your real function is x1^2 + x2, then in your distance function you could square the difference in x1 instead just using the linear difference.

Answer 6

As the number of features (dimensions) grows, you need exponentially more data to generalize accurately. This applies to all ML problems. The idea being that you want your data points to represent the same 'amount' of the dimensional space. 1D: | x x x | must go to: 2D: | x x x | | x x x | | x x x | When you add another feature.

Answer 7

Performing a local regression on the nearest neighbors where the closest ones have the most influence. So you're building concepts around the local neighbors.

Answer 8

The theory of NFL says that averaged over all problems, no method will do better than random

Answer 9

Use weak learners to create a bunch of small rules over small parts of the dataset (random small subset) and then combine them all to create a complex rule.

Answer 10

Create weak learners and train them all on small random subsets of the data, then average the hypotheses learned by each.

Answer 11

No matter what the distribution is, they will do better than chance. Error rate should always be strictly < 0.5

Answer 12

Start with a distribution of 1/n. Train a weak learner on the dataset then when we redistribute the likelihood of drawing samples from D such that the next weak learner has higher probability of drawing the samples the previous learner got wrong. Repeat this process for many weak learners. Make a weighted combination of the weak learners in the end to form a final hypothesis. Boosting says that your classification labels are {-1, +1}

Answer 13

Like KNN because it took simple hypotheses and made a more complex one (local weighted regression in KNN). Weighted combination of things is like NN's.

Answer 14

Because you pass it through the sgn function at the end which is non-linear.

Answer 15

You're using weak learners that must get at least 50% of the classifications correct. At each step you re-distribute such that you put more weight on the likelihood of drawing samples that the previous learner got incorrect. So the next weak learner is focusing more on the problems the previous learner got wrong, and they must get > 50% of them correct. This means you should always be improving.

Answer 16

The goal is to maximize the margin between the data subject to being able to properly classify the data on either side of that decision boundary. try to minimize 1/2 * || w ||^2 where w is a vector containing your hyperplane parameters. Ultimately this turns into a quadratic programming problem, and once you solve it you can recover w as: w = sum(alpha_i * y_i * x_i), but many alphas are 0 so most of your data doesn't matter, only a few of the x_i's do. The data that IS left are your support vectors.

Answer 17

Only the local points matter (near the decision boundary). Also like KNN except you've already done the work of figuring out which points matter and you can throw away the rest. Like instance based learning except you've already put some energy into figuring out which points matter.

Answer 18

In the quadratic equation, there is a term: xT_i*x_j which is the dot product of the vectors. This tells you whether they are orthogonal (0), in the same direction (large +) or opposite directions (large -). They are a form of similarity.

Answer 19

Transforming your data into a higher dimensional space without actually have to go there.

Answer 20

K(x_i, x_j), Your similarity metric. It's where you inject domain knowledge. This is a dot product that projects your data into a higher dimensional space where it can (could?) be linearly separable. You don't actually have to change your data into that dimension, though, because of the kernel trick. The kernel should really capture your domain knowledge. The datapoints (x_i, x_j) can be continuous or discrete!

Answer 21

"It acts like a distance or a similarity" which means it's a well behaved distance function.

Answer 22

As you add more learners, your error doesn't necessarily change but you increase your confidence. This is like the margin in SVM. Boosting after normalizing dividing by sum(alpha_t) |-__-_-_____|__+____+___++| |---________|_________++++| | margin |

Answer 23

Larger margins tend to minimize overfitting.

Answer 24

If the weak learners overfit! A bunch of things that overfit when combined can only overfit as well. The other case is when there is pink noise (uniform noise)

Answer 25

How many misclassifications can a learner make over an infinite run. 1) Assume possible each variable is positive AND negated (A && NOT A && B && NOT B && ...) 2) Given input, compute output, you guess FALSE 3) first time you're wrong, keep bit pattern in your positive/negated table. 4) every time you're wrong thereafter, set all positive variables that were 0 to absent, negative variables that were 1 to absent, goto 2 The idea is that whenever you see a true response, any bit patterns that are flipped when compared to the previous true response must be absent because they don't contribute to the output. You will make k + 1 mistakes at most (k being number of hypotheses).

Answer 26

How much computational effort is required for a learner to converge. (Not really discussing this topic)

Answer 27

How many training examples are needed for a learner to create a successful hypothesis.

Answer 28

Set of all hypotheses that are consistent with the data.

Answer 29

Of all possible hypotheses that the learner could return, it returns the one that matches the data it has seen so far. It's consistent with the data.

Answer 30

Training Error: fraction of training examples misclassified by h True Error: fraction of examples that _would be_ misclassified on sample drawn from D. error(h) = Pr[c(x) != h(x)] - You're being penalized proportional to that thing being wrong. "You are not being penalized for misclassifying samples you don't ever see"

Answer 31

C: Concept Class L: Learner H: Hypothesis Space n: |H|, size of hypothesis space D: distribution over inputs 0 <= epsilon <= 1/2: error goal - error in hypothesis should be less than epsilon 0 <= delta <= 1/2: failure probability (with prob (1 - delta), algorithm must be able to produce a true error <= epsilon) Probably Approximately Correct (1 - delta), epsilon, h(x) = c(x)

Answer 32

C (Concept) is PAC learnable by L (Learner) iff L will with probability (1-delta) output a hypothesis such that error(h) <= epsilon in time and samples polynomial in 1/epsilon, 1/delta, and n

Answer 33

The version space is epsilon exhausted when all the hypotheses in the version space will produce an error(h) <= epsilon.

Answer 34

A way to bound the true error as a function of the number of training examples drawn. The minimum number of samples to guarantee you've epsilon exhausted your version space is: m >= 1/eps * (ln |H| + ln 1/delta) This bound is agnostic to the distribution!

Answer 35

Answers the question: What is the largest set of inputs that the hypothesis class can [label in all possible ways] (eg, shatter)? You only must prove it can shatter one collection of points (all combinations) but to prove you can't shatter, you just need to show one combination that doesn't work.

Answer 36

You put some number of points down and then no matter how you label those points, you must be able to separate them. For a d-dimensional hyperplane, VC = d + 1

Answer 37

H is PAC-learnable if and only if VC dimension is finite. You can't PAC learn it if the VC dimension is infinite.

Answer 38

d = VC(H) <= log_2 |H|

Answer 39

Like the Haussler's theorem, except instead of ln |H| you use VC(H)... and there are many other changes. The point is you can calculate a minimum number of samples to be epsilon exhausted if VC(H) is finite.

Answer 40

P(A|B) = (P(B|A) * P(A)) / P(B) Allows you to swap what's on the sides of the bar's.

Answer 41

Putting probabilities on various hypotheses. P(h|D) = P(D|h) * P(h) / P(D) • in general we drop the `/ P(D)` though because it doesn't affect the argmax h_map = argmax_h P(h|D) = argmax_h P(D|h) * P(h)

Answer 42

Maximum a posteriori h_map = argmax_h P(h|D) = argmax_h P(D|h) * P(h)

Answer 43

Maximum Likelyhood hypothesis h_ml = argmax_h P(D|h) This used to contain a P(h) as well, but that basically gets ignored because we assume a uniform distribution over h, so it becomes some constant and it doesn't really matter what it is. So it basically also doesn't affect the argmax because it would be the same for everything. It is also the MAP you get when you have a uniform prior.

Answer 44

Transform L_map = argmax_h P(D|h)*P(h) using lg: argmin_h [-lg P(D|h) -lg P(h)] where each of these represents a length a la information theory. so the lg P(h) is like the length of your hypothesis ('size' of h) and the lg P(D|h) is a notion of misclassification or 'error'. eg it minimizes error using the simplest hypothesis. There is usually a tradeoff, though. This is called minimum description length.

Answer 45

You use a weighted vote for all the h in H. Eg, weighted vote of Pr(h|D) for all h. Example: h1: + -> 0.4 h2: - -> 0.3 h3: - -> 0.3 What's the MAP? h1, so + But what's the most likely label (bayes classification)? minus (-). This is known as bayes optimal classifier.

Answer 46

Answer 47

"The joint distribution between two variables is equal to the product of their marginals" - CL Independence: P(x, y) = P(x) * P(y) Chain Rule: P(x, y) = P(x | y) * P(y) Setting them equal: P(x) * P(y) = P(x | y) * P(y) P(x) = P(x | y)

Answer 48

Dependency graphs or probabilities. Storm -> Lightening -> Thunder means: * Lightening is dependent on Storm * Thunder is dependent on Lightening * But Thunder is conditionally independent of storm, given lightening (b/c there is no arrow from storm to lightening) * If there was an arrow from Storm -> Thunder then Thunder would also be dependent on Storm.

Answer 49

You multiple the probabilities of the entire graph back together: A -> C -> D B -> C B -> D ``` A = P(A) : 1 prob B = P(B) : 1 prob C = P(C | A, B) : 4 probs D = P(D | B, C) : 4 probs ``` Joint = P(A) * P(B) * P(C | A, B) * P(D | B, C) The number of probabilities at each node is exponential in the number of immediate parents.

Answer 50

* to simulate a complex process * approximate inference - rather than back compute some value, just sample from that scenario - plus it's faster * visualization - get a feel for the data

Answer 51

Represent the probability of a value by summing over another variable and looking at the joint probabilities. Example: P(x | sum_y(P(x, y)) sum of: (x///( ) y) places where x and not y (x (//) y) and places where x and y Another example: P(x) = P(x, y) + P(x, not y)

Answer 52

A special case of a belief network where you've got a root node (label) producing many different attributes values. Each attribute has no children and only has the root node as its single parent. P(V | a1, a2, a3...) = prod_i(P(a_i | V) * P(V) / Z MAP Class = argmax_V P(V) * prod_i(P(a_i | V)) - The most likely class given the data that we've seen. Naive bayes can handle missing attributes.

Answer 53

* Num. of parameters you need to know is linear. 1 prob for the root node (label/class) and 2 for each attribute. * You can estimate the params with labeled data as: P(a_i | V) = # a_i, V / # V => how often do you have that attribute in that class divided by the number of times that class showed up. • Can do inference in both directions

Answer 54

It assumes all the attributes are conditionally independent of each other! (root note -> attributes). It doesn't model interrelationships between attributes. It's OK because we don't care about the absolute probabilities that come out of it, only that we get the correct label (so ordering of probabilities is what matters).

Answer 55

* Guess some value, x * Check both sides of your current position * Move to the highest of your neighboring points * Repeat until both of your neighbors are lower than you * Once you've reached your local maximum, restart in some other random position and repeat * Do this some number of iterations and output the best result * In general this is an exploitation algorithm because it's always going uphill. Maybe we could call the random restarts the exploration.

Answer 56

If you've got a wide basin of attraction on a local optima (that is not the global) then the majority of the time you're not reaching the global optima, especially if the global optima has a narrow basin of attraction. It may take a very large number of iterations to find it.

Answer 57

A random search algorithm that trades off exploration and exploitation by starting off with more exploration and moving toward exploitation over time. Algorithm: • Sample a new point, x_t in neighborhood N(x) • Compute probability of moving to that point as: P(x, x_t, T): • 1 if f(x_t) >= f(x) (neighbor is >= to you, make the move) • e^([f(x_t) - f(x))]/T) • Decrease Temperature, T f(x_t) - f(x) will always be negative, so you've really got 1/e^(difference/T) * When T is large the exponent goes toward 0 and 1/1 = 1, so will move downhill. * As the temperature decreases, exponent becomes very small, which magnifies the exponent, making the probability very small. So you start out with more exploration, accepting many downhill neighbors when T is large and accepting less downhill neighbors choosing mostly to move uphill as T gets smaller.

Answer 58

* When T => 0: like hill climbing * When T => inf: random walk * In general, you should decrease T slowly P(ending at some point, x) = e^(f(x)/T) / Z_t (boltzmann distribution) So more probability mass is put on that point as the Temperature goes down.

Answer 59

A parallel random search algorithm that maintains population of searchers that 'evolve' over time in hopes of landing on a good solution. Algorithm: Create a random population • compute fitness for each member • select most fit individuals (cull the heard) • pair up individuals and breed (crossover - a notion of locality) to replenish the population • allow 'mutation' to occur (local search)

Answer 60

A method of generating 'children' from 'parents' where the children have n bits from one parent and length-n bits of the other parent. Each child has the opposite set of bits from each parent. This does imply some kind of bias about the locality of the bits - that some bit groups should stay together.

Answer 61

Basically just randomizing whether you keep or flip the bits at each bit position.

Answer 62

* Generate samples from P(x) theta_t (starts out uniform) * set theta_t+1 to n_th percentile (fittest) * retain only those samples such that f(x) >= theta_t+1 * Estimate the new distribution, P(x) theta_t+1 * Repeat You're taking theta_min toward theta_max over time. It's like the water rising around a bunch of mountainous islands. The peaks remain visible and the water keeps rising until the peak(s) become single point(s) (if equal height peaks).

Answer 63

* generating samples from P(x) theta_t : this is like your population * keeping the nth percentile of samples with f(x) >= theta_t: is like culling the population in GA

Answer 64

A belief network where each node in the tree has exactly one parent (except for the root). Each variable then depends on exactly one other variable. They are used used to estimate the distribution P(x) theta_t. By using dependency trees we are assuming each variable will have at most one dependent variable. Typically we would have: • P(x) = P(x1 | x2,... xn) * P(x2 | x3 ... xn) ... * P(xn) With dependency trees, it becomes: • P(x)_pi = prod(P(x_i | pi(x_i)) Where pi is the parent of x_i.

Answer 65

No - using dependency trees is just the easiest thing to do while still allowing you to represent dependent relationships. You could use bayesian networks or unconditional probability distributions, etc.

Answer 66

It is a method of describing the similarity between two distributions. A measure of the divergence between the underlying distribution P and the candidate distribution, Phat_pi. The goal in MIMIC is to minimize the difference between the dependency tree generated (Phat_pi) and the underlying distribution (P). * D_kl (P || Phat_pi) = sum( P( lg(P) - lg (Phat_pi) ) * max J_pi = Sum( I( x_i ; pi(x_i)) ): mutual information The goal is to maximize the mutual information between each feature and its parents. eg, the best dependency tree is the one that captures dependencies the best.

Answer 67

* Start with a fully connected graph between all attributes. * Find the maximum spanning tree (use Prim)! Ultimately you start with the fully connected graph with mutual information (which is bidirectional) on each edge. Then after applying Prim it is reduced to the maximum spanning tree which is the spanning tree that has the maximum weight.

Answer 68

* Does well with structure. When the optimal values you care about depend on the structure as opposed to specific values. You get a lot of information at each iteration. * MIMIC is also useful when the fitness function is very complex to calculate. Other search methods that required hundreds of thousands of iterations would need to evaluate this complex function over and over while MIMIC would do it in orders of magnitude less.

Answer 69

* can still get stuck in local optima, but in general it tends not to * It's execution time for single iterations is slow, but it doesn't have to do many iterations as compared to other randomized search algorithms.

Answer 70

A measure of the reduction of randomness of one variable given knowledge of some other variable. I(x, y) = H(y) - H( y | x )

Answer 71

Randomness contained in two variables together. • H(x, y) = - sum( P(x, y) log(P(x, y) )

Answer 72

Measure of randomness of one variable given another variable. H(y | x) = - sum( P(x, y) log(P( y | x) )

Answer 73

Conditional: H(y | x) = H(y) Joint: H(y, x) = H(y) + H(x)

Answer 74

A measure of randomness or information. • - sum(P(s) * lg P(s))

Midterm Prep Flashcards

(98 cards)