midterm Flashcards
When would boosting overfit?
If the weak learners overfit! A bunch of things that overfit when combined can only overfit as well.
The other case is when there is pink noise (uniform noise)
How is SVM like KNN?
SVM like KNN except you done the work of figuring out which point matters, only local points maters, you don’t need all of the points.
KNN -> point that are closer together ought to have similar labels
What is a notion of similarity in SVM?
In the quadratic equation, there is a term: xT_i*x_j which is the dot product of the vectors. This tells you whether they are orthogonal (0), in the same direction (large +) or opposite directions (large -). They are a form of similarity.
What is the kernel trick?
K(x_i, x_j), Your similarity metric. It’s where you inject domain knowledge. This is a dot product that projects your data into a higher dimensional space where it can (could?) be linearly separable.
You don’t actually have to change your data into that dimension, though, because of the kernel trick.
The kernel should really capture your domain knowledge.
The datapoints (x_i, x_j) can be continuous or discrete!
What is an SVM kernel?
The kernel function is a function which can be represented the dot product of the mapping function (kernel method)
What is the Mercer Condition?
“It acts like a distance or a similarity” which means it’s a well behaved distance function.
How is boosting related to SVM?
As you add more learners, your error doesn’t necessarily change but you increase your confidence. This is like the margin in SVM.
Why do you want large margins in SVM?
Large margin mean the ability to generalize better an avoid overfitting
What is Version space in ML?
Set of all hypotheses that are consistent with the data.
What are mistake bounds?
How many misclassifications can a learner make over an infinite run.
1) Assume possible each variable is positive AND negated (A && NOT A && B && NOT B && …)
2) Given input, compute output, you guess FALSE
3) first time you’re wrong, keep bit pattern in your positive/negated table.
4) every time you’re wrong thereafter, set all positive variables that were 0 to absent, negative variables that were 1 to absent, goto 2
The idea is that whenever you see a true response, any bit patterns that are flipped when compared to the previous true response must be absent because they don’t contribute to the output.
You will make k + 1 mistakes at most (k being number of hypotheses).
What is computational complexity?
How many computational effort is needed for a learner to converge
Why boosting hard to overfitting?
1) giving high weight to low confidence to be high confidence
2) maximize margin meaning no overfitting because it will be able to generalize well
What is Haussler’s theorem?
a
What is the training error vs True error of a hypothesis?
Training error -=> is the fraction of training examples misclassified by h,
True error -> is the fraction of examples that would be misclassified on sample drawn from D
How do you find the VC dimension?
a
How do VC dimensions relate to PAC?
a
What is the VC of finite H space?
a
What is Haussler’s theorem?
Is a away to Bounding the true error as function of training example.
m >= 1/epsilon * (ln |H| + ln 1/delta)
(delta is the failure probability)
(epsilon is the error parameter)
There is a concern as the number of sample depend on the size of the hypothesis (what happen if you have large hypothesis space)
What are VC dimensions?
“What kind of data can this classifier classify?”. This fundamental question can be answered using a classifier’s VC dimension.
“the cardinality (size) of the largest set of points that the classification algorithm can shatter”
How do you find the VC dimension?
You put some number of points down and then no matter how you label those points, you must be able to separate them.
VC is the number of parameters! For a d-dimensional hyperplane, VC = d + 1 1 dimension has 1 VC (theta) interval has 2 VC (a,b) two dimension has 3 VC (W in R2, theta) three dimension has 4 VC
How do VC dimensions relate to PAC?
H PAC-learnable if and only if VC dimension is finite. You can’t PAC learn it if the VC dimension is infinite.
What is the VC of finite H space?
There is log relation between the size of finite hypotheses class and the VC dimensionality d = VC(H) <= log_2 |H|
How does the VC dimension relate to sample complexity?
m >= 1/epslion(8.vc(H).log(13/epslion ) + 4 log 2/delta ) –> infinite case
Like the Haussler’s theorem, except instead of ln |H| you use VC(H)… and there are many other changes.
The point is you can calculate a minimum number of samples to be epsilon exhausted if VC(H) is finite.
What’s Bayes Rule?
Pr(h|D) = ( Pr(D|h).Pr(h) )/ pr(D)
Pr(D) -> Prior on the data
Pr(D|h) -> easy to think about
pr(h) -> prior on h (our domain knowledge)
Allows you to swap what’s on the sides of the bar’s.
Describe Bayesian learning
Learn the best (probable) hypothesis give data and some domain knowledge For each h E H : calclulate Pr(h|D) output: h = argmax pr(h|D)
What’s the MAP hypothesis?
Maximum a posteriori = argmax pr(h|D)
What is the ML hypothesis?
h = argmax Pr(D|h) we don’t care of Pr(h) as all hypothesis equally uniform
How does Bayesian learning and information theory fit together?
Transform L_map = argmax_h P(D|h)*P(h) using lg:
argmin_h [-lg P(D|h) -lg P(h)]
where each of these represents a length a la information theory.
so the lg P(h) is like the length of your hypothesis (‘size’ of h) and the lg P(D|h) is a notion of misclassification or ‘error’.
eg it minimizes error using the simplest hypothesis. There is usually a tradeoff, though.
This is called minimum description length.
How does Bayesian classification work?
Weighted voted h E H, pr(h|D)
finding the value is what we care about
What is conditional independence?
X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z ; that is if
P(X=x | Y=y, Z=z) = p(X=x |Z=z)
What is normal independence?
Pr(x,y) = pr(x).pr(y)
“The joint distribution between two variables is equal to the product of their marginals” - CL
What are belief networks?
Represent the conditional independence relationship between all the variable in the join distribution graphically. Node -> variables. Edges -> dependences
How do you recover the joint distribution from a belief network?
You multiple the probabilities of the entire graph back together:
Pr(A,B,C,D,E) = pr(A). pr(B).pr(c|A,B).pr(D.B,c).Pr(E|C,D)
Why sample in Bayesian Learning?
Why have distribution
1) probability of value
2) generate value according this distribution
Why sample in Bayesian Learning?
1) Simulation of a complex process
2) approximate inference(Not exactly because it hard)
3) visualization
What is marginalization (stats)?
p(x) = sum (p(x,y))
What is Naïve Bayes?
Naïve Bayes classifiers is “probabilistic classifiers” based on applying Bayes’ theorem with strong independence assumptions between the features
P(V|a1,a2,….,n) = product p(a|V) . p(v)/z
e.g spam email classification
A special case of a belief network where you’ve got a root node (label) producing many different attributes values. Each attribute has no children and only has the root node as its single parent.
P(V | a1, a2, a3…) = prod_i(P(a_i | V) * P(V) / Z
MAP Class = argmax_V P(V) * prod_i(P(a_i | V)) - The most likely class given the data that we’ve seen.
Naive bayes can handle missing attributes.
Why is Naïve Bayes powerful?
1) When you have naïve structure Inference is cheap
2) few parameters (Num. of parameters you need to know is linear)
3 Easy Estimate parameters with label data
4) connect inference and classification
5) empirically successful
6) Can do inference in both directions
Why is Naïve Bayes… naïve?
No free lunch.
The network not represent the world most of the time because the network says all the attributes are conditionally independent giving that you know the label.
Doesn’t model interrelationship between attributes.
It assumes all the attributes are conditionally independent of each other! (root note -> attributes). It doesn’t model interrelationships between attributes.
It’s OK because we don’t care about the absolute probabilities that come out of it, only that we get the correct label (so ordering of probabilities is what matters).
Describe random restart hill climbing
- Guess some value, x
- Check both sides of your current position
- Move to the highest of your neighboring points
- Repeat until both of your neighbors are lower than you
- Once you’ve reached your local maximum, restart in some other random position and repeat
- Do this some number of iterations and output the best result
- In general this is an exploitation algorithm because it’s always going uphill. Maybe we could call the random restarts the exploration.
Is away of dealing with local optima.
Once local optimum reached try again starting from a randomly chosen x .
1) Not much more expensive.
2) multiple tries to find a good starting place.
When might random hill climbing have trouble finding the solution?
If you’ve got a wide basin of attraction on a local optima (that is not the global) then the majority of the time you’re not reaching the global optima, especially if the global optima has a narrow basin of attraction. It may take a very large number of iterations to find it.
Describe simulated annealing
Don’t always improve - sometimes you need to search.
A random search algorithm that trades off exploration and exploitation by starting off with more exploration and moving toward exploitation over time.
Algorithm:
• Sample a new point, x_t in neighborhood N(x)
• Compute probability of moving to that point as:
P(x, x_t, T):
• 1 if f(x_t) >= f(x) (neighbor is >= to you, make the move)
• e^([f(x_t) - f(x))]/T)
• Decrease Temperature, T
f(x_t) - f(x) will always be negative, so you’ve really got 1/e^(difference/T)
- When T is large the exponent goes toward 0 and 1/1 = 1, so will move downhill.
- As the temperature decreases, exponent becomes very small, which magnifies the exponent, making the probability very small.
So you start out with more exploration, accepting many downhill neighbors when T is large and accepting less downhill neighbors choosing mostly to move uphill as T gets smaller.
What are the properties of simulated annealing?
T ->0 : like hill climbing
T ->infinity : random walk
Decrease T slowly
BOLTZMANN distribution
Describe Genetic Algorithms
Population like parallel random restart.
Initial population of size k
repeat until converge:
* compute fitness of all x E p +
* select most fit individual
* pair up individual, replacing least fit individual via crossover/ mutation
What is one point crossover (GA’s)?
1) locality of bit matters
2) it assumes that subpart of the space that can be independently optimize
A method of generating ‘children’ from ‘parents’ where the children have n bits from one parent and length-n bits of the other parent. Each child has the opposite set of bits from each parent.
This does imply some kind of bias about the locality of the bits - that some bit groups should stay together.
What is Uniform Crossover (GA’s)?
Randomize at each bit position
Describe the MIMIC algorithm
- Directly model distribution
- Successively refine model
- Generate samples from P(x) theta_t (starts out uniform)
- set theta_t+1 to n_th percentile (fittest)
- retain only those samples such that f(x) >= theta_t+1
- Estimate the new distribution, P(x) theta_t+1
- Repeat
You’re taking theta_min toward theta_max over time.
It’s like the water rising around a bunch of mountainous islands. The peaks remain visible and the water keeps rising until the peak(s) become single point(s) (if equal height peaks).
How is MIMIC like genetic algorithms?
- generating samples from P(x) theta_t : this is like your population
- keeping the nth percentile of samples with f(x) >= theta_t: is like culling the population in GA