Midterm Flashcards
What is the Bayes Optimal Classifier?
A system that classified new instances by weighing the probabilities of values based on hypotheses.
argmax of vj in V is the sum_hi in H of P(vj|hi) P(hi | D)
The optimal classifier may not equal the MAP hypothesis.
How does an ensemble learn a dataset well?
Taking subsets it gets rid of overfitting and may outperform direct calculation
Why is boosting really good
agnostic to the learner
combining simple -> complex
minimizes overfitting - in practice.
What is Hill Climbing algorithm?
Guess x in X
Repeat:
Let n* = argmax_n in N(x) f(n)
if f(n*) > f(x): x = n
else : stop
What is the bias of Decision Trees?
ID3 performs simple-to-complex hill climbing searhc
Inductive Bias - Prefers goods splits at the top, prefers correct over incorrect, prefers shorter trees than longer trees (from good splits)
What is MIMIC?
Directly model probability distribution
Search through space and refine estimate of distribution
Structure can be conveyed by refinement of distribution
Generate sample from Ptheta_t(x) (population)
Set theta_t+1 to nth percentile
retain only those samples s.t. f(x) <= theta_t+1 (fittest)
estimate Ptheta_t+1(x) (structure)
repeat
Perceptron and Boolean representation
AND
OR
NOT (one variable)
XOR - not a single perceptron, but a network (OR - AND in Quiz example)
What are approaches to optimization?
Generate and test - small input space, complex function
Calculus - function has derivative, solvable = 0
Newton’s Method - function has derivative, iteratively improve, single optimum
What is the relevance of the value C in SVM?
C is a regularisation parameter that controls the trade-off between maximizing the margin and minimizing the training error term. If C is too small than insufficient stress will be placed on fitting the training data. If C is too large then the algorithm will overfit the training data. (C is penalty for errors)
What is an instance based learning algorithm
KNN - parameters K and distance (similarity)
What is Naive Bayes
Probability of a value given attributes
P(V|A1, A2, A3,…,An) = product_i P(A1|V) * P(V) / Z (Z - normalize)
MAP Class = argmax_v P(v) product_i P(Ai|V)
What is Radial Basis Function?
RBF networks are a type of ANN constructed from spatially localized kernel functions. Blend of instance based and ANN approaches.
What is the KNN algorithm
Given: Training Data D = {xi. yi}
distance metric d(q,x)
number of neighbors k
query point q
NN = {i: d(q,xi) k smallest}
Return
For classification: Vote of the yi (plurality, mode)
For regression: mean, or weighted mean
What is conditional entropy?
H(y | x ) = - sum ( P(x,y) log P(y|x) )
If x is independent of y:
H(y|x) = H(y)
H(x,y) = H(x) + H(y)
What is the VC dimension of convex polgyons
number of parameters = infinite
VC dimension is infinite.
Draw points on a circle (polygon) VC dim = # points on circle for polygon. VC dim = # of points.
What is a Belief Network?
aka Bayes Net, aka Bayesian Network, aka Graphical Model
Draw the conditional independencies between variables
How powerful is a perceptron unit?
Returns 0 or 1 as a function of inputs
Weights matter
Linear relationship means perceptron is a linear function that computes linear planes.
gradient descent
robust to non linear separability
(unthresholded output)
The delta rule is the preferred choice for non-linearly separable
Classification
Map input to label
Decision Trees
Ex - At a restaurant think of whether you should enter or not given input about the restaurant (cost, food type, etc.)
Representation - Leaves (answers), edges (values), nodes (attribute)
What is epsilon - exhaustion?
VS(s) is epsilon-exhausted iff for all h in VS(s) the error_D(h) <= epsilon
Choose any h in VS(s) and you will have low error. We need to make sure the only things left in the version space have very low error.
Key points of SVMs
Want to maximize the margin between the boundaries
Want to find decision boundary (betwen margins)
y = wTx + b (label, weights, parameters of the plane)
0 = wTx + b (on boundary) (+1/-2 on margins)
Want perpendicular line between margins to be a maximum.
What is the learning/query runtime and space for 1-NN, K-NN and linear regression for a sorted data set?
What is the sum rule for probabilities?
P(A u B = P(A) + P(B) - P(A n B)
How are rules combined in ensemble learning
take sets uniformly randomly to get subset
apply a learner
combine with mean
Mistake Bounds
how many misclassifications can a learner make over an infinite run
When do we stop in a decision tree?
When everything is classified correctly, unless noise.
When no more attributes.
What do we do if noise?
Don’t trust the data completely.
Can overfit with a tree too big
What is the maximum likelihood hypothesis?
h_ml = argmax_h in H of P(D | h)
The MAP Hypothesis with P(h) dropped because we consider all P(h) priors equal.
What are the bias’ of Genetic Algorithms?
locality of the bits matter (first 4 bits are related or last 4 bits are related)
sub parts of the space can be independently optimized.
What is Bayes Rule
P(A|B) = P(B|A)P(A) / P(B)
P(A,B) = P(A|B)P(B) = P(B|A)P(A)
What are properties of simulated annealing?
T -> 0: like hill climbing
T -> inf: like random walk
decrease temperatore slowly
Boltzmann distribution - likely to be in a location of high fitness
Why does boosting tend to do well?
Boosting makes samples that aren’t working well be re-weighted to do better on.
The number of things getting wrong will have to be half right as the process is renormalized.
Errors go down and alphas go up, overtime.
Hypotheses that are more right have more vote.
Information gain, must pickup information as you go along. You have to pick up things that you got wrong in the past.
Reinforcement Learning
Learning from delayed reward
What is a perceptron ANN?
Attributes and weights flow in
Activation: Sum_i=1 to k_ Xi * Wi
>= theta (firing threshold)
yes - y = 1
no - y = 0
Testing Set
Just like a training set. Look at candidate and see if it does the job. Train =/= Test
What are the bias’ of neural networks
Restriction bias:
perceptron - only considers planes, half spaces
sigmoids - much more complex. not much restriction
Representations:
boolean - network of threshold like units
continuous - connected no jumps - hidden
arbitrary - stitch together - 2 hidden
Preference bias:
initial weights (small, random values to minimize bias and give variability between runs)
prefer simpler explanations (occam’s razor)
What is unique to ANN regarding errors and cross-validation?
ANN is iterative
As the iterations continue, at risk of weights becoming too large and overfit
Error over iterations plot will show variance between train and validation data
Recommend stopping training before the train/validation data diverges
Instances
inputs. Vectors of values. The set of things you are looking for
What is a weak learner
A weak learner is a learner that will always do better than chance
For all D, PrD[.] <= 1/2 - epsilon
epsilon - a really really small number <<< 1
What is backpropagation?
Computationally beneficial organization of the chain rule.
The errors flow backwards.
Error function can have many local optima. (Weights cannot change without making error worse, so stuck)
What is Entropy?
- sum P(s) log (P(S))
What are the advantages of random restart hill climbing?
Multiple tries to find a good starting place
Not much more expensive (constant factor)
What is the theorem of total probability?
If events are mutually exclusive with sum from i= 1 to n of P(Ai) = 1, then
P(B) = sum_ i = 1 to n of P(B|Ai)P(Ai)
What is the optimization problem for SVM?
How do you deal with the threshold issue and avoid undifferentiable activation?
Sigmoid - Differentiable threshold
sigma(a) = 1 / (1 + e ^ -a)
goes to 0 when a -> -inf
goes to 1 when a -> inf
What if target concept not in Hypothesis Space?
Learning scenario called “agnostic”
Learner doesnt have to have hypothesis in target space, but has to find the best one at matching true concept.
m > = (1/2epsilon^2)* ( ln(|H| + ln(1\delta) )
Like the equation for non-agnostic case but epsilon is a
How expressive is a decision tree?
XOR with n boolean attributes. How many trees/rows in the truth table? How big is the truth table?
O(n!) nodes (exponential)
output exponential
2^n rows (truth table)
2^(2^n) size truth table. How many different bit patterns? (2^(# positions)). 2 ^ (2^n)
What is the equation for the margin between decision boundaries in SVM?
M = 2 / ||W||
(wT/||W|| (x1 - x2) = 2)
VC dimension of example?
2.
With 3 points on the line, cannot create a range such that points are :
+ - +
What is the relationship between PAC-learnability and VC dimension?
H is PAC-learnable iff VC dimension is finite.
What is simulated annealing?
Don’t always improve (exploit), sometimes you need to search (explore). Tradeoff
For finite set of iterations:
sample new point Xt in N(x)
jump to new sample with probability given by an acceptance probability function (hill climb or use Temperature to decide if to jump)
decrease temperature
Higher temperature - e^0 willing to take downward steps
Distance metrics for K
Euclidean sqrt( x0 - x1)^2 + (y0 - y1)^2
Manhattan |(x0 - x1)| + |(y0 - y1)| (taxicab)
What are genetic algorithms?
Population of individuals
mutation - local search N(x)
cross over - population holds information (different**). Combine attributes to be better
generations - iterations of improvement
Po = initial population of size K
Repeat until converged
Compute fitness of all x in Pt
Select “most fit” individuals (top half - truncated selection, weighted prob - roulette wheel)
Pair up individuals, replacing “least fit” individuals via crossover/mutation
How to find the line of best fit in polynomial regression
Co + C1X + C2X^2 + C3X^3 = y
Use matrix multiplicate
Xw = y
X^TXw = X^Ty
(X^TX)^-1 X^TXw = (X^TX)^-1XTy
w = (X^TX)^-1XTy
Input spaces of regression
Scalar input, continuous out
Vector input, continuous out - include more input features: (size, distance from zoo). hyperplanes
discrete vector or scalar inputs - for discrete quantities can enumerate them into numbers.