Midterm Flashcards
What is the Bayes Optimal Classifier?
A system that classified new instances by weighing the probabilities of values based on hypotheses.
argmax of vj in V is the sum_hi in H of P(vj|hi) P(hi | D)
The optimal classifier may not equal the MAP hypothesis.
How does an ensemble learn a dataset well?
Taking subsets it gets rid of overfitting and may outperform direct calculation
Why is boosting really good
agnostic to the learner
combining simple -> complex
minimizes overfitting - in practice.
What is Hill Climbing algorithm?
Guess x in X
Repeat:
Let n* = argmax_n in N(x) f(n)
if f(n*) > f(x): x = n
else : stop
What is the bias of Decision Trees?
ID3 performs simple-to-complex hill climbing searhc
Inductive Bias - Prefers goods splits at the top, prefers correct over incorrect, prefers shorter trees than longer trees (from good splits)
What is MIMIC?
Directly model probability distribution
Search through space and refine estimate of distribution
Structure can be conveyed by refinement of distribution
Generate sample from Ptheta_t(x) (population)
Set theta_t+1 to nth percentile
retain only those samples s.t. f(x) <= theta_t+1 (fittest)
estimate Ptheta_t+1(x) (structure)
repeat
Perceptron and Boolean representation
AND
OR
NOT (one variable)
XOR - not a single perceptron, but a network (OR - AND in Quiz example)
What are approaches to optimization?
Generate and test - small input space, complex function
Calculus - function has derivative, solvable = 0
Newton’s Method - function has derivative, iteratively improve, single optimum
What is the relevance of the value C in SVM?
C is a regularisation parameter that controls the trade-off between maximizing the margin and minimizing the training error term. If C is too small than insufficient stress will be placed on fitting the training data. If C is too large then the algorithm will overfit the training data. (C is penalty for errors)
What is an instance based learning algorithm
KNN - parameters K and distance (similarity)
What is Naive Bayes
Probability of a value given attributes
P(V|A1, A2, A3,…,An) = product_i P(A1|V) * P(V) / Z (Z - normalize)
MAP Class = argmax_v P(v) product_i P(Ai|V)
What is Radial Basis Function?
RBF networks are a type of ANN constructed from spatially localized kernel functions. Blend of instance based and ANN approaches.
What is the KNN algorithm
Given: Training Data D = {xi. yi}
distance metric d(q,x)
number of neighbors k
query point q
NN = {i: d(q,xi) k smallest}
Return
For classification: Vote of the yi (plurality, mode)
For regression: mean, or weighted mean
What is conditional entropy?
H(y | x ) = - sum ( P(x,y) log P(y|x) )
If x is independent of y:
H(y|x) = H(y)
H(x,y) = H(x) + H(y)
What is the VC dimension of convex polgyons
number of parameters = infinite
VC dimension is infinite.
Draw points on a circle (polygon) VC dim = # points on circle for polygon. VC dim = # of points.
What is a Belief Network?
aka Bayes Net, aka Bayesian Network, aka Graphical Model
Draw the conditional independencies between variables
How powerful is a perceptron unit?
Returns 0 or 1 as a function of inputs
Weights matter
Linear relationship means perceptron is a linear function that computes linear planes.
gradient descent
robust to non linear separability
(unthresholded output)
The delta rule is the preferred choice for non-linearly separable
Classification
Map input to label
Decision Trees
Ex - At a restaurant think of whether you should enter or not given input about the restaurant (cost, food type, etc.)
Representation - Leaves (answers), edges (values), nodes (attribute)
What is epsilon - exhaustion?
VS(s) is epsilon-exhausted iff for all h in VS(s) the error_D(h) <= epsilon
Choose any h in VS(s) and you will have low error. We need to make sure the only things left in the version space have very low error.
Key points of SVMs
Want to maximize the margin between the boundaries
Want to find decision boundary (betwen margins)
y = wTx + b (label, weights, parameters of the plane)
0 = wTx + b (on boundary) (+1/-2 on margins)
Want perpendicular line between margins to be a maximum.
What is the learning/query runtime and space for 1-NN, K-NN and linear regression for a sorted data set?
What is the sum rule for probabilities?
P(A u B = P(A) + P(B) - P(A n B)