Midterm Flashcards
What is the Bayes Optimal Classifier?
A system that classified new instances by weighing the probabilities of values based on hypotheses.
argmax of vj in V is the sum_hi in H of P(vj|hi) P(hi | D)
The optimal classifier may not equal the MAP hypothesis.
How does an ensemble learn a dataset well?
Taking subsets it gets rid of overfitting and may outperform direct calculation
Why is boosting really good
agnostic to the learner
combining simple -> complex
minimizes overfitting - in practice.
What is Hill Climbing algorithm?
Guess x in X
Repeat:
Let n* = argmax_n in N(x) f(n)
if f(n*) > f(x): x = n
else : stop
What is the bias of Decision Trees?
ID3 performs simple-to-complex hill climbing searhc
Inductive Bias - Prefers goods splits at the top, prefers correct over incorrect, prefers shorter trees than longer trees (from good splits)
What is MIMIC?
Directly model probability distribution
Search through space and refine estimate of distribution
Structure can be conveyed by refinement of distribution
Generate sample from Ptheta_t(x) (population)
Set theta_t+1 to nth percentile
retain only those samples s.t. f(x) <= theta_t+1 (fittest)
estimate Ptheta_t+1(x) (structure)
repeat
Perceptron and Boolean representation
AND
OR
NOT (one variable)
XOR - not a single perceptron, but a network (OR - AND in Quiz example)
What are approaches to optimization?
Generate and test - small input space, complex function
Calculus - function has derivative, solvable = 0
Newton’s Method - function has derivative, iteratively improve, single optimum
What is the relevance of the value C in SVM?
C is a regularisation parameter that controls the trade-off between maximizing the margin and minimizing the training error term. If C is too small than insufficient stress will be placed on fitting the training data. If C is too large then the algorithm will overfit the training data. (C is penalty for errors)
What is an instance based learning algorithm
KNN - parameters K and distance (similarity)
What is Naive Bayes
Probability of a value given attributes
P(V|A1, A2, A3,…,An) = product_i P(A1|V) * P(V) / Z (Z - normalize)
MAP Class = argmax_v P(v) product_i P(Ai|V)
What is Radial Basis Function?
RBF networks are a type of ANN constructed from spatially localized kernel functions. Blend of instance based and ANN approaches.
What is the KNN algorithm
Given: Training Data D = {xi. yi}
distance metric d(q,x)
number of neighbors k
query point q
NN = {i: d(q,xi) k smallest}
Return
For classification: Vote of the yi (plurality, mode)
For regression: mean, or weighted mean
What is conditional entropy?
H(y | x ) = - sum ( P(x,y) log P(y|x) )
If x is independent of y:
H(y|x) = H(y)
H(x,y) = H(x) + H(y)
What is the VC dimension of convex polgyons
number of parameters = infinite
VC dimension is infinite.
Draw points on a circle (polygon) VC dim = # points on circle for polygon. VC dim = # of points.
What is a Belief Network?
aka Bayes Net, aka Bayesian Network, aka Graphical Model
Draw the conditional independencies between variables
How powerful is a perceptron unit?
Returns 0 or 1 as a function of inputs
Weights matter
Linear relationship means perceptron is a linear function that computes linear planes.
gradient descent
robust to non linear separability
(unthresholded output)
The delta rule is the preferred choice for non-linearly separable
Classification
Map input to label
Decision Trees
Ex - At a restaurant think of whether you should enter or not given input about the restaurant (cost, food type, etc.)
Representation - Leaves (answers), edges (values), nodes (attribute)
What is epsilon - exhaustion?
VS(s) is epsilon-exhausted iff for all h in VS(s) the error_D(h) <= epsilon
Choose any h in VS(s) and you will have low error. We need to make sure the only things left in the version space have very low error.
Key points of SVMs
Want to maximize the margin between the boundaries
Want to find decision boundary (betwen margins)
y = wTx + b (label, weights, parameters of the plane)
0 = wTx + b (on boundary) (+1/-2 on margins)
Want perpendicular line between margins to be a maximum.
What is the learning/query runtime and space for 1-NN, K-NN and linear regression for a sorted data set?
What is the sum rule for probabilities?
P(A u B = P(A) + P(B) - P(A n B)
How are rules combined in ensemble learning
take sets uniformly randomly to get subset
apply a learner
combine with mean
Mistake Bounds
how many misclassifications can a learner make over an infinite run
When do we stop in a decision tree?
When everything is classified correctly, unless noise.
When no more attributes.
What do we do if noise?
Don’t trust the data completely.
Can overfit with a tree too big
What is the maximum likelihood hypothesis?
h_ml = argmax_h in H of P(D | h)
The MAP Hypothesis with P(h) dropped because we consider all P(h) priors equal.
What are the bias’ of Genetic Algorithms?
locality of the bits matter (first 4 bits are related or last 4 bits are related)
sub parts of the space can be independently optimized.
What is Bayes Rule
P(A|B) = P(B|A)P(A) / P(B)
P(A,B) = P(A|B)P(B) = P(B|A)P(A)
What are properties of simulated annealing?
T -> 0: like hill climbing
T -> inf: like random walk
decrease temperatore slowly
Boltzmann distribution - likely to be in a location of high fitness
Why does boosting tend to do well?
Boosting makes samples that aren’t working well be re-weighted to do better on.
The number of things getting wrong will have to be half right as the process is renormalized.
Errors go down and alphas go up, overtime.
Hypotheses that are more right have more vote.
Information gain, must pickup information as you go along. You have to pick up things that you got wrong in the past.
Reinforcement Learning
Learning from delayed reward
What is a perceptron ANN?
Attributes and weights flow in
Activation: Sum_i=1 to k_ Xi * Wi
>= theta (firing threshold)
yes - y = 1
no - y = 0
Testing Set
Just like a training set. Look at candidate and see if it does the job. Train =/= Test
What are the bias’ of neural networks
Restriction bias:
perceptron - only considers planes, half spaces
sigmoids - much more complex. not much restriction
Representations:
boolean - network of threshold like units
continuous - connected no jumps - hidden
arbitrary - stitch together - 2 hidden
Preference bias:
initial weights (small, random values to minimize bias and give variability between runs)
prefer simpler explanations (occam’s razor)
What is unique to ANN regarding errors and cross-validation?
ANN is iterative
As the iterations continue, at risk of weights becoming too large and overfit
Error over iterations plot will show variance between train and validation data
Recommend stopping training before the train/validation data diverges
Instances
inputs. Vectors of values. The set of things you are looking for
What is a weak learner
A weak learner is a learner that will always do better than chance
For all D, PrD[.] <= 1/2 - epsilon
epsilon - a really really small number <<< 1
What is backpropagation?
Computationally beneficial organization of the chain rule.
The errors flow backwards.
Error function can have many local optima. (Weights cannot change without making error worse, so stuck)
What is Entropy?
- sum P(s) log (P(S))
What are the advantages of random restart hill climbing?
Multiple tries to find a good starting place
Not much more expensive (constant factor)
What is the theorem of total probability?
If events are mutually exclusive with sum from i= 1 to n of P(Ai) = 1, then
P(B) = sum_ i = 1 to n of P(B|Ai)P(Ai)
What is the optimization problem for SVM?
How do you deal with the threshold issue and avoid undifferentiable activation?
Sigmoid - Differentiable threshold
sigma(a) = 1 / (1 + e ^ -a)
goes to 0 when a -> -inf
goes to 1 when a -> inf
What if target concept not in Hypothesis Space?
Learning scenario called “agnostic”
Learner doesnt have to have hypothesis in target space, but has to find the best one at matching true concept.
m > = (1/2epsilon^2)* ( ln(|H| + ln(1\delta) )
Like the equation for non-agnostic case but epsilon is a
How expressive is a decision tree?
XOR with n boolean attributes. How many trees/rows in the truth table? How big is the truth table?
O(n!) nodes (exponential)
output exponential
2^n rows (truth table)
2^(2^n) size truth table. How many different bit patterns? (2^(# positions)). 2 ^ (2^n)
What is the equation for the margin between decision boundaries in SVM?
M = 2 / ||W||
(wT/||W|| (x1 - x2) = 2)
VC dimension of example?
2.
With 3 points on the line, cannot create a range such that points are :
+ - +
What is the relationship between PAC-learnability and VC dimension?
H is PAC-learnable iff VC dimension is finite.
What is simulated annealing?
Don’t always improve (exploit), sometimes you need to search (explore). Tradeoff
For finite set of iterations:
sample new point Xt in N(x)
jump to new sample with probability given by an acceptance probability function (hill climb or use Temperature to decide if to jump)
decrease temperature
Higher temperature - e^0 willing to take downward steps
Distance metrics for K
Euclidean sqrt( x0 - x1)^2 + (y0 - y1)^2
Manhattan |(x0 - x1)| + |(y0 - y1)| (taxicab)
What are genetic algorithms?
Population of individuals
mutation - local search N(x)
cross over - population holds information (different**). Combine attributes to be better
generations - iterations of improvement
Po = initial population of size K
Repeat until converged
Compute fitness of all x in Pt
Select “most fit” individuals (top half - truncated selection, weighted prob - roulette wheel)
Pair up individuals, replacing “least fit” individuals via crossover/mutation
How to find the line of best fit in polynomial regression
Co + C1X + C2X^2 + C3X^3 = y
Use matrix multiplicate
Xw = y
X^TXw = X^Ty
(X^TX)^-1 X^TXw = (X^TX)^-1XTy
w = (X^TX)^-1XTy
Input spaces of regression
Scalar input, continuous out
Vector input, continuous out - include more input features: (size, distance from zoo). hyperplanes
discrete vector or scalar inputs - for discrete quantities can enumerate them into numbers.
Training Set
Set of all inputs paired with correct outputs (this is tall, this is not tall).
What is a dependency tree?
Tree where every node (minus root) has one parent
Every node depends on exactly one node.
Simplest inter-relationship
Concept
The function that we care about that will map the inputs to outputs. (instances to outputs).
Why should we sample?
Get distributions for probabiliy of values, generate values
simulates a complex process
approximate (faster than exact) inference (machine)
visualization, getting a feel (human)
What is the VC dimension of a d-dimensional hyperplane?
d + 1
Note: VC dimension is often number of parameters
What assumptions are required to derive that minimizing the sum of squared errors will output the maximum likelihood hypothesis?
training examples are pairs of where di = f(xi) + epsilon
epsilon error is iid with normal distribution and zero mean.
In practice, it may not always make sense that di contains error but the xs do not.
Version Space
True Hypothesis - c in H
Candidate Hypothesis - h in H
consistent learner - produces c(x) = h(x) for x in S
version space - VS(S) = {h s.t. h in H consistent wrt S}. set of hypothesis consistant with examples.
Optimizing Weights
To avoid local minima:
advanced optimization methods
momentum - use momentum to “bounce out”
higher order derivatives
randomized optimization
penalty for complexity (more nodes, layers penalty, larger number weights)
What is required for a Kernel?
Mercer Condition - Acts like a distance or acts like a similarity .
What is Ensemble Learning?
Come up with many simple rules, instead of one complicated rule.
Learn over a subset of data, combine for simple rule
repeat
combine all rules, complex rule
What is Cross Validation
Take some of the training data and act like it is test data. The left out set of data is the cross validation set.
Divide data into K - folds
Ex:
4 folds
train on 1,2,3 test on 4
train on 1,3,4 test on 2
train on 2,3,4 test on 1
train on 1,2,4 test on 3
Average across the folds. Model that performs the best on cross-validation would likely be best model for train data.
Deduction vs. Induction
Deduction - General rule to specifics
Induction - Examples to generate rule
What is a Brute Force Bayes Concept Learner?
Learner with
P(D) - |VS of H,D | / |H |
P(h | D) = 1/ |VS of H,D| if h is consistent with D, else 0
* This learner requires uniform priors, noise free data, and target concept c is in H
Which hypothesis spaces are infinite:
- linear separators
- artifical neural networks
- decision trees (discrete inputs)
- decision trees (continuous inputs)
- linear separators - inf number of lines
- artifical neural networks - inf number of weights
- decision trees (continuous inputs) - inf number of questions
Hypothesis Class
Set of all functions that you are willing to entertain. Could be all possible functions in the world (unlikely)
Regression considerations for decision trees?
Continuous outputs (not inputs). How do you measure information on continuous values?
Look at variance.
How to split? On leaves do a standard fitting algorithm like the average or linear fit or do a vote.
Target Concept
The thing we are trying to find. The answer. (ex: determine if something is car or not, person is M or F)
Regression
Map input space to real number
What is the specific algorithm for Decision tree
Loop
A
What is the MAP hypothesis?
maximum a posteriori
h_map ~= argmax_h in H P(h | D)
= argmax_h in H P(D | h) P(h) / P(D) (drop D, independent of H)
= argmax_h in H P(D | h) P(h)
What is Preference Bias
Only the Hypothesis we prefer from those we consider
Candidate
Concept that you think is the target concept
When is ANN useful?
Training data corresponse to noisy, complex sensor data. When the learned target function does not need to be understood by a person
What is the hypothesis space
H = {hi(x) = xi}
x = 10 bits
|H| = 10
input space = 2^10
Does it make sense to repeat an attribute along a path in the tree?
Yes, if the attribute is continuous you can use a different range or a discrete value from the attribute
What is the equation for the number of datapoints for an infinite hypothesis space?
What are the certainty and error goals?
epsilon : error goal between 0 and 1/2 “approximately”
delta: certainty goal between 0 and 1/2 “probably”
What is the curse of dimensionality
As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially
How does boosting differ from baggin
Bagging learns over a subset of data (drawn randomly) and applies the learner then combines with a mean.
Boosting takes a subset of the data (hardest examples) and applies the learner. Finally, taking a weighted mean
Why is Naive Bayes cool?
Inference is cheap (in this case)
Few parameters
Estimate parameters with labeled data
Connects inference and classification
Empirically successful
Handles missing attributes pretty well
How can we define error?
accuracy - misatches / total (implies each error is equal)
Error : PrD[h(x) =/= c(x)]
probability given the underlying distribution that I will disagree on the true concept
Why is it called “Naive” Bayes
Does not model relationship between attributes
How can this work in practice? Weak relationships, enough attributes, may still get the correct label even if the probabilities are wrong.
Naive Bayes believes answer too much but doesn’t matter if it is right.
Why is boosting more robust to overfitting?
Margins are maximized as boosting continues. There is a reduction in the fraction of training examples with small margin which yields improvements in test error. The process of learning on the hardest examples continues even after the training error has reached zero.
Boosting Pseudocode
epsilon would be less than half (not even a half)
IID
Independent and identically distributed
Assumption of algorithms
Data is representative in test/train of the real world distribution
Unsupervised Learning
Derive structure from inputs
(Description)
What is the kernel in SVM?
A function that is used to represent similarity
Mechanism by which we inject domain knowledge into the SVM
How do you find the best dependency tree?
Find the maximum spanning tree using Prim.
MST over mutual information
What is the product rule of probabilities?
P (A n B) = p(A|B)P(B) = p(B|A)P(A)
How to determine best attribute for decision trees
information gain
Gain (S,A) = Entropy S - Sumv ( | Sv | / S ) Entropy (Sv)
(entropy of original collection - expected entropy after S is partitioned using A)
Maximum entropy on even split
Key points from the Matas and Sochman slides?
Adaboost
- converges to log likelihood ratio
- linear classifier
- many mods (TC AB
Advantages
simple
feature selection on large sets of features
generalization
disadvantages
suboptimal
can overfit in presence of noise
Instance Based Learning
Instead of training on data to produce a function then throwing the data out, store the data and perform a lookup against new values. Get back exactly what you put in.
Pros:
remembers, fast, simple
Cons:
no generalization, overfitting, affected by noise (2 inputs different outputs)
How to find the line of best fit in constant regression?
Minimize the sum of squared errors of the points
E(C) = Sum_i=1 to n_ (yi - c)^2
Take the derivative of E(C) and get c = sum of yi / n (mean)
What is joint entropy?
H(x,y) = - sum( P(x,y) log P(x,y) )
What is mutual information?
Measures relationship between x and y.
Measure reduction of randomness in a variable given some other variable.
I (x, y) = H(y) - H(y | x)
What is bagging?
Taking random subsets and combining by the mean
Regression
mapping continuous inputs to outputs
word comes from fitting a regression line (averages regress to the mean, ie. tall people having medium height children)
Locally weighted regression
K = n weighted average
Create local linear regression for different chunks of data. Can make a more complicated space based off of building simple hypothesis spaces .
perceptron rule
perceptron rule (threshold) -> single unit
While error:
Wi = Wi + deltaWi
deltaWi = n (y - ^y)Xi (n is learning rate)
^y = Sum_i_WiXi >= 0 (subtract theta from both sides, to become weight, bias term added so threshold is folded into weights (bias term = -1 theta) )
** perceptron rule will find the answer in finite iterations if linearly separable
What is bayesian learning?
Learn the best hypothesis given data and some domain knowledge
(best = most probable)
P(h|D)
D - data
h - particular hypothesis
argmax h in H [ P(h |D)
Difference between lazy and eager learners
Lazy - low complexity at learn, more at query
Eager - more complexity at learn, less at query
With equal hypothesis spaces, lazy learners can have a richer hypothesis space as it doesn’t commit to a particular space. Lazy learners allow for local approximations of the hypothesis space.
Lazy model complex target functions with lesss complex local approximations.
What is the VC dimension
The largest set of inputs that the hypothesis class can label in all possible ways (“shatter”)
Vapnik-Chervonenkis
The VC dimension helps to create a definition for the number of data points needed when a hypothesis class is infinite
m >= 1/epsilon(8 * VC(H) * log_2(13/epsilon) + 4log_2(2/delta)
What is a perceptron
How do deal with a decision tree that’s overfit?
Cross-validation : try different trees and see which has lowest error on validation set
More efficient : hold out a set and if error gets worse stop.
Pruning : when full tree is built, see if collapsing nodes up will lower or raise error. If raise, stop. Update output with a vote.
What is Pac Learning
Training err - frac of training misclassified by h
true err - frac of examples that would be misclassified on sample drawn from D
error_D(h) = Pr_x~D[c(x) =/= h(x)] - err on examples we will never see is ok.
Probably - Approximately - Correct
C is PAC-learnable by learner using Hyp space iff learner L will, with probability 1 - delta (certainty goal), output a hypothesis h in H such that error_D(h) <= epsilon (error goal) in time and samples polynomial (1/ epsilon, 1/delta, n)
What are the drawbacks of SA,GA, and RHC?
No memory of where you’ve been at where you are
Decision Tree and Errors or Missing data
Decision Trees are robust to errors (all training used at each step and termination criteria can be updated to accept hypotheses that imperfectly fit data) or missing data
Computational Complexity
computational effort for confergence
In P(h|D) = P(D|h)P(h) / P(D)
what do the terms mean?
P(D|h) - the probably of seeing the data given that the hypothesis true. Assume the X’s are given, the labels are what we are trying to assign probability to. Given set of X’s, whats the probability I would see a particular label. Easier to compute the probability of seeing a label.
P(D) - prior on the data
P(h) = prior on the hypothesis drawn from the hypothesis space.
prior is the domain knowledge
What is the role of the learning rate in ANN ?
Moderate the degree to which weights are changed at each step. May decay as iterations increase.
What is entropy
A measure of randomness
A lot of entropy (1 bit) for a fair coin. Unfair coin has no entropy (0 bit)
- Sumv [p(v) log p(v)] (sum over all possible values)
What is computational learning theory?
Defining learning problems
showing specific algorithms work
show these problems are fundamentally hard
What is the final hypothesis of Boosting?
Hfinal (X) = sgn(Sum alpha T ht(x))
alpha T - more weight if doing well
Sample Complexity
batch. how many training examples are needed for a learner to create a successful hypothesis
What is conditional independence?
X is conditionally indpendent of Y given Z if the probability distribution governing X is independent of the value of y given the value of Z, that is if
For all X,Y,Z P(X=x|Y=y,Z=z) = P(X=x|Z=z)
in short: P(X|Y,Z) = P(X|Z)
What is the restrictive assumption built into the PAC-learnable definition?
The definition implicitly assumes that the learner’s hypothesis space contains a hypothesis with arbitrarily small error for every target concept in C.
Need to consider a second mesure of the complexity of H, VC dimension. We can state bounds on sample complexity that use VC(H) rather than |H|.
What is KL Divergence?
Measures the difference between two distributions. Distance measure.
Always non-negative and zero (when p = q)
D( p || q) = integral p(x) log (p(x)/q(x))
How do decision trees handle boolean functions?
Boolean Functions
- AND 2 nodes, commutative (A or B order doesn’t matter)
- OR 2 nodes, commutative, linear in n
- XOR 3 nodes, need to split twice, 2^n - 1 nodes, exponential (HARD)
What is the Minimum Description Length (MDL)?
Start with hMAP = argmax P (D|h) P(h)
= argmax [lg P (D|h) + lg P(h)]
= argmin [- lg P (D|h) - lg P(h)]
First arg -lg P(D|h) => length D|h). Bits needed to describe “size of miscalculation/err”
Second arg -lg P(h) => length(h). Bits needed to describe “size of h”
This provides a way of trading off hypothesis complexity for the number of errors committed by the hypothesis.
Helps with overfitting.
Practical Matters of Mimic
Mimic does well with structure
Representing P_theta for all theta
Local optima problem (restarts free, probability theory)
takes a long time - orders of magnitude fewer iterations, but longer per iteration
more information per iteration
works well when cost of evaluating fit function is high
What is the preference bias for KNN
Locality - near points are simillar
smoothness - averaging
All features matter equally
What is case-based reasoning?
An instance-based learning method in which instances (cases) may be rich relational descriptions and in which the retrieval and combination of cases to solve the current query may rely on knowledge based reasoning and search-intensive problem-solving methods. (ex: Cadet device design)
What is the distribution for Boosting as the process goes?
D1 (i) = 1/n (uniform at first)
Dt+1(i) = Dt(i) * e ^ (-alphaT yi ht (xt) / zt (normalization)
where alphaT = 1/2 ln (1 - err t)/err t (positive)
ht = -1/+1
yi = -1/+1
y*h = +1 if agree, -1 if disagree
Total product: - if they agree
Total product: + if they disagree
when does boosting tend to overfit?
If a weak learner uses ANN with many layers and nodes
pink noise - uniform noise
underlying weak learner that overfits
What resources matter in computational learning theory
time, space, data
What are ways of selecting training examples
learner/teacher - learner asks Qs.
Teacher gives exs.
Fixed distribution (nature)
Evil - worst distribution
What is the Restriction Bias
Only the Hypothesis we consider
Decision Tree Algorithm
- Pick best attribute (split data)
- Asked Question
- Follow answer path
- Go to 1, repeat to answer
VC dimension of example?
VC dimension is 3
Collinear case is tricky + - + for points on y = 0. But make use of the second dimension. So 3 points can be shattered.
Four points cannot. See example.
What is Haussler Theorem
This is a theorem for bounding the true error as a function of the number of examples that are drawn. It provides a genral bound on the number of training examples sufficient for any consistent learner to successfully learn any target concept in H
m >= (1/epsilon) * (ln|H| + ln(1/delta) )
What are types of learner queries?
Learners with constrained bounds (X and Y and Not Z)
Learner with mistake bounds (X and Not X):
1 - asume possible each variable is positive and negated
2 - given input, compute output
3 - if wrong, set + to - and - to +. Go to 2
(k + 1) mistakes
Sample
Training Set
What are key facts to the optimizatio problem for SVM
most alphas are 0,
most w’s don’t matter.
most data points are not support vectors