Midterm Flashcards

1
Q

What is the Bayes Optimal Classifier?

A

A system that classified new instances by weighing the probabilities of values based on hypotheses.

argmax of vj in V is the sum_hi in H of P(vj|hi) P(hi | D)

The optimal classifier may not equal the MAP hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does an ensemble learn a dataset well?

A

Taking subsets it gets rid of overfitting and may outperform direct calculation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is boosting really good

A

agnostic to the learner

combining simple -> complex

minimizes overfitting - in practice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Hill Climbing algorithm?

A

Guess x in X

Repeat:

Let n* = argmax_n in N(x) f(n)

if f(n*) > f(x): x = n

else : stop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the bias of Decision Trees?

A

ID3 performs simple-to-complex hill climbing searhc
Inductive Bias - Prefers goods splits at the top, prefers correct over incorrect, prefers shorter trees than longer trees (from good splits)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is MIMIC?

A

Directly model probability distribution

Search through space and refine estimate of distribution

Structure can be conveyed by refinement of distribution

Generate sample from Ptheta_t(x) (population)

Set theta_t+1 to nth percentile

retain only those samples s.t. f(x) <= theta_t+1 (fittest)

estimate Ptheta_t+1(x) (structure)

repeat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Perceptron and Boolean representation

A

AND

OR

NOT (one variable)

XOR - not a single perceptron, but a network (OR - AND in Quiz example)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are approaches to optimization?

A

Generate and test - small input space, complex function

Calculus - function has derivative, solvable = 0

Newton’s Method - function has derivative, iteratively improve, single optimum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the relevance of the value C in SVM?

A

C is a regularisation parameter that controls the trade-off between maximizing the margin and minimizing the training error term. If C is too small than insufficient stress will be placed on fitting the training data. If C is too large then the algorithm will overfit the training data. (C is penalty for errors)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an instance based learning algorithm

A

KNN - parameters K and distance (similarity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Naive Bayes

A

Probability of a value given attributes

P(V|A1, A2, A3,…,An) = product_i P(A1|V) * P(V) / Z (Z - normalize)

MAP Class = argmax_v P(v) product_i P(Ai|V)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Radial Basis Function?

A

RBF networks are a type of ANN constructed from spatially localized kernel functions. Blend of instance based and ANN approaches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the KNN algorithm

A

Given: Training Data D = {xi. yi}

distance metric d(q,x)

number of neighbors k

query point q

NN = {i: d(q,xi) k smallest}

Return

For classification: Vote of the yi (plurality, mode)

For regression: mean, or weighted mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is conditional entropy?

A

H(y | x ) = - sum ( P(x,y) log P(y|x) )

If x is independent of y:

H(y|x) = H(y)

H(x,y) = H(x) + H(y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the VC dimension of convex polgyons

A

number of parameters = infinite

VC dimension is infinite.

Draw points on a circle (polygon) VC dim = # points on circle for polygon. VC dim = # of points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a Belief Network?

A

aka Bayes Net, aka Bayesian Network, aka Graphical Model

Draw the conditional independencies between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How powerful is a perceptron unit?

A

Returns 0 or 1 as a function of inputs

Weights matter

Linear relationship means perceptron is a linear function that computes linear planes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

gradient descent

A

robust to non linear separability

(unthresholded output)

The delta rule is the preferred choice for non-linearly separable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Classification

A

Map input to label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Decision Trees

A

Ex - At a restaurant think of whether you should enter or not given input about the restaurant (cost, food type, etc.)
Representation - Leaves (answers), edges (values), nodes (attribute)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is epsilon - exhaustion?

A

VS(s) is epsilon-exhausted iff for all h in VS(s) the error_D(h) <= epsilon

Choose any h in VS(s) and you will have low error. We need to make sure the only things left in the version space have very low error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Key points of SVMs

A

Want to maximize the margin between the boundaries

Want to find decision boundary (betwen margins)

y = wTx + b (label, weights, parameters of the plane)

0 = wTx + b (on boundary) (+1/-2 on margins)

Want perpendicular line between margins to be a maximum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the learning/query runtime and space for 1-NN, K-NN and linear regression for a sorted data set?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the sum rule for probabilities?

A

P(A u B = P(A) + P(B) - P(A n B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How are rules combined in ensemble learning
take sets uniformly randomly to get subset apply a learner combine with mean
26
Mistake Bounds
how many misclassifications can a learner make over an infinite run
27
When do we stop in a decision tree?
When everything is classified correctly, unless noise. When no more attributes. What do we do if noise? Don't trust the data completely. Can overfit with a tree too big
28
What is the maximum likelihood hypothesis?
h\_ml = argmax\_h in H of P(D | h) The MAP Hypothesis with P(h) dropped because we consider all P(h) priors equal.
29
What are the bias' of Genetic Algorithms?
locality of the bits matter (first 4 bits are related or last 4 bits are related) sub parts of the space can be independently optimized.
30
What is Bayes Rule
P(A|B) = P(B|A)P(A) / P(B) P(A,B) = P(A|B)P(B) = P(B|A)P(A)
31
What are properties of simulated annealing?
T -\> 0: like hill climbing T -\> inf: like random walk decrease temperatore slowly Boltzmann distribution - likely to be in a location of high fitness
32
Why does boosting tend to do well?
Boosting makes samples that aren't working well be re-weighted to do better on. The number of things getting wrong will have to be half right as the process is renormalized. Errors go down and alphas go up, overtime. Hypotheses that are more right have more vote. Information gain, must pickup information as you go along. You have to pick up things that you got wrong in the past.
33
Reinforcement Learning
Learning from delayed reward
34
What is a perceptron ANN?
Attributes and weights flow in Activation: Sum\_i=1 to k\_ Xi \* Wi \>= theta (firing threshold) yes - y = 1 no - y = 0
35
Testing Set
Just like a training set. Look at candidate and see if it does the job. Train =/= Test
36
What are the bias' of neural networks
Restriction bias: perceptron - only considers planes, half spaces sigmoids - much more complex. not much restriction Representations: boolean - network of threshold like units continuous - connected no jumps - hidden arbitrary - stitch together - 2 hidden Preference bias: initial weights (small, random values to minimize bias and give variability between runs) prefer simpler explanations (occam's razor)
37
What is unique to ANN regarding errors and cross-validation?
ANN is iterative As the iterations continue, at risk of weights becoming too large and overfit Error over iterations plot will show variance between train and validation data Recommend stopping training before the train/validation data diverges
38
Instances
inputs. Vectors of values. The set of things you are looking for
39
What is a weak learner
A weak learner is a learner that will always do better than chance For all D, PrD[.] \<= 1/2 - epsilon epsilon - a really really small number \<\<\< 1
40
What is backpropagation?
Computationally beneficial organization of the chain rule. The errors flow backwards. Error function can have many local optima. (Weights cannot change without making error worse, so stuck)
41
What is Entropy?
- sum P(s) log (P(S))
42
What are the advantages of random restart hill climbing?
Multiple tries to find a good starting place Not much more expensive (constant factor)
43
What is the theorem of total probability?
If events are mutually exclusive with sum from i= 1 to n of P(Ai) = 1, then P(B) = sum\_ i = 1 to n of P(B|Ai)P(Ai)
44
What is the optimization problem for SVM?
45
How do you deal with the threshold issue and avoid undifferentiable activation?
Sigmoid - Differentiable threshold sigma(a) = 1 / (1 + e ^ -a) goes to 0 when a -\> -inf goes to 1 when a -\> inf
46
What if target concept not in Hypothesis Space?
Learning scenario called "agnostic" Learner doesnt have to have hypothesis in target space, but has to find the best one at matching true concept. m \> = (1/2epsilon^2)\* ( ln(|H| + ln(1\delta) ) Like the equation for non-agnostic case but epsilon is a
47
How expressive is a decision tree? XOR with n boolean attributes. How many trees/rows in the truth table? How big is the truth table?
O(n!) nodes (exponential) output exponential 2^n rows (truth table) 2^(2^n) size truth table. How many different bit patterns? (2^(# positions)). 2 ^ (2^n)
48
What is the equation for the margin between decision boundaries in SVM?
M = 2 / ||W|| | (wT/||W|| (x1 - x2) = 2)
49
VC dimension of example?
2. With 3 points on the line, cannot create a range such that points are : + - +
50
What is the relationship between PAC-learnability and VC dimension?
H is PAC-learnable iff VC dimension is finite.
51
What is simulated annealing?
Don't always improve (exploit), sometimes you need to search (explore). Tradeoff ## Footnote For finite set of iterations: sample new point Xt in N(x) jump to new sample with probability given by an acceptance probability function (hill climb or use Temperature to decide if to jump) decrease temperature Higher temperature - e^0 willing to take downward steps
52
Distance metrics for K
Euclidean sqrt( x0 - x1)^2 + (y0 - y1)^2 Manhattan |(x0 - x1)| + |(y0 - y1)| (taxicab)
53
What are genetic algorithms?
Population of individuals mutation - local search N(x) cross over - population holds information (different\*\*). Combine attributes to be better generations - iterations of improvement Po = initial population of size K Repeat until converged Compute fitness of all x in Pt Select "most fit" individuals (top half - truncated selection, weighted prob - roulette wheel) Pair up individuals, replacing "least fit" individuals via crossover/mutation
54
How to find the line of best fit in polynomial regression
Co + C1X + C2X^2 + C3X^3 = y Use matrix multiplicate Xw = y X^TXw = X^Ty (X^TX)^-1 X^TXw = (X^TX)^-1XTy w = (X^TX)^-1XTy
55
Input spaces of regression
Scalar input, continuous out Vector input, continuous out - include more input features: (size, distance from zoo). hyperplanes discrete vector or scalar inputs - for discrete quantities can enumerate them into numbers.
56
Training Set
Set of all inputs paired with correct outputs (this is tall, this is not tall).
57
What is a dependency tree?
Tree where every node (minus root) has one parent Every node depends on exactly one node. Simplest inter-relationship
58
Concept
The function that we care about that will map the inputs to outputs. (instances to outputs).
59
Why should we sample?
Get distributions for probabiliy of values, generate values simulates a complex process approximate (faster than exact) inference (machine) visualization, getting a feel (human)
60
What is the VC dimension of a d-dimensional hyperplane?
d + 1 Note: VC dimension is often number of parameters
61
What assumptions are required to derive that minimizing the sum of squared errors will output the maximum likelihood hypothesis?
training examples are pairs of where di = f(xi) + epsilon epsilon error is iid with normal distribution and zero mean. In practice, it may not always make sense that di contains error but the xs do not.
62
Version Space
True Hypothesis - c in H Candidate Hypothesis - h in H consistent learner - produces c(x) = h(x) for x in S version space - VS(S) = {h s.t. h in H consistent wrt S}. set of hypothesis consistant with examples.
63
Optimizing Weights
To avoid local minima: advanced optimization methods momentum - use momentum to "bounce out" higher order derivatives randomized optimization penalty for complexity (more nodes, layers penalty, larger number weights)
64
What is required for a Kernel?
Mercer Condition - Acts like a distance or acts like a similarity .
65
What is Ensemble Learning?
Come up with many simple rules, instead of one complicated rule. Learn over a subset of data, combine for simple rule repeat combine all rules, complex rule
66
What is Cross Validation
Take some of the training data and act like it is test data. The left out set of data is the cross validation set. Divide data into K - folds Ex: 4 folds train on 1,2,3 test on 4 train on 1,3,4 test on 2 train on 2,3,4 test on 1 train on 1,2,4 test on 3 Average across the folds. Model that performs the best on cross-validation would likely be best model for train data.
67
Deduction vs. Induction
Deduction - General rule to specifics Induction - Examples to generate rule
68
What is a Brute Force Bayes Concept Learner?
Learner with P(D) - |VS of H,D | / |H | P(h | D) = 1/ |VS of H,D| if h is consistent with D, else 0 \* This learner requires uniform priors, noise free data, and target concept c is in H
69
Which hypothesis spaces are infinite: - linear separators - artifical neural networks - decision trees (discrete inputs) - decision trees (continuous inputs)
- linear separators - inf number of lines - artifical neural networks - inf number of weights - decision trees (continuous inputs) - inf number of questions
70
Hypothesis Class
Set of all functions that you are willing to entertain. Could be all possible functions in the world (unlikely)
71
Regression considerations for decision trees?
Continuous outputs (not inputs). How do you measure information on continuous values? Look at variance. How to split? On leaves do a standard fitting algorithm like the average or linear fit or do a vote.
72
Target Concept
The thing we are trying to find. The answer. (ex: determine if something is car or not, person is M or F)
73
Regression
Map input space to real number
74
What is the specific algorithm for Decision tree
Loop A
75
What is the MAP hypothesis?
maximum a posteriori h\_map ~= argmax\_h in H P(h | D) = argmax\_h in H P(D | h) P(h) / P(D) (drop D, independent of H) = argmax\_h in H P(D | h) P(h)
76
What is Preference Bias
Only the Hypothesis we prefer from those we consider
77
Candidate
Concept that you think is the target concept
78
When is ANN useful?
Training data corresponse to noisy, complex sensor data. When the learned target function does not need to be understood by a person
79
What is the hypothesis space H = {hi(x) = xi} x = 10 bits
|H| = 10 input space = 2^10
80
Does it make sense to repeat an attribute along a path in the tree?
Yes, if the attribute is continuous you can use a different range or a discrete value from the attribute
81
What is the equation for the number of datapoints for an infinite hypothesis space?
82
What are the certainty and error goals?
epsilon : error goal between 0 and 1/2 "approximately" delta: certainty goal between 0 and 1/2 "probably"
83
What is the curse of dimensionality
As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially
84
How does boosting differ from baggin
Bagging learns over a subset of data (drawn randomly) and applies the learner then combines with a mean. Boosting takes a subset of the data (hardest examples) and applies the learner. Finally, taking a weighted mean
85
Why is Naive Bayes cool?
Inference is cheap (in this case) Few parameters Estimate parameters with labeled data Connects inference and classification Empirically successful Handles missing attributes pretty well
86
How can we define error?
accuracy - misatches / total (implies each error is equal) Error : PrD[h(x) =/= c(x)] probability given the underlying distribution that I will disagree on the true concept
87
Why is it called "Naive" Bayes
Does not model relationship between attributes How can this work in practice? Weak relationships, enough attributes, may still get the correct label even if the probabilities are wrong. Naive Bayes believes answer too much but doesn't matter if it is right.
88
Why is boosting more robust to overfitting?
Margins are maximized as boosting continues. There is a reduction in the fraction of training examples with small margin which yields improvements in test error. The process of learning on the hardest examples continues even after the training error has reached zero.
89
Boosting Pseudocode
epsilon would be less than half (not even a half)
90
IID
Independent and identically distributed Assumption of algorithms Data is representative in test/train of the real world distribution
91
92
Unsupervised Learning
Derive structure from inputs | (Description)
93
What is the kernel in SVM?
A function that is used to represent similarity Mechanism by which we inject domain knowledge into the SVM
94
How do you find the best dependency tree?
Find the maximum spanning tree using Prim. MST over mutual information
95
What is the product rule of probabilities?
P (A n B) = p(A|B)P(B) = p(B|A)P(A)
96
How to determine best attribute for decision trees
information gain Gain (S,A) = Entropy S - Sumv ( | Sv | / S ) Entropy (Sv) (entropy of original collection - expected entropy after S is partitioned using A) Maximum entropy on even split
97
Key points from the Matas and Sochman slides?
Adaboost - converges to log likelihood ratio - linear classifier - many mods (TC AB Advantages simple feature selection on large sets of features generalization disadvantages suboptimal can overfit in presence of noise
98
Instance Based Learning
Instead of training on data to produce a function then throwing the data out, store the data and perform a lookup against new values. Get back exactly what you put in. ## Footnote Pros: remembers, fast, simple Cons: no generalization, overfitting, affected by noise (2 inputs different outputs)
99
How to find the line of best fit in constant regression?
Minimize the sum of squared errors of the points E(C) = Sum\_i=1 to n\_ (yi - c)^2 Take the derivative of E(C) and get c = sum of yi / n (mean)
100
What is joint entropy?
H(x,y) = - sum( P(x,y) log P(x,y) )
101
What is mutual information?
Measures relationship between x and y. Measure reduction of randomness in a variable given some other variable. I (x, y) = H(y) - H(y | x)
102
What is bagging?
Taking random subsets and combining by the mean
103
Regression
mapping continuous inputs to outputs word comes from fitting a regression line (averages regress to the mean, ie. tall people having medium height children)
104
Locally weighted regression
K = n weighted average Create local linear regression for different chunks of data. Can make a more complicated space based off of building simple hypothesis spaces .
105
perceptron rule
perceptron rule (threshold) -\> single unit While error: Wi = Wi + deltaWi deltaWi = n (y - ^y)Xi (n is learning rate) ^y = Sum\_i\_WiXi \>= 0 (subtract theta from both sides, to become weight, bias term added so threshold is folded into weights (bias term = -1 theta) ) \*\* perceptron rule will find the answer in finite iterations if linearly separable
106
What is bayesian learning?
Learn the best hypothesis given data and some domain knowledge (best = most probable) P(h|D) D - data h - particular hypothesis argmax h in H [ P(h |D)
107
Difference between lazy and eager learners
Lazy - low complexity at learn, more at query Eager - more complexity at learn, less at query With equal hypothesis spaces, lazy learners can have a richer hypothesis space as it doesn't commit to a particular space. Lazy learners allow for local approximations of the hypothesis space. Lazy model complex target functions with lesss complex local approximations.
108
What is the VC dimension
The largest set of inputs that the hypothesis class can label in all possible ways ("shatter") Vapnik-Chervonenkis The VC dimension helps to create a definition for the number of data points needed when a hypothesis class is infinite m \>= 1/epsilon(8 \* VC(H) \* log\_2(13/epsilon) + 4log\_2(2/delta)
109
What is a perceptron
![]() ![]() ![]()
110
How do deal with a decision tree that's overfit?
Cross-validation : try different trees and see which has lowest error on validation set More efficient : hold out a set and if error gets worse stop. Pruning : when full tree is built, see if collapsing nodes up will lower or raise error. If raise, stop. Update output with a vote.
111
What is Pac Learning
Training err - frac of training misclassified by h true err - frac of examples that would be misclassified on sample drawn from D error\_D(h) = Pr\_x~D[c(x) =/= h(x)] - err on examples we will never see is ok. Probably - Approximately - Correct C is PAC-learnable by learner using Hyp space iff learner L will, with probability 1 - delta (certainty goal), output a hypothesis h in H such that error\_D(h) \<= epsilon (error goal) in time and samples polynomial (1/ epsilon, 1/delta, n)
112
What are the drawbacks of SA,GA, and RHC?
No memory of where you've been at where you are
113
Decision Tree and Errors or Missing data
Decision Trees are robust to errors (all training used at each step and termination criteria can be updated to accept hypotheses that imperfectly fit data) or missing data
114
Computational Complexity
computational effort for confergence
115
In P(h|D) = P(D|h)P(h) / P(D) what do the terms mean?
P(D|h) - the probably of seeing the data given that the hypothesis true. Assume the X's are given, the labels are what we are trying to assign probability to. Given set of X's, whats the probability I would see a particular label. Easier to compute the probability of seeing a label. P(D) - prior on the data P(h) = prior on the hypothesis drawn from the hypothesis space. prior is the domain knowledge
116
What is the role of the learning rate in ANN ?
Moderate the degree to which weights are changed at each step. May decay as iterations increase.
117
What is entropy
A measure of randomness A lot of entropy (1 bit) for a fair coin. Unfair coin has no entropy (0 bit) - Sumv [p(v) log p(v)] (sum over all possible values)
118
What is computational learning theory?
Defining learning problems showing specific algorithms work show these problems are fundamentally hard
119
What is the final hypothesis of Boosting?
Hfinal (X) = sgn(Sum alpha T ht(x)) alpha T - more weight if doing well
120
Sample Complexity
batch. how many training examples are needed for a learner to create a successful hypothesis
121
What is conditional independence?
X is conditionally indpendent of Y given Z if the probability distribution governing X is independent of the value of y given the value of Z, that is if For all X,Y,Z P(X=x|Y=y,Z=z) = P(X=x|Z=z) in short: P(X|Y,Z) = P(X|Z)
122
What is the restrictive assumption built into the PAC-learnable definition?
The definition implicitly assumes that the learner's hypothesis space contains a hypothesis with arbitrarily small error for every target concept in C. Need to consider a second mesure of the complexity of H, VC dimension. We can state bounds on sample complexity that use VC(H) rather than |H|.
123
What is KL Divergence?
Measures the difference between two distributions. Distance measure. Always non-negative and zero (when p = q) D( p || q) = integral p(x) log (p(x)/q(x))
124
How do decision trees handle boolean functions?
Boolean Functions - AND 2 nodes, commutative (A or B order doesn't matter) - OR 2 nodes, commutative, linear in n - XOR 3 nodes, need to split twice, 2^n - 1 nodes, exponential (HARD)
125
What is the Minimum Description Length (MDL)?
Start with hMAP = argmax P (D|h) P(h) = argmax [lg P (D|h) + lg P(h)] = argmin [- lg P (D|h) - lg P(h)] First arg -lg P(D|h) =\> length D|h). Bits needed to describe "size of miscalculation/err" Second arg -lg P(h) =\> length(h). Bits needed to describe "size of h" This provides a way of trading off hypothesis complexity for the number of errors committed by the hypothesis. Helps with overfitting.
126
Practical Matters of Mimic
Mimic does well with structure Representing P\_theta for all theta Local optima problem (restarts free, probability theory) takes a long time - orders of magnitude fewer iterations, but longer per iteration more information per iteration works well when cost of evaluating fit function is high
127
What is the preference bias for KNN
Locality - near points are simillar smoothness - averaging All features matter equally
128
What is case-based reasoning?
An instance-based learning method in which instances (cases) may be rich relational descriptions and in which the retrieval and combination of cases to solve the current query may rely on knowledge based reasoning and search-intensive problem-solving methods. (ex: Cadet device design)
129
What is the distribution for Boosting as the process goes?
D1 (i) = 1/n (uniform at first) Dt+1(i) = Dt(i) \* e ^ (-alphaT yi ht (xt) / zt (normalization) where alphaT = 1/2 ln (1 - err t)/err t (positive) ht = -1/+1 yi = -1/+1 y\*h = +1 if agree, -1 if disagree Total product: - if they agree Total product: + if they disagree
130
when does boosting tend to overfit?
If a weak learner uses ANN with many layers and nodes pink noise - uniform noise underlying weak learner that overfits
131
What resources matter in computational learning theory
time, space, data
132
What are ways of selecting training examples
learner/teacher - learner asks Qs. Teacher gives exs. Fixed distribution (nature) Evil - worst distribution
133
What is the Restriction Bias
Only the Hypothesis we consider
134
Decision Tree Algorithm
1. Pick best attribute (split data) 2. Asked Question 3. Follow answer path 4. Go to 1, repeat to answer
135
VC dimension of example?
VC dimension is 3 Collinear case is tricky + - + for points on y = 0. But make use of the second dimension. So 3 points can be shattered. Four points cannot. See example.
136
What is Haussler Theorem
This is a theorem for bounding the true error as a function of the number of examples that are drawn. It provides a genral bound on the number of training examples sufficient for any consistent learner to successfully learn any target concept in H m \>= (1/epsilon) \* (ln|H| + ln(1/delta) )
137
What are types of learner queries?
Learners with constrained bounds (X and Y and Not Z) Learner with mistake bounds (X and Not X): 1 - asume possible each variable is positive and negated 2 - given input, compute output 3 - if wrong, set + to - and - to +. Go to 2 (k + 1) mistakes
138
Sample
Training Set
139
What are key facts to the optimizatio problem for SVM
most alphas are 0, most w's don't matter. most data points are not support vectors