Midterm Flashcards

1
Q

What is the Bayes Optimal Classifier?

A

A system that classified new instances by weighing the probabilities of values based on hypotheses.

argmax of vj in V is the sum_hi in H of P(vj|hi) P(hi | D)

The optimal classifier may not equal the MAP hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does an ensemble learn a dataset well?

A

Taking subsets it gets rid of overfitting and may outperform direct calculation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is boosting really good

A

agnostic to the learner

combining simple -> complex

minimizes overfitting - in practice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Hill Climbing algorithm?

A

Guess x in X

Repeat:

Let n* = argmax_n in N(x) f(n)

if f(n*) > f(x): x = n

else : stop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the bias of Decision Trees?

A

ID3 performs simple-to-complex hill climbing searhc
Inductive Bias - Prefers goods splits at the top, prefers correct over incorrect, prefers shorter trees than longer trees (from good splits)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is MIMIC?

A

Directly model probability distribution

Search through space and refine estimate of distribution

Structure can be conveyed by refinement of distribution

Generate sample from Ptheta_t(x) (population)

Set theta_t+1 to nth percentile

retain only those samples s.t. f(x) <= theta_t+1 (fittest)

estimate Ptheta_t+1(x) (structure)

repeat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Perceptron and Boolean representation

A

AND

OR

NOT (one variable)

XOR - not a single perceptron, but a network (OR - AND in Quiz example)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are approaches to optimization?

A

Generate and test - small input space, complex function

Calculus - function has derivative, solvable = 0

Newton’s Method - function has derivative, iteratively improve, single optimum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the relevance of the value C in SVM?

A

C is a regularisation parameter that controls the trade-off between maximizing the margin and minimizing the training error term. If C is too small than insufficient stress will be placed on fitting the training data. If C is too large then the algorithm will overfit the training data. (C is penalty for errors)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an instance based learning algorithm

A

KNN - parameters K and distance (similarity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Naive Bayes

A

Probability of a value given attributes

P(V|A1, A2, A3,…,An) = product_i P(A1|V) * P(V) / Z (Z - normalize)

MAP Class = argmax_v P(v) product_i P(Ai|V)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Radial Basis Function?

A

RBF networks are a type of ANN constructed from spatially localized kernel functions. Blend of instance based and ANN approaches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the KNN algorithm

A

Given: Training Data D = {xi. yi}

distance metric d(q,x)

number of neighbors k

query point q

NN = {i: d(q,xi) k smallest}

Return

For classification: Vote of the yi (plurality, mode)

For regression: mean, or weighted mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is conditional entropy?

A

H(y | x ) = - sum ( P(x,y) log P(y|x) )

If x is independent of y:

H(y|x) = H(y)

H(x,y) = H(x) + H(y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the VC dimension of convex polgyons

A

number of parameters = infinite

VC dimension is infinite.

Draw points on a circle (polygon) VC dim = # points on circle for polygon. VC dim = # of points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a Belief Network?

A

aka Bayes Net, aka Bayesian Network, aka Graphical Model

Draw the conditional independencies between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How powerful is a perceptron unit?

A

Returns 0 or 1 as a function of inputs

Weights matter

Linear relationship means perceptron is a linear function that computes linear planes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

gradient descent

A

robust to non linear separability

(unthresholded output)

The delta rule is the preferred choice for non-linearly separable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Classification

A

Map input to label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Decision Trees

A

Ex - At a restaurant think of whether you should enter or not given input about the restaurant (cost, food type, etc.)
Representation - Leaves (answers), edges (values), nodes (attribute)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is epsilon - exhaustion?

A

VS(s) is epsilon-exhausted iff for all h in VS(s) the error_D(h) <= epsilon

Choose any h in VS(s) and you will have low error. We need to make sure the only things left in the version space have very low error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Key points of SVMs

A

Want to maximize the margin between the boundaries

Want to find decision boundary (betwen margins)

y = wTx + b (label, weights, parameters of the plane)

0 = wTx + b (on boundary) (+1/-2 on margins)

Want perpendicular line between margins to be a maximum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the learning/query runtime and space for 1-NN, K-NN and linear regression for a sorted data set?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the sum rule for probabilities?

A

P(A u B = P(A) + P(B) - P(A n B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How are rules combined in ensemble learning

A

take sets uniformly randomly to get subset

apply a learner

combine with mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Mistake Bounds

A

how many misclassifications can a learner make over an infinite run

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

When do we stop in a decision tree?

A

When everything is classified correctly, unless noise.
When no more attributes.

What do we do if noise?
Don’t trust the data completely.
Can overfit with a tree too big

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the maximum likelihood hypothesis?

A

h_ml = argmax_h in H of P(D | h)

The MAP Hypothesis with P(h) dropped because we consider all P(h) priors equal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the bias’ of Genetic Algorithms?

A

locality of the bits matter (first 4 bits are related or last 4 bits are related)

sub parts of the space can be independently optimized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is Bayes Rule

A

P(A|B) = P(B|A)P(A) / P(B)

P(A,B) = P(A|B)P(B) = P(B|A)P(A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are properties of simulated annealing?

A

T -> 0: like hill climbing

T -> inf: like random walk

decrease temperatore slowly

Boltzmann distribution - likely to be in a location of high fitness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Why does boosting tend to do well?

A

Boosting makes samples that aren’t working well be re-weighted to do better on.

The number of things getting wrong will have to be half right as the process is renormalized.

Errors go down and alphas go up, overtime.

Hypotheses that are more right have more vote.

Information gain, must pickup information as you go along. You have to pick up things that you got wrong in the past.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Reinforcement Learning

A

Learning from delayed reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is a perceptron ANN?

A

Attributes and weights flow in

Activation: Sum_i=1 to k_ Xi * Wi

>= theta (firing threshold)

yes - y = 1

no - y = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Testing Set

A

Just like a training set. Look at candidate and see if it does the job. Train =/= Test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are the bias’ of neural networks

A

Restriction bias:

perceptron - only considers planes, half spaces

sigmoids - much more complex. not much restriction

Representations:

boolean - network of threshold like units

continuous - connected no jumps - hidden

arbitrary - stitch together - 2 hidden

Preference bias:

initial weights (small, random values to minimize bias and give variability between runs)

prefer simpler explanations (occam’s razor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is unique to ANN regarding errors and cross-validation?

A

ANN is iterative

As the iterations continue, at risk of weights becoming too large and overfit

Error over iterations plot will show variance between train and validation data

Recommend stopping training before the train/validation data diverges

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Instances

A

inputs. Vectors of values. The set of things you are looking for

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is a weak learner

A

A weak learner is a learner that will always do better than chance

For all D, PrD[.] <= 1/2 - epsilon

epsilon - a really really small number <<< 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is backpropagation?

A

Computationally beneficial organization of the chain rule.

The errors flow backwards.

Error function can have many local optima. (Weights cannot change without making error worse, so stuck)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is Entropy?

A
  • sum P(s) log (P(S))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What are the advantages of random restart hill climbing?

A

Multiple tries to find a good starting place

Not much more expensive (constant factor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the theorem of total probability?

A

If events are mutually exclusive with sum from i= 1 to n of P(Ai) = 1, then

P(B) = sum_ i = 1 to n of P(B|Ai)P(Ai)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is the optimization problem for SVM?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

How do you deal with the threshold issue and avoid undifferentiable activation?

A

Sigmoid - Differentiable threshold

sigma(a) = 1 / (1 + e ^ -a)

goes to 0 when a -> -inf

goes to 1 when a -> inf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What if target concept not in Hypothesis Space?

A

Learning scenario called “agnostic”

Learner doesnt have to have hypothesis in target space, but has to find the best one at matching true concept.

m > = (1/2epsilon^2)* ( ln(|H| + ln(1\delta) )

Like the equation for non-agnostic case but epsilon is a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

How expressive is a decision tree?
XOR with n boolean attributes. How many trees/rows in the truth table? How big is the truth table?

A

O(n!) nodes (exponential)
output exponential

2^n rows (truth table)

2^(2^n) size truth table. How many different bit patterns? (2^(# positions)). 2 ^ (2^n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is the equation for the margin between decision boundaries in SVM?

A

M = 2 / ||W||

(wT/||W|| (x1 - x2) = 2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

VC dimension of example?

A

2.

With 3 points on the line, cannot create a range such that points are :

+ - +

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is the relationship between PAC-learnability and VC dimension?

A

H is PAC-learnable iff VC dimension is finite.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is simulated annealing?

A

Don’t always improve (exploit), sometimes you need to search (explore). Tradeoff

For finite set of iterations:

sample new point Xt in N(x)

jump to new sample with probability given by an acceptance probability function (hill climb or use Temperature to decide if to jump)

decrease temperature

Higher temperature - e^0 willing to take downward steps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Distance metrics for K

A

Euclidean sqrt( x0 - x1)^2 + (y0 - y1)^2

Manhattan |(x0 - x1)| + |(y0 - y1)| (taxicab)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What are genetic algorithms?

A

Population of individuals

mutation - local search N(x)

cross over - population holds information (different**). Combine attributes to be better

generations - iterations of improvement

Po = initial population of size K

Repeat until converged

Compute fitness of all x in Pt

Select “most fit” individuals (top half - truncated selection, weighted prob - roulette wheel)

Pair up individuals, replacing “least fit” individuals via crossover/mutation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

How to find the line of best fit in polynomial regression

A

Co + C1X + C2X^2 + C3X^3 = y
Use matrix multiplicate
Xw = y
X^TXw = X^Ty
(X^TX)^-1 X^TXw = (X^TX)^-1XTy
w = (X^TX)^-1XTy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Input spaces of regression

A

Scalar input, continuous out
Vector input, continuous out - include more input features: (size, distance from zoo). hyperplanes
discrete vector or scalar inputs - for discrete quantities can enumerate them into numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Training Set

A

Set of all inputs paired with correct outputs (this is tall, this is not tall).

57
Q

What is a dependency tree?

A

Tree where every node (minus root) has one parent

Every node depends on exactly one node.

Simplest inter-relationship

58
Q

Concept

A

The function that we care about that will map the inputs to outputs. (instances to outputs).

59
Q

Why should we sample?

A

Get distributions for probabiliy of values, generate values

simulates a complex process

approximate (faster than exact) inference (machine)

visualization, getting a feel (human)

60
Q

What is the VC dimension of a d-dimensional hyperplane?

A

d + 1

Note: VC dimension is often number of parameters

61
Q

What assumptions are required to derive that minimizing the sum of squared errors will output the maximum likelihood hypothesis?

A

training examples are pairs of where di = f(xi) + epsilon

epsilon error is iid with normal distribution and zero mean.

In practice, it may not always make sense that di contains error but the xs do not.

62
Q

Version Space

A

True Hypothesis - c in H

Candidate Hypothesis - h in H

consistent learner - produces c(x) = h(x) for x in S

version space - VS(S) = {h s.t. h in H consistent wrt S}. set of hypothesis consistant with examples.

63
Q

Optimizing Weights

A

To avoid local minima:

advanced optimization methods

momentum - use momentum to “bounce out”

higher order derivatives

randomized optimization

penalty for complexity (more nodes, layers penalty, larger number weights)

64
Q

What is required for a Kernel?

A

Mercer Condition - Acts like a distance or acts like a similarity .

65
Q

What is Ensemble Learning?

A

Come up with many simple rules, instead of one complicated rule.

Learn over a subset of data, combine for simple rule

repeat

combine all rules, complex rule

66
Q

What is Cross Validation

A

Take some of the training data and act like it is test data. The left out set of data is the cross validation set.

Divide data into K - folds
Ex:
4 folds
train on 1,2,3 test on 4
train on 1,3,4 test on 2
train on 2,3,4 test on 1
train on 1,2,4 test on 3
Average across the folds. Model that performs the best on cross-validation would likely be best model for train data.

67
Q

Deduction vs. Induction

A

Deduction - General rule to specifics
Induction - Examples to generate rule

68
Q

What is a Brute Force Bayes Concept Learner?

A

Learner with

P(D) - |VS of H,D | / |H |

P(h | D) = 1/ |VS of H,D| if h is consistent with D, else 0

* This learner requires uniform priors, noise free data, and target concept c is in H

69
Q

Which hypothesis spaces are infinite:

  • linear separators
  • artifical neural networks
  • decision trees (discrete inputs)
  • decision trees (continuous inputs)
A
  • linear separators - inf number of lines
  • artifical neural networks - inf number of weights
  • decision trees (continuous inputs) - inf number of questions
70
Q

Hypothesis Class

A

Set of all functions that you are willing to entertain. Could be all possible functions in the world (unlikely)

71
Q

Regression considerations for decision trees?

A

Continuous outputs (not inputs). How do you measure information on continuous values?

Look at variance.

How to split? On leaves do a standard fitting algorithm like the average or linear fit or do a vote.

72
Q

Target Concept

A

The thing we are trying to find. The answer. (ex: determine if something is car or not, person is M or F)

73
Q

Regression

A

Map input space to real number

74
Q

What is the specific algorithm for Decision tree

A

Loop
A

75
Q

What is the MAP hypothesis?

A

maximum a posteriori

h_map ~= argmax_h in H P(h | D)

= argmax_h in H P(D | h) P(h) / P(D) (drop D, independent of H)

= argmax_h in H P(D | h) P(h)

76
Q

What is Preference Bias

A

Only the Hypothesis we prefer from those we consider

77
Q

Candidate

A

Concept that you think is the target concept

78
Q

When is ANN useful?

A

Training data corresponse to noisy, complex sensor data. When the learned target function does not need to be understood by a person

79
Q

What is the hypothesis space

H = {hi(x) = xi}

x = 10 bits

A

|H| = 10

input space = 2^10

80
Q

Does it make sense to repeat an attribute along a path in the tree?

A

Yes, if the attribute is continuous you can use a different range or a discrete value from the attribute

81
Q

What is the equation for the number of datapoints for an infinite hypothesis space?

A
82
Q

What are the certainty and error goals?

A

epsilon : error goal between 0 and 1/2 “approximately”

delta: certainty goal between 0 and 1/2 “probably”

83
Q

What is the curse of dimensionality

A

As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially

84
Q

How does boosting differ from baggin

A

Bagging learns over a subset of data (drawn randomly) and applies the learner then combines with a mean.

Boosting takes a subset of the data (hardest examples) and applies the learner. Finally, taking a weighted mean

85
Q

Why is Naive Bayes cool?

A

Inference is cheap (in this case)

Few parameters

Estimate parameters with labeled data

Connects inference and classification

Empirically successful

Handles missing attributes pretty well

86
Q

How can we define error?

A

accuracy - misatches / total (implies each error is equal)

Error : PrD[h(x) =/= c(x)]

probability given the underlying distribution that I will disagree on the true concept

87
Q

Why is it called “Naive” Bayes

A

Does not model relationship between attributes

How can this work in practice? Weak relationships, enough attributes, may still get the correct label even if the probabilities are wrong.

Naive Bayes believes answer too much but doesn’t matter if it is right.

88
Q

Why is boosting more robust to overfitting?

A

Margins are maximized as boosting continues. There is a reduction in the fraction of training examples with small margin which yields improvements in test error. The process of learning on the hardest examples continues even after the training error has reached zero.

89
Q

Boosting Pseudocode

A

epsilon would be less than half (not even a half)

90
Q

IID

A

Independent and identically distributed
Assumption of algorithms
Data is representative in test/train of the real world distribution

91
Q
A
92
Q

Unsupervised Learning

A

Derive structure from inputs

(Description)

93
Q

What is the kernel in SVM?

A

A function that is used to represent similarity

Mechanism by which we inject domain knowledge into the SVM

94
Q

How do you find the best dependency tree?

A

Find the maximum spanning tree using Prim.

MST over mutual information

95
Q

What is the product rule of probabilities?

A

P (A n B) = p(A|B)P(B) = p(B|A)P(A)

96
Q

How to determine best attribute for decision trees

A

information gain
Gain (S,A) = Entropy S - Sumv ( | Sv | / S ) Entropy (Sv)
(entropy of original collection - expected entropy after S is partitioned using A)
Maximum entropy on even split

97
Q

Key points from the Matas and Sochman slides?

A

Adaboost

  • converges to log likelihood ratio
  • linear classifier
  • many mods (TC AB

Advantages

simple

feature selection on large sets of features

generalization

disadvantages

suboptimal

can overfit in presence of noise

98
Q

Instance Based Learning

A

Instead of training on data to produce a function then throwing the data out, store the data and perform a lookup against new values. Get back exactly what you put in.

Pros:

remembers, fast, simple

Cons:

no generalization, overfitting, affected by noise (2 inputs different outputs)

99
Q

How to find the line of best fit in constant regression?

A

Minimize the sum of squared errors of the points
E(C) = Sum_i=1 to n_ (yi - c)^2
Take the derivative of E(C) and get c = sum of yi / n (mean)

100
Q

What is joint entropy?

A

H(x,y) = - sum( P(x,y) log P(x,y) )

101
Q

What is mutual information?

A

Measures relationship between x and y.

Measure reduction of randomness in a variable given some other variable.

I (x, y) = H(y) - H(y | x)

102
Q

What is bagging?

A

Taking random subsets and combining by the mean

103
Q

Regression

A

mapping continuous inputs to outputs

word comes from fitting a regression line (averages regress to the mean, ie. tall people having medium height children)

104
Q

Locally weighted regression

A

K = n weighted average

Create local linear regression for different chunks of data. Can make a more complicated space based off of building simple hypothesis spaces .

105
Q

perceptron rule

A

perceptron rule (threshold) -> single unit

While error:

Wi = Wi + deltaWi

deltaWi = n (y - ^y)Xi (n is learning rate)

^y = Sum_i_WiXi >= 0 (subtract theta from both sides, to become weight, bias term added so threshold is folded into weights (bias term = -1 theta) )

** perceptron rule will find the answer in finite iterations if linearly separable

106
Q

What is bayesian learning?

A

Learn the best hypothesis given data and some domain knowledge

(best = most probable)

P(h|D)

D - data

h - particular hypothesis

argmax h in H [ P(h |D)

107
Q

Difference between lazy and eager learners

A

Lazy - low complexity at learn, more at query

Eager - more complexity at learn, less at query

With equal hypothesis spaces, lazy learners can have a richer hypothesis space as it doesn’t commit to a particular space. Lazy learners allow for local approximations of the hypothesis space.

Lazy model complex target functions with lesss complex local approximations.

108
Q

What is the VC dimension

A

The largest set of inputs that the hypothesis class can label in all possible ways (“shatter”)

Vapnik-Chervonenkis

The VC dimension helps to create a definition for the number of data points needed when a hypothesis class is infinite

m >= 1/epsilon(8 * VC(H) * log_2(13/epsilon) + 4log_2(2/delta)

109
Q

What is a perceptron

A

110
Q

How do deal with a decision tree that’s overfit?

A

Cross-validation : try different trees and see which has lowest error on validation set
More efficient : hold out a set and if error gets worse stop.
Pruning : when full tree is built, see if collapsing nodes up will lower or raise error. If raise, stop. Update output with a vote.

111
Q

What is Pac Learning

A

Training err - frac of training misclassified by h

true err - frac of examples that would be misclassified on sample drawn from D

error_D(h) = Pr_x~D[c(x) =/= h(x)] - err on examples we will never see is ok.

Probably - Approximately - Correct

C is PAC-learnable by learner using Hyp space iff learner L will, with probability 1 - delta (certainty goal), output a hypothesis h in H such that error_D(h) <= epsilon (error goal) in time and samples polynomial (1/ epsilon, 1/delta, n)

112
Q

What are the drawbacks of SA,GA, and RHC?

A

No memory of where you’ve been at where you are

113
Q

Decision Tree and Errors or Missing data

A

Decision Trees are robust to errors (all training used at each step and termination criteria can be updated to accept hypotheses that imperfectly fit data) or missing data

114
Q

Computational Complexity

A

computational effort for confergence

115
Q

In P(h|D) = P(D|h)P(h) / P(D)

what do the terms mean?

A

P(D|h) - the probably of seeing the data given that the hypothesis true. Assume the X’s are given, the labels are what we are trying to assign probability to. Given set of X’s, whats the probability I would see a particular label. Easier to compute the probability of seeing a label.

P(D) - prior on the data

P(h) = prior on the hypothesis drawn from the hypothesis space.

prior is the domain knowledge

116
Q

What is the role of the learning rate in ANN ?

A

Moderate the degree to which weights are changed at each step. May decay as iterations increase.

117
Q

What is entropy

A

A measure of randomness
A lot of entropy (1 bit) for a fair coin. Unfair coin has no entropy (0 bit)

  • Sumv [p(v) log p(v)] (sum over all possible values)
118
Q

What is computational learning theory?

A

Defining learning problems

showing specific algorithms work

show these problems are fundamentally hard

119
Q

What is the final hypothesis of Boosting?

A

Hfinal (X) = sgn(Sum alpha T ht(x))

alpha T - more weight if doing well

120
Q

Sample Complexity

A

batch. how many training examples are needed for a learner to create a successful hypothesis

121
Q

What is conditional independence?

A

X is conditionally indpendent of Y given Z if the probability distribution governing X is independent of the value of y given the value of Z, that is if

For all X,Y,Z P(X=x|Y=y,Z=z) = P(X=x|Z=z)

in short: P(X|Y,Z) = P(X|Z)

122
Q

What is the restrictive assumption built into the PAC-learnable definition?

A

The definition implicitly assumes that the learner’s hypothesis space contains a hypothesis with arbitrarily small error for every target concept in C.

Need to consider a second mesure of the complexity of H, VC dimension. We can state bounds on sample complexity that use VC(H) rather than |H|.

123
Q

What is KL Divergence?

A

Measures the difference between two distributions. Distance measure.

Always non-negative and zero (when p = q)

D( p || q) = integral p(x) log (p(x)/q(x))

124
Q

How do decision trees handle boolean functions?

A

Boolean Functions

  • AND 2 nodes, commutative (A or B order doesn’t matter)
  • OR 2 nodes, commutative, linear in n
  • XOR 3 nodes, need to split twice, 2^n - 1 nodes, exponential (HARD)
125
Q

What is the Minimum Description Length (MDL)?

A

Start with hMAP = argmax P (D|h) P(h)

= argmax [lg P (D|h) + lg P(h)]

= argmin [- lg P (D|h) - lg P(h)]

First arg -lg P(D|h) => length D|h). Bits needed to describe “size of miscalculation/err”

Second arg -lg P(h) => length(h). Bits needed to describe “size of h”

This provides a way of trading off hypothesis complexity for the number of errors committed by the hypothesis.

Helps with overfitting.

126
Q

Practical Matters of Mimic

A

Mimic does well with structure

Representing P_theta for all theta

Local optima problem (restarts free, probability theory)

takes a long time - orders of magnitude fewer iterations, but longer per iteration

more information per iteration

works well when cost of evaluating fit function is high

127
Q

What is the preference bias for KNN

A

Locality - near points are simillar

smoothness - averaging

All features matter equally

128
Q

What is case-based reasoning?

A

An instance-based learning method in which instances (cases) may be rich relational descriptions and in which the retrieval and combination of cases to solve the current query may rely on knowledge based reasoning and search-intensive problem-solving methods. (ex: Cadet device design)

129
Q

What is the distribution for Boosting as the process goes?

A

D1 (i) = 1/n (uniform at first)

Dt+1(i) = Dt(i) * e ^ (-alphaT yi ht (xt) / zt (normalization)

where alphaT = 1/2 ln (1 - err t)/err t (positive)

ht = -1/+1

yi = -1/+1

y*h = +1 if agree, -1 if disagree

Total product: - if they agree

Total product: + if they disagree

130
Q

when does boosting tend to overfit?

A

If a weak learner uses ANN with many layers and nodes

pink noise - uniform noise

underlying weak learner that overfits

131
Q

What resources matter in computational learning theory

A

time, space, data

132
Q

What are ways of selecting training examples

A

learner/teacher - learner asks Qs.

Teacher gives exs.

Fixed distribution (nature)

Evil - worst distribution

133
Q

What is the Restriction Bias

A

Only the Hypothesis we consider

134
Q

Decision Tree Algorithm

A
  1. Pick best attribute (split data)
  2. Asked Question
  3. Follow answer path
  4. Go to 1, repeat to answer
135
Q

VC dimension of example?

A

VC dimension is 3

Collinear case is tricky + - + for points on y = 0. But make use of the second dimension. So 3 points can be shattered.

Four points cannot. See example.

136
Q

What is Haussler Theorem

A

This is a theorem for bounding the true error as a function of the number of examples that are drawn. It provides a genral bound on the number of training examples sufficient for any consistent learner to successfully learn any target concept in H

m >= (1/epsilon) * (ln|H| + ln(1/delta) )

137
Q

What are types of learner queries?

A

Learners with constrained bounds (X and Y and Not Z)

Learner with mistake bounds (X and Not X):

1 - asume possible each variable is positive and negated

2 - given input, compute output

3 - if wrong, set + to - and - to +. Go to 2

(k + 1) mistakes

138
Q

Sample

A

Training Set

139
Q

What are key facts to the optimizatio problem for SVM

A

most alphas are 0,

most w’s don’t matter.

most data points are not support vectors