Midterm Flashcards

Question

How are rules combined in ensemble learning

Answer 1

take sets uniformly randomly to get subset apply a learner combine with mean

Answer 2

how many misclassifications can a learner make over an infinite run

Answer 3

When everything is classified correctly, unless noise. When no more attributes. What do we do if noise? Don't trust the data completely. Can overfit with a tree too big

Answer 4

h\_ml = argmax\_h in H of P(D | h) The MAP Hypothesis with P(h) dropped because we consider all P(h) priors equal.

Answer 5

locality of the bits matter (first 4 bits are related or last 4 bits are related) sub parts of the space can be independently optimized.

Answer 6

P(A|B) = P(B|A)P(A) / P(B) P(A,B) = P(A|B)P(B) = P(B|A)P(A)

Answer 7

T -\> 0: like hill climbing T -\> inf: like random walk decrease temperatore slowly Boltzmann distribution - likely to be in a location of high fitness

Answer 8

Boosting makes samples that aren't working well be re-weighted to do better on. The number of things getting wrong will have to be half right as the process is renormalized. Errors go down and alphas go up, overtime. Hypotheses that are more right have more vote. Information gain, must pickup information as you go along. You have to pick up things that you got wrong in the past.

Answer 9

Learning from delayed reward

Answer 10

Attributes and weights flow in Activation: Sum\_i=1 to k\_ Xi \* Wi \>= theta (firing threshold) yes - y = 1 no - y = 0

Answer 11

Just like a training set. Look at candidate and see if it does the job. Train =/= Test

Answer 12

Restriction bias: perceptron - only considers planes, half spaces sigmoids - much more complex. not much restriction Representations: boolean - network of threshold like units continuous - connected no jumps - hidden arbitrary - stitch together - 2 hidden Preference bias: initial weights (small, random values to minimize bias and give variability between runs) prefer simpler explanations (occam's razor)

Answer 13

ANN is iterative As the iterations continue, at risk of weights becoming too large and overfit Error over iterations plot will show variance between train and validation data Recommend stopping training before the train/validation data diverges

Answer 14

inputs. Vectors of values. The set of things you are looking for

Answer 15

A weak learner is a learner that will always do better than chance For all D, PrD[.] \<= 1/2 - epsilon epsilon - a really really small number \<\<\< 1

Answer 16

Computationally beneficial organization of the chain rule. The errors flow backwards. Error function can have many local optima. (Weights cannot change without making error worse, so stuck)

Answer 17

- sum P(s) log (P(S))

Answer 18

Multiple tries to find a good starting place Not much more expensive (constant factor)

Answer 19

If events are mutually exclusive with sum from i= 1 to n of P(Ai) = 1, then P(B) = sum\_ i = 1 to n of P(B|Ai)P(Ai)

Answer 20

Sigmoid - Differentiable threshold sigma(a) = 1 / (1 + e ^ -a) goes to 0 when a -\> -inf goes to 1 when a -\> inf

Answer 21

Learning scenario called "agnostic" Learner doesnt have to have hypothesis in target space, but has to find the best one at matching true concept. m \> = (1/2epsilon^2)\* ( ln(|H| + ln(1\delta) ) Like the equation for non-agnostic case but epsilon is a

Answer 22

O(n!) nodes (exponential) output exponential 2^n rows (truth table) 2^(2^n) size truth table. How many different bit patterns? (2^(# positions)). 2 ^ (2^n)

Answer 23

M = 2 / ||W|| | (wT/||W|| (x1 - x2) = 2)

Answer 24

2. With 3 points on the line, cannot create a range such that points are : + - +

Answer 25

H is PAC-learnable iff VC dimension is finite.

Answer 26

Don't always improve (exploit), sometimes you need to search (explore). Tradeoff ## Footnote For finite set of iterations: sample new point Xt in N(x) jump to new sample with probability given by an acceptance probability function (hill climb or use Temperature to decide if to jump) decrease temperature Higher temperature - e^0 willing to take downward steps

Answer 27

Euclidean sqrt( x0 - x1)^2 + (y0 - y1)^2 Manhattan |(x0 - x1)| + |(y0 - y1)| (taxicab)

Answer 28

Population of individuals mutation - local search N(x) cross over - population holds information (different\*\*). Combine attributes to be better generations - iterations of improvement Po = initial population of size K Repeat until converged Compute fitness of all x in Pt Select "most fit" individuals (top half - truncated selection, weighted prob - roulette wheel) Pair up individuals, replacing "least fit" individuals via crossover/mutation

Answer 29

Co + C1X + C2X^2 + C3X^3 = y Use matrix multiplicate Xw = y X^TXw = X^Ty (X^TX)^-1 X^TXw = (X^TX)^-1XTy w = (X^TX)^-1XTy

Answer 30

Scalar input, continuous out Vector input, continuous out - include more input features: (size, distance from zoo). hyperplanes discrete vector or scalar inputs - for discrete quantities can enumerate them into numbers.

Answer 31

Set of all inputs paired with correct outputs (this is tall, this is not tall).

Answer 32

Tree where every node (minus root) has one parent Every node depends on exactly one node. Simplest inter-relationship

Answer 33

The function that we care about that will map the inputs to outputs. (instances to outputs).

Answer 34

Get distributions for probabiliy of values, generate values simulates a complex process approximate (faster than exact) inference (machine) visualization, getting a feel (human)

Answer 35

d + 1 Note: VC dimension is often number of parameters

Answer 36

training examples are pairs of where di = f(xi) + epsilon epsilon error is iid with normal distribution and zero mean. In practice, it may not always make sense that di contains error but the xs do not.

Answer 37

True Hypothesis - c in H Candidate Hypothesis - h in H consistent learner - produces c(x) = h(x) for x in S version space - VS(S) = {h s.t. h in H consistent wrt S}. set of hypothesis consistant with examples.

Answer 38

To avoid local minima: advanced optimization methods momentum - use momentum to "bounce out" higher order derivatives randomized optimization penalty for complexity (more nodes, layers penalty, larger number weights)

Answer 39

Mercer Condition - Acts like a distance or acts like a similarity .

Answer 40

Come up with many simple rules, instead of one complicated rule. Learn over a subset of data, combine for simple rule repeat combine all rules, complex rule

Answer 41

Take some of the training data and act like it is test data. The left out set of data is the cross validation set. Divide data into K - folds Ex: 4 folds train on 1,2,3 test on 4 train on 1,3,4 test on 2 train on 2,3,4 test on 1 train on 1,2,4 test on 3 Average across the folds. Model that performs the best on cross-validation would likely be best model for train data.

Answer 42

Deduction - General rule to specifics Induction - Examples to generate rule

Answer 43

Learner with P(D) - |VS of H,D | / |H | P(h | D) = 1/ |VS of H,D| if h is consistent with D, else 0 \* This learner requires uniform priors, noise free data, and target concept c is in H

Answer 44

- linear separators - inf number of lines - artifical neural networks - inf number of weights - decision trees (continuous inputs) - inf number of questions

Answer 45

Set of all functions that you are willing to entertain. Could be all possible functions in the world (unlikely)

Answer 46

Continuous outputs (not inputs). How do you measure information on continuous values? Look at variance. How to split? On leaves do a standard fitting algorithm like the average or linear fit or do a vote.

Answer 47

The thing we are trying to find. The answer. (ex: determine if something is car or not, person is M or F)

Answer 48

Map input space to real number

Answer 49

maximum a posteriori h\_map ~= argmax\_h in H P(h | D) = argmax\_h in H P(D | h) P(h) / P(D) (drop D, independent of H) = argmax\_h in H P(D | h) P(h)

Answer 50

Only the Hypothesis we prefer from those we consider

Answer 51

Concept that you think is the target concept

Answer 52

Training data corresponse to noisy, complex sensor data. When the learned target function does not need to be understood by a person

Answer 53

|H| = 10 input space = 2^10

Answer 54

Yes, if the attribute is continuous you can use a different range or a discrete value from the attribute

Answer 55

epsilon : error goal between 0 and 1/2 "approximately" delta: certainty goal between 0 and 1/2 "probably"

Answer 56

As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially

Answer 57

Bagging learns over a subset of data (drawn randomly) and applies the learner then combines with a mean. Boosting takes a subset of the data (hardest examples) and applies the learner. Finally, taking a weighted mean

Answer 58

Inference is cheap (in this case) Few parameters Estimate parameters with labeled data Connects inference and classification Empirically successful Handles missing attributes pretty well

Answer 59

accuracy - misatches / total (implies each error is equal) Error : PrD[h(x) =/= c(x)] probability given the underlying distribution that I will disagree on the true concept

Answer 60

Does not model relationship between attributes How can this work in practice? Weak relationships, enough attributes, may still get the correct label even if the probabilities are wrong. Naive Bayes believes answer too much but doesn't matter if it is right.

Answer 61

Margins are maximized as boosting continues. There is a reduction in the fraction of training examples with small margin which yields improvements in test error. The process of learning on the hardest examples continues even after the training error has reached zero.

Answer 62

epsilon would be less than half (not even a half)

Answer 63

Independent and identically distributed Assumption of algorithms Data is representative in test/train of the real world distribution

Answer 64

Derive structure from inputs | (Description)

Answer 65

A function that is used to represent similarity Mechanism by which we inject domain knowledge into the SVM

Answer 66

Find the maximum spanning tree using Prim. MST over mutual information

Answer 67

P (A n B) = p(A|B)P(B) = p(B|A)P(A)

Answer 68

information gain Gain (S,A) = Entropy S - Sumv ( | Sv | / S ) Entropy (Sv) (entropy of original collection - expected entropy after S is partitioned using A) Maximum entropy on even split

Answer 69

Adaboost - converges to log likelihood ratio - linear classifier - many mods (TC AB Advantages simple feature selection on large sets of features generalization disadvantages suboptimal can overfit in presence of noise

Answer 70

Instead of training on data to produce a function then throwing the data out, store the data and perform a lookup against new values. Get back exactly what you put in. ## Footnote Pros: remembers, fast, simple Cons: no generalization, overfitting, affected by noise (2 inputs different outputs)

Answer 71

Minimize the sum of squared errors of the points E(C) = Sum\_i=1 to n\_ (yi - c)^2 Take the derivative of E(C) and get c = sum of yi / n (mean)

Answer 72

H(x,y) = - sum( P(x,y) log P(x,y) )

Answer 73

Measures relationship between x and y. Measure reduction of randomness in a variable given some other variable. I (x, y) = H(y) - H(y | x)

Answer 74

Taking random subsets and combining by the mean

Answer 75

mapping continuous inputs to outputs word comes from fitting a regression line (averages regress to the mean, ie. tall people having medium height children)

Answer 76

K = n weighted average Create local linear regression for different chunks of data. Can make a more complicated space based off of building simple hypothesis spaces .

Answer 77

perceptron rule (threshold) -\> single unit While error: Wi = Wi + deltaWi deltaWi = n (y - ^y)Xi (n is learning rate) ^y = Sum\_i\_WiXi \>= 0 (subtract theta from both sides, to become weight, bias term added so threshold is folded into weights (bias term = -1 theta) ) \*\* perceptron rule will find the answer in finite iterations if linearly separable

Answer 78

Learn the best hypothesis given data and some domain knowledge (best = most probable) P(h|D) D - data h - particular hypothesis argmax h in H [ P(h |D)

Answer 79

Lazy - low complexity at learn, more at query Eager - more complexity at learn, less at query With equal hypothesis spaces, lazy learners can have a richer hypothesis space as it doesn't commit to a particular space. Lazy learners allow for local approximations of the hypothesis space. Lazy model complex target functions with lesss complex local approximations.

Answer 80

The largest set of inputs that the hypothesis class can label in all possible ways ("shatter") Vapnik-Chervonenkis The VC dimension helps to create a definition for the number of data points needed when a hypothesis class is infinite m \>= 1/epsilon(8 \* VC(H) \* log\_2(13/epsilon) + 4log\_2(2/delta)

Answer 81

![]() ![]() ![]()

Answer 82

Cross-validation : try different trees and see which has lowest error on validation set More efficient : hold out a set and if error gets worse stop. Pruning : when full tree is built, see if collapsing nodes up will lower or raise error. If raise, stop. Update output with a vote.

Answer 83

Training err - frac of training misclassified by h true err - frac of examples that would be misclassified on sample drawn from D error\_D(h) = Pr\_x~D[c(x) =/= h(x)] - err on examples we will never see is ok. Probably - Approximately - Correct C is PAC-learnable by learner using Hyp space iff learner L will, with probability 1 - delta (certainty goal), output a hypothesis h in H such that error\_D(h) \<= epsilon (error goal) in time and samples polynomial (1/ epsilon, 1/delta, n)

Answer 84

No memory of where you've been at where you are

Answer 85

Decision Trees are robust to errors (all training used at each step and termination criteria can be updated to accept hypotheses that imperfectly fit data) or missing data

Answer 86

computational effort for confergence

Answer 87

P(D|h) - the probably of seeing the data given that the hypothesis true. Assume the X's are given, the labels are what we are trying to assign probability to. Given set of X's, whats the probability I would see a particular label. Easier to compute the probability of seeing a label. P(D) - prior on the data P(h) = prior on the hypothesis drawn from the hypothesis space. prior is the domain knowledge

Answer 88

Moderate the degree to which weights are changed at each step. May decay as iterations increase.

Answer 89

A measure of randomness A lot of entropy (1 bit) for a fair coin. Unfair coin has no entropy (0 bit) - Sumv [p(v) log p(v)] (sum over all possible values)

Answer 90

Defining learning problems showing specific algorithms work show these problems are fundamentally hard

Answer 91

Hfinal (X) = sgn(Sum alpha T ht(x)) alpha T - more weight if doing well

Answer 92

batch. how many training examples are needed for a learner to create a successful hypothesis

Answer 93

X is conditionally indpendent of Y given Z if the probability distribution governing X is independent of the value of y given the value of Z, that is if For all X,Y,Z P(X=x|Y=y,Z=z) = P(X=x|Z=z) in short: P(X|Y,Z) = P(X|Z)

Answer 94

The definition implicitly assumes that the learner's hypothesis space contains a hypothesis with arbitrarily small error for every target concept in C. Need to consider a second mesure of the complexity of H, VC dimension. We can state bounds on sample complexity that use VC(H) rather than |H|.

Answer 95

Measures the difference between two distributions. Distance measure. Always non-negative and zero (when p = q) D( p || q) = integral p(x) log (p(x)/q(x))

Answer 96

Boolean Functions - AND 2 nodes, commutative (A or B order doesn't matter) - OR 2 nodes, commutative, linear in n - XOR 3 nodes, need to split twice, 2^n - 1 nodes, exponential (HARD)

Answer 97

Start with hMAP = argmax P (D|h) P(h) = argmax [lg P (D|h) + lg P(h)] = argmin [- lg P (D|h) - lg P(h)] First arg -lg P(D|h) =\> length D|h). Bits needed to describe "size of miscalculation/err" Second arg -lg P(h) =\> length(h). Bits needed to describe "size of h" This provides a way of trading off hypothesis complexity for the number of errors committed by the hypothesis. Helps with overfitting.

Answer 98

Mimic does well with structure Representing P\_theta for all theta Local optima problem (restarts free, probability theory) takes a long time - orders of magnitude fewer iterations, but longer per iteration more information per iteration works well when cost of evaluating fit function is high

Answer 99

Locality - near points are simillar smoothness - averaging All features matter equally

Answer 100

An instance-based learning method in which instances (cases) may be rich relational descriptions and in which the retrieval and combination of cases to solve the current query may rely on knowledge based reasoning and search-intensive problem-solving methods. (ex: Cadet device design)

Answer 101

D1 (i) = 1/n (uniform at first) Dt+1(i) = Dt(i) \* e ^ (-alphaT yi ht (xt) / zt (normalization) where alphaT = 1/2 ln (1 - err t)/err t (positive) ht = -1/+1 yi = -1/+1 y\*h = +1 if agree, -1 if disagree Total product: - if they agree Total product: + if they disagree

Answer 102

If a weak learner uses ANN with many layers and nodes pink noise - uniform noise underlying weak learner that overfits

Answer 103

time, space, data

Answer 104

learner/teacher - learner asks Qs. Teacher gives exs. Fixed distribution (nature) Evil - worst distribution

Answer 105

Only the Hypothesis we consider

Answer 106

1. Pick best attribute (split data) 2. Asked Question 3. Follow answer path 4. Go to 1, repeat to answer

Answer 107

VC dimension is 3 Collinear case is tricky + - + for points on y = 0. But make use of the second dimension. So 3 points can be shattered. Four points cannot. See example.

Answer 108

This is a theorem for bounding the true error as a function of the number of examples that are drawn. It provides a genral bound on the number of training examples sufficient for any consistent learner to successfully learn any target concept in H m \>= (1/epsilon) \* (ln|H| + ln(1/delta) )

Answer 109

Learners with constrained bounds (X and Y and Not Z) Learner with mistake bounds (X and Not X): 1 - asume possible each variable is positive and negated 2 - given input, compute output 3 - if wrong, set + to - and - to +. Go to 2 (k + 1) mistakes

Answer 110

Training Set

Answer 111

most alphas are 0, most w's don't matter. most data points are not support vectors

Midterm Flashcards

(139 cards)