Decision Trees Flashcards

Question

What are the convergence guarantees with backpropagation

Answer 1

In multilayer networks, only guaranteed to converge to local minima (not necessarily global) * In practice, it is still highly effective

Answer 2

* With high dimension spaces of weights, a local minima with respect to one of the weights is not necissarily a local minima w.r.t. the others * When weights initialized near zero, network will represent a smooth, approximately linear function due to natural shape of sigmoid function. Once grown, only then can we represent highly non-linear functions. That's when there are more likely to be local minima, but hopefully by that point we're already close enough to the global minima so its OK

Answer 3

Not a lot of surefire ways to know when they'll cause difficulty, but common methods are: * Add a momentum term * Use stochastic gradient descent instead of true g.d. * Train multiple networks each w/different initial weights. Then select the best one or use a committee approach and average the results

Answer 4

We have: * A continuous (large) hypothesis space of weight values * A differentiable error function w.r.t. our continuous hypotheses This provides a useful structure via the error gradient for organizing the search for top hypothesis

Answer 5

difficult to characterize, as depends on interplay between the g.d. search & way in which weight space spans the space of representable functions But there is PREFERENCE BIAS: * g.d. prefers initial weights to be small random values, so we are starting with lower complexity / simpler explanations (small values) and sufficient variability (random values) (NO restriction bias, b/c we can represent any type of relationship with neural networks)

Answer 6

The intermediate representations of a neural network that are free to be set as whatever representation best minimizes squared error E This allows ANNs to define new hidden layer features that aren't explicit in the input representation, but that still capture properties relevant to learning the target function * Flexible so human doesn't have to invent features * More layers = more complex representations

Answer 7

Large number of weight parameters give many degrees of freedom for overfitting on training data idiosyncrasies to combat: * best way is still by using cross validation sets * can also use weight decay (decrease weights by a factor each iteration) to keep them small & bias against complex decision surfaces

Answer 8

directed, cyclic graphs that allow representations of time-series data * Outputs at time 't' are used as inputs for time 't+1' * Trained with a variant of backprop * More difficult to train than neural networks with no feedback loop and they don't generalize as reliably * Increased representational power

Answer 9

That any instance classified as positive by hypothesis B will also be classified as positive with hypothesis A. * other words, more general -> fewer constraints

Answer 10

* Each training example can incrementally inc./dec. estimated probability a hypothesis is correct (more flexible than completely eliminating a hypothesis if inconsistent with a single example) * Prior knowledge can be injected * Can accommodate hypotheses that make probabilistic predictions (e.g. 93% chance of recovery) * Provides a standard of optimal decision making to measure against, even if intractable

Answer 11

* Typically requires knowledge of many probabilities initially -- and if not known, they're estimated using assumptions / bias * Significant computational cost required to determine Bayes optimal hypothesis (linear in # of candidates)

Answer 12

Instead of finding the 'best' hypothesis, we want to find the 'most probable' hypothesis, given the data and any initial knowledge of the prior probabilities

Answer 13

Pr(h | D) reflects our confidence in hypothesis 'h' holding after seeing the training data 'D'

Answer 14

maximum a posteriori -- the most likely hypothesis of the posterior distribution, meaning it incorporates prior probabilities as well = argmax_h [ Pr(D | h) * Pr(h) ]

Answer 15

maximum likelihood -- simplified version of the MAP, where we assume each hypothesis in H is equally probably (uniform) a priori = argmax_h [ Pr(D | h) ]

Answer 16

one that always outputs a hypothesis with zero error over training examples (i.e. a hypothesis belonging to the version space)

Answer 17

Set of all hypotheses 'h' in H that correctly classify all training examples 'D'

Answer 18

If our training data 'D' is noise-free, and no a priori reason to believe any hypothesis more probable than another (uniform), Pr(h | D) = * 1 / [size of version space], if 'h' consistent with 'D' * 0, otherwise (not consistent) Every consistent hypothesis is therefore a MAP hypothesis, and consistent learners always produce a MAP hypothesis! (even if bayes not used explicitly)

Answer 19

Using Bayes, we can show that under certain assumptions, a learning algorithm that minimizes the sum of squared errors between predictions and training labels will actually output a [maximum likelihood hypothesis]! Basically we can find (via Bayes) justification for many curve fitting methods that use sum of squared error, as well as gradient descent methods like in neural networks

Answer 20

Minimum Description Length principle recommends choosing the hypothesis that minimizes the description length of the hypothesis plus the description length of the data given the hypothesis (shorter hypotheses preferred) Bayes theorem and basic information theory results can be used to describe the rational for this

Answer 21

A method to deal with overfitting by trading off hypothesis complexity for # errors committed by the hypothesis

Answer 22

One that combines the predictions of all hypotheses, weighted by their posterior probabilities * NOT necessarily the prediction outputted by the MAP hypothesis! * Resulting 'h' is new hypothesis not always part of H, because we're taking linear combinations of predictions

Answer 23

Steps: 1. Choose h in H at random according to the posterior distribution 2. Use that 'h' to predict classification of the next instance X Less optimal than Bayes optimal (duh), at worst 2x the error

Answer 24

Simplifies Bayes theorem by assuming 'naively' that ALL attribute values are conditionally independent given the target value * If that assumption is actually satisfied, the resulting classification will be equivalent to the Bayes optimal classification * If not, still works well (like with NLP example)

Answer 25

NB doesn't search through a hypothesis space! Hypothesis is simply formed by counting frequencies of data combinations

Answer 26

Describes the joint probability distribution for a set of variables * Allows stating conditional independence assumptions to subsets of the variables (so less constraining than Naive Bayes which assumes all are conditionally independent)

Answer 27

* Network structure known in advance * All network variables directly observable in each examples Without one of these, more difficult

Answer 28

Can use the EM algorithm to learn their presence Goes back & forth between estimating the ML hypothesis & expected values of hidden variables until convergence to a local maxima

Answer 29

The # of training examples needed for a learner to converge to a successful hypothesis

Answer 30

How much computational effort is needed for a learner to converge to a successful hypothesis

Answer 31

Number of training examples that learner will misclassify before converging to a successful hypothesis

Answer 32

The probability that 'h' will misclassify an instance drawn at random from 'D' (not just training data)

Answer 33

Probably Approximately Correct * Require error be bound by a constant, epsilon * Require learner's probability of failure be bound by a constant, delta

Answer 34

that for all concepts in the concept class, L will, with probability (1-delta), output a hypothesis with true error less than epsilon.

Answer 35

to be epsilon-exhausted, all hypotheses in the version space must have true error less than epsilon This allows us to bound the # examples needed by any consistent learner by making sure the version space contains no 'unacceptable' hypotheses

Answer 36

Using the size of the hypothesis space, allows us to bound the probability that the version space is epsilon-exhausted after a given number of training examples

Answer 37

hypothesis space H is PAC-learnable if and only if its VC dimension is FINITE

Answer 38

Measures the complexity of the hypothesis space H by the number of distinct instances from X that can be completely discriminated using H. The larger the VC dimension of a hypothesis space, the more data that's needed to properly learn (increases sample complexity)

Answer 39

Gives a measure of a hypothesis space's capacity to represent target concepts. The larger subset that can be shattered, the more expressive H is & the more data that's needed We say that H shatters a subset of instances 'S', if every dichotomy of S can be represented by some hypothesis from H

Decision Trees Flashcards

(64 cards)