Decision Trees Flashcards

1
Q

Inductive Bias

A

The set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances. Some form of inductive bias is needed in order to generalize beyond training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Restriction bias

A

Form of inductive bias that comes from the learners search space of hypotheses being restricted
(less desirable because potentially excludes the unknown target function altogether)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Preference bias

A

From of inductive bias that comes from the learner’s search strategy, where certain hypotheses are preferred over others (with no hard restrictions on hypotheses that can be enumerated)

(more desirable than restriction bias because target function will always be in the hypothesis space)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ID3 algorithm: basic description

A

Learns by constructing tree from top-down, using a simple-to-complex greedy hill climb search (no backtracking).

Hypothesis space searched: set of all possible decision trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ID3 algorithm: pseudocode

A
  • Pick best attribute (via information gain) to use as decision node
  • For each value, create descendant of node & sort training examples to leaves
  • Repeat until (a) everything classified correctly or (b) all attributes used
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Entropy

A

characterizes (im)purity of an arbitrary selection of samples (i.e. the expected encoding length in bits)

If all belong to same class -> 0
If equal # of positive & negative samples -> 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Information Gain

A

Expected reduction in entropy caused by partitioning examples according the attribute in question (i.e. # bits saved – how well it separates examples)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Implications of no backtracking in ID3

A

Comes with risks of usual hill-climbing w/o backtracking – converging to local optima (instead of global)

A way to counter this is post-pruning (a form of backtracking done after completing the tree)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Inductive bias of ID3

A

Comes entirely from preference bias:

  • Prefers best splits (highest information gain) closest to the root (at top of tree)
  • Prefers shorter trees over longer ones (not always the outcome though, due to first bullet)

(ID3 searches over complete hypothesis space, so no restriction bias)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Occam’s razor

A

Prefer the simplest among different choices.

This is why we prefer shorter hypotheses, as they are less likely to be due to statistical coincidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In general, what causes overfitting

A
  • Training data has noise that is learned, such that training error decreases but so does ability to generalize to new examples
  • Too few training examples (even if noise free)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to avoid overfitting with decision trees

A

Two types of approaches:

1) Stop growing the tree earlier (before training data perfectly classified)
2) Allow overfitting, but then post-prune the tree (more successful b/c more concrete than #1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the main type of way to add safety checks against overfitting?

A

Divide training data into training and validation sets (cross-validation, leave-one-out, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Neural Networks: high level basics

A

Robust approach to approximate real-valued, discrete-valued, & vector-valued target functions

  • Well suited for noisy, complex data
  • Long training times
  • Fast evaluation of learned target function
  • Difficult to interpret learned function (since we’re just learning network weights)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the key speculation from biology that Neural Networks are loosely modeled after?

A

That the human brain gains it’s information processing abilities from a HIGHLY PARALLEL processes on distributed representations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a perceptron

A

Linear combination of inputs that outputs 1 if the result is greater than some threshold, and 0 otherwise

  • Results in a hyperplane that separates the data
  • Single perceptron can represent {AND, OR, NOT} fucntions
  • Chaining multiple together allows representation of any BOOLEAN function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

perceptron training rule

A

Uses the result from the thresholded activation (the predicted label) to compute how much we should shift our weights by.
[ weight delta = lr * (y - y_hat) * x ]

  • Convergence guarenteed (in finite iterations) only when data is linearly separable (& sufficiently small learning rate used)
  • If those conditions satisfied, converges to perfect classifier
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

gradient descent / delta training rule

A

Uses the activation (raw / un-thresholded) to compute weight changes.

In reality, this means we move weights along the direction that decreases error w.r.t. each weight (gradient descent along the error derivative)

  • Converges asymptotically towards at least a local minima of error surface, possibly in unbounded time
  • Works whether data linearly separable or not
  • Runs risk of overstepping the minimum of error surface if learning rate too large
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

stochastic gradient descent

A

Approximates true gradient descent closely (with small enough learning rate) by updating weights after each training example is observed

  • Less computation per update step than standard gradient descent
  • Can sometimes help avoid falling into local minima since it uses the per-sample error gradient instead of on the whole training set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What kind of functions can we represent with multi-layer neural networks of linear units?

A

Still can only produce linear functions!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What type of activation function is needed for gradient descent?

A

We need:

  • Unit whose output is a nonlinear function of inputs
  • But whose output is also a differentiable function of its inputs

(so perceptron unit does not work as it has a discontinuous threshold – not differentiable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is the sigmoid function

A

A common choice of an activation function. Also called the logistic function or ‘squashing’ function.

  • Outputs are continuous between 0 and 1
  • Easy derivative expressed in terms of its output: sigmoid(x) * (1 - sigmoid(x))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is backpropagation (basic idea)

A

The method for learning weights in multilayer networks via gradient descent.

  • Error is the sum of errors over the network’s output units
  • Gradient descent used because we have a large hypothesis space (all possible weight values)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

what is momentum in a neural network?

A

A mechanism where we make the weight update on the nth iteration depend partially on the update occurring during the (n-1)th iteration

  • Analogous to a ball rolling down a surface, where momentum keeps the ball rolling in the same direction as it was before
  • Can help ‘roll’ through local minima or through flat regions to speed up convergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the convergence guarantees with backpropagation

A

In multilayer networks, only guaranteed to converge to local minima (not necessarily global)

  • In practice, it is still highly effective
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why is backpropagation still effective even though convergence only to local minima guaranteed?

A
  • With high dimension spaces of weights, a local minima with respect to one of the weights is not necissarily a local minima w.r.t. the others
  • When weights initialized near zero, network will represent a smooth, approximately linear function due to natural shape of sigmoid function. Once grown, only then can we represent highly non-linear functions. That’s when there are more likely to be local minima, but hopefully by that point we’re already close enough to the global minima so its OK
27
Q

How can we deal with local minima in neural networks with backpropagation?

A

Not a lot of surefire ways to know when they’ll cause difficulty, but common methods are:

  • Add a momentum term
  • Use stochastic gradient descent instead of true g.d.
  • Train multiple networks each w/different initial weights. Then select the best one or use a committee approach and average the results
28
Q

What is the nature of the hypothesis space search with backpropagation?

A

We have:

  • A continuous (large) hypothesis space of weight values
  • A differentiable error function w.r.t. our continuous hypotheses

This provides a useful structure via the error gradient for organizing the search for top hypothesis

29
Q

What is the inductive bias at play with backpropagation?

A

difficult to characterize, as depends on interplay between the g.d. search & way in which weight space spans the space of representable functions

But there is PREFERENCE BIAS:
* g.d. prefers initial weights to be small random values, so we are starting with lower complexity / simpler explanations (small values) and sufficient variability (random values)

(NO restriction bias, b/c we can represent any type of relationship with neural networks)

30
Q

Hidden layer representations of ANNs

A

The intermediate representations of a neural network that are free to be set as whatever representation best minimizes squared error E

This allows ANNs to define new hidden layer features that aren’t explicit in the input representation, but that still capture properties relevant to learning the target function

  • Flexible so human doesn’t have to invent features
  • More layers = more complex representations
31
Q

Idea of overfitting with ANNs & how to combat it

A

Large number of weight parameters give many degrees of freedom for overfitting on training data idiosyncrasies

to combat:

  • best way is still by using cross validation sets
  • can also use weight decay (decrease weights by a factor each iteration) to keep them small & bias against complex decision surfaces
32
Q

Recurrent Neural Networks

A

directed, cyclic graphs that allow representations of time-series data

  • Outputs at time ‘t’ are used as inputs for time ‘t+1’
  • Trained with a variant of backprop
  • More difficult to train than neural networks with no feedback loop and they don’t generalize as reliably
  • Increased representational power
33
Q

What does it mean for hypothesis A to be “more general than” hypothesis B?

A

That any instance classified as positive by hypothesis B will also be classified as positive with hypothesis A.

  • other words, more general -> fewer constraints
34
Q

What are some features of Bayesian learning?

A
  • Each training example can incrementally inc./dec. estimated probability a hypothesis is correct (more flexible than completely eliminating a hypothesis if inconsistent with a single example)
  • Prior knowledge can be injected
  • Can accommodate hypotheses that make probabilistic predictions (e.g. 93% chance of recovery)
  • Provides a standard of optimal decision making to measure against, even if intractable
35
Q

Practical difficulties of Bayesian learning

A
  • Typically requires knowledge of many probabilities initially – and if not known, they’re estimated using assumptions / bias
  • Significant computational cost required to determine Bayes optimal hypothesis (linear in # of candidates)
36
Q

Main idea of Bayesian learning?

A

Instead of finding the ‘best’ hypothesis, we want to find the ‘most probable’ hypothesis, given the data and any initial knowledge of the prior probabilities

37
Q

what is the posterior probability?

A

Pr(h | D)

reflects our confidence in hypothesis ‘h’ holding after seeing the training data ‘D’

38
Q

What is the MAP hypothesis?

A

maximum a posteriori – the most likely hypothesis of the posterior distribution, meaning it incorporates prior probabilities as well

= argmax_h [ Pr(D | h) * Pr(h) ]

39
Q

What is the ML hypothesis?

A

maximum likelihood – simplified version of the MAP, where we assume each hypothesis in H is equally probably (uniform) a priori

= argmax_h [ Pr(D | h) ]

40
Q

What is a consistent learner?

A

one that always outputs a hypothesis with zero error over training examples

(i.e. a hypothesis belonging to the version space)

41
Q

What is the version space?

A

Set of all hypotheses ‘h’ in H that correctly classify all training examples ‘D’

42
Q

What is the relationship between Bayes theorem and consistent learners?

A

If our training data ‘D’ is noise-free, and no a priori reason to believe any hypothesis more probable than another (uniform),

Pr(h | D) =

  • 1 / [size of version space], if ‘h’ consistent with ‘D’
  • 0, otherwise (not consistent)

Every consistent hypothesis is therefore a MAP hypothesis, and consistent learners always produce a MAP hypothesis! (even if bayes not used explicitly)

43
Q

What is the relationship between Bayes theorem and sum of squared error algorithms?

A

Using Bayes, we can show that under certain assumptions, a learning algorithm that minimizes the sum of squared errors between predictions and training labels will actually output a [maximum likelihood hypothesis]!

Basically we can find (via Bayes) justification for many curve fitting methods that use sum of squared error, as well as gradient descent methods like in neural networks

44
Q

Relationship between minimum description length and Bayes

A

Minimum Description Length principle recommends choosing the hypothesis that minimizes the description length of the hypothesis plus the description length of the data given the hypothesis (shorter hypotheses preferred)

Bayes theorem and basic information theory results can be used to describe the rational for this

45
Q

What does the minimum description length principle give us?

A

A method to deal with overfitting by trading off hypothesis complexity for # errors committed by the hypothesis

46
Q

What is a Bayes optimal classifier?

A

One that combines the predictions of all hypotheses, weighted by their posterior probabilities

  • NOT necessarily the prediction outputted by the MAP hypothesis!
  • Resulting ‘h’ is new hypothesis not always part of H, because we’re taking linear combinations of predictions
47
Q

What is the Gibbs algorithm and how does it compare to the Bayes optimal classifier

A

Steps:

  1. Choose h in H at random according to the posterior distribution
  2. Use that ‘h’ to predict classification of the next instance X

Less optimal than Bayes optimal (duh), at worst 2x the error

48
Q

Basics of Naive Bayes classifier

A

Simplifies Bayes theorem by assuming ‘naively’ that ALL attribute values are conditionally independent given the target value

  • If that assumption is actually satisfied, the resulting classification will be equivalent to the Bayes optimal classification
  • If not, still works well (like with NLP example)
49
Q

Describe the hypothesis search mechanism of Naive Bayes classifiers

A

NB doesn’t search through a hypothesis space!

Hypothesis is simply formed by counting frequencies of data combinations

50
Q

Basics of Bayesian Belief Networks (Bayes Nets)

A

Describes the joint probability distribution for a set of variables

  • Allows stating conditional independence assumptions to subsets of the variables (so less constraining than Naive Bayes which assumes all are conditionally independent)
51
Q

What is needed for Bayes Nets to be straightforward?

A
  • Network structure known in advance
  • All network variables directly observable in each examples

Without one of these, more difficult

52
Q

What can we do if we want to build a Bayes Net but relevant instance variables are unobservable?

A

Can use the EM algorithm to learn their presence

Goes back & forth between estimating the ML hypothesis & expected values of hidden variables until convergence to a local maxima

53
Q

What is sample complexity?

A

The # of training examples needed for a learner to converge to a successful hypothesis

54
Q

What is computational complexity?

A

How much computational effort is needed for a learner to converge to a successful hypothesis

55
Q

What is the mistake bound?

A

Number of training examples that learner will misclassify before converging to a successful hypothesis

56
Q

What is true error?

A

The probability that ‘h’ will misclassify an instance drawn at random from ‘D’ (not just training data)

57
Q

PAC learning: general idea

A

Probably Approximately Correct

  • Require error be bound by a constant, epsilon
  • Require learner’s probability of failure be bound by a constant, delta
58
Q

What does it mean to say that a concept class is “PAC-learnable” by a learner ‘L’

A

that for all concepts in the concept class, L will, with probability (1-delta), output a hypothesis with true error less than epsilon.

59
Q

What is an epsilon-exhausted version space and its meaning?

A

to be epsilon-exhausted, all hypotheses in the version space must have true error less than epsilon

This allows us to bound the # examples needed by any consistent learner by making sure the version space contains no ‘unacceptable’ hypotheses

60
Q

What does the Haussler Theorem let us do?

A

Using the size of the hypothesis space, allows us to bound the probability that the version space is epsilon-exhausted after a given number of training examples

61
Q

How does PAC-learning relate to VC dimensions?

A

hypothesis space H is PAC-learnable if and only if its VC dimension is FINITE

62
Q

What is a VC dimension

A

Measures the complexity of the hypothesis space H by the number of distinct instances from X that can be completely discriminated using H.

The larger the VC dimension of a hypothesis space, the more data that’s needed to properly learn (increases sample complexity)

63
Q

What is shattering?

A

Gives a measure of a hypothesis space’s capacity to represent target concepts. The larger subset that can be shattered, the more expressive H is & the more data that’s needed

We say that H shatters a subset of instances ‘S’, if every dichotomy of S can be represented by some hypothesis from H

64
Q

How do SVMs and perceptrons relate?

A

?