Decision Trees Flashcards
Inductive Bias
The set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances. Some form of inductive bias is needed in order to generalize beyond training data.
Restriction bias
Form of inductive bias that comes from the learners search space of hypotheses being restricted
(less desirable because potentially excludes the unknown target function altogether)
Preference bias
From of inductive bias that comes from the learner’s search strategy, where certain hypotheses are preferred over others (with no hard restrictions on hypotheses that can be enumerated)
(more desirable than restriction bias because target function will always be in the hypothesis space)
ID3 algorithm: basic description
Learns by constructing tree from top-down, using a simple-to-complex greedy hill climb search (no backtracking).
Hypothesis space searched: set of all possible decision trees
ID3 algorithm: pseudocode
- Pick best attribute (via information gain) to use as decision node
- For each value, create descendant of node & sort training examples to leaves
- Repeat until (a) everything classified correctly or (b) all attributes used
Entropy
characterizes (im)purity of an arbitrary selection of samples (i.e. the expected encoding length in bits)
If all belong to same class -> 0
If equal # of positive & negative samples -> 1
Information Gain
Expected reduction in entropy caused by partitioning examples according the attribute in question (i.e. # bits saved – how well it separates examples)
Implications of no backtracking in ID3
Comes with risks of usual hill-climbing w/o backtracking – converging to local optima (instead of global)
A way to counter this is post-pruning (a form of backtracking done after completing the tree)
Inductive bias of ID3
Comes entirely from preference bias:
- Prefers best splits (highest information gain) closest to the root (at top of tree)
- Prefers shorter trees over longer ones (not always the outcome though, due to first bullet)
(ID3 searches over complete hypothesis space, so no restriction bias)
Occam’s razor
Prefer the simplest among different choices.
This is why we prefer shorter hypotheses, as they are less likely to be due to statistical coincidence
In general, what causes overfitting
- Training data has noise that is learned, such that training error decreases but so does ability to generalize to new examples
- Too few training examples (even if noise free)
How to avoid overfitting with decision trees
Two types of approaches:
1) Stop growing the tree earlier (before training data perfectly classified)
2) Allow overfitting, but then post-prune the tree (more successful b/c more concrete than #1)
What is the main type of way to add safety checks against overfitting?
Divide training data into training and validation sets (cross-validation, leave-one-out, etc.)
Neural Networks: high level basics
Robust approach to approximate real-valued, discrete-valued, & vector-valued target functions
- Well suited for noisy, complex data
- Long training times
- Fast evaluation of learned target function
- Difficult to interpret learned function (since we’re just learning network weights)
What is the key speculation from biology that Neural Networks are loosely modeled after?
That the human brain gains it’s information processing abilities from a HIGHLY PARALLEL processes on distributed representations
What is a perceptron
Linear combination of inputs that outputs 1 if the result is greater than some threshold, and 0 otherwise
- Results in a hyperplane that separates the data
- Single perceptron can represent {AND, OR, NOT} fucntions
- Chaining multiple together allows representation of any BOOLEAN function
perceptron training rule
Uses the result from the thresholded activation (the predicted label) to compute how much we should shift our weights by.
[ weight delta = lr * (y - y_hat) * x ]
- Convergence guarenteed (in finite iterations) only when data is linearly separable (& sufficiently small learning rate used)
- If those conditions satisfied, converges to perfect classifier
gradient descent / delta training rule
Uses the activation (raw / un-thresholded) to compute weight changes.
In reality, this means we move weights along the direction that decreases error w.r.t. each weight (gradient descent along the error derivative)
- Converges asymptotically towards at least a local minima of error surface, possibly in unbounded time
- Works whether data linearly separable or not
- Runs risk of overstepping the minimum of error surface if learning rate too large
stochastic gradient descent
Approximates true gradient descent closely (with small enough learning rate) by updating weights after each training example is observed
- Less computation per update step than standard gradient descent
- Can sometimes help avoid falling into local minima since it uses the per-sample error gradient instead of on the whole training set
What kind of functions can we represent with multi-layer neural networks of linear units?
Still can only produce linear functions!
What type of activation function is needed for gradient descent?
We need:
- Unit whose output is a nonlinear function of inputs
- But whose output is also a differentiable function of its inputs
(so perceptron unit does not work as it has a discontinuous threshold – not differentiable)
what is the sigmoid function
A common choice of an activation function. Also called the logistic function or ‘squashing’ function.
- Outputs are continuous between 0 and 1
- Easy derivative expressed in terms of its output: sigmoid(x) * (1 - sigmoid(x))
what is backpropagation (basic idea)
The method for learning weights in multilayer networks via gradient descent.
- Error is the sum of errors over the network’s output units
- Gradient descent used because we have a large hypothesis space (all possible weight values)
what is momentum in a neural network?
A mechanism where we make the weight update on the nth iteration depend partially on the update occurring during the (n-1)th iteration
- Analogous to a ball rolling down a surface, where momentum keeps the ball rolling in the same direction as it was before
- Can help ‘roll’ through local minima or through flat regions to speed up convergence
What are the convergence guarantees with backpropagation
In multilayer networks, only guaranteed to converge to local minima (not necessarily global)
- In practice, it is still highly effective