Midterm Prep Flashcards
How do decision trees work?
A splitting of data down a tree. At each node, you’re trying to maximize the information gain (reduction in randomness) which is the split that splits the majority of the data as evenly as possible down the possible paths.
What’s the big O notation of XOR vs OR?
XOR is exponential: O(2^n), while OR is linear.
What is restriction bias?
When the set of possible hypotheses considered becomes restricted to a smaller subset.
What is entropy?
A measure of randomness. When you split a decision tree you’re trying to reduce the amount of randomness as much as possible. If after the split you end up with 1/2 of each class down each branch then there was no change in entropy (50x, 50y @ node1 -> node2: 25x, 25y, node3: 25x, 25y). If you completely classify something you’ve got max entropy reduction.
What is preference bias?
What sort of hypotheses from your hypothesis set do you prefer? This is the heart of inductive bias.
What’s the inductive bias of DTs?
- Good splits at the top
- Prefers correct to incorrect tree’s
- Prefers shorter to longer trees - comes naturally from doing good splits at the top. If you do bad splits, the tree’s would be longer
How do DT’s treat continuous attributes?
Pose it as a question: 20 <= age < 30 then split on the answer. You can split multiple times on the same continuous variable, but you can’t ask the same question twice. Information gain should pick that up anyway though because you wouldn’t see a reduction in entropy by choosing that selection.
What is pruning?
Collapse leaves back up on the training set and label everything via maybe the avg, or vote, etc. if it doesn’t increase the error too much then do it, if it does, don’t do it. This helps prevent overfitting.
Could you use DTs for regression?
Sure - outputs could be something like average or local linear fit. When you split you could look at the variance.
What is cross validation?
Split your training data into subsets (k folds of them). Then use all but one of those sets as your training data and perform validation against the one remaining subset. Repeat this so that each subset becomes the validation set and then you will average the test scores.
What is IID?``
Independent and Identically Distributed. All data is coming from the same distribution/source. Each pull from the distribution is independent of each other and they are all distributed the same.
How do you get a ‘not’ behavior from a perceptron?
Set the weight to negative (or the threshold)!
What boolean functions can be represented by a single perceptron?
AND, OR, NOT, NAND, NOR - you cannot represent XOR with a single perceptron.
What is the perceptron training rule?
w_i += delta_w_i
delta_w_i = eta(y - y_hat)x_i
y = target y_hat = output eta = learning rate
This is a thresholded training rule, eg, the outputs will be on or off.
What does linear separability mean to the perceptron training rule?
If the data IS linearly separable, then the perceptron training rule WILL find that separation in a finite number of iterations.
What does linearly separable mean?
All data can be subdivided by half planes with no misclassification. Technically you can make something linearly separable with N - 1 planes where N is the number of points.
What is gradient descent?
Learning rule for NN for when the data is not-linearly separable. It will trend toward the target concept over time but will only converge toward a local optima.
delta_w_i = eta(y-a)x_i
eta = learning rate y = target output a = sum(w_i*x_i)
The outputs of these are not thresholded.
What’s the difference b/t the perceptron rule and the gradient descent rule?
perceptron: eta(y-y_hat)x_i
gradient descent: eta(y-sum(w_ix_i)*x_i
perceptron learning is done on the thresholded output while gradient descent is done on the non-thresholded output.
What is the sigmoid activation function?
S shaped threshold:
sigma(a) = 1 / (1+e^-a)
as a -> -inf, sigma(a) -> 0
as a -> inf, sigma(a) -> 1
This is differentiable.
What is the restriction bias of NNs (What set of hypotheses will be consider)?
perceptron: half spaces
sigmoids: much more complex
Not much restriction. There is a real danger of overfitting though with too complex of a network. Use CV to help avoid this.
What kind of functions can you represent with a NN of various layers?
Boolean: networks of threshold like units
Continuous (no discontinuities): 1 hidden layer with enough hidden units. Each hidden unit can worry about one little patch of the function and they get patched together at the output.
Arbitrary - with discontinuities: 2 hidden layers
What is the preference bias of NNs (which hypotheses you would prefer given the options)?
Prefer small weights (because larger weights == more complexity - occams razor)
What is instance based learning?
Lazy learner - store all of the training data. Don’t fit or do anything like that. Computation is done during prediction instead of fitting.
For KNN, what is distance?
It’s really a ‘similarity’ metric. But it could be defined as something like as the crow flies or manhattan distance.
Which components in KNN represent your domain knowledge?
The distance metric and the number of neighbors.
How does classification work for KNN?
Find the k nearest neighbors as determined by your distance metric. Then you could use voting or a weighted vote.
How does regression work for KNN?
Find the k nearest neighbors as determined by your distance metric. Then take the mean of the y values for the nearest neighbors (or weighted mean, etc).
What’s the time complexity of KNN for learning and queries?
For learning it’s always constant.
query time is log_2+k time because you can basically find the query point in log_2(n) time and then bounce back and forth to find your k points.
if k is on the order of n/2 then it would dominate the log_2 and become O(n) instead of O(log(n))
What is the preference bias for KNN?
- Locality: assumes near points are similar
- Smoothness: averages
- All features matter equally (because of the distance metric) Example: your real function is x1^2 + x2, then in your distance function you could square the difference in x1 instead just using the linear difference.
What’s the curse of dimensionality?
As the number of features (dimensions) grows, you need exponentially more data to generalize accurately.
This applies to all ML problems. The idea being that you want your data points to represent the same ‘amount’ of the dimensional space.
1D:
| x x x | must go to:
2D:
| x x x |
| x x x |
| x x x |
When you add another feature.
What is locally weighted regression (KNN)?
Performing a local regression on the nearest neighbors where the closest ones have the most influence. So you’re building concepts around the local neighbors.
What’s the theorem of no free lunch?
The theory of NFL says that averaged over all problems, no method will do better than random
What is ensemble learning?
Use weak learners to create a bunch of small rules over small parts of the dataset (random small subset) and then combine them all to create a complex rule.
What is bagging/bootstraping?
Create weak learners and train them all on small random subsets of the data, then average the hypotheses learned by each.
What is a weak learner?
No matter what the distribution is, they will do better than chance. Error rate should always be strictly < 0.5
What is boosting?
Start with a distribution of 1/n.
Train a weak learner on the dataset then when we redistribute the likelihood of drawing samples from D such that the next weak learner has higher probability of drawing the samples the previous learner got wrong. Repeat this process for many weak learners. Make a weighted combination of the weak learners in the end to form a final hypothesis.
Boosting says that your classification labels are {-1, +1}
How is boosting like other algorithms?
Like KNN because it took simple hypotheses and made a more complex one (local weighted regression in KNN).
Weighted combination of things is like NN’s.
Why is Boosting non-linear?
Because you pass it through the sgn function at the end which is non-linear.
Why does boosting tend to do well? What’s the intuition?
You’re using weak learners that must get at least 50% of the classifications correct. At each step you re-distribute such that you put more weight on the likelihood of drawing samples that the previous learner got incorrect. So the next weak learner is focusing more on the problems the previous learner got wrong, and they must get > 50% of them correct. This means you should always be improving.