MidtermExam Flashcards
What is the definition of ML that Dr. Isbell prefers?
ML is about using math, computation, engineering (among other things) to build computational artifacts that learn over time.
What is inductive reasoning?
Reasoning that goes from specifics –> generalities
What is deductive reasoning?
Applying general rules to draw specific (logical) conclusions
At a high-level, what is supervised learning considered an example of?
Function approximation. It’s the process of inductively learning a general rule from observations.
What is unsupervised learning all about?
It’s about DESCRIBING data. In UML, we’re only given “inputs”, so the objective is to learn if there is some latent structure in the data that we can use to describe the data in a more efficient (i.e. more concise) way. Clustering is a common time of unsupervised ML.
What are two types of supervised machine learning?
- Classification (Discrete output values)
- Regression (Continuous output values)
In a Decision Tree (DT) model, nodes are _______ and edges are _____?
Attributes, Values
What is the decision tree algorithm?
- Pick the best attribute (where best splits the data roughly in half)
- Ask the question
- Follow the answer path
- Go to 1. (until arrive at answer)
What is the complexity of XOR?
O(2^n) [i.e. exponential, NP-Hard]
What is the complexity of OR?
O(n) [i.e. linear]
What are the two types of biases we worry about when searching through a hypothesis space?
- Restriction bias: This is how restrictive the function space is we’ve chosen. So given that there are an infinite number of ways of representing a function, the restriction bias of a decision tree algorithm limits it to the space of boolean functions.
- Preference bias: This tells us what hypotheses within the space that we prefer. THIS IS AT THE HEART OF INDUCTIVE BIAS. (Because we have a preference for hypotheses that fit the data well.)
What are some of the inductive biases of the ID3 algorithm for decision trees?
- Preference for good splits at the top of the tree as opposed to the bottom of the tree (because we build from the top of the tree down)
- Prefers models that fit the data well (i.e. prefers correct over incorrect)
- Tends to prefer shorter trees over taller trees (this is just a natural corollary that stems from the fact that we prefer trees that perform better splits at the top)
According to Mitchell, what three features must be present in order to have a well-posed learning problem?
- The class of tasks
- The measure of performance to be improved
- The source of experience
Historically, how did we end up with the term “regression”?
The idea of regression to the mean, e.g. the heights of a taller/shorter than average person have children that tend to ‘regress’ back to the mean.
What are some examples of where error comes from?
- Sensor Error
- Malicious/adversarial data
- Transcription error
- Un-modeled influences
What is one of the fundamental assumptions that we make in most supervised learning algorithms?
That the data from the training set are REPRESENTATIVE of what we expect in the future. If this isn’t the case, then our model won’t GENERALIZE, which is what we really care about as ML practitioners. More formally, the general assumption is that data are I.I.D. (Independent, Identically Distributed); that is, that the process the generated the training data is the same process that is generating the test data (in fact, any future data!).
In the context of regression, the best constant in terms of the squared error is the _____?
mean
Describe cross-validation and why we use it?
We split the training data into k-folds. Then we use n-1 of the folds for training and use the final n fold as a validation set (i.e. a stand-in for the test set). We can do this for all the combinatorial sets of k-folds, and then average the validation error across all of them. The best model is then the one with the lowest average error.
We use cross validation to avoid overfitting the model. Since we should have no access to the test set when developing our model, using cross validation improves generalization.
Logical AND is expressible as a perceptron unit? (True/False)
True
Logical OR is not expressible as a perceptron unit? (True/False)
False
Logical NOT is expressible as a perceptron unit? (True/False)
True
Logical XOR is not expressible as a perceptron unit? (True/False)
False
For perceptron network training, what is the difference between the “perceptron rule” and the “gradient descent” rule?
The perceptron rule uses thresholded output values while gradient descent uses the UNthresholded values.
If the data are linearly separable, wile the perceptron rule find the hyperplane that separates them in a finite number of iterations?
Yes
Why can we not a priori know whether a dataset is linearly separable (at least in non-trivial cases)?
Because while the perceptron rule does have a convergence guarantee to find the hyperplane that separates the data if it is linearly separable in a finite number of iterations, we have no way of knowing what “finite” means. Maybe the error finally goes to zero, but maybe it doesn’t. It’s really just a halting problem.
The perceptron training rule is more robust against non-linearly separable data than gradient descent is? (True/False)
False.
The perceptron rule does not have the theoretical guarantee of finite convergence? (True/False)
False
What assumption is made for the theoretical guarantee of finite convergence for the perceptron rule?
That the data are linearly separable.
What convergence guarantee does gradient descent offer?
Convergence to a local optimum. This is in contrast to the perceptron rule which does offer the guarantee of convergence to the global optimum in finite time, but only in the case of linearly separable data. Gradient descent is more ROBUST to non-linearly separable data than the perceptron rule.
Since gradient descent is more robust than the perceptron rule, why not just use GD on the weight update for the perceptron rule also?
Because the perceptron rule uses the THRESHOLDED output values, and the thresholding operation isn’t smooth (it has a discontinuity at the point of thresholding), hence it is not differentiable, which means we can’t use the calculus required for GD.
What is the domain for the sigmoid function? How does the output behave at the limits?
The domain is from -infinity to +infinity. As the function goes to -infinity, the output goes to 0; conversely, as the input goes to +infinity, the output asymptotically approaches +1.
What is the derivative of the sigmoid function?
Let ‘a’ := the activation and S(a) := the sigmoid function. Then the derivative dS(a)/da = S(a)*(1 - S(a))
What is backpropogation?
It’s a computationally beneficial organization of the chain rule from calculus that allows us to calculate how the weights should move to minimize the error of the output from a neural network.
What are some of the more advanced methods that are sometimes beneficial to help mitigate the tendency for gradient descent to get stuck in local minima?
- Momentum terms
- Using-higher order derivatives (the analogy would be like adding an error “velocity/acceleration” term(s) to borrow from physical motion equations)
- Randomized optimization
- Penalty for “complexity”, where complexity might be the result of two large of network STRUCTURES (in terms of breadth and/or depth) or weights that are excessively large (in the latter case of weights we typically refer to this complexity penalty as “regularization”)
What is the ‘Restriction Bias’
It relates to the representational power of a learner by limiting the set of hypotheses that we will consider as valid.
What is the hypothesis space of a single perceptron unit?
Half-spaces (i.e. a hyperplane that perfectly separates data [although only in the case where the data is linearly separable!])
What is the restriction bias of a simple network of threshold-like units?
Boolean functions
If we add a single fully connected hidden layer that includes sigmoided activation functions, what is the restriction bias of the network?
Continuous functions
If we stitch together two or more fully connected hidden layers that include smooth (i.e. differentiable) activation functions, what is the restriction bias of the network?
Arbitrary functions
A network with two hidden layers containing 50 neurons each can represent any arbitrary function? (True/False)
False. While a finite neural net with at least two hidden layers can theoretically model an arbitrary function, we don’t necessarily (in fact in almost every case we don’t) know how “big” a sufficiently big network is required to model the function of interest.
If the error on the validation set increases while the error on the training set continues to decrease, what situation has occurred?
We’ve OVERFIT the training data. The model is no longer capable of generalizing.
What is ‘Preference Bias’?
When a given algorithm is making a selection between one representation or another, it’s preference bias determines which one of those representations it is more likely to choose.
Say while training a NN, a single weight is growing exponentially, becoming much larger than all the other weights in the network. What situation is likely occurring?
We’re likely overfitting the dataset. In the case where one weight is growing much larger than all the others, the model has probably honed in on a single feature of the data that is highly correlated with the outputs in the training dataset, but it is unlikely to generalize to new data.
What are some of the preference biases of a Neural Network?
Initialization of weights to small random values results in a preference for - ceteris paribus - models with lower complexity (i.e. if most of the weights can stay small and still model the problem sufficiently well, then the model is going to prefer this situation)
Which is preferable: a restriction bias or preference bias?
Preference bias. At least in that case, we can be assured that the hypothesis space contains our target function of interest.
Where does the inductive bias of the ID3 algorithm for decision trees come from?
It’s search strategy. (Which makes it prefer shorter trees over taller trees, and tends to place attributes with the highest information gain towards the top of the tree.
In terms of computational complexity, learning for linear regression is _______ and querying ______?
Expensive | Cheap
In terms of computational complexity, learning for K-NN is _______ and querying ______?
Cheap | Expensive
What is the computional complexity and (run time and space) for KNN vs linear regression?
See screenshot below.
What are some of the preference biases of the KNN algorithm?
- Locality: nearer points are similar
- Smoothness: expectation that averaging data points works (i.e. the underlying function generating the data is smooth)
- Equality: all features matter equally
What is the “Curse of Dimensionality”?
As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.
The curse of dimensionality only applies to the KNN algorithm? (True/False)
False. It’s true for any ML algorithm that as the number of features grow, we need exponentially more data to cover the space in order to generalize.
What is the difference between ‘eager’ and ‘lazy’ learners? What are some examples of each?
A lazy learner puts off learning as long as possible; eager learners solve the problem as soon as it is posed, and saves the result as some sort of parametric function that can be used later to quickly evaluate new data.
KNN is an example of a lazy learner. Linear regression is an example of an eager learner.
KNN can only handle classification problems?
False, it can handle classification and regression. In the case of regression, it could (as one example) simply take the average of the points to get a continuous output.
How does locally weighted KNN work and what does it allow us to do?
Locally weighted KNN allows us to represent more complex non-linear functions. It does this by approximating data locally as piece-wise functions. By stacking up these local functions we can generate non-linear outputs.
How does ensemble learning work?
- Take a subset of the data and create individual learners that learn a rule for that subset
- Combine all the individual learners into an ensemble
What are the two important ideas in boosting?
- Select data subsets based on their difficulty
- Use a weighted mean instead of simple average
What is the definition of a “weak learner”?
It’s a learner that, no matter what the distribution of data is that you pull from, outputs a prediction that is better than random chance.
Define ‘Inductive Learning’ (per the computational learning theory lecture)
It’s the process of “Learning from Examples”.
- Probability of successful training (i.e. 1 - δ, where δ is the probability of failure)
- Number of examples to train on (i.e. m)
- Complexity of the hypothesis class H
- Accuracy to which target concept is approximate (often denoted as ε)
- Manner in which training examples presented (batch vs online)
- Maner in which training examples selected
What are the two main ways that examples can be presented to a learning algorithm?
- Batch: we get a set of training data, and it’s handed over to the learner all at once to learn from
- Online: The learner is presented with samples one at a time, predicts a label, and then gets feedback from the algorithm/oracle about whether that prediction was correct or not
Describe the relationship between Learners and Teachers. How do they interact with one another?
- Learner asks questions of the teacher - so for example, the learner gets some data/observation x, the teacher provides a concept c(x) that explains that data
- Teacher gives examples to help learner (i.e. Teacher chooses x, tells c(x) to Learner)
- Fixed Distribution, i.e x is chosen from D by nature (Dr. Isbell notes that this is mostly what we’ve talked about up until this point)
- Evil distribution - i.e. some sort of adversarial approach
Random thoughts to consider: in the lectures discussing computational learning theory, one of the quizzes asks for the game like ‘20 Questions’, where we have H hypotheses containing the set of possible people and X questions we can ask, how many questions does the learner need to ask if the learner chooses x to figure out the correct h? Dr. Isbell says log2(x) because we can, on average, eliminate half the hypotheses for each question we ask. Makes sense from an information theoretic perspective. This is predicated on the assumption that we can ask “good” questions, but doesn’t really say anything about what makes a question “good”. What distinguishes good questions from bad questions?
A naive response would be that it’s one that eliminates roughly half the answers. But that’s tautological. It seems to me that domain knowledge is really implicit here. In the case where H is a set of people, we naturally have background knowledge of this that admits a lot of binary questions (at least from an information theoretic perspective, if not exactly politically correct): is the person male/female? Are they older than 40? Etc, etc. Not only that, we also have really strong priors we can derive from this background knowledge: we know that the world population is roughly 50/50 male/female. Human life expectancy gives us a good idea how to split intro roughly equal groups.
But what about the case were we’re much less certain (or have no information at all) about what H contains? I guess since we’re human, the most reasonable place to start would be asking questions based on prevalance of some attribute that H is most likely to contain? So questions like “Is h in the space of people?”, etc. It’s tough to swallow as things become more abstract though. What if H is the set of all atoms in the universe, and the answer we’re looking for is something ridiculous like: “the 3rd atom in Saturn’s ring”. (Interesting note that the first example that jumped to my mind was Saturn: I’ve also just shown my own preference bias for hypotheses that are in the set of things in our solar system).
I don’t know exactly where I’m going with this, and I guess it comes down to the problem of a potentiall infinite hypothesis space, which I’m sure will discuss in the lectures. Anyway, it’s interesting stuff to think about.
Why do constrained queries make things so hard for learners?
Because we can only ask things that amount to data points, and not the thing we’re actually interested in itself. So think about the case of bit strings and a hypothesis class that consists of conjunctions of literals or negations. What we’d really like to ask are things like: is x1 in the formula? But the constraint forces us to only ask questions where x takes on binary values. This means it could take a huge amount of samples before we finally get a positive result; the negative results simply don’t tell us much. This means in the worse case it could take us 2k queries to find the actual answer.
What is ‘Version Space’?
Let S be a training set of data drawn from X. The version space is simply the set of all candidate hypotheses h such that h is consistent w.r.t. the data S that it has seen thus far. More succinctly, it’s the “set of hypotheses consistent with the examples”
In the context of PAC learning, what are the two types of error?
Training error and True Error
What is training error?
It is the fraction of training examples misclassified by hypothesis h
What is the ‘True Error’?
The fraction of examples that would be missclassified on a sample drawn from D in the infinitie limit
What does this formula represent:
The True Error, i.e. the fraction of examples that would be misclassified by some hypothesis h(x) compared to the true hypothesis (i.e. concept) c(x)
What is the definition of PAC Learning (“Probably Approximately Correct”)?
Concept class C is learnable by learner L using hypothesis space H if and only if the learner will, with probability 1 - δ, output a hypothesis h in H such that the error(h) <= ε in time and samples polynomial in 1/ε, 1/δ, and n (where n is the size of the hypothesis space).
Mathematically, what do the words “Probably Approximately Correct” represent?
Probably := 1 - δ [i.e. the certainty goal]
Approximately := ε [i.e. the error goal]
Correct := h(x)=c(x) [i.e. the chosen hypothesis is equal to the actual concept hypothesis]
At a high-level, describe what PAC Learning is?
Something is considered PAC learnable if the learner L can find a hypothesis that has low true error and can be learned in polynomial time as a function of 1/ε, 1/δ and n (the size of the hypothesis space).
Is the concept in the image below PAC learnable?
Yes. We just keep track of the version space (the hypotheses that are consistent with the data) and then - since we don’t have any additional information to suggest otherwise - we should simply pick uniformly from that version space.
My original guess on the answer (it’s wrong per the lectures)
I’m going to go with ‘No’. So first let’s remember what it means for something to be PAC learnable: it means the learner has to be able to find to a hypothesis h in H that has a low, bounded true error and that can be learned in polynomial time as a function of the error rate, certainty goal and hypothesis space size.
Well first off, we can see right away that our hypothesis space is going to go as a function of k. But that’s constant, we at least tick that box. The error bounds seem more of an issue to me. Think about the case where we increased the length of k for each new sample
0
01
010
1000
What is ε-exhaustion of a Version Space (VS)?
A Version Space is epsilon exhausted in the case when every hypothesis you might possibly choose in the VS has an error less than ε. In that case we can return any of the hypotheses in the VS by choosing uniformly because we have no prior information that compells us to choose one over the other, and they all provide the same error guarantees! [Note: This is mentioned as a key concept!]
If we know the size of our hypothesis space and desire to find a hypothesis that’s bounded by some error goal ε and certainty goal δ, then the number of samples required to achieve that is polynomial? (True/False)
True. This comes from the Haussler Theorem (see screenshot)
One way to think of why the distribution need not be uniform for the number of samples needed for something to be PAC learnable is that the distribution in some sense sort of cancels out between the training and true error? (True/False)
True. See the second to last video in the computational learning theory lectures for a really good example of this.
What is the ‘agnostic’ situation in the context of PAC learnability?
It occurs when the target concept is not in our hypothesis space. In that case, the learner doesn’t match the true concept, but just chooses the best one. The error bounds change a bit, but are more or less the same, and the learning sample complexity is still polynomial in all of the terms we’ve been discussing previously (size of the hypothesis space, error goal, failure rate)
Which one of these hypothesis spaces are infinite?
[yes] Linear separators - As an example, think of an SVM using a linear kernel to create a hyperplane between two classes. Could you rotate it 0.0000000001 degrees one direction or the other and still split the classes? There’s a good chance.
[yes] Artificial Neural Networks - think about GPT3 with its billions of parameters. Each one of those parameters is some floating point number that could take on any real valued number. Definitely an infinite space.
[no] Decision Trees (discrete inputs) - there are only so many ways we can split discrete data, so the hypothesis space for discrete inputs is not infinite.
[yes] Decision Trees (continuous inputs) - a continuous input could be split at any arbitrary point along the number line and at infinite precision, so definitely hypothesis space is infinite.