MidtermExam Flashcards
What is the definition of ML that Dr. Isbell prefers?
ML is about using math, computation, engineering (among other things) to build computational artifacts that learn over time.
What is inductive reasoning?
Reasoning that goes from specifics –> generalities
What is deductive reasoning?
Applying general rules to draw specific (logical) conclusions
At a high-level, what is supervised learning considered an example of?
Function approximation. It’s the process of inductively learning a general rule from observations.
What is unsupervised learning all about?
It’s about DESCRIBING data. In UML, we’re only given “inputs”, so the objective is to learn if there is some latent structure in the data that we can use to describe the data in a more efficient (i.e. more concise) way. Clustering is a common time of unsupervised ML.
What are two types of supervised machine learning?
- Classification (Discrete output values)
- Regression (Continuous output values)
In a Decision Tree (DT) model, nodes are _______ and edges are _____?
Attributes, Values
What is the decision tree algorithm?
- Pick the best attribute (where best splits the data roughly in half)
- Ask the question
- Follow the answer path
- Go to 1. (until arrive at answer)
What is the complexity of XOR?
O(2^n) [i.e. exponential, NP-Hard]
What is the complexity of OR?
O(n) [i.e. linear]
What are the two types of biases we worry about when searching through a hypothesis space?
- Restriction bias: This is how restrictive the function space is we’ve chosen. So given that there are an infinite number of ways of representing a function, the restriction bias of a decision tree algorithm limits it to the space of boolean functions.
- Preference bias: This tells us what hypotheses within the space that we prefer. THIS IS AT THE HEART OF INDUCTIVE BIAS. (Because we have a preference for hypotheses that fit the data well.)
What are some of the inductive biases of the ID3 algorithm for decision trees?
- Preference for good splits at the top of the tree as opposed to the bottom of the tree (because we build from the top of the tree down)
- Prefers models that fit the data well (i.e. prefers correct over incorrect)
- Tends to prefer shorter trees over taller trees (this is just a natural corollary that stems from the fact that we prefer trees that perform better splits at the top)
According to Mitchell, what three features must be present in order to have a well-posed learning problem?
- The class of tasks
- The measure of performance to be improved
- The source of experience
Historically, how did we end up with the term “regression”?
The idea of regression to the mean, e.g. the heights of a taller/shorter than average person have children that tend to ‘regress’ back to the mean.
What are some examples of where error comes from?
- Sensor Error
- Malicious/adversarial data
- Transcription error
- Un-modeled influences
What is one of the fundamental assumptions that we make in most supervised learning algorithms?
That the data from the training set are REPRESENTATIVE of what we expect in the future. If this isn’t the case, then our model won’t GENERALIZE, which is what we really care about as ML practitioners. More formally, the general assumption is that data are I.I.D. (Independent, Identically Distributed); that is, that the process the generated the training data is the same process that is generating the test data (in fact, any future data!).
In the context of regression, the best constant in terms of the squared error is the _____?
mean
Describe cross-validation and why we use it?
We split the training data into k-folds. Then we use n-1 of the folds for training and use the final n fold as a validation set (i.e. a stand-in for the test set). We can do this for all the combinatorial sets of k-folds, and then average the validation error across all of them. The best model is then the one with the lowest average error.
We use cross validation to avoid overfitting the model. Since we should have no access to the test set when developing our model, using cross validation improves generalization.
Logical AND is expressible as a perceptron unit? (True/False)
True
Logical OR is not expressible as a perceptron unit? (True/False)
False
Logical NOT is expressible as a perceptron unit? (True/False)
True
Logical XOR is not expressible as a perceptron unit? (True/False)
False
For perceptron network training, what is the difference between the “perceptron rule” and the “gradient descent” rule?
The perceptron rule uses thresholded output values while gradient descent uses the UNthresholded values.
If the data are linearly separable, wile the perceptron rule find the hyperplane that separates them in a finite number of iterations?
Yes




















