Exam 2017 Flashcards
What is the primary difference between “supervised” and “unsupervised” learning?
whether the training instances are explicitly labelled or not
Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: blood pressure level, with possible values {flow, medium, high}
ordinal
Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: age, with possible values [0,120]
numeric
Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: weather, with possible values {clear, rain, snow}
categorical
Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: abalone sex, with possible values {male, female, infant}
categorical
Describe a strategy for measuring the distance between two data points comprising of “categorical” features
Hamming distance OR cosine similarity OR jaccard OR dice
What is the relationship between “accuracy” and “error rate” in evaluation?
accuracy = 1− error rate
With the aid of a diagram, describe what is meant by “maximal marginal” in the context of training a “support vector machine”.
the width of the margin (= distance from separating hyperplane and the support vectors) should be maximised
What makes a feature “good”, i.e. worth keeping in a feature representation? How might we measure that “goodness”?
good = correlation/association with category of interest (and non-redundant)
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: multi-layer perceptron with a softmax final layer
classification
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: soft k-means
clustering
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: multi-response linear regression
classification
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: logistic regression
classification
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: model tree
regression
For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: support vector regression
regression
With the aid of an example, briefly describe what a “hyperparameter” is.
a top-level setting for a given model (which is set prior to training)
With the use of an example, outline what “stacking” is.
combining the output of a number of base classifiers as input to a further supervised learning model
What is the convergence criterion for the “EM algorithm”?
convergence of maximum log-likelihood to within an episilon (small) change
Outline the basis of “purity” as a form of cluster evaluation.
what proportion of instances in the cluster correspond to majority class
What is the underlying assumption behind active learning based on “query-by-committee”?
disagreement between base classifiers indicates that the instance is hard to classify, and thus will have high utility as a training instance
“Random forests”

are based on decision trees under different dimensions of “randomisation”. With reference to the following toy training dataset, provide a brief outline of two (2) such “random processes” used in training a random forest. (You should give examples as necessary; it is not necessary to draw the resulting trees, although you may do so if you wish.)
- random sampling of training instances (similarly to bagging) 2. random subsampling of attributes for given decision tree 3. random construction of new features based on linear combinations of numeric features
HMMs
In the “forward algorithm”, αt(j) is used to “memoise” a particular value for each state j and observation t. Describe what each αt(j) represents.
the probability of observing all observations up to and including t and ending up in state j
HMMs
In the “Viterbi algorithm”, two memoisation variables are used to describe for each combination of state j and observation t:
- βt(j) (which plays a similar role to αt(j) in the forward algorithm)
- φt(j). Describe what each φt(j) represents.
he most probable immediately preceding state for state j given observations up to and including t
HMMs
Why do we tend to use “log probabilities” in the Viterbi algorithm but not the forward
algorithm?
Viterbi based on multiplication, so can convert to sum of log probabilities; forward based on sum of product of probabilities,
so logging the probabilities doesn’t help in the calculation
Model Learning
Is our primary objective in machine learning to derive a model that fits the subset of the data
that we do have? Why or why not?
No. Want to build a model that generalises to new data
Model Learning
Explain how we can use our limited data, in a machine learning context, to demonstrate
whether or not our objective has been met.
Split data into training/dev/test. Need to measure generalisation from “best” (tuned) model to unseen data
Model Learning
Identify and explain one important problem that can emerge with respect to this primary objective, even if we are successful in deriving a good model for the data that we have. Name one specific technique discussed in class that can be applied to mitigate this problem, and explain how it does so.
overfitting; model does well on training data but poorly on test data
technique: L1/L2/Lasso regularisation
how: reduce complexity of model/constrain model function
Model Learning
Define “bias” and “variance”, indicating how we might detect each one. Discuss how bias
and variance relate to each other in the context of our primary objective
bias: how well our model approximates the (training) data; approximation error (high bias=consistently poor performance)
variance: how well our model generalises to held-out (test) data; estimation error (high variance=sensitive to training data)
Relationship: Increasing the complexity of our model tends to reduce bias (by fitting the data better) but increase variance
(because the model will be more strongly fit to the training data); in order to obtain generalisation we seek to minimise both
bias and variance in order to have a good model that generalises well (isn’t overfit to training). Increase training examples
and control for overfitting to lower variance
Briefly explain — in at most two sentences — the basic logic behind the “ID3” algorthmic
approach toward building decision trees. This should be focussed on labelling the nodes
and leaves; you do not have to explain edge-cases.
Recursively determine which feature has the highest information gain (i.e. does the best job of partitioning the data into pure
subsets) over the subset of training instances selected by the path to that node, and add a branch for each value of that feature;
continue until every leaf node is pure (and label with corresponding class)
The criterion for labelling a node, as explained in the lectures, was based around the idea of
“entropy” — what does entropy tell us about a node of a decision tree?
how skewed (“pure”) the label distribution is (lower entropy ! more skewed ! better)