Exam 2017 Flashcards by Dara O hEidhin

What is the primary difference between “supervised” and “unsupervised” learning?

whether the training instances are explicitly labelled or not

How well did you know this?

Not at all

Perfectly

Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: blood pressure level, with possible values {flow, medium, high}

ordinal

How well did you know this?

Not at all

Perfectly

Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: age, with possible values [0,120]

numeric

How well did you know this?

Not at all

Perfectly

Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: weather, with possible values {clear, rain, snow}

categorical

How well did you know this?

Not at all

Perfectly

Indicate which of “numeric”, “ordinal”, and “categorical” best captures its type: abalone sex, with possible values {male, female, infant}

categorical

How well did you know this?

Not at all

Perfectly

Describe a strategy for measuring the distance between two data points comprising of “categorical” features

Hamming distance OR cosine similarity OR jaccard OR dice

How well did you know this?

Not at all

Perfectly

What is the relationship between “accuracy” and “error rate” in evaluation?

accuracy = 1− error rate

How well did you know this?

Not at all

Perfectly

With the aid of a diagram, describe what is meant by “maximal marginal” in the context of training a “support vector machine”.

the width of the margin (= distance from separating hyperplane and the support vectors) should be maximised

How well did you know this?

Not at all

Perfectly

What makes a feature “good”, i.e. worth keeping in a feature representation? How might we measure that “goodness”?

good = correlation/association with category of interest (and non-redundant)

How well did you know this?

Not at all

Perfectly

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: multi-layer perceptron with a softmax final layer

classification

How well did you know this?

Not at all

Perfectly

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: soft k-means

clustering

How well did you know this?

Not at all

Perfectly

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: multi-response linear regression

classification

How well did you know this?

Not at all

Perfectly

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: logistic regression

classification

How well did you know this?

Not at all

Perfectly

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: model tree

regression

How well did you know this?

Not at all

Perfectly

For the mode below state whether it is canonically applied in a “classification”, “regression” or “clustering” setting: support vector regression

regression

How well did you know this?

Not at all

Perfectly

With the aid of an example, briefly describe what a “hyperparameter” is.

Study These Flashcards

a top-level setting for a given model (which is set prior to training)

With the use of an example, outline what “stacking” is.

Study These Flashcards

combining the output of a number of base classifiers as input to a further supervised learning model

What is the convergence criterion for the “EM algorithm”?

Study These Flashcards

convergence of maximum log-likelihood to within an episilon (small) change

Outline the basis of “purity” as a form of cluster evaluation.

Study These Flashcards

what proportion of instances in the cluster correspond to majority class

What is the underlying assumption behind active learning based on “query-by-committee”?

Study These Flashcards

disagreement between base classifiers indicates that the instance is hard to classify, and thus will have high utility as a training instance

“Random forests”

are based on decision trees under different dimensions of “randomisation”. With reference to the following toy training dataset, provide a brief outline of two (2) such “random processes” used in training a random forest. (You should give examples as necessary; it is not necessary to draw the resulting trees, although you may do so if you wish.)

Study These Flashcards

random sampling of training instances (similarly to bagging) 2. random subsampling of attributes for given decision tree 3. random construction of new features based on linear combinations of numeric features

HMMs

In the “forward algorithm”, α_t(j) is used to “memoise” a particular value for each state j and observation t. Describe what each α_t(j) represents.

Study These Flashcards

the probability of observing all observations up to and including t and ending up in state j

HMMs

In the “Viterbi algorithm”, two memoisation variables are used to describe for each combination of state j and observation t:

β_t(j) (which plays a similar role to α_t(j) in the forward algorithm)
φ_t(j). Describe what each φ_t(j) represents.

Study These Flashcards

he most probable immediately preceding state for state j given observations up to and including t

HMMs

Why do we tend to use “log probabilities” in the Viterbi algorithm but not the forward
algorithm?

Study These Flashcards

Viterbi based on multiplication, so can convert to sum of log probabilities; forward based on sum of product of probabilities,
so logging the probabilities doesn’t help in the calculation

**Model Learning** Is our primary objective in machine learning to derive a model that fits the subset of the data that we do have? Why or why not?

No. Want to build a model that generalises to new data

**Model Learning** Explain how we can use our limited data, in a machine learning context, to demonstrate whether or not our objective has been met.

Split data into training/dev/test. Need to measure generalisation from “best” (tuned) model to unseen data

**Model Learning** ``` Identify and explain one important problem that can emerge with respect to this primary objective, even if we are successful in deriving a good model for the data that we have. Name one specific technique discussed in class that can be applied to mitigate this problem, and explain how it does so. ```

overfitting; model does well on training data but poorly on test data technique: L1/L2/Lasso regularisation how: reduce complexity of model/constrain model function

**Model Learning** Define “bias” and “variance”, indicating how we might detect each one. Discuss how bias and variance relate to each other in the context of our primary objective

bias: how well our model approximates the (training) data; approximation error (high bias=consistently poor performance) variance: how well our model generalises to held-out (test) data; estimation error (high variance=sensitive to training data) Relationship: Increasing the complexity of our model tends to reduce bias (by fitting the data better) but increase variance (because the model will be more strongly fit to the training data); in order to obtain generalisation we seek to minimise both bias and variance in order to have a good model that generalises well (isn’t overfit to training). Increase training examples and control for overfitting to lower variance

Briefly explain — in at most two sentences — the basic logic behind the “ID3” algorthmic approach toward building decision trees. This should be focussed on labelling the nodes and leaves; you do not have to explain edge-cases.

Recursively determine which feature has the highest information gain (i.e. does the best job of partitioning the data into pure subsets) over the subset of training instances selected by the path to that node, and add a branch for each value of that feature; continue until every leaf node is pure (and label with corresponding class)

The criterion for labelling a node, as explained in the lectures, was based around the idea of “entropy” — what does entropy tell us about a node of a decision tree?

how skewed (“pure”) the label distribution is (lower entropy ! more skewed ! better)

Exam 2017 Flashcards

(30 cards)