Oldie Flashcards

Question 1

Q

Shortcomings of Information Gain?

Answer

A

IG tends to prefer highly branching attributes
Subsets are more likely to be pure if there is a large number of variables
- May result in overfitting

Solve with Gain Ratio

Question 2

Q

What aspect of Information Gain is “Gain Ratio” intended to remedy? Explain with equation how to achieve this.

Answer

A

Fixes IG’s shortcoming of preferring highly branching attributes (i.e. ID attributes low entropy and lots of them so high IG)
GR reduces the bias for IG toward highly branching attributes by normalising relative to the Split Information (SI).

GR(Ra | R) = IG(Ra | R) / SI(Ra | R)
= H(R) - SumOf P(xi)H(xi) / - (sumof P(xi)logP(xi))

Question 3

Q

What’s the difference between classification and regression?

Answer

A

Classification = Predicting which class a data point is part of.
- The dependent variables are categorical

Regression = Predicting continuous values
- The dependent variables are numerical

Question 4

Q

Outline the nature of “hill climbing” & provide example of hill climbing algorithm.

Answer

A

Finds the global maximum iteratively - sometimes gets stuck in local maximum - unless convex function?
e. g. EM algorithm - guaranteed positive hill climb

Question 5

Q

What basic approach does hill climbing contrast with?

Question 6

Q

Advantages of “bagging”

Answer

A

Decrease variance
- Highly effective over noisy data
Performance is generally better & never substantially worse
Possibility to parallelise computation of individual base classifiers

Question 7

Q

What is sum of squared errors? Name one thing it’s applied to.

Answer

A

Method to evaluate the quality of clusters.

Applied to K-Means

Question 8

Q

How does stratified cross validation contrast with non-stratified cross validation?

Answer

A

Stratification is generally better than regular.
- Both in terms of bias and variance

Achieves this by rearranging the data to ensure each fold is a good representation of the whole data set.

Question 9

Q

Outline the difference between supervised and unsupervised ML methods

Answer

A

Unsupervised:

No knowledge of labelled data
Needs to find clusters, patterns, etc by itself

Supervised:

Has knowledge of labelled data
Learning involves comparing its current predicted output with the correct outputs
- Able to know what the error is

Question 10

Q

Define “discretisation” with an example

Answer

A

Process of converting continuous attributes to nominal or ordinal attributes.

Some learners, such as Decision Trees, generally work better with nominal attributes.

Question 11

Q

Briefly describe what “overfitting” is

Answer

A

Overfitting is when the model picks up errors and/or noise OR has a lack of training data

High variance in classifier
Causes the training accuracy and test accuracy to not be similar - big gap between test and training curves
Occurs when the model is excessively complex
Struggles to generalise
Overreacts to minor fluctuations in the training data

Question 12

Q

Briefly outline the “inductive learning hypothesis”

Answer

A

Any hypothesis found to approx. the target function well, over a sufficiently large set of training examples, will also approximate the target function well, over any other unobserved examples

Question 13

Q

Categorise each as either “parametric” or “non-parametric”

i) SVM
ii) Max Entropy Model
iii) C4.5 -> Style DT

Answer

A

SVM - Depends on kernel
Max Entropy - Parametric?
DT - Non-parametric

Question 14

Q

What is a “consistent” classifier?

Answer

A

A classifier which is able to flawlessly predict the class of all TRAINING instances

Question 15

Q

Outline the difference between “Lazy” and “eager” classifiers

Answer

A

Lazy (instance-based) -> KNN
- Stores data and waits until a query is made to do anything

Eager (decision tree, SVM)
- Constructs a classification model (generalise the training data) before receiving queries (new data to classify)

Question 16

Q

What is Zero-R?

Answer

A

Classifies all instances according to most occurring class.

Question 17

Q

Briefly outline the nature of conditional independence assumption in the context of NB

Question 18

Q

What is “hidden” in a HMM?

Question 19

Q

What is the role of “log likelihood” in EM algorithm?

Question 20

Q

“Smoothing” is considered important when dealing with probability

i) why?
ii) name one smoothing method - explain how it works

Question 21

Q

What are 2 parameters required to generate a Gaussian probability density function?

Question 22

Q

How are missing values in test instances dealt with in NB?

Answer

A

If training -> Don’t include in classifier
If test -> Ignore from end result
Just drop missing shit

Question 23

Q

Briefly explain how to generate a training & test learning curve

Question 24

Q

Explain with equations what PRECISION and RECALL are

Answer

A

Use Precision and Recall - if we want to know what we have positively identified NOT what we correctly identified

Precision = positive predictive values - proportion of positive predictions that are correct
= TP / TP+FP

Recall = sensitivity - accuracy with respect to positive cases
= TP / TP+FN

Question 25

Q

What sort of task is “root relative squared error” used to evaluate

Answer

A

Don’t think this is part of course

Question 26

Q

Describe with equation how the “overlap” metric is used to calculate distance between 2 continuous feature values in instance based learning

Answer

A

Don’t think this is part of course

Question 27

Q

Advantage of “overlap” metric over Euclidean Distance?

Answer

A

??? don’t think relevant

Question 28

Q

Describe how instance based learning combines distances for discrete and continuous features

Answer

A

??? don’t think relevant

Question 29

Q

Describe with use of an equation how “inverse distance” is used in weighting voting for instance based learning

Answer

A

??? don’t think relevant

Question 30

Q

Outline each step you would go through in calculating classification accuracy based on “10-fold stratified cross-validation” for a given dataset and supervised algo.

Question 31

Q

Why is stratified cross-validation superior to “holdout” method

Question 32

Q

Outline steps involved in the “bagging” algorithm

Question 33

Q

Give an example of an “unstable” algorithm

Question 34

Q

Classify as either supervised or unsupervised

i) C4.5 DT
ii) K-means
iii) boosting
iv) zero-R

Answer

A

i) Supervised
ii) Unsupervised
iii) ?
iv) Supervised

Question 35

Q

What is an “incremental” algo? When is this a desirable property?

Question 36

Q

What is the relationship between similarity and distance?

Question 37

Q

What is a nearest neighbour?

Answer

A

Closest point: Maximum similarity or Minimum distance

d(x,y) = min(d(z,y) | z belonging to C)

Question 38

Q

Big O of brute force nearest neighbour?

Answer

A

N samples, D dimensions
- O(DN^2)

Good for small datasets, but quickly becomes infeasible as N grows

Question 39

Q

PROS and CONS of Nearest Neighbours

Answer

A

PROS

Simple
Can handle arbitrarily many classes
Incremental (can add extra data to the classifier on the fly)

CONS

Prone to bias
Arbitrary k-value
Need a useful distance function
Need an averaging function
Expensive
Lazy learner

Question 40

Q

What is the specificity equation

Answer

A

Accuracy with respect to negative cases

Specificity = TN / TN+FP

Question 41

Q

What is the (training) bias of a classifier?

Answer

A

The difference between correct and predicted value, averaged over training sets

Bias is large if the learning method produces classifiers that are consistently wrong (underfitting)
Bias is small if consistently correct or different training sets cause errors on different test sets

Question 42

Q

What is the (test) variance of a classifier?

Answer

A

The validation in the prediction of learned classifiers across training sets - Measures inconsistency not accuracy

Variance is large if different training sets give rise to very different classifiers (overfitting)
Variance is small if different training sets have a minor effect on the classification decisions made, don’t care if correct or incorrect

Question 43

Q

Pros and Cons of Holdout Strategy?

Answer

A

Pros

Simple to work with
High reproducibility

Cons

Tradeoff between more training and more test data (variance vs bias)
Representativeness of the training and test data

Question 44

Q

Pros and Cons of Random Sampling?

Answer

A

PROS
- Reduction in variance and bias over “holdout” strategy

Cons
- Lack of reproducibility

Question 45

Q

Pros and Cons of leave-one-out?

Answer

A

Pros

No sampling bias in evaluating the system and results will be unique and repeatable
Generally gives high accuracy

Cons
- Very expensive if working with a large dataset

Question 46

Q

Pros and Cons of M-Fold Cross Validation

Answer

A

Pros

Need to only train the model M times (rather than N times like in leave-one-out) M is partitions, N is all instances
Can measure stability of the system across different training/test combinations

Cons

How data is distributed among the M partitions due to sampling can lead to bias (training data differs from test)
The results will not be unique unless we always partition the data identically
Slightly lower accuracy because (M-1) / M of data is used for training

Question 47

Q

What is M-Fold CV similar to?

Answer

A

Similar to leave-one-out, instead of doing over all N instances, you partition into M and only do over M

Question 48

Q

What is a baseline and what is a benchmark?

Answer

A

Baseline
- Naïve method which we would expect any reasonable well-developed classifier to perform better (the bare minimum)

Benchmark
- Established rival technique which we are pitching our method against (goal)

Question 49

Q

Discuss the importance of baselines, give examples

Answer

A

Establishing whether any proposed method is doing better than “dumb and simple”
Valuable in getting a sense for the intrinsic difficulty of a given task
e. g. Random Baseline, Zero-R, One-R

Question 50

Q

What is One-R? How does it work?

Answer

A

Baseline method

Creates one rule for each attribute in the training data
Then selects the rule with the smallest error rate as its “one rule”

How it works

Create a decision stump for each attribute with branches for each value (attribute)
Populate leaf with the majority class at that leaf (i.e. make all instances the majority class in this leaf - if majority is YES, make all instances YES)
Select decision stump which leads to lowest error rate over training

Weather (Outlook) 9 Yes 5 No

Sunny 2 Yes 3 NO (Replace all with NO - 2 errors)
O’cast 4 YES 0 No (replace all with YES)
Rainy 3 YES 2 No (replace all with YES - 2 errors)

Total errors = 4 / 14

Question 51

Q

What to be aware of / sensitive of when formulating a baseline?

Answer

A

Sensitive to the importance of positive and negatives in the classification task.

Question 52

Q

Pros & cons of One-R?

Answer

A

Pros

Simple to understand and implement
Simple to comprehend results
Good results

Cons

Unable to capture attribute interactions
Bias towards high-arity attributes

Question 53

Q

What is the “convergence” criterion in k-means?

Answer

A

No change for past 2 iterations or if difference between iterations falls below a threshold specified.
=> Stable state

Question 54

Q

What is the name of the process of converting nominal -> continuous values? explain how this works with e.g.

Answer

A

Similarities / Hamming Distance

Question 55

Q

What is “meta-classification”?

Question 56

Q

What is the range in entropy values?

Answer

A

0 <= Entropy =< log(n)

n = number of outcomes

Question 57

Q

Formula for Bayes Rule? How to apply? Give example

Question 58

Q

What is F-Score? How to apply? Give example

Question 59

Q

Derive Naïve Bayes

Question 60

Q

K-Means algorithm?

Question 61

Q

How to calculate base probabilities in Naïve Bayes

Question 62

Q

Nearest Neighbour algorithm?

Question 63

Q

Nearest Prototype Algorithm?

Question 64

Q

What is semi-supervised learning and when is it desirable?

Answer

A

TL;DR do both supervised and unsupervised

When we have a small number of labelled instances, large number of unlabelled instances.

labelled &laquo_space;unlabelled

This is not enough data to train a RELIABLE classifier (purely supervised), but we can potentially leverage the labelled instances to build a better classifier than a purely unsupervised method.

Answer 44

A

Self Training

Train the learner on the currently labelled instances
Use the learner to predict the labels of unlabelled instances
Where the learner is very confident, add newly labelled instances to the training set
Repeat until all instances are labelled, or no new instances can be labelled confidently

Answer 45

A

More or less the same as self-training, but with two learners operating (hopefully) independently.

Answer 46

A

In active learning, the learner is allowed to choose a small number of instances to be labelled by a human judge (an oracle).

The idea is that, many instances are easy to classify and a small number of instances are difficult to classify, BUT, would be easy to classify with more training data (however we don’t have that luxury).

Some methods to choose instances include;

The learner generating its own difficult instances
Instances are selected as the ones which are most difficult to classify in a fixed, unlabelled set

Answer 47

A

Supervised Machine Learning technique that involves predicting structured objects, rather than discrete or real values

Sequential structure
Hierarchical structure
Graph structure

Answer 48

A

Translating a natural language sentence into a syntactic representation of a parse tree

Answer 49

A

Variant of a finite state machine having a set of hidden states

Current state is not observable (hidden)

Each state produces an output with a certain probability

Answer 50

A

1) Evaluation: Compute the PROBABILITY of a particular output sequence - forward & backward algo
2) Decode: Find the MOST LIKELY SEQUENCE OF STATES which could have generated a given output sequence - Viterbi & posterior decoding
3) Train: Given an output sequence, find the most likely set of state transitions and output probability - Baum Welch

Answer 51

A

1) Likelihood of test data
- Keep some test data and compute the likelihood of the test sequences by using the forward algo

2) Predicting parts of the data given other parts

Answer 52

A

1) Enumerate all the hidden state sequences, brute force & sort
2) Viterbi

Answer 53

A

Baum-Welch

Answer 54

A

Similar to Naïve Bayes, suffers from floating point underflow.

As with generative models, hard to add ad-hoc features

Answer 55

A

Max Entropy Markov Model is able to add extra features indiscriminately, as well as capturing unidirectional tag interactions

Answer 56

A

Maximum Entropy Markov Models - just logistic regression model where we condition on the tag for the preceding instance

Conditional Random Fields

Answer 57

A

Method for unsupervised learning.

Grouping a set of objects in such a way that the objects in the same group (called a cluster) are more similar to each other than those in other groups (clusters)

Answer 58

A

Hard clustering
- Each data point either belongs to a cluster or not

Soft clustering
- Probability or likelihood of that data point belonging to that cluster is assigned

Answer 59

A

Connectivity models
- Hierarchical
Centroid models
- K-means &amp; soft k-means
Distribution models
- Expectation Maximisation (EM)
Density models
- DBSCAN

Answer 60

A

Iterative clustering algorithm to find the local maxima in each iteration

1) if not given k, specify k (number of centroids)
1. 5) Randomly assign k
2) randomly assign each data point to a cluster
3) compute centroid point of cluster
4) reassign each data point to the closest cluster centroid
5) recompute cluster centroids
6) repeat 4 and 5 until no more updates can be made -> convergence criterion reached

Answer 61

A

K-means with softmax function

1) Set t = 0, randomly initialise the centroids
2) Soft assign each instance Xj to a cluster
3) Update each centroid
4) Set t = t+1, go back to step 2, until centroids stabalise

Answer 62

A

Soft k-means

Answer 63

A

Expectation Maximisation.
- Quasi-newton parameter estimation method with guaranteed positive hill climbing characteristics to the gradient of log likelihood

used to estimate hidden parameter values or cluster membership

Answer 64

A

Iteratively alternates between performing 2 steps

Step E (expectation)
- Calculate the expected log-likelihood based on the current estimates of the parameters

Step M (maximisation)
- Compute new parameter distribution, maximising the log-likelihood found in step E.

Answer 65

A

Relative difference in log likelihood from one iteration to the next, once this falls below a certain predefined level, can consider that the estimate to have converged.

Answer 66

A

Application of EM to HMM (unsupervised)

Answer 67

A

PROS

Guaranteed positive hill climbing
Fast to converge
Results in probabilistic cluster assignment
Relatively simple but powerful

CONS

Can get stuck in a local maximum
Relies on arbitrary k
Tends to overfit

Answer 68

A

The initial random seed centroids

* Initial class assignments

Answer 69

A

Unsupervised:
- Cohesion and Separation

Supervised:
- Purity and entropy

Answer 70

A

A good cluster analysis should have one or both:

High cluster cohesion
High cluster separation

Answer 71

A

Whether clustering method is:

prototype or graph-based
deterministic or probabilistic
etc.

Answer 72

A

Sum of Squared Errors (SSE) is a method to evaluate the quality of clusters (especially k-means)

Answer 73

A

Combines cohesion and separation to compute a figure of merit for the overall clustering output, individual clusters and individual points.

Answer 74

A

Measures the degree to which predicted class labels MATCH the actual class labels
* purity and entropy

Answer 75

A

That there is a feature split which leads to independent classifiers.

If given this, can lead to significant improvement in classifier accuracy.

Answer 76

A

1) membership query synthesis
- Synthesises queries for labelling
2) Stream-based selecting sampling
- Determines for each stream of instances whether to have them labelled or not
3) Pool based sampling
- Selects from a fixed set of instances what it wants to have labelled

Answer 77

A

1) query those that the classifier is least confident on
2) perform margin sampling
3) query-by-committee

Answer 78

A

Constructs a set of base classifiers from a given set of training data and aggregates the outputs into a signle meta classifier

Answer 79

A

1) the sum of lots of weak classifiers can be at least as good as one strong classifier
2) the sum of a selection of strong classifiers is usually at least as good as the best of the base classifiers

Answer 80

A

Simple classifier combination method based on sampling and voting

Can parallelise computation of individual base classifiers
Highly effective over noisy data
Performance is generally significantly better and never substantially worse
Decreases variance

Answer 81

A

Construct a novel dataset through random sampling with replacement
Generate k permutations of the training set, build classifier for each
Combine classifiers via voting

Answer 82

A

Tunes classifiers to focus on the HARD TO CLASSIFY instances.

Mathematically complicated, computationally cheap
More computuationally expensive than bagging
Tendency to overfit

How it works?
Iteratively change the distribution and weights of training instances to reflect performance of classifier in previous iteration
1) Each instance has probability of 1/N of being included in sample
2) Over T iterations, train classifier and update the weight of each instance according to whether it is correctly classified
3) Combine base classifiers via WEIGHTED voting

Answer 83

A

Bagging of Decision Trees

- An extention of DT

Answer 84

A

Smooths errors over a range of algorithm with different biases

Method 1) Simple voting - assumes the classifiers have equal performance
Method 2) Train a classifier over the outputs of the base classifiers (meta classification)

Answer 85

A

A classifier that has the outputs of base classifiers aggregated into one single classifier, A META CLASSIFIER

Answer 86

A

Legit just analysing the errors, figure out it the issue is to do with quantity of data or something else

Identifying different classes of errors that the system makes
Hypothesising what caused the errors
Feed hypothesis back into feature/model engineering

Answer 87

A

1) Confusion matrix
2) Random subsample of misclassified instances
3) Come up with hypothesis for error
4) Test hypothesis against data
5) Where possible, use the model to guide the error analysis

Answer 88

A

Hyperparameter:
- Parameters which define/bias/constrain the learning process

Parameters:
- What is learnt when a given learner with a given set of hyperparameters is applied to a particular training set and is then used to classify test instances

Answer 89

A

Interpreted relative to the parameters associated with a given test instance

Answer 90

A

K (neighbourhood size)
Distance / Similarity metric
Feature weighting / selection

Answer 91

A

NONE

because it’s a lazy learner
doesn’t abstract away from the training instances in anyway

Answer 92

A

Relative to the training instances that give rise to a given classification and their geometric distribution

Answer 93

A

Distance / similarity metric

- Feature weighting / selection

Answer 94

A

Prototype for each class
Size = O(|C| |F|)
c = set of classes
f = set of features

Answer 95

A

relative to the geometric distribution of the prototypes and distance to each for a given test instance

Answer 96

A

Choice of smoothing method

- Optionally the choice of distribution to used to model the features - binomial or multinomial

Answer 97

A

Class priors and conditional probability for each feature-value-class combination
Size = O(|C| +|C| |FV|)
c = set of clases
fv = set of feature value pairs

Answer 98

A

Usually based on the most positively weighted features associated with a given instance

Answer 99

A

Choice of function used for attribute selection - Information Gain or Gain Ratio
Convergence Criterion

Answer 100

A

The decision tree itself
Worst case size O(V^|Tr|)
Average case size O(|FV|^2|Tr|)

V = average branching factor
tr = set of training instances
fv = set of feature-value pairs

Answer 101

A

Based directly on the path through the decision tree

Answer 102

A

Penalty term for soft marging SVM
Feature value scaling
Choice of kernel (and any hyperparameters associated with it)

Answer 103

A

Vector of feature weights + bias term

- Size = O(|C|^2|F|) - assuming one-vs-rest

Answer 104

A

Choice of weight of regulariser

Answer 105

A

Weight associated with each feature function and the bias term
Size = O(|C||FV|)
c = set of classes
fv = feature-value pairs

Answer 106

A

based on the absolute-weighted features associated with a given instance

high positive = correlation
high negative = anti-correlation

Answer 107

A

reducing the number of features or attributes to be more easily graphable

Answer 108

A

Going to be lossy

- Not possible to reproduce the original data FROM the reduced version

Answer 109

A

Principal Component Analysis - a method of dimensionality reduction

Transforming to a new set of variables, the principal components, which are uncorrelated, and which are ordered so that the first few return retain most of the variation present in all of the original variables

Answer 110

A

Also known as nearest centroid classifier, assigns to instances the label of the class of training samples whose mean (centroid) is closest to the observation.