Week 1 - 4 Flashcards

1
Q

Briefly explain how Nearest Prototype classifier works

A
  • Calculate the centroid of each class (by averaging the numeric values along each axis -> then do Euclidean distance)
  • Classify each test instance according to the class of the centroid it is nearest to
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the theoretical properties of Nearest Prototype?

A
  • Multiclass Classification model
  • Parametric: Prototypes capture everything that is needed for prediction
  • Incremental: Easy to add extra data to the classification on the fly
  • Handles both nominal & continuous data
  • Decision boundary between two classes is linear
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some examples of Parametric models?

A
  • Nearest Prototype
  • Naïve Bayes
  • Linear or Logistic Regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some examples of Non-Parametric models?

A
  • K-NN

- Decision Trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Is SVM parametric or non-parametric?

A

Depends on the kernel used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Quirks of parametric models?

A
  • Simpler, since they have clear functional form
  • Once you have learnt the model coefficients, data is no longer needed
  • TEND TOWARDS UNDERFITTING
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Quirks of non-parametric models?

A
  • Require more data, since there no assumptions are made about the form of the function
  • TEND TOWARDS OVERFITTING
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Briefly explain Bayesian Methods

A
  • Learning & Classification methods based on probability theory
  • Build a generative model that approximates how data is produced
  • Categorisation produces a posterior P(A|B) probability distribution over the possible categories
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Formula for Bayes Rule?

A

P(C | X) = [ P(X | C) * P(C) ] / P(X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Formula for Naïve Bayes?

A

argmax ci belong to C P(Cj) Multiplcation Of P(Xi | Cj)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What to do if prob for Naïve Bayes is 0?

A

Make it a small ass number (e = less than 1/n) so whole thing doesn’t equal to 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What to do if prob for Naïve Bayes is 0?

A

Make it a small ass number (e = less than 1/n) so whole thing doesn’t equal to 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to do Naïve Bayes?

A

P(C) * P(A | C) * P(B | C) … * P(X | C)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to handle missing values in Naïve Bayes?

A

If a value is missing in a TEST instance, it is possible to simply ignore that feature for the purpose of classification.

If a value is missing in a TRAINING instance, it is possible to simply have it not contribute to the attribute-value counts / probability estimates for that feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to do Naïve Bayes for continuous features?

A

1) Discretise the feature into nominal features

2) Use probability density estimation to estimate
P(Xi | Cj)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Theoretical properties of Naïve Bayes models?

A
  • Multiclass classification method
  • Parametric: Only have to store attribute-value pairs, counts / probability for each class not actual instances
  • Incremental: Implications for weakly supervised learning
  • Handles both nominal & continuous features
  • Simple -> Fast
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

WTF is Noise?

A
  • Noise refers to corruptions in the values of attributes
    • Erroneous values
    • Missing values
    • Incomplete values
    • Simply uninformative or unpredictive values

To fix, do some combination of feature selection and feature weighting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

WTF is Feature Weighting?

A
  • Weighting the distance calculation for each feature

- Feature selection: By thresholding based on weight, we can select only the features we want to keep

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are some common methods for feature weighting?

A
  • Pointwise Mutual Information & Mutual Information
  • Chi Squared
  • Information Gain
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are some common methods for feature weighting?

A
  • Pointwise Mutual Information & Mutual Information
  • Chi Squared
  • Information Gain
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

WTF is Feature Filtering?

A
  • Intuition is that it is possible to evaluate “goodness” of each feature, separate from other features
  • Consider each feature separately: Linear time in number of attributes
  • Possible (but difficult) to control inter-dependence of features
  • Typically most popular strategy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What makes a feature set good?

A

Better models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What makes a single feature good?

A

Well correlated with class

- Knowing a lets us predict c with more confidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How to calc Pointwise Mutual Info?

A

PMI(A,C) = log2 ( P(A,C) / (P(A) * P(C)) )

Attributes with the greatest PMI = best attributes

Refer to Toy Example in Lecture 3B Page ~39

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

WTF is contingency table

A

Refer to lecture 3b

26
Q

How to calc Mutual Info?

A

Refer to lecture 3b page 53

27
Q

How to choose a good feature set (feature selection)?

A

“Wrapper” methods

  • Choose subset of attributes that give best performance on the development data
  • e.g. train on all combinations of features in a dataset
  • on {outlook}
  • on {outlook, temperature}
  • on {temperature}
  • Best performance on dataset -> best feature set
28
Q

Pros and Cons of “Wrapper” methods for feature selection?

A

PROS
- Feature set with optimal performance on development data

CONS
- Takes a long time

29
Q

What are more practical “Wrapper” methods?

A

Greedy approach

  • Train and eval model on each single attribute
  • Choose best attribute
  • Iterate until performance stops increasing
  • Takes 1/2m^2 cycles, for m attributes

“Ablation” approach

  • Start with all attributes
  • Remove one attribute, train and evaluate model
  • Stop when performance starts to degrade by more than a threshold
  • O(m^2)
30
Q

What are “embedded methods”

A
  • Models that perform feature selection as part of the algorithm
31
Q

Examples of “embedded methods”?

A
  • Linear classifiers
  • To some degree: SVM
  • To some degree: Decision Trees
32
Q

What is the Zero-R algorithm?

A

Baseline classifier (also known as “majority class classifier”)

  • Throws out all of the attributes except for the class labels
  • Predict each test instance according to whichever label is most common in the training data
33
Q

What do SVM try to do?

A

SVMs attempt to partition the training data based on the best line (hyperplane) that divides the positive instances of the class that we’re looking for from the negative instances.

34
Q

Explain what “linear separability” is

A

When the points in our k-dimensional vector space corresponding to positive instances can be separated from the negative instances by a line.

(More accurately a (k - 1)-dimensional hyperplane)

This means that all of the positive instances are on one side, and all the negative instances are on the other side.

35
Q

What is the definition of “best” line?

A

When choosing our (best line) hyperplane. We choose a pair of parallel lines, one for positive instances and one for negative instances. This creates sort of a street, with the two lines forming a ‘gutter’. The gutters that we choose are the ones that make the street the widest.

The perpendicular line from the middle of the street to the gutter is called the “margin”.

36
Q

Explain what the “support vectors” are.

A

The support vectors are the margins (gutters). Can use them to classify test instances by calculating which of the support vectors is closer to the point defined by the test instance.

Alternatively just use the single line in between the support vectors to figure which side the point belongs to.

37
Q

What are “soft margins”? (SVM)

A
  • Relaxing linear separability

Small number of points are allowed to be on the “wrong” side of the line if it means getting a much better set of support vectors (i.e. larger margin).

38
Q

What is so desirable about “kernel functions”? (SVM)

A

They allow us to transform the data.
- Allows us to tackle it from a different perspective.

Sometimes the data isn’t linearly separable, but after applying some kind of function and transforming the data. The data becomes linearly separable.

This allows us to apply the algorithm as usual.

39
Q

Why are SVMs “binary classifiers”?

A

Trying to find a line that partitions the data into true and false, suitably interpreted for our data.

For two classes we only need a single line.

40
Q

How to extend SVM to “multiclass classifiers”

A

Need at least two lines to define three classes. But this is an issue when the lines aren’t parallel.

In general, we want to find three “half-lines”, radiating out from some central point between the three different classes.
HOWEVER, this is numerically much more difficult to solve.

Alternative? Use clustering, might as well use Nearest Prototype.

If we really wanted to use SVM, would have to build multiple models, either by comparing each class against all other classes (one vs many) or by building a model for each pair of classes (one vs one)

41
Q

How to extend SVM to “multiclass classifiers”

A

Need at least two lines to define three classes. But this is an issue when the lines aren’t parallel.

In general, we want to find three “half-lines”, radiating out from some central point between the three different classes.
HOWEVER, this is numerically much more difficult to solve.

Alternative? Use clustering, might as well use Nearest Prototype.

If we really wanted to use SVM, would have to build multiple models, either by comparing each class against all other classes (one vs many) or by building a model for each pair of classes (one vs one)

42
Q

What is dimensionality reduction, and why is it important?

A

TL;DR

  • Reducing the number of dimensions in a vector space.
  • Speeds up learning
  • Can remove redundant, misleading or noise

If we interpret our attribute set as implicitly defining a vector space (“feature space”), dimensionality reduction is about reducing the number of dimensions in this space.

Most machine learning methods have a time complexity that is linear in the number of dimensions of the feature space, reducing this number means that we get a speed increase.

In some data sets, some of the information is redundant, misleading or noise. Reducing dimensionality can remove some of this useless information.

43
Q

Why is “feature selection” a kind of dimensionality reduction?

A

Fewer attributes = fewer dimensions in the feature space - DUH!

44
Q

What is the logic behind Principal Component Analysis (PCA)?

A
  • Attributes with higher variance are more likely to be useful than attributes with low variance
  • So we create dimensions according to (the linear combination of) attributes with the greatest variance, after accounting for (correlation with) the dimensions that we have already created.
45
Q

Will there be an improvement in error rate on adding a 3rd classifier to an ensemble when the errors were correlated?

A

NO, need to be uncorrelated for improvement.

If errors were ‘mostly’ correlated, would only see a small improvement.

46
Q

What will happen for classifier ensembles when the errors are highly correlated?

A

All systems will be making the same predictions, so the error rate will be roughly the same as the correlated classifiers, and voting is unlikely to improve the ensemble.

47
Q

Why can’t you use classifier combination like the one in the tute for multi-class problems?

A
Error rate doesn't give us enough information about what sorts of errors the classifier is making in a multi-class problem.
This means we can't produce a confusion matrix.

Difficult to make estimates where the ensembles are partly correct. e.g. a 4-class problem and an ensemble of 5 classifiers. When 2 classifiers predict the right class and 3 of the classifiers predict the wrong class,

  • if the three wrong classifiers all chose the same (wrong) label, the ensemble will be wrong.
  • If the three wrong classifiers all chose different (wrong) labels, the ensemble will actually be right.
48
Q

What is macro-averaged precision and recall?

A

Taking the precision or recall average across the values for each class

Macro-average method can be used when you want to know how the system performs overall across the sets of data. You should not come up with any specific decision with this average.

49
Q

What is micro-averaged precision and recall?

A

Sum up all the TPs, FPs or FNs then work out precision and recall as normal.

Micro-average can be a useful measure when your dataset varies in size.

50
Q

Precision formula?

A

TP / (TP + FP)

51
Q

Recall formula?

A

TP / (TP + FN)

52
Q

Why is a high-bias, low-variance classifier undesirable?

A
  • Consistently wrong
53
Q

Why is a low-bias, high-variance classifier undesirable?

A

Low bias = Making a bunch of correct decisions
High variance = Not all of the predictions can possibly be correct (otherwise it would be low variance) - correct predictions will change as we change the training data.

INCONSISTENT

Makes it difficult to be certain about the performance of the classifier, might estimate low error rate on one data set but high error rate on another.

54
Q

What is the difference between evaluating using a holdout strategy and evaluating using a cross-validation strategy?

A

Holdout
- Partition the data into a training set and test set; usually 80 - 20 split. We build the model on the 80 split and evaluate on the 20 split.

Cross-validation
- Do the same as holdout, but do it a number of times, where each iteration uses one partition of the data as a test set and the rest as a training set (partition is different each time).

55
Q

What are some reasons we would prefer holdout over cross-validation and vice versa?

A

Holdout

  • Subject to some random variation, depending on which instances are assigned to the training data, and which are assigned to the test data
  • This could mean that our estimate of the performance of the model is way off

Cross-validation

  • Solves the problems in Holdout by averaging a bunch of values, so that one weird partition of the data won’t throw our estimate of performance completely off
  • Takes longer to cross-validate because we need to train a model for every test partition
56
Q

What is One-R algorithm?

A

Slightly better baseline method than Zero-R

  • Creates one rule for each attribute in the training data
  • Then selects the rule with the smallest error rate as its “one rule”

How it works

  • Create a decision stump for each attribute with branches for each value (attribute)
  • Populate leaf with the majority class at that leaf (i.e. make all instances the majority class in this leaf - if majority is YES, make all instances YES)
  • Select decision stump which leads to lowest error rate over training

Weather (Outlook) 9 Yes 5 No

  • Sunny 2 Yes 3 NO (Replace all with NO - 2 errors)
  • O’cast 4 YES 0 No (replace all with YES)
  • Rainy 3 YES 2 No (replace all with YES - 2 errors)

Total errors = 4 / 14

57
Q

How to compare Accuracy against F-Score?

A

YOU CAN’T DIRECTLY COMPARE THESE

58
Q

Little labelled training data is available, which classifier?

A

Try Naive Bayes

59
Q

Lots of labelled training data is available, which classifier?

A

K-NN (will overfit with too little)

60
Q

Reasonable amounts of labelled training data is available, which classifier?

A

SVM

61
Q

Numerical or nominal (labelled) data? Which classifier?

A

Nominal: Decision Trees

62
Q

Non-Linearly separable data, which classifier?

A

SVM with kernel projection