Week 1 - 4 Flashcards

Question

WTF is contingency table

Answer 1

Refer to lecture 3b

Answer 2

Refer to lecture 3b page 53

Answer 3

"Wrapper" methods - Choose subset of attributes that give best performance on the development data - e.g. train on all combinations of features in a dataset - on {outlook} - on {outlook, temperature} - on {temperature} - ... - Best performance on dataset -> best feature set

Answer 4

PROS - Feature set with optimal performance on development data CONS - Takes a long time

Answer 5

Greedy approach - Train and eval model on each single attribute - Choose best attribute - Iterate until performance stops increasing - Takes 1/2m^2 cycles, for m attributes "Ablation" approach - Start with all attributes - Remove one attribute, train and evaluate model - Stop when performance starts to degrade by more than a threshold - O(m^2)

Answer 6

- Models that perform feature selection as part of the algorithm

Answer 7

- Linear classifiers - To some degree: SVM - To some degree: Decision Trees

Answer 8

Baseline classifier (also known as "majority class classifier") - Throws out all of the attributes except for the class labels - Predict each test instance according to whichever label is most common in the training data

Answer 9

SVMs attempt to partition the training data based on the best line (hyperplane) that divides the positive instances of the class that we're looking for from the negative instances.

Answer 10

When the points in our k-dimensional vector space corresponding to positive instances can be separated from the negative instances by a line. (More accurately a (k - 1)-dimensional hyperplane) This means that all of the positive instances are on one side, and all the negative instances are on the other side.

Answer 11

When choosing our (best line) hyperplane. We choose a pair of parallel lines, one for positive instances and one for negative instances. This creates sort of a street, with the two lines forming a 'gutter'. The gutters that we choose are the ones that make the street the widest. The perpendicular line from the middle of the street to the gutter is called the "margin".

Answer 12

The support vectors are the margins (gutters). Can use them to classify test instances by calculating which of the support vectors is closer to the point defined by the test instance. Alternatively just use the single line in between the support vectors to figure which side the point belongs to.

Answer 13

- Relaxing linear separability Small number of points are allowed to be on the "wrong" side of the line if it means getting a much better set of support vectors (i.e. larger margin).

Answer 14

They allow us to transform the data. - Allows us to tackle it from a different perspective. Sometimes the data isn't linearly separable, but after applying some kind of function and transforming the data. The data becomes linearly separable. This allows us to apply the algorithm as usual.

Answer 15

Trying to find a line that partitions the data into true and false, suitably interpreted for our data. For two classes we only need a single line.

Answer 16

Need at least two lines to define three classes. But this is an issue when the lines aren't parallel. In general, we want to find three "half-lines", radiating out from some central point between the three different classes. HOWEVER, this is numerically much more difficult to solve. Alternative? Use clustering, might as well use Nearest Prototype. If we really wanted to use SVM, would have to build multiple models, either by comparing each class against all other classes (one vs many) or by building a model for each pair of classes (one vs one)

Answer 17

Need at least two lines to define three classes. But this is an issue when the lines aren't parallel. In general, we want to find three "half-lines", radiating out from some central point between the three different classes. HOWEVER, this is numerically much more difficult to solve. Alternative? Use clustering, might as well use Nearest Prototype. If we really wanted to use SVM, would have to build multiple models, either by comparing each class against all other classes (one vs many) or by building a model for each pair of classes (one vs one)

Answer 18

TL;DR - Reducing the number of dimensions in a vector space. - Speeds up learning - Can remove redundant, misleading or noise If we interpret our attribute set as implicitly defining a vector space ("feature space"), dimensionality reduction is about reducing the number of dimensions in this space. Most machine learning methods have a time complexity that is linear in the number of dimensions of the feature space, reducing this number means that we get a speed increase. In some data sets, some of the information is redundant, misleading or noise. Reducing dimensionality can remove some of this useless information.

Answer 19

Fewer attributes = fewer dimensions in the feature space - DUH!

Answer 20

- Attributes with higher variance are more likely to be useful than attributes with low variance - So we create dimensions according to (the linear combination of) attributes with the greatest variance, after accounting for (correlation with) the dimensions that we have already created.

Answer 21

NO, need to be uncorrelated for improvement. | If errors were 'mostly' correlated, would only see a small improvement.

Answer 22

All systems will be making the same predictions, so the error rate will be roughly the same as the correlated classifiers, and voting is unlikely to improve the ensemble.

Answer 23

``` Error rate doesn't give us enough information about what sorts of errors the classifier is making in a multi-class problem. This means we can't produce a confusion matrix. ``` Difficult to make estimates where the ensembles are partly correct. e.g. a 4-class problem and an ensemble of 5 classifiers. When 2 classifiers predict the right class and 3 of the classifiers predict the wrong class, - if the three wrong classifiers all chose the same (wrong) label, the ensemble will be wrong. - If the three wrong classifiers all chose different (wrong) labels, the ensemble will actually be right.

Answer 24

Taking the precision or recall average across the values for each class Macro-average method can be used when you want to know how the system performs overall across the sets of data. You should not come up with any specific decision with this average.

Answer 25

Sum up all the TPs, FPs or FNs then work out precision and recall as normal. Micro-average can be a useful measure when your dataset varies in size.

Answer 26

TP / (TP + FP)

Answer 27

TP / (TP + FN)

Answer 28

- Consistently wrong

Answer 29

Low bias = Making a bunch of correct decisions High variance = Not all of the predictions can possibly be correct (otherwise it would be low variance) - correct predictions will change as we change the training data. INCONSISTENT Makes it difficult to be certain about the performance of the classifier, might estimate low error rate on one data set but high error rate on another.

Answer 30

Holdout - Partition the data into a training set and test set; usually 80 - 20 split. We build the model on the 80 split and evaluate on the 20 split. Cross-validation - Do the same as holdout, but do it a number of times, where each iteration uses one partition of the data as a test set and the rest as a training set (partition is different each time).

Answer 31

Holdout - Subject to some random variation, depending on which instances are assigned to the training data, and which are assigned to the test data - This could mean that our estimate of the performance of the model is way off Cross-validation - Solves the problems in Holdout by averaging a bunch of values, so that one weird partition of the data won't throw our estimate of performance completely off - Takes longer to cross-validate because we need to train a model for every test partition

Answer 32

Slightly better baseline method than Zero-R - Creates one rule for each attribute in the training data - Then selects the rule with the smallest error rate as its "one rule" How it works - Create a decision stump for each attribute with branches for each value (attribute) - Populate leaf with the majority class at that leaf (i.e. make all instances the majority class in this leaf - if majority is YES, make all instances YES) - Select decision stump which leads to lowest error rate over training Weather (Outlook) 9 Yes 5 No - Sunny 2 Yes 3 NO (Replace all with NO - 2 errors) - O'cast 4 YES 0 No (replace all with YES) - Rainy 3 YES 2 No (replace all with YES - 2 errors) Total errors = 4 / 14

Answer 33

YOU CAN'T DIRECTLY COMPARE THESE

Answer 34

Try Naive Bayes

Answer 35

K-NN (will overfit with too little)

Answer 36

Nominal: Decision Trees

Answer 37

SVM with kernel projection

Week 1 - 4 Flashcards

(62 cards)