Chapter 5 Flashcards

Question

What are the problems of the holdout method?

Answer 1

- Potentially biased samples - Substantial portions of data must be reserved to test and validate the model

Answer 2

The test and validate data cannot be used to train the model until its performance has been measured.

Answer 3

Repeated holdout method This is a special case of the holdout method that uses the average result from several random holdout samples to evaluate a model's performance.

Answer 4

Multiple holdout samples are used It still has the issue that the different test sets considered potentially overlap - this may influence the overall predictive accuracy calculated.

Answer 5

Single-point estimate In practice, we want both an unbiased estimate of our model's future performance on new data (simulated by test data) and an estimate of the distribution of this estimate under typical variations in data and training procedures.

Answer 6

K-fold cross-validation This technique helps make sure that your predictions are not just a one-hit wonder but consistently reliable across new, unseen datasets. And the related ideas of: - Empirical resampling - Bootstrapping

Answer 7

K-Fold Cross-Validation is a robust technique used to evaluate the performance of machine learning models. It helps ensure that the model generalises well to unseen data by using different portions of the dataset for training and testing in multiple iterations.

Answer 8

Repeat the construction of the model on different subsets of the available training data, and then evaluate the model only on data not seen during construction. This is an attempt to simulate the performance of the model on unseen future data.

Answer 9

k = 10 10-fold cross-validation (10-fold CV)

Answer 10

Empirical evidence suggests that there is little added benefit in using a greater number.

Answer 11

For each of the 10 folds (each containing 10% of the total data), a machine learning model is built on the remaining 90% of the data. The fold's matching 10% samples is then used for model evaluation. After the process of training and evaluating has occurred for §0 times, the average performance across all the folds is reported.

Answer 12

The leave-one-out method. This performs k-fold CV using a fold for each of the data's samples. If the dataset contains N observations, all of the observations are divided into N equal-sized sections. A predictive model is built repeatedly leaving one observation out in each of the N iterations, this omitted observation is then used to train the generated predictive model.

Answer 13

- Ensures that the greatest amount of data is used to train the model - It is so computationally expensive it is rarely used in practice

Answer 14

If it works on the data it was trained from

Answer 15

To have a better understanding of how its performance will extrapolate to future cases.

Answer 16

It is usually unfeasible to test a still-unproven model in a live environment. Ask the model to make a prediction based on the cases that resemble what it will be asked to do in the future.

Answer 17

By observing the learner's responses when asked to make a prediction based on the cases that resemble what it will be asked to do in the future. We compare predicted values with the actual values from the dataset. We need to know the correct answer for a machine learner's predictions. We need two vectors of data, one with the correct class values and one with the predicted class values. Both vectors must have the same number of values stored in the same order.

Answer 18

Quantification of the performance of a model, ie calculation of the summary scores that tell us if the model is effective or not.

Answer 19

We look at some "ideal" models - The Null Model (tells us what low performance looks like) - The best single-variable model (tells us what a simple model can achieve)

Answer 20

The best model of a very simple form you are trying to perform. eg classification - model always returns the most popular category eg score model - model returns an average of all outcomes (has the least square deviation from all the outcome) If the null model is not outperformed by the generated predictive model, then the generated model is of no value.

Answer 21

- A model that is a single constant (returns the same answer for all situations) - A model that is independent (doesn't record any important interaction between inputs and outputs)

Answer 22

A complicated model can't be justified if it does not outperform the best single-variable model available from the training data.

Answer 23

- Accuracy and error rate - Precision and recall - Sensitivity and specificity

Answer 24

A confusion matrix

Answer 25

A table that categorises predictions according to whether they match the actual value. One of the table's dimensions indicates the possible categories of predicted values, while the other dimension indicates the same for actual values. eg 3x3 matrix for a three-class model

Answer 26

A correct classification is when the predicted value is the same as the actual value. Denoted by O - these fall on the diagonal values

Answer 27

Cases where the predicted value differs from the actual value. These are the off-diagonal matrix cells - denoted by X

Answer 28

The counts of predictions falling on and off the diagonal in these tables - True positives - False negatives - False positives - True negatives

Answer 29

The model's ability to discern one class versus all others

Answer 30

- Positive class - class of interest - Negative class - all others (Not intended to imply any value judgement)

Answer 31

- True positive (TP): correctly classified as class of interest - True negative (TN): correctly classified as not the class of interest - False positive (FP): incorrectly classified as class of interest - False negative (FN): incorrectly classified as not the class of interest

Answer 32

- Accuracy - Error rate - Sensitivity - Specificity - Precision - Recall [See flashcard]

Answer 33

Accuracy - at the very least you want the classifier to be accurate

Answer 34

For a classifier, accuracy is the number of items categorised correctly divided by the total number of items. It is simply what fraction of the time the classifier is correct.

Answer 35

Represents the proportion of the incorrectly classified samples.

Answer 36

Accuracy is an inappropriate measure for unbalanced classes. eg when we have a rare event we are trying to predict. - The null model (the event never happens) is very accuract, and more accurate than a useful classifier - Accuracy is not a good measure for events that have unbalanced distribution or unbalanced costs (different costs of "type 1" and "type 2" errors)

Answer 37

True positive rate (TPR) Measures the proportion of positive examples that were correctly classified. It is the number of true positives, divided by the total number positives (both correctly and incorrectly classified)

Answer 38

True negative rate (TNR) Measures the proportion of negative examples that were correctly classified. True negatives divided by the total number of ngetives.

Answer 39

The tradeoff / balance between predictions that are overly conservative and overly aggressive. - Think email spam filter

Answer 40

They are measures of effect. What fraction of class members are identified as positive and what fraction of non-class members are identified as negative.

Answer 41

0 to 1 - Closer to 1 is more desirable - 1 correlates to everything in the confusion matrix being along the diagonal We want to find balance between the two - a task that is often context-specific

Answer 42

ROC curve - Receiver operating characteristic

Answer 43

Performance evaluation metrics that come from the field of information retrieval. They are intended to provide an indication of how interesting and relevant a classifier's results are, or whether the predictions are diluted by meaningless noise.

Answer 44

The proportion of positive examples that are truly positive. Precision describes how often a positive indication turns out to be correct. It is a measure of confirmation (when the classifier indicates positive, how often it is in fact correct).

Answer 45

A metric that describes how complete the results are. It is a measure of utility (how much the classifier finds out of what there is to find out). The number of true positives over the total number of positives. Classifiers with a high recall capture a large portion of the positive examples, meaning that it has wide breadth. eg high recall if the majority of spam messages are correctly identified eg search engines with high recall return a large number of documents pertinent to the search query.

Answer 46

It is easy to be precise if you target the easy to classify samples. It is easy to have high recall by casting a very wide net, meaning that the model is overly aggressive in identifying the positive case. High precision and high recall at the same time is very challenging. We want to test a variety of models to find a combination of precision and recall that will meet the needs of the project.

Answer 47

"we need most of our decisions to be correct"

Answer 48

"Most of what we marked as spam needs to be spam"

Answer 49

"We want to cut down on the amount of spam a user sees by a factor of 10 (eliminate 90% of spam"

Answer 50

"We have to cut a lot of spam, otherwise the user won't see a benefit"

Answer 51

"We must be at least three nines on legitimate email; the user must see at least 99.9% of their non-spam email"

Answer 52

- Statistics attempt to boil model performance down to a single number - Visualisations depict how a learner performs across a wide range of conditions

Answer 53

No, learning algorithms have different biases so they could have drastic differences in how they achieve their accuracy.

Answer 54

A method to understand trade-offs, by comparing learners side by side in a single chart

Answer 55

Receiver operating characteristic

Answer 56

It is commonly used to examine the trade-off between the detection of true positives, while avoiding the false positives

Answer 57

[See flashcard] Also known as a sensitivity/specificity plot. - Proportion of true positives on the vertical axis (sensitivity) - Proportion of false positives on the horizontal axis (1 - specificity) - Points indicate the true positive rate at varying false positive thresholds To create the curve, a classifier's predictions are sorted by the model's estimate probability of the positive class, with the largest values first. Each prediction's impact results in the curve tracing vertically (for correct prediction) or horizontally (for an incorrect prediction).

Answer 58

A classifier with no predictive value. This classifier detects true positives and false positives at exactly the same rate, implying that the classifier cannot discriminate between the two. This is the baseline by which other classifiers may be judged. The closer to the line, the less useful the model.

Answer 59

A curve that passes through the point at a 100% true positive rate and 0% false positive rate. It is able to correctly identify all the positives before it incorrectly classifies any negative result. The closer the curve is to the perfect classifier, the better it is at identifying positive values.

Answer 60

The closer it is, the better it is at identifying positive values. Measured using area under the ROC curve (AUC). The AUC treats the ROC diagram as a 2D-square and measures the total area under the roc curve. AUC ranges from 0.5 (classifier with no predictive value) to 1.0 (perfect classifier.

Answer 61

0.5 - 0.6 - no discrimination 0.6 - 0.7 - Poor 0.7 - 0.8 - acceptable/fair 0.8 - 0.9 - excellent / good 0.9 - 1.0 - outstanding

Answer 62

If the difference between observed values and the model's predicted values (residuals) are small and unbiased.

Answer 63

- Root mean square error (RMSE) - Mean absolute error (MAE) - Coefficient of determination - R^2

Answer 64

Root mean square error RMSE is the most common goodness-of-fit-measurmeent for assessing regression models. It is a quadratic scoring rule which measures the average magnitude of error. It is calculated as the square root of the average square of the difference between prediction and actual values. [See flashcard] Since errors are squared before averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable.

Answer 65

Mean absolute error Measures the average magnitude of the errors in a set of forecasts, without considering their directions. It is calculated as the average absolute difference between forecast and the corresponding observation. [See flashcard] It is a linear score - means all the individual differences are weighted equally in the average.

Answer 66

To diagnose the variation in the errors in a set of forecasts. The RMSE will always be larger or equal to MAE - the greater the difference between them, the greater the variance in the individual errors in the sample. If RMSE = MAE, then all the errors are of the same magnitude. Both range from 0 to infinity. They are negatively-oriented scores, lower values are better.

Answer 67

Describes the proportion of the variation in the response that is predictable from the independent variable. Ranges from 0 to 1 where higher values indicate better performance of the regression model

Answer 68

Models which combine the effects of many input variables (predictors) tend to be much more powerful than models that use only a single predictor.

Answer 69

- Selecting what input variables to use - How the input variables are to be transformed or treated

Answer 70

A variable that's coincident with (available near or after) the time that the outcome occurs may make a very accurate model with little utility (as it can't be used for long-term prediction).

Answer 71

Variables that are function of or "contaminated by" the value to be predicted

Answer 72

Project sponsor

Answer 73

- Possible cost of accuracy - Removing variables from project design

Answer 74

An acceptable prediction one day before anent can be much more useful than a more accurate prediction one hour before the event.

Answer 75

- Chance of explaining more of the outcome variation (ie a chance of building a better model) - A possible source of noise and overfitting

Answer 76

We preselect which subset of variables we will use to fit.

Answer 77

Log likelihood values - ln(L)

Answer 78

- Akaike information criterion (AIC) - Bayesian information criterion (BIC) - Likelihood ratio test

Answer 79

AIC - akaike information criterion Using AIC to compare models, the model with the smallest AIC is chosen. [See flashcard] - The measure penalises models that have excessive number of parameters (they may be overfitting). - The measure can also be used to compare between different statistical models (eg logistic vs probit)

Answer 80

BIC - Bayesian information criterion or Schwarz criterion The model chosen is the one with the smallest BIC. It has a larger penalty than the AIC [ See flashcard ]

Answer 81

BIC has a larger penalty for models that have an excessive number of parameters.

Answer 82

In the likelihood ratio test, two nested models are compared. Model 1 is nested in Model 2 if the covariates in Model 1 are also contained in model 2. [ See flashcard ] Under the null hypothesis that the additional B parameter(s) in model 2 all equal zero, this ratio should be an observation from a X2 distribution. If the statistic lies in the tail of the distribution, there is sufficient evidence to reject the null hypothesis (ie some of the additional covariates in model 2 are required). If we don't reject our null hypothesis, this suggests that all the additional covariates in model 2 are not required.

Answer 83

- Re-substitution error rate - Predictive accuracy - Speed and scalability - Robustness - Interpretability

Answer 84

A predictive model is built based on training data. It is then applied to the same training data so that comparisons can be made between predicted and actual observations recorded in the training data.

Answer 85

How good (or bad) a predictive model performs on the training data. It gives a performance measure for the algorithm (lower = better algorithm). It does not correspond to how well the model would predict previously unseen operations.

Answer 86

Observations recorded (with known classification) are split between training and test sets. The predictive model is first built based on the training set, then applied to the test set.

Answer 87

How good (or bad) a predictive model performs on the test data.

Answer 88

- Time to construct the predictive model - Time to use the model

Answer 89

The number of instances

Answer 90

The time taken to test which side of the line the unlabelled instance is. This can be done in constant time.

Answer 91

Handling - Noise (eg mistyped values) - Missing values - Irrelevant features - Streaming data

Answer 92

For many real world problems, we do not have a single fixed dataset. Instead the data continuously arrives, potentially forever. - Eg stock market, weather data, sensor data - It could take three weeks to arrive/run - it will always be three weeks out of date

Answer 93

Interpretability refers to the degree to which a human can predict the outcome of a model or understand the reasons behind its decisions. The understanding and insight provided by the model.

Answer 94

The structure of the learned classifier tells us something about the domain. - eg a single linear classifier does not work well but two linear classifiers do.

Answer 95

To identify the characteristics that indicate the group to which each case belongs. Classification deals with associating a class label to a point, according to the class to which the point is believed to belong. The goal of classification is to group items that have similar feature values, into groups. - Classification can be based on a distance measure between the point of consideration and other points belonging to known classes in point space.

Answer 96

- Understanding the existing data - Predict how new instances will behave

Answer 97

Makes a classification decision based on the value of the linear combination of these features. Linear classification classifies points according to their position relative to a hyperplane in the point space. All points on a given side of the hyperplane are considered as belonging to the same class.

Answer 98

Two classes - eg yes and no

Answer 99

Splitting a high-dimensional input space with a hyperplane. All points on one side of the hyperplane are classified as "yes" and all points on the other are classified as "no".

Answer 100

In the absence of a known model for class data.

Answer 101

Linearly separable

Answer 102

Naive Bayes

Answer 103

A piecewise linear classifier can be generalised to N classes, by fitting N-1 lines.

Answer 104

There may not exist a single hyperplane capable of uniquely separating point classes - Geometrical reasons - There are more than 2 classes

Answer 105

Piecewise linear classification

Answer 106

Defining boundaries between classes using several hyperplanes - allows us to define more than two regions in space, making it possible to classify a set of points more precisely.

Answer 107

A classifier which makes its classification decision based on a non-linear combination of its features. - k-nearest neighbour (K-NN) algorithm: the decision boundaries are locally linear segments, but in general have a complex shape that is not equivalent to a line in 2D or hyperplane in higher dimensions.

Answer 108

A supervised learning model which learns on the data by making decision rules on the variables to separate the classes in a flowchart-like tree structure. - Can be used for both regression and classification - Regression trees: response variable is continuous - Classification trees: response variable is quantitative discrete or qualitative

Answer 109

To learn the structure of the tree using a training sample of data, and then using the resulting tree structure to create a set of classification rules. Once such rules are created, new unknown samples can be classified accordingly - Direct us to predicting what class a new piece of data would be

Answer 110

- A flow-chart-like structure - Rectangle (internal node) denotes a test on an attribute/variable - Branches of internal nodes represents outcome of test - Lead nodes (circle) represent class labels or class distribution

Answer 111

Two phases - tree construction - tree pruning

Answer 112

At the start, all the training examples are at the root. The examples are partitioned recursively based on selected attributes.

Answer 113

- PRE-RUNING - halt its construction early - Eg decide not to add a further split at some node, so the node becomes a leaf - POST-PRUNING - identify and remove branches that reflect noise or outliers

Answer 114

To classify an unknown sample - By testing the variable values of the sample against the decision tree - Not all variables will necessarily help distinguish between cases

Answer 115

The ID3 algorithm - A greedy algorithm - Splits the records based on attribute test that optimises certain criterion

Answer 116

- The tree is constructed in a top-down recursive divide-and-conquer manner - At the start, the tree starts as a single node (the root) representing the training samples - If all the samples are in the same class, the node becomes a leaf and is labelled with that class - Variables are categorical (or discretised if continuous) - The algorithm uses a statistical measure (eg information gain) for selecting the variable that will best separate the samples into individual classes - this variable becomes the decision variable at the node - A branch is created for each known value of the decision variable and the samples are partitioned accordingly - The algorithm uses the same process recursively to form a decision tree for the samples at each portion.

Answer 117

The variable is not considered for any of the node's descendants.

Answer 118

Use information gain - entropy

Answer 119

- All samples for a given node belong to the same class - There are no remaining attributes for further partitioning - majority voting is employed for classifying the leaf - There are no samples left

Answer 120

- Variable value description: the same variables must describe each example and have a fixed number of values - Predefined classes - an example's variables must already be defined, they are not learned by ID3 - Sufficient examples - there must be enough test cases to distinguish valid patterns from chance occurrences

Answer 121

A statistical property called information is used for selecting the variable that will best separate the samples into individual classes Selects the variable with the highest information gain for branching.

Answer 122

How well a given variable separates training examples into targeted classes.

Answer 123

Using the statistic entropy which measures the amount of information (or dispersion) in a variable or sample. Information gain measures the expected reduction in entropy caused by knowing the value of a variable.

Answer 124

[See flashcard]

Answer 125

When all members of S belong to the same class (ie perfect homogeneity)

Answer 126

If all observations are uniformly distributed across the c possible outcomes (ie maximum heterogeneity)

Answer 127

Calculate the gain for each variable, use the variable with the highest gain to be the variable that splits the root node (the decision variable in the root node)

Answer 128

Entropy has a constraint to be between 0 and 1 (inclusive)

Answer 129

Represent as classification IF-THEN rules - Create one rule for each path from the root to a leaf

Answer 130

- Medical diagnosis - Credit risk assessment of loan applications - Equipment failures

Answer 131

- They are intuitive and easy to understand - Simple and easily-understood classification rules produced - easy to understand how to classify a new point - Relatively fast (compared to other classification methods)

Answer 132

- May suffer from overfitting (we can have a huge tree if there is lots of data but it may be overfit - need to prune) - Rectangular partitioning approach (does not handle correlated features very well) - Can be quite large - pruning is sometimes necessary - May suffer from "noisy" data

Answer 133

C4.5 algorithm It is based on many of the principals of the ID3 algorithm but includes a number of improvements.

Answer 134

- It can handle continuous and discrete attributes - for continuous variables, it creates a threshold and splits based on this - It can handle training data with missing variable values - they are simply not used in gain and entropy calculations - It can handle variables with differing costs (weights) - Can prune trees after creation

Answer 135

The best split-point for a continuous variable Ak must be determined. Typically the observed values of Ak in the training set would be sorted and the information gain would be calculated by having the split-point at each and every of the mid-points of the sorted values in the observed training set. The position of the split-point with the best information gain is chosen for the actual split-point. Then this is compared in the usual way to the other possible variable.

Answer 136

Examples which are misclassified or where one or more of the attribute variables is wrong.

Answer 137

We could use unused attributes to elaborate the tree to take care of this one case but it would be wrong and likely misclassify real data. If we know there is noise in the training data, it might be wise to "prune" the decision tree to remove nodes which, statistically speaking, seem likely to arise from noise in the training data.

Answer 138

They can be overly complex, describing even noise in the training data.

Answer 139

By splitting the data into a training, test and validation set

Answer 140

A set of examples used for learning, that is to fit the parameters of the classifier

Answer 141

A set of examples used to tune the parameters of a classifier - for example to choose the number of nodes to keep in a decision tree. eg deciding between 15 or 20 nodes, the validation set compares the performance of the two decision trees to decide which one to take. The training data likely has less error with 20 nodes, but the validation data may perform worse with 20-nodes due to overfitting.

Answer 142

A set of examples used only asses the performance of a fully-specified classifier.

Answer 143

- The error in the training set decreases as the decision tree becomes more complex - The same may occur for the validation data or not. Initially, the error decreases with the number of nodes but at some point, the model will be overfit to the training data, and the validation error increases. The best predictive and fitted model would be where the error in the validation set has its global minimum.

Answer 144

Statistical classifiers that can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. Bayesian classification is based on Bayes' theorem.

Answer 145

Naive Bayesian classifier / idiot Bayes / simple bayes

Answer 146

The effect of one variable on a given class is independent of the values of other variables. Quite often variables are not completely independent - eg heart attack example, a naive Bayesian model might struggle because we are assuming class conditional independence (a disadvantage of this model) CLASS CONDITIONAL INDEPENDENCE

Answer 147

To simplify the computations

Answer 148

[See flashcard]

Answer 149

The probability of an event occurring after taking into account relevant prior information or evidence.

Answer 150

The highest posterior probability conditioned on X [See flashcard]

Answer 151

P(Ci|X) therefore we need to maximise P(X|Ci) x P(Ci) - This is when the classes are in the same proportion, we only need to maximise the top part of the formula We assume all the priors are equal so we only need to maximise P(X|Ci)

Answer 152

P(Ci) = Si/S

Answer 153

[See flashcard]

Answer 154

Estimate from a training sample by counting the number of observations in the class Cj having the value Xk for Ak and dividing by the total number of observations in the class Ci

Answer 155

The variable is typically assumed to have a normal distribution - there is a formula for this.

Answer 156

Find out the probability of a previously unseen instance belonging to each class, then simply pick the most probable class.

Answer 157

- Fast to train (single scan) and fast to classify - Speed compares to decision trees - Handles real and discrete data

Answer 158

Assumes independence of features - this is often a difficult assumption in reality

Answer 159

The probability P(Xk|Ci) is estimated to be zero and therefore P(X|Ci) is also inappropriately estimated as zero.

Answer 160

We can add a small number to the counts in both the numerator and denominator of the probability estimation. If the training set is large, this Laplacian correction makes only a negligible difference to the probability estimation, but ensures no probability is zero.

Answer 161

A supervised learning algorithm where the result of a new query is classified base don the class of its nearest neighbour. The purpose of this algorithm is to classify a new object based on attributes and training samples.

Answer 162

No model is used, it is only based on memory.

Answer 163

Euclidean distance - quantitative variables Similarity measures - qualitative variables

Answer 164

The nearest neighbour algorithm is sensitive to outliers. May result in misclassification.

Answer 165

The k-Nearest Neighbour (k-NN) algorithm.

Answer 166

It works on the minimum distance from the query to the training samples to determine the k-nearest neighbours. After the nearest neighbours are gathered, we take simple majority (majority vote) of these k neighbours to be the prediction of the query instance.

Answer 167

An odd number

Answer 168

Helps deal with the outliers.

Answer 169

- Get coordinates for query point - Calculate euclidean distances for all points from query point - Rank these distances (smallest = 1) - Determine the k nearest points - Of these k nearest points, determine which class has majority rule - Conclude what the query point can be classified as

Answer 170

- Simple to implement - Handles correlated features - Defined for any distance measure

Answer 171

- Very sensitive to irrelevant features - Slow classification time for large datasets - Sensitive to scale of measurement

Answer 172

The points may get separated out using an unhelpful variable and result in misclassification

Answer 173

- Use more training instances - Ask an expert what features are relevant to the task - Use statistical tests to try to determine which features are useful

Answer 174

Standardise the variables before the classification process.

Answer 175

The curse of dimensionality

Answer 176

- Associate weights with the attributes - Backward elimination

Answer 177

- Assign random weights - Calculate the classification error - Adjust the weights according to the error - Repeat until an acceptable level of accuracy is reached

Answer 178

Starts with the full set of features and greedily removes the one that most improves performance, or degrades performance slightly

Chapter 5 Flashcards

(202 cards)