Chapter 5 Flashcards

1
Q

What is data mining?

A

The process of extracting valuable information from datasets.

  • Processing data and identifying patterns and trends in the information
  • Help make predictions on future trends by analysing past data
  • Identify relationships between different pieces of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the aim of supervised learning?

A

To build a model that makes predictions based on evidence in the presence of uncertainty.

A supervised learning algorithm takes a set of known input data and known responses/targets and trains a model to generate predictions for the response of the set of new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When thinking of an entire set of input data for supervised learning as a heterogeneous mix, what do the columns and rows represent?

How can you think of the target data?

A
  • Columns are called predictors / attributes / features and represent a measurement taken on every subject
  • Rows are called observations / examples / instances and each contain a set of measurements for a subject
  • Target data can be thought of as a column vector where each row contains the output of the corresponding observation in the input data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two categories of supervised learning algorithms and what do they depend on?

A

Classification and regression

Depends on what the target feature is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is classification used for?

A

Where the target feature to be predicted is a categorical feature (class) and is divided into categories called levels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How many levels can have a class have?

A

Two or more levels
- Yes / No
- A / B / C

The levels may or not be ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is regression used for?

A

To predict a continuous measurement for an observation (target variables are real numbers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define the training dataset and test dataset.

A
  • Training dataset: the set of known input data and known targets. Its purpose is to generate the predictive model.
  • Test dataset: the set of new data that is unknown to the model. Its purpose is to assess the accuracy of the model.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How are the training and test datasets often obtained?

A

Partitioning the raw (given) dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the most popular partitioning data methods?

A
  • The holdout partitioning method
  • The K-Fold Cross-Validation Partitioning method
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe the holdout partitioning method.

A

In the HP method, the raw dataset is divided into training and test datasets based on some predefined percentage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the usual amount of data held out for testing?

A
  • 1/3 for testing
  • 2/3 for training

This proportion can vary depending on the amount of available data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why do you need to ensure samples are randomly divided into the twi groups (test and train)?

A

To ensure there are no systematic differences between the training and test data,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do we need a test set?

A

We can’t say how good our model is if we don’t have known values to compare

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you ensure the holdout method results in a truly accurate estimate of the future performance?

A

By ensuring that the performance on the test dataset is not allowed to influence the model.

  • For example, after building several models on the training data, don’t cherry-pick the one with the highest accuracy on the test data. Cherry-picking means the test performance is not an unbiased measure of the performance on unseen data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you overcome this problem?

A

In addition to training and test datasets, create a validation dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a validation dataset?

A

The validation dataset would be used for iterating and refining the model(s).

It is kept completely separate until the end.
Using a little bit of data to fine tune the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the use of a validation dataset mean for the test dataset?

A

The test dataset is only used once as a final step to report an estimated error rate for future predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a typical split between training, test and validation data?

A

50 / 25 /25

Varies depending on the size of the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a simple method to create holdout samples?

A

Use random number generators to assign records to partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the problem with holdout sampling this way?

A

Each partition may have a larger or smaller proportion of some classes.

Particularly if there is a class which is a very small proportion of the dataset, this can often lead a class to be omitted from the training dataset. This is a significant problem, because the model will not be able to learn this class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the problem if a class is not in the training dataset?

A

The model will not be able to learn it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do you account for this?

A

Use a technique called stratified random sampling.

This guarantees that the random partitions have nearly the same proportion of each class as the full dataset, even when some classes are small. (We want to make sure a particular class isn’t omitted from the final testing dataset).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Stratified random sampling distributes the classes evenly, but what can it not guarantee?

A

Other types of representativeness

  • Eg some samples may have too many/few difficult cases, easy-to-predict cases or outliers.
  • This is especially true for smaller datasets, which may not have a large enough portion of such cases to be divided among the training and test sets.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the problems of the holdout method?

A
  • Potentially biased samples
  • Substantial portions of data must be reserved to test and validate the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why are performance estimates using the holdout method likely to be conservative?

A

The test and validate data cannot be used to train the model until its performance has been measured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What technique mitigates the problems of randomly composed training datasets?

A

Repeated holdout method

This is a special case of the holdout method that uses the average result from several random holdout samples to evaluate a model’s performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Why does the repeated holdout method make it less likely that the model is trained or tested on non-representative data?

A

Multiple holdout samples are used

It still has the issue that the different test sets considered potentially overlap - this may influence the overall predictive accuracy calculated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What kind of estimate of model performance does testing on hold-out data give?

How does this differ from what we want in practice?

A

Single-point estimate

In practice, we want both an unbiased estimate of our model’s future performance on new data (simulated by test data) and an estimate of the distribution of this estimate under typical variations in data and training procedures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is a good method to obtain both an unbiased estimate of the model’s future performance and an estimate of the distribution of this estimate under typical variations in data and training procedures?

A

K-fold cross-validation

This technique helps make sure that your predictions are not just a one-hit wonder but consistently reliable across new, unseen datasets.

And the related ideas of:
- Empirical resampling
- Bootstrapping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is k-fold cross-validation?

A

K-Fold Cross-Validation is a robust technique used to evaluate the performance of machine learning models. It helps ensure that the model generalises well to unseen data by using different portions of the dataset for training and testing in multiple iterations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the idea behind k-fold cross-validation?

A

Repeat the construction of the model on different subsets of the available training data, and then evaluate the model only on data not seen during construction.

This is an attempt to simulate the performance of the model on unseen future data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the most common convention of the k number of sections?

A

k = 10

10-fold cross-validation (10-fold CV)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Why is k = 10 often used?

A

Empirical evidence suggests that there is little added benefit in using a greater number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How are machine learning models built using 10-fold CV?

A

For each of the 10 folds (each containing 10% of the total data), a machine learning model is built on the remaining 90% of the data. The fold’s matching 10% samples is then used for model evaluation.

After the process of training and evaluating has occurred for §0 times, the average performance across all the folds is reported.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Discuss the extreme case of a k-fold CV.

A

The leave-one-out method.

This performs k-fold CV using a fold for each of the data’s samples. If the dataset contains N observations, all of the observations are divided into N equal-sized sections.

A predictive model is built repeatedly leaving one observation out in each of the N iterations, this omitted observation is then used to train the generated predictive model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are the advantages and disadvantages of the leave-one-out method?

A
  • Ensures that the greatest amount of data is used to train the model
  • It is so computationally expensive it is rarely used in practice
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Once you have built a model, what is the first thing to check?

A

If it works on the data it was trained from

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the goal of evaluating a predictive model?

A

To have a better understanding of how its performance will extrapolate to future cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Why do we typically simulate future conditions and how?

A

It is usually unfeasible to test a still-unproven model in a live environment.

Ask the model to make a prediction based on the cases that resemble what it will be asked to do in the future.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How do we learn about a models strengths and weaknesses?

A

By observing the learner’s responses when asked to make a prediction based on the cases that resemble what it will be asked to do in the future.

We compare predicted values with the actual values from the dataset. We need to know the correct answer for a machine learner’s predictions. We need two vectors of data, one with the correct class values and one with the predicted class values. Both vectors must have the same number of values stored in the same order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is model evaluation?

A

Quantification of the performance of a model, ie calculation of the summary scores that tell us if the model is effective or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

How do we decide if a summary score is high or low?

A

We look at some “ideal” models
- The Null Model (tells us what low performance looks like)
- The best single-variable model (tells us what a simple model can achieve)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is the null model?

A

The best model of a very simple form you are trying to perform.

eg classification - model always returns the most popular category
eg score model - model returns an average of all outcomes (has the least square deviation from all the outcome)

If the null model is not outperformed by the generated predictive model, then the generated model is of no value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are the two most typical null model choices?

A
  • A model that is a single constant (returns the same answer for all situations)
  • A model that is independent (doesn’t record any important interaction between inputs and outputs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Why should we compare a single-variable model?

A

A complicated model can’t be justified if it does not outperform the best single-variable model available from the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What are the most common metrics used for the assessment of the classifier quality?

A
  • Accuracy and error rate
  • Precision and recall
  • Sensitivity and specificity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What do we need to produce in order to calculate and describe these metrics?

A

A confusion matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is a confusion matrix?

A

A table that categorises predictions according to whether they match the actual value.

One of the table’s dimensions indicates the possible categories of predicted values, while the other dimension indicates the same for actual values.

eg 3x3 matrix for a three-class model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is a correct classification and how is it denoted?

A

A correct classification is when the predicted value is the same as the actual value.

Denoted by O - these fall on the diagonal values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What are incorrect predictions and how are they denoted?

A

Cases where the predicted value differs from the actual value.

These are the off-diagonal matrix cells - denoted by X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What are the performance measures for classification models based on from the confusion matrix?

A

The counts of predictions falling on and off the diagonal in these tables

  • True positives
  • False negatives
  • False positives
  • True negatives
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What do the most common performance measures consider?

A

The model’s ability to discern one class versus all others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is the class of interest known as vs the others?

A
  • Positive class - class of interest
  • Negative class - all others

(Not intended to imply any value judgement)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

The relationship between the positive class and negative class predictions can be predicted as a 2x2 confusion matrix that tabulates whether predictions fall into one of four categories? What are the categories?

A
  • True positive (TP): correctly classified as class of interest
  • True negative (TN): correctly classified as not the class of interest
  • False positive (FP): incorrectly classified as class of interest
  • False negative (FN): incorrectly classified as not the class of interest
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What are the common metrics for evaluating the classifier’s performance from the confusion matrix?

What are their formulas?

A
  • Accuracy
  • Error rate
  • Sensitivity
  • Specificity
  • Precision
  • Recall

[See flashcard]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is the most widely known measure of classifier performance?

A

Accuracy - at the very least you want the classifier to be accurate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is accuracy?

What is its formula?

A

For a classifier, accuracy is the number of items categorised correctly divided by the total number of items.

It is simply what fraction of the time the classifier is correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is the error rate?

A

Represents the proportion of the incorrectly classified samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

When is accuracy an inappropriate measure?

A

Accuracy is an inappropriate measure for unbalanced classes.

eg when we have a rare event we are trying to predict.
- The null model (the event never happens) is very accuract, and more accurate than a useful classifier
- Accuracy is not a good measure for events that have unbalanced distribution or unbalanced costs (different costs of “type 1” and “type 2” errors)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is the sensitivity of a model?

A

True positive rate (TPR)

Measures the proportion of positive examples that were correctly classified.

It is the number of true positives, divided by the total number positives (both correctly and incorrectly classified)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What is the specificity of a model?

A

True negative rate (TNR)

Measures the proportion of negative examples that were correctly classified.

True negatives divided by the total number of ngetives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What do the pair of performance measures sensitivity and specificity capture?

A

The tradeoff / balance between predictions that are overly conservative and overly aggressive.

  • Think email spam filter
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What are sensitivity and specificity measures of?

A

They are measures of effect.

What fraction of class members are identified as positive and what fraction of non-class members are identified as negative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What is the range of sensitivity and specificity?

A

0 to 1

  • Closer to 1 is more desirable
  • 1 correlates to everything in the confusion matrix being along the diagonal

We want to find balance between the two - a task that is often context-specific

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What other technique can assist with understanding the trade-off between sensitivity and specialisation?

A

ROC curve

  • Receiver operating characteristic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What are precision and recall?

A

Performance evaluation metrics that come from the field of information retrieval.

They are intended to provide an indication of how interesting and relevant a classifier’s results are, or whether the predictions are diluted by meaningless noise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What is precision?

A

The proportion of positive examples that are truly positive.

Precision describes how often a positive indication turns out to be correct. It is a measure of confirmation (when the classifier indicates positive, how often it is in fact correct).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What is recall?

A

A metric that describes how complete the results are. It is a measure of utility (how much the classifier finds out of what there is to find out).

The number of true positives over the total number of positives.

Classifiers with a high recall capture a large portion of the positive examples, meaning that it has wide breadth.

eg high recall if the majority of spam messages are correctly identified
eg search engines with high recall return a large number of documents pertinent to the search query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Discuss the trade-off between precision and recall.

A

It is easy to be precise if you target the easy to classify samples.

It is easy to have high recall by casting a very wide net, meaning that the model is overly aggressive in identifying the positive case.

High precision and high recall at the same time is very challenging.

We want to test a variety of models to find a combination of precision and recall that will meet the needs of the project.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

What is the typical business need for accuracy?

A

“we need most of our decisions to be correct”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

What is the typical business need for precision?

A

“Most of what we marked as spam needs to be spam”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

What is the typical business need for recall?

A

“We want to cut down on the amount of spam a user sees by a factor of 10 (eliminate 90% of spam”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

What is the typical business need for sensitivity?

A

“We have to cut a lot of spam, otherwise the user won’t see a benefit”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What is the typical business need for specificity?

A

“We must be at least three nines on legitimate email; the user must see at least 99.9% of their non-spam email”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

What do statistics do for understanding performance vs visualisations?

A
  • Statistics attempt to boil model performance down to a single number
  • Visualisations depict how a learner performs across a wide range of conditions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

If two classifiers have similar accuracies, are they the same?

A

No, learning algorithms have different biases so they could have drastic differences in how they achieve their accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

What do visualisations allow when comparing learners?

A

A method to understand trade-offs, by comparing learners side by side in a single chart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

What does ROC stand for?

A

Receiver operating characteristic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

What is the ROC curve used for?

A

It is commonly used to examine the trade-off between the detection of true positives, while avoiding the false positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

Describe the ROC curve.

A

[See flashcard]

Also known as a sensitivity/specificity plot.

  • Proportion of true positives on the vertical axis (sensitivity)
  • Proportion of false positives on the horizontal axis (1 - specificity)
  • Points indicate the true positive rate at varying false positive thresholds

To create the curve, a classifier’s predictions are sorted by the model’s estimate probability of the positive class, with the largest values first. Each prediction’s impact results in the curve tracing vertically (for correct prediction) or horizontally (for an incorrect prediction).

82
Q

What does the diagonal line on the ROC curve indicate?

A

A classifier with no predictive value.

This classifier detects true positives and false positives at exactly the same rate, implying that the classifier cannot discriminate between the two.

This is the baseline by which other classifiers may be judged.

The closer to the line, the less useful the model.

83
Q

What is the perfect classifier in an ROC curve?

A

A curve that passes through the point at a 100% true positive rate and 0% false positive rate.

It is able to correctly identify all the positives before it incorrectly classifies any negative result.

The closer the curve is to the perfect classifier, the better it is at identifying positive values.

84
Q

How do we compare the test classifier curve to the perfect classifier curve?

A

The closer it is, the better it is at identifying positive values.

Measured using area under the ROC curve (AUC). The AUC treats the ROC diagram as a 2D-square and measures the total area under the roc curve.

AUC ranges from 0.5 (classifier with no predictive value) to 1.0 (perfect classifier.

85
Q

How can you interpret the AUC scores?

A

0.5 - 0.6 - no discrimination
0.6 - 0.7 - Poor
0.7 - 0.8 - acceptable/fair
0.8 - 0.9 - excellent / good
0.9 - 1.0 - outstanding

86
Q

Generally speaking, when does a regression predictive model have good performance?

A

If the difference between observed values and the model’s predicted values (residuals) are small and unbiased.

87
Q

What are the most commonly used metrics for evaluating regression predictive models?

A
  • Root mean square error (RMSE)
  • Mean absolute error (MAE)
  • Coefficient of determination - R^2
88
Q

What is the RMSE?

A

Root mean square error

RMSE is the most common goodness-of-fit-measurmeent for assessing regression models. It is a quadratic scoring rule which measures the average magnitude of error.

It is calculated as the square root of the average square of the difference between prediction and actual values.

[See flashcard]

Since errors are squared before averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable.

89
Q

What is the MAE?

A

Mean absolute error

Measures the average magnitude of the errors in a set of forecasts, without considering their directions.

It is calculated as the average absolute difference between forecast and the corresponding observation.

[See flashcard]

It is a linear score - means all the individual differences are weighted equally in the average.

90
Q

What can the RMSE and MAE be used together for?

A

To diagnose the variation in the errors in a set of forecasts.

The RMSE will always be larger or equal to MAE - the greater the difference between them, the greater the variance in the individual errors in the sample.

If RMSE = MAE, then all the errors are of the same magnitude.

Both range from 0 to infinity. They are negatively-oriented scores, lower values are better.

91
Q

What is the Coefficient of Determination R^2?

A

Describes the proportion of the variation in the response that is predictable from the independent variable.

Ranges from 0 to 1 where higher values indicate better performance of the regression model

92
Q

Why would you want to build a model using many input variables?

A

Models which combine the effects of many input variables (predictors) tend to be much more powerful than models that use only a single predictor.

93
Q

What do you need to consider when building a model with multiple input variables?

A
  • Selecting what input variables to use
  • How the input variables are to be transformed or treated
94
Q

How does WHEN variables are available impact model utility?

A

A variable that’s coincident with (available near or after) the time that the outcome occurs may make a very accurate model with little utility (as it can’t be used for long-term prediction).

95
Q

What other variables do analysts need to be cautious of?

A

Variables that are function of or “contaminated by” the value to be predicted

96
Q

Who should you discuss variable availability with?

A

Project sponsor

97
Q

What can improving model utility be at the cost of and how might you achieve it?

A
  • Possible cost of accuracy
  • Removing variables from project design
98
Q

What kind of prediction is much more useful, in terms of utility?

A

An acceptable prediction one day before anent can be much more useful than a more accurate prediction one hour before the event.

99
Q

What does each variable represent?

A
  • Chance of explaining more of the outcome variation (ie a chance of building a better model)
  • A possible source of noise and overfitting
100
Q

How do we control for the effect of noise and overfitting?

A

We preselect which subset of variables we will use to fit.

101
Q

What are criteria for the selection of variables to be included in the model based on?

A

Log likelihood values - ln(L)

102
Q

What are the names of different criteria you can use for selecting variables?

A
  • Akaike information criterion (AIC)
  • Bayesian information criterion (BIC)
  • Likelihood ratio test
103
Q

Describe AIC.

A

AIC - akaike information criterion

Using AIC to compare models, the model with the smallest AIC is chosen.

[See flashcard]

  • The measure penalises models that have excessive number of parameters (they may be overfitting).
  • The measure can also be used to compare between different statistical models (eg logistic vs probit)
104
Q

Describe BIC.

A

BIC - Bayesian information criterion or Schwarz criterion

The model chosen is the one with the smallest BIC.
It has a larger penalty than the AIC

[ See flashcard ]

105
Q

What is the difference between AIC and BIC?

A

BIC has a larger penalty for models that have an excessive number of parameters.

106
Q

Describe the likelihood ratio test.

A

In the likelihood ratio test, two nested models are compared.

Model 1 is nested in Model 2 if the covariates in Model 1 are also contained in model 2.

[ See flashcard ]

Under the null hypothesis that the additional B parameter(s) in model 2 all equal zero, this ratio should be an observation from a X2 distribution. If the statistic lies in the tail of the distribution, there is sufficient evidence to reject the null hypothesis (ie some of the additional covariates in model 2 are required). If we don’t reject our null hypothesis, this suggests that all the additional covariates in model 2 are not required.

107
Q

What measures can we use to compare supervised predictive models?

A
  • Re-substitution error rate
  • Predictive accuracy
  • Speed and scalability
  • Robustness
  • Interpretability
108
Q

How do we calculate the re-substitution error rate?

A

A predictive model is built based on training data. It is then applied to the same training data so that comparisons can be made between predicted and actual observations recorded in the training data.

109
Q

What does the re-substitution error rate indicate?

A

How good (or bad) a predictive model performs on the training data. It gives a performance measure for the algorithm (lower = better algorithm).

It does not correspond to how well the model would predict previously unseen operations.

110
Q

How do we assess the predictive accuracy of a supervised predictive model?

A

Observations recorded (with known classification) are split between training and test sets. The predictive model is first built based on the training set, then applied to the test set.

111
Q

What does the predictive accuracy indicate?

A

How good (or bad) a predictive model performs on the test data.

112
Q

What are the time and space requirements for the two distinct phases of classification?

A
  • Time to construct the predictive model
  • Time to use the model
113
Q

In the case of a simple linear classifier, what is the time taken to fit the line linear to?

A

The number of instances

114
Q

In the case of a simple linear classifier, what is the time taken to use the model?

A

The time taken to test which side of the line the unlabelled instance is. This can be done in constant time.

115
Q

What is robustness?

A

Handling
- Noise (eg mistyped values)
- Missing values
- Irrelevant features
- Streaming data

116
Q

What does streaming data refer to?

A

For many real world problems, we do not have a single fixed dataset. Instead the data continuously arrives, potentially forever.

  • Eg stock market, weather data, sensor data
  • It could take three weeks to arrive/run - it will always be three weeks out of date
117
Q

What is interpretability?

A

Interpretability refers to the degree to which a human can predict the outcome of a model or understand the reasons behind its decisions. The understanding and insight provided by the model.

118
Q

Some classifiers offer a bonus feature, what is this?

A

The structure of the learned classifier tells us something about the domain.

  • eg a single linear classifier does not work well but two linear classifiers do.
119
Q

What is the aim of classification problems?

A

To identify the characteristics that indicate the group to which each case belongs.

Classification deals with associating a class label to a point, according to the class to which the point is believed to belong. The goal of classification is to group items that have similar feature values, into groups.

  • Classification can be based on a distance measure between the point of consideration and other points belonging to known classes in point space.
120
Q

What can classification patterns be used for?

A
  • Understanding the existing data
  • Predict how new instances will behave
121
Q

How does a simple linear classifier group items that have similar feature values into groups?

A

Makes a classification decision based on the value of the linear combination of these features.

Linear classification classifies points according to their position relative to a hyperplane in the point space. All points on a given side of the hyperplane are considered as belonging to the same class.

122
Q

What is a two-class classification problem?

A

Two classes - eg yes and no

123
Q

How can you visualise the operation of a linear classifier for the two-class classification problem?

A

Splitting a high-dimensional input space with a hyperplane.

All points on one side of the hyperplane are classified as “yes” and all points on the other are classified as “no”.

124
Q

When is linear classification considered a simple and robust method of classification?

A

In the absence of a known model for class data.

125
Q

What is the term for problems that can be solved by a linear classifier?

A

Linearly separable

126
Q

What is an example of a linear classifier?

A

Naive Bayes

127
Q

What is an alternative to a simple linear classifier?

A

A piecewise linear classifier can be generalised to N classes, by fitting N-1 lines.

128
Q

Why might a simple linear classifier perform poorly?

A

There may not exist a single hyperplane capable of uniquely separating point classes

  • Geometrical reasons
  • There are more than 2 classes
129
Q

How is this problem solved?

A

Piecewise linear classification

130
Q

What does piecewise linear classification involve?

A

Defining boundaries between classes using several hyperplanes - allows us to define more than two regions in space, making it possible to classify a set of points more precisely.

131
Q

What is a non-linear classifier?

A

A classifier which makes its classification decision based on a non-linear combination of its features.

  • k-nearest neighbour (K-NN) algorithm: the decision boundaries are locally linear segments, but in general have a complex shape that is not equivalent to a line in 2D or hyperplane in higher dimensions.
132
Q

What are decision trees?

A

A supervised learning model which learns on the data by making decision rules on the variables to separate the classes in a flowchart-like tree structure.

  • Can be used for both regression and classification
  • Regression trees: response variable is continuous
  • Classification trees: response variable is quantitative discrete or qualitative
133
Q

What is the idea behind decision trees?

A

To learn the structure of the tree using a training sample of data, and then using the resulting tree structure to create a set of classification rules.

Once such rules are created, new unknown samples can be classified accordingly
- Direct us to predicting what class a new piece of data would be

134
Q

What are the features of a decision tree?

A
  • A flow-chart-like structure
  • Rectangle (internal node) denotes a test on an attribute/variable
  • Branches of internal nodes represents outcome of test
  • Lead nodes (circle) represent class labels or class distribution
135
Q

What are the phases of decision tree generation?

A

Two phases
- tree construction
- tree pruning

136
Q

What does tree construction involve?

A

At the start, all the training examples are at the root. The examples are partitioned recursively based on selected attributes.

137
Q

What does tree pruning involve?

A
  • PRE-RUNING - halt its construction early
  • Eg decide not to add a further split at some node, so the node becomes a leaf
  • POST-PRUNING - identify and remove branches that reflect noise or outliers
138
Q

What is the use for decision trees?

A

To classify an unknown sample
- By testing the variable values of the sample against the decision tree
- Not all variables will necessarily help distinguish between cases

139
Q

What is the basic algorithm used to split the records in a decision tree?

A

The ID3 algorithm

  • A greedy algorithm
  • Splits the records based on attribute test that optimises certain criterion
140
Q

Discuss the ID3 algorithm for tree construction.

A
  • The tree is constructed in a top-down recursive divide-and-conquer manner
  • At the start, the tree starts as a single node (the root) representing the training samples
  • If all the samples are in the same class, the node becomes a leaf and is labelled with that class
  • Variables are categorical (or discretised if continuous)
  • The algorithm uses a statistical measure (eg information gain) for selecting the variable that will best separate the samples into individual classes - this variable becomes the decision variable at the node
  • A branch is created for each known value of the decision variable and the samples are partitioned accordingly
  • The algorithm uses the same process recursively to form a decision tree for the samples at each portion.
141
Q

Once a variable has been used at a node, what is true?

A

The variable is not considered for any of the node’s descendants.

142
Q

How do we decide what variable to split on?

A

Use information gain - entropy

143
Q

What are potential conditions for stopping partitioning?

A
  • All samples for a given node belong to the same class
  • There are no remaining attributes for further partitioning - majority voting is employed for classifying the leaf
  • There are no samples left
144
Q

What are the requirements of the sample data used in the ID3 algorithm?

A
  • Variable value description: the same variables must describe each example and have a fixed number of values
  • Predefined classes - an example’s variables must already be defined, they are not learned by ID3
  • Sufficient examples - there must be enough test cases to distinguish valid patterns from chance occurrences
145
Q

How does the ID3 decide which variable is best used for splitting?

A

A statistical property called information is used for selecting the variable that will best separate the samples into individual classes

Selects the variable with the highest information gain for branching.

146
Q

What does information gain measure?

A

How well a given variable separates training examples into targeted classes.

147
Q

How do we define information gain?

A

Using the statistic entropy which measures the amount of information (or dispersion) in a variable or sample.

Information gain measures the expected reduction in entropy caused by knowing the value of a variable.

148
Q

What are the formulas for entropy and information gain?

A

[See flashcard]

149
Q

When is the entropy 0?

A

When all members of S belong to the same class (ie perfect homogeneity)

150
Q

When is the entropy 1?

A

If all observations are uniformly distributed across the c possible outcomes (ie maximum heterogeneity)

151
Q

How do you figure out which variable should be at the root node of the decision tree?

A

Calculate the gain for each variable, use the variable with the highest gain to be the variable that splits the root node (the decision variable in the root node)

152
Q

Why will no/perfect clarification not be valid if using log10 or log e?

A

Entropy has a constraint to be between 0 and 1 (inclusive)

153
Q

How do extract classification rules from decision trees?

A

Represent as classification IF-THEN rules

  • Create one rule for each path from the root to a leaf
154
Q

What kinds of real-world applications have classification trees been implemented in?

A
  • Medical diagnosis
  • Credit risk assessment of loan applications
  • Equipment failures
155
Q

What are the advantages of decision trees?

A
  • They are intuitive and easy to understand
  • Simple and easily-understood classification rules produced - easy to understand how to classify a new point
  • Relatively fast (compared to other classification methods)
156
Q

What are the disadvantages of decision trees?

A
  • May suffer from overfitting (we can have a huge tree if there is lots of data but it may be overfit - need to prune)
  • Rectangular partitioning approach (does not handle correlated features very well)
  • Can be quite large - pruning is sometimes necessary
  • May suffer from “noisy” data
157
Q

What is another decision tree algorithm that can be used to construct decision trees?

A

C4.5 algorithm

It is based on many of the principals of the ID3 algorithm but includes a number of improvements.

158
Q

What are improvements of the C4.5 algorithm?

A
  • It can handle continuous and discrete attributes - for continuous variables, it creates a threshold and splits based on this
  • It can handle training data with missing variable values - they are simply not used in gain and entropy calculations
  • It can handle variables with differing costs (weights)
  • Can prune trees after creation
159
Q

How do decision trees handle continuous variables (C4.5 algorithm)?

A

The best split-point for a continuous variable Ak must be determined.

Typically the observed values of Ak in the training set would be sorted and the information gain would be calculated by having the split-point at each and every of the mid-points of the sorted values in the observed training set. The position of the split-point with the best information gain is chosen for the actual split-point. Then this is compared in the usual way to the other possible variable.

160
Q

What is “noisy data”?

A

Examples which are misclassified or where one or more of the attribute variables is wrong.

161
Q

What might happen if there is “noisy data”?

A

We could use unused attributes to elaborate the tree to take care of this one case but it would be wrong and likely misclassify real data.

If we know there is noise in the training data, it might be wise to “prune” the decision tree to remove nodes which, statistically speaking, seem likely to arise from noise in the training data.

162
Q

What does overfitting refer to?

A

They can be overly complex, describing even noise in the training data.

163
Q

How do we overcome overfitting?

A

By splitting the data into a training, test and validation set

164
Q

What is a training set?

A

A set of examples used for learning, that is to fit the parameters of the classifier

165
Q

What is a validation set?

A

A set of examples used to tune the parameters of a classifier - for example to choose the number of nodes to keep in a decision tree.

eg deciding between 15 or 20 nodes, the validation set compares the performance of the two decision trees to decide which one to take. The training data likely has less error with 20 nodes, but the validation data may perform worse with 20-nodes due to overfitting.

166
Q

What is the test set?

A

A set of examples used only asses the performance of a fully-specified classifier.

167
Q

How does the error in the training set and validation set vary with the number of nodes?

A
  • The error in the training set decreases as the decision tree becomes more complex
  • The same may occur for the validation data or not. Initially, the error decreases with the number of nodes but at some point, the model will be overfit to the training data, and the validation error increases.

The best predictive and fitted model would be where the error in the validation set has its global minimum.

168
Q

What are Bayesian classifiers?

A

Statistical classifiers that can predict class membership probabilities, such as the probability that a given sample belongs to a particular class.

Bayesian classification is based on Bayes’ theorem.

169
Q

What is one kind of Bayesian classifier?

A

Naive Bayesian classifier / idiot Bayes / simple bayes

170
Q

What do Naive Bayesian classifiers assume?

A

The effect of one variable on a given class is independent of the values of other variables.

Quite often variables are not completely independent - eg heart attack example, a naive Bayesian model might struggle because we are assuming class conditional independence (a disadvantage of this model)

CLASS CONDITIONAL INDEPENDENCE

171
Q

Why do naive bayesian classifiers assume class conditional independence?

A

To simplify the computations

172
Q

What is Baye’s theorem?

A

[See flashcard]

173
Q

What does posterior probability mean?

A

The probability of an event occurring after taking into account relevant prior information or evidence.

174
Q

The naive bayesian classifier predicts that X belongs to the class with what property?

A

The highest posterior probability conditioned on X

[See flashcard]

175
Q

What do we need to maximise in Bayes’ theorem?

A

P(Ci|X)

therefore we need to maximise P(X|Ci) x P(Ci)

  • This is when the classes are in the same proportion, we only need to maximise the top part of the formula

We assume all the priors are equal so we only need to maximise P(X|Ci)

176
Q

How do we estimate class prior probabilities from the data?

A

P(Ci) = Si/S

177
Q

How can we calculate P(X|Cj)

A

[See flashcard]

178
Q

How do we estimate P(Xk|Cj) for categorical variables?

A

Estimate from a training sample by counting the number of observations in the class Cj having the value Xk for Ak and dividing by the total number of observations in the class Ci

179
Q

How do we estimate P(Xk|Cj) for continuous variables?

A

The variable is typically assumed to have a normal distribution - there is a formula for this.

180
Q

In a sentence, what is the method of the naive bayesian classification method?

A

Find out the probability of a previously unseen instance belonging to each class, then simply pick the most probable class.

181
Q

What are the advantages of Naive Bayes?

A
  • Fast to train (single scan) and fast to classify
  • Speed compares to decision trees
  • Handles real and discrete data
182
Q

What are the disadvantages of the naive bayesian classifier?

A

Assumes independence of features - this is often a difficult assumption in reality

183
Q

With the Naive Bayesian classifier, if the training set does not contain any observations for a particular Xk and Ci, what happens?

A

The probability P(Xk|Ci) is estimated to be zero and therefore P(X|Ci) is also inappropriately estimated as zero.

184
Q

How do we avoid this problem?

A

We can add a small number to the counts in both the numerator and denominator of the probability estimation. If the training set is large, this Laplacian correction makes only a negligible difference to the probability estimation, but ensures no probability is zero.

185
Q

What is the nearest neighbour classifier?

A

A supervised learning algorithm where the result of a new query is classified base don the class of its nearest neighbour.

The purpose of this algorithm is to classify a new object based on attributes and training samples.

186
Q

What model is used for the nearest neighbour classifier?

A

No model is used, it is only based on memory.

187
Q

How do we measure a new object’s neighbours?

A

Euclidean distance - quantitative variables
Similarity measures - qualitative variables

188
Q

How does the nearest neighbour algorithm react to outliers?

A

The nearest neighbour algorithm is sensitive to outliers.
May result in misclassification.

189
Q

What do we generalise the nearest neighbour algorithm to?

A

The k-Nearest Neighbour (k-NN) algorithm.

190
Q

Describe the k-Nearest Neighbour algorithm.

A

It works on the minimum distance from the query to the training samples to determine the k-nearest neighbours.

After the nearest neighbours are gathered, we take simple majority (majority vote) of these k neighbours to be the prediction of the query instance.

191
Q

What value is k typically?

A

An odd number

192
Q

What does generalising the nearest neighbour algorithm to k-nearest neighbours do?

A

Helps deal with the outliers.

193
Q

Briefly, for k-NN to be performed on a sample data set by hand, what are the steps?

A
  • Get coordinates for query point
  • Calculate euclidean distances for all points from query point
  • Rank these distances (smallest = 1)
  • Determine the k nearest points
  • Of these k nearest points, determine which class has majority rule
  • Conclude what the query point can be classified as
194
Q

What are the advantages of the nearest neighbour algorithms?

A
  • Simple to implement
  • Handles correlated features
  • Defined for any distance measure
195
Q

What are the disadvantages of the nearest neighbour algorithms?

A
  • Very sensitive to irrelevant features
  • Slow classification time for large datasets
  • Sensitive to scale of measurement
196
Q

What does being very sensitive to irrelevant features mean?

A

The points may get separated out using an unhelpful variable and result in misclassification

197
Q

How do we mitigate the nearest neighbour’s sensitivity to irrelevant features?

A
  • Use more training instances
  • Ask an expert what features are relevant to the task
  • Use statistical tests to try to determine which features are useful
198
Q

What is one possible solution to account for the nearest neighbour algorithm’s sensitivity to the scale of measurement?

A

Standardise the variables before the classification process.

199
Q

What is the wrong classification due to the presence of many irrelevant attributes termed?

A

The curse of dimensionality

200
Q

What approaches help with classification accuracy?

A
  • Associate weights with the attributes
  • Backward elimination
201
Q

What is involved in associating weights with the attributes?

A
  • Assign random weights
  • Calculate the classification error
  • Adjust the weights according to the error
  • Repeat until an acceptable level of accuracy is reached
202
Q

What is involve in backward elimination?

A

Starts with the full set of features and greedily removes the one that most improves performance, or degrades performance slightly