Chapter 5 Flashcards
What is data mining?
The process of extracting valuable information from datasets.
- Processing data and identifying patterns and trends in the information
- Help make predictions on future trends by analysing past data
- Identify relationships between different pieces of data
What is the aim of supervised learning?
To build a model that makes predictions based on evidence in the presence of uncertainty.
A supervised learning algorithm takes a set of known input data and known responses/targets and trains a model to generate predictions for the response of the set of new data.
When thinking of an entire set of input data for supervised learning as a heterogeneous mix, what do the columns and rows represent?
How can you think of the target data?
- Columns are called predictors / attributes / features and represent a measurement taken on every subject
- Rows are called observations / examples / instances and each contain a set of measurements for a subject
- Target data can be thought of as a column vector where each row contains the output of the corresponding observation in the input data
What are the two categories of supervised learning algorithms and what do they depend on?
Classification and regression
Depends on what the target feature is
What is classification used for?
Where the target feature to be predicted is a categorical feature (class) and is divided into categories called levels.
How many levels can have a class have?
Two or more levels
- Yes / No
- A / B / C
The levels may or not be ordinal
What is regression used for?
To predict a continuous measurement for an observation (target variables are real numbers)
Define the training dataset and test dataset.
- Training dataset: the set of known input data and known targets. Its purpose is to generate the predictive model.
- Test dataset: the set of new data that is unknown to the model. Its purpose is to assess the accuracy of the model.
How are the training and test datasets often obtained?
Partitioning the raw (given) dataset
What are the most popular partitioning data methods?
- The holdout partitioning method
- The K-Fold Cross-Validation Partitioning method
Describe the holdout partitioning method.
In the HP method, the raw dataset is divided into training and test datasets based on some predefined percentage.
What is the usual amount of data held out for testing?
- 1/3 for testing
- 2/3 for training
This proportion can vary depending on the amount of available data
Why do you need to ensure samples are randomly divided into the twi groups (test and train)?
To ensure there are no systematic differences between the training and test data,
Why do we need a test set?
We can’t say how good our model is if we don’t have known values to compare
How do you ensure the holdout method results in a truly accurate estimate of the future performance?
By ensuring that the performance on the test dataset is not allowed to influence the model.
- For example, after building several models on the training data, don’t cherry-pick the one with the highest accuracy on the test data. Cherry-picking means the test performance is not an unbiased measure of the performance on unseen data.
How can you overcome this problem?
In addition to training and test datasets, create a validation dataset.
What is a validation dataset?
The validation dataset would be used for iterating and refining the model(s).
It is kept completely separate until the end.
Using a little bit of data to fine tune the model.
What does the use of a validation dataset mean for the test dataset?
The test dataset is only used once as a final step to report an estimated error rate for future predictions.
What is a typical split between training, test and validation data?
50 / 25 /25
Varies depending on the size of the dataset.
What is a simple method to create holdout samples?
Use random number generators to assign records to partitions.
What is the problem with holdout sampling this way?
Each partition may have a larger or smaller proportion of some classes.
Particularly if there is a class which is a very small proportion of the dataset, this can often lead a class to be omitted from the training dataset. This is a significant problem, because the model will not be able to learn this class.
What is the problem if a class is not in the training dataset?
The model will not be able to learn it.
How do you account for this?
Use a technique called stratified random sampling.
This guarantees that the random partitions have nearly the same proportion of each class as the full dataset, even when some classes are small. (We want to make sure a particular class isn’t omitted from the final testing dataset).
Stratified random sampling distributes the classes evenly, but what can it not guarantee?
Other types of representativeness
- Eg some samples may have too many/few difficult cases, easy-to-predict cases or outliers.
- This is especially true for smaller datasets, which may not have a large enough portion of such cases to be divided among the training and test sets.
What are the problems of the holdout method?
- Potentially biased samples
- Substantial portions of data must be reserved to test and validate the model
Why are performance estimates using the holdout method likely to be conservative?
The test and validate data cannot be used to train the model until its performance has been measured.
What technique mitigates the problems of randomly composed training datasets?
Repeated holdout method
This is a special case of the holdout method that uses the average result from several random holdout samples to evaluate a model’s performance.
Why does the repeated holdout method make it less likely that the model is trained or tested on non-representative data?
Multiple holdout samples are used
It still has the issue that the different test sets considered potentially overlap - this may influence the overall predictive accuracy calculated.
What kind of estimate of model performance does testing on hold-out data give?
How does this differ from what we want in practice?
Single-point estimate
In practice, we want both an unbiased estimate of our model’s future performance on new data (simulated by test data) and an estimate of the distribution of this estimate under typical variations in data and training procedures.
What is a good method to obtain both an unbiased estimate of the model’s future performance and an estimate of the distribution of this estimate under typical variations in data and training procedures?
K-fold cross-validation
This technique helps make sure that your predictions are not just a one-hit wonder but consistently reliable across new, unseen datasets.
And the related ideas of:
- Empirical resampling
- Bootstrapping
What is k-fold cross-validation?
K-Fold Cross-Validation is a robust technique used to evaluate the performance of machine learning models. It helps ensure that the model generalises well to unseen data by using different portions of the dataset for training and testing in multiple iterations.
What is the idea behind k-fold cross-validation?
Repeat the construction of the model on different subsets of the available training data, and then evaluate the model only on data not seen during construction.
This is an attempt to simulate the performance of the model on unseen future data.
What is the most common convention of the k number of sections?
k = 10
10-fold cross-validation (10-fold CV)
Why is k = 10 often used?
Empirical evidence suggests that there is little added benefit in using a greater number.
How are machine learning models built using 10-fold CV?
For each of the 10 folds (each containing 10% of the total data), a machine learning model is built on the remaining 90% of the data. The fold’s matching 10% samples is then used for model evaluation.
After the process of training and evaluating has occurred for §0 times, the average performance across all the folds is reported.
Discuss the extreme case of a k-fold CV.
The leave-one-out method.
This performs k-fold CV using a fold for each of the data’s samples. If the dataset contains N observations, all of the observations are divided into N equal-sized sections.
A predictive model is built repeatedly leaving one observation out in each of the N iterations, this omitted observation is then used to train the generated predictive model.
What are the advantages and disadvantages of the leave-one-out method?
- Ensures that the greatest amount of data is used to train the model
- It is so computationally expensive it is rarely used in practice
Once you have built a model, what is the first thing to check?
If it works on the data it was trained from
What is the goal of evaluating a predictive model?
To have a better understanding of how its performance will extrapolate to future cases.
Why do we typically simulate future conditions and how?
It is usually unfeasible to test a still-unproven model in a live environment.
Ask the model to make a prediction based on the cases that resemble what it will be asked to do in the future.
How do we learn about a models strengths and weaknesses?
By observing the learner’s responses when asked to make a prediction based on the cases that resemble what it will be asked to do in the future.
We compare predicted values with the actual values from the dataset. We need to know the correct answer for a machine learner’s predictions. We need two vectors of data, one with the correct class values and one with the predicted class values. Both vectors must have the same number of values stored in the same order.
What is model evaluation?
Quantification of the performance of a model, ie calculation of the summary scores that tell us if the model is effective or not.
How do we decide if a summary score is high or low?
We look at some “ideal” models
- The Null Model (tells us what low performance looks like)
- The best single-variable model (tells us what a simple model can achieve)
What is the null model?
The best model of a very simple form you are trying to perform.
eg classification - model always returns the most popular category
eg score model - model returns an average of all outcomes (has the least square deviation from all the outcome)
If the null model is not outperformed by the generated predictive model, then the generated model is of no value.
What are the two most typical null model choices?
- A model that is a single constant (returns the same answer for all situations)
- A model that is independent (doesn’t record any important interaction between inputs and outputs)
Why should we compare a single-variable model?
A complicated model can’t be justified if it does not outperform the best single-variable model available from the training data.
What are the most common metrics used for the assessment of the classifier quality?
- Accuracy and error rate
- Precision and recall
- Sensitivity and specificity
What do we need to produce in order to calculate and describe these metrics?
A confusion matrix
What is a confusion matrix?
A table that categorises predictions according to whether they match the actual value.
One of the table’s dimensions indicates the possible categories of predicted values, while the other dimension indicates the same for actual values.
eg 3x3 matrix for a three-class model
What is a correct classification and how is it denoted?
A correct classification is when the predicted value is the same as the actual value.
Denoted by O - these fall on the diagonal values
What are incorrect predictions and how are they denoted?
Cases where the predicted value differs from the actual value.
These are the off-diagonal matrix cells - denoted by X
What are the performance measures for classification models based on from the confusion matrix?
The counts of predictions falling on and off the diagonal in these tables
- True positives
- False negatives
- False positives
- True negatives
What do the most common performance measures consider?
The model’s ability to discern one class versus all others
What is the class of interest known as vs the others?
- Positive class - class of interest
- Negative class - all others
(Not intended to imply any value judgement)
The relationship between the positive class and negative class predictions can be predicted as a 2x2 confusion matrix that tabulates whether predictions fall into one of four categories? What are the categories?
- True positive (TP): correctly classified as class of interest
- True negative (TN): correctly classified as not the class of interest
- False positive (FP): incorrectly classified as class of interest
- False negative (FN): incorrectly classified as not the class of interest
What are the common metrics for evaluating the classifier’s performance from the confusion matrix?
What are their formulas?
- Accuracy
- Error rate
- Sensitivity
- Specificity
- Precision
- Recall
[See flashcard]
What is the most widely known measure of classifier performance?
Accuracy - at the very least you want the classifier to be accurate
What is accuracy?
What is its formula?
For a classifier, accuracy is the number of items categorised correctly divided by the total number of items.
It is simply what fraction of the time the classifier is correct.
What is the error rate?
Represents the proportion of the incorrectly classified samples.
When is accuracy an inappropriate measure?
Accuracy is an inappropriate measure for unbalanced classes.
eg when we have a rare event we are trying to predict.
- The null model (the event never happens) is very accuract, and more accurate than a useful classifier
- Accuracy is not a good measure for events that have unbalanced distribution or unbalanced costs (different costs of “type 1” and “type 2” errors)
What is the sensitivity of a model?
True positive rate (TPR)
Measures the proportion of positive examples that were correctly classified.
It is the number of true positives, divided by the total number positives (both correctly and incorrectly classified)
What is the specificity of a model?
True negative rate (TNR)
Measures the proportion of negative examples that were correctly classified.
True negatives divided by the total number of ngetives.
What do the pair of performance measures sensitivity and specificity capture?
The tradeoff / balance between predictions that are overly conservative and overly aggressive.
- Think email spam filter
What are sensitivity and specificity measures of?
They are measures of effect.
What fraction of class members are identified as positive and what fraction of non-class members are identified as negative.
What is the range of sensitivity and specificity?
0 to 1
- Closer to 1 is more desirable
- 1 correlates to everything in the confusion matrix being along the diagonal
We want to find balance between the two - a task that is often context-specific
What other technique can assist with understanding the trade-off between sensitivity and specialisation?
ROC curve
- Receiver operating characteristic
What are precision and recall?
Performance evaluation metrics that come from the field of information retrieval.
They are intended to provide an indication of how interesting and relevant a classifier’s results are, or whether the predictions are diluted by meaningless noise.
What is precision?
The proportion of positive examples that are truly positive.
Precision describes how often a positive indication turns out to be correct. It is a measure of confirmation (when the classifier indicates positive, how often it is in fact correct).
What is recall?
A metric that describes how complete the results are. It is a measure of utility (how much the classifier finds out of what there is to find out).
The number of true positives over the total number of positives.
Classifiers with a high recall capture a large portion of the positive examples, meaning that it has wide breadth.
eg high recall if the majority of spam messages are correctly identified
eg search engines with high recall return a large number of documents pertinent to the search query.
Discuss the trade-off between precision and recall.
It is easy to be precise if you target the easy to classify samples.
It is easy to have high recall by casting a very wide net, meaning that the model is overly aggressive in identifying the positive case.
High precision and high recall at the same time is very challenging.
We want to test a variety of models to find a combination of precision and recall that will meet the needs of the project.
What is the typical business need for accuracy?
“we need most of our decisions to be correct”
What is the typical business need for precision?
“Most of what we marked as spam needs to be spam”
What is the typical business need for recall?
“We want to cut down on the amount of spam a user sees by a factor of 10 (eliminate 90% of spam”
What is the typical business need for sensitivity?
“We have to cut a lot of spam, otherwise the user won’t see a benefit”
What is the typical business need for specificity?
“We must be at least three nines on legitimate email; the user must see at least 99.9% of their non-spam email”
What do statistics do for understanding performance vs visualisations?
- Statistics attempt to boil model performance down to a single number
- Visualisations depict how a learner performs across a wide range of conditions
If two classifiers have similar accuracies, are they the same?
No, learning algorithms have different biases so they could have drastic differences in how they achieve their accuracy.
What do visualisations allow when comparing learners?
A method to understand trade-offs, by comparing learners side by side in a single chart
What does ROC stand for?
Receiver operating characteristic
What is the ROC curve used for?
It is commonly used to examine the trade-off between the detection of true positives, while avoiding the false positives