Chapter 5 Flashcards
What is data mining?
The process of extracting valuable information from datasets.
- Processing data and identifying patterns and trends in the information
- Help make predictions on future trends by analysing past data
- Identify relationships between different pieces of data
What is the aim of supervised learning?
To build a model that makes predictions based on evidence in the presence of uncertainty.
A supervised learning algorithm takes a set of known input data and known responses/targets and trains a model to generate predictions for the response of the set of new data.
When thinking of an entire set of input data for supervised learning as a heterogeneous mix, what do the columns and rows represent?
How can you think of the target data?
- Columns are called predictors / attributes / features and represent a measurement taken on every subject
- Rows are called observations / examples / instances and each contain a set of measurements for a subject
- Target data can be thought of as a column vector where each row contains the output of the corresponding observation in the input data
What are the two categories of supervised learning algorithms and what do they depend on?
Classification and regression
Depends on what the target feature is
What is classification used for?
Where the target feature to be predicted is a categorical feature (class) and is divided into categories called levels.
How many levels can have a class have?
Two or more levels
- Yes / No
- A / B / C
The levels may or not be ordinal
What is regression used for?
To predict a continuous measurement for an observation (target variables are real numbers)
Define the training dataset and test dataset.
- Training dataset: the set of known input data and known targets. Its purpose is to generate the predictive model.
- Test dataset: the set of new data that is unknown to the model. Its purpose is to assess the accuracy of the model.
How are the training and test datasets often obtained?
Partitioning the raw (given) dataset
What are the most popular partitioning data methods?
- The holdout partitioning method
- The K-Fold Cross-Validation Partitioning method
Describe the holdout partitioning method.
In the HP method, the raw dataset is divided into training and test datasets based on some predefined percentage.
What is the usual amount of data held out for testing?
- 1/3 for testing
- 2/3 for training
This proportion can vary depending on the amount of available data
Why do you need to ensure samples are randomly divided into the twi groups (test and train)?
To ensure there are no systematic differences between the training and test data,
Why do we need a test set?
We can’t say how good our model is if we don’t have known values to compare
How do you ensure the holdout method results in a truly accurate estimate of the future performance?
By ensuring that the performance on the test dataset is not allowed to influence the model.
- For example, after building several models on the training data, don’t cherry-pick the one with the highest accuracy on the test data. Cherry-picking means the test performance is not an unbiased measure of the performance on unseen data.
How can you overcome this problem?
In addition to training and test datasets, create a validation dataset.
What is a validation dataset?
The validation dataset would be used for iterating and refining the model(s).
It is kept completely separate until the end.
Using a little bit of data to fine tune the model.
What does the use of a validation dataset mean for the test dataset?
The test dataset is only used once as a final step to report an estimated error rate for future predictions.
What is a typical split between training, test and validation data?
50 / 25 /25
Varies depending on the size of the dataset.
What is a simple method to create holdout samples?
Use random number generators to assign records to partitions.
What is the problem with holdout sampling this way?
Each partition may have a larger or smaller proportion of some classes.
Particularly if there is a class which is a very small proportion of the dataset, this can often lead a class to be omitted from the training dataset. This is a significant problem, because the model will not be able to learn this class.
What is the problem if a class is not in the training dataset?
The model will not be able to learn it.
How do you account for this?
Use a technique called stratified random sampling.
This guarantees that the random partitions have nearly the same proportion of each class as the full dataset, even when some classes are small. (We want to make sure a particular class isn’t omitted from the final testing dataset).
Stratified random sampling distributes the classes evenly, but what can it not guarantee?
Other types of representativeness
- Eg some samples may have too many/few difficult cases, easy-to-predict cases or outliers.
- This is especially true for smaller datasets, which may not have a large enough portion of such cases to be divided among the training and test sets.