Module 2: Chapter 4 - Supervised Learning – Part 2: Machine Learning Techniques Flashcards
What is a decision tree?
A decision tree is a supervised machine-learning technique that examines input features sequentially and is so-called because, pictorially, the model can be represented as a tree. At each node is a question, which branches an observation to another node or a leaf (a terminal node).
What are CARTs?
Classification and regression trees (CARTs)
Why are CARTs popular?
CARTs are popular due to their interpretability, and for this reason, they are sometimes known as “white-box models,” in contrast to other techniques such as neural networks where the fitted model is very difficult to interpret.
What is a problem with CARTs?
CARTs typically perform less well than “black-box” techniques including neural networks in terms of predictive accuracy
How to improve predictive accuracy of decision trees?
To improve their performance, trees are often combined using ensemble techniques such as random forests, bagging and boosting
What are regression trees?
Decision trees can be applied both when the target is a continuous variable and when it is a categorical (qualitative) one. In the first case, we call them regression trees. The goal is to split the feature space into regions such that we minimize the residual sum of squares (RSS)
Since it is impossible to check all partitions of a feature space in a regression tree application, what can we do?
We employ a top-down recursive binary splitting approach. In this approach, we start with all the observations in one region and search for the split that produces the maximum reduction of the RSS; then, for each of the two regions obtained in this way, we look for a further best split, and we proceed recursively until a given stopping criterion is reached.
How are classification trees estimated?
The objective is to split the data into groups that are as pure as possible (that is, they contain the largest proportion of one class as possible).
What is used as alternatives to RSS in classification trees?
Two measures are generally considered: entropy and the Gini coefficient
Entropy is a measure of disorder in a system
The Gini coefficient is a measure of the impurity of a node. A small value of the Gini index indicates that a node mostly contains instances from the same class
Gini and entropy usually lead to very similar decision trees.
How is purity/impurity for the Gini coefficient defined?
Ideally, a particular question will provide a perfect split between categories – i.e., each terminal node will be a pure set. For instance, if it had been the case that no technology stocks paid a dividend, this would be highly beneficial information and the node containing technology stocks would be pure (that is, it will only contain non dividend paying stocks). On the other hand, the worst possible scenario would be where exactly half of the technology stocks paid a dividend, and the other half did not, in which case having information only on whether a company was a tech stock or not would be much less useful.
What is pruning in the context of decisions trees?
Small trees offer several advantages over large trees: interpretability, fewer irrelevant features and, especially, avoidance of overfitting. As well as employing a separate testing sub-sample, overfitting can be prevented by using stopping rules specified a priori (pre- or online pruning) or by pruning the tree after it has been grown (post pruning).
An example of stopping rules is when a certain number of branches is reached, no further splitting is allowed. Another example is the termination of the splitting of a node if the number of observations under that node is smaller than a certain number.
What are the differences and benefits of pre- and post-pruning?
Whereas pre-pruning prevents a tree from growing too much, post pruning consists of growing the tree fully and then identifying ‘weak links’ ex post. In other words, it consists of replacing some subtrees with leaves whose label is the class of most of the instances that reach the subtree in the original classifier (model). There are several pruning algorithms that can be distinguished between top-down and bottom-up approaches depending on whether they start at the leaves or at the root of the tree.
What is reduced error pruning?
One of the simplest forms of pruning is reduced error pruning, which is a bottom-up approach. Starting at the bottom, this algorithm replaces a node with its most popular class any time that the resulting pruned tree does not perform worse than the original tree in the validation sample.
What is cost complexity pruning?
It consists of adding a penalty term to the RSS such that a trade-off between accuracy over the training sample and the number of terminal nodes is established. The extent of the trade-off is determined by a tuning parameter alpha, which is chosen with cross-validation
Which three Ensemble Techniques are used in the context of decision trees?
1) Bootstrap Aggregation (a.k.a. bagging)
2) Random Forests
3) Boosting
How does bootstrap aggregation work?
It involves bootstrapping (sampling with replacement) from among the training sample to create multiple decision trees. The resulting predictions or classifications are aggregated to construct a new prediction or classification.
Steps:
1) Sample a subset of the complete training set. For example, if the training set consists of 100,000 observations, sample 10,000.
2) Construct a decision tree in the usual fashion.
3) Repeat steps 1 and 2 many times, sampling with replacement, so that an observation in one subsample can also be in another subsample.
4) If the problem is a regression, average across the forecasts of the several trees to obtain the final prediction. If the problem is a classification, record the class predicted by each of the trees and take a majority vote: the class predicted by most of the trees is the overall prediction.
Side note: Because the data are sampled with replacement, some observations will not appear at all. The observations that were not selected (called out-of-bag data) will not have been used for estimation in that replication and can be used to evaluate model performance
What is pasting in the context of bootstrap aggregation?
Pasting is an approach identical to bagging except that sampling takes place without replacement (so that each datapoint can only be drawn at most once in any replication). In pasting with 100,000 items in the training set and sub-samples of 10,000, there would be a total of 10 sub-samples.
How do random forests work?
Aggregating several forecasts works particularly well when the different learners exhibit low correlation. Random forests provide an improvement over bagging by reducing correlation across the trees. To achieve this, each time that a tree is split, only a random subset of all the features is considered
The logic is that, if, for instance, a feature is a very strong predictor whereas the rest only have modest predictive power, all the resulting trees will have this feature at the top and they are likely to yield very similar forecasts. In contrast, by forcing some trees to deliberately ignore this strong predictor, the other features are given a chance, and the resulting forecasts are less correlated.
How does boosting work?
Like bagging, boosting entails combining the forecasts from many decision trees. However, while in bagging each tree is grown independently from the others, in boosting each tree is grown while exploiting the information from the prediction errors of the previously grown trees. The two main varieties of boosting are gradient boosting and adaptive boosting (so-called AdaBoost).
What is the difference between gradient boosting and adaptive boosting (AdaBoost)?
Gradient boosting constructs a new model on the residuals of the previous one, which then become the target—in other words, the labels in the training set are replaced with the residuals from the previous iteration. AdaBoost involves training a model with equal weights on all observations and then sequentially increasing the weight on misclassified outputs to incentivize the classifier to focus more on those cases.
How does K Nearest Neighbors work?
K nearest neighbors (KNN) is a simple, intuitive, supervised machine-learning model that can be used for either classification or predicting the value of a target variable. To predict the outcome (or the class) for an observation not in the training set, we search for the K observations in the training set that are closest to it using one of the distance measures.
Our prediction is the mean of the nearest neighbors’ outcomes. If the problem is a classification one, the instance to be classified is assigned to the class to which most of the nearest neighbors belong (an approach that is known as majority voting).