Module 2: Chapter 4 - Supervised Learning – Part 2: Machine Learning Techniques Flashcards

1
Q

What is a decision tree?

A

A decision tree is a supervised machine-learning technique that examines input features sequentially and is so-called because, pictorially, the model can be represented as a tree. At each node is a question, which branches an observation to another node or a leaf (a terminal node).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are CARTs?

A

Classification and regression trees (CARTs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are CARTs popular?

A

CARTs are popular due to their interpretability, and for this reason, they are sometimes known as “white-box models,” in contrast to other techniques such as neural networks where the fitted model is very difficult to interpret.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a problem with CARTs?

A

CARTs typically perform less well than “black-box” techniques including neural networks in terms of predictive accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to improve predictive accuracy of decision trees?

A

To improve their performance, trees are often combined using ensemble techniques such as random forests, bagging and boosting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are regression trees?

A

Decision trees can be applied both when the target is a continuous variable and when it is a categorical (qualitative) one. In the first case, we call them regression trees. The goal is to split the feature space into regions such that we minimize the residual sum of squares (RSS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Since it is impossible to check all partitions of a feature space in a regression tree application, what can we do?

A

We employ a top-down recursive binary splitting approach. In this approach, we start with all the observations in one region and search for the split that produces the maximum reduction of the RSS; then, for each of the two regions obtained in this way, we look for a further best split, and we proceed recursively until a given stopping criterion is reached.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are classification trees estimated?

A

The objective is to split the data into groups that are as pure as possible (that is, they contain the largest proportion of one class as possible).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is used as alternatives to RSS in classification trees?

A

Two measures are generally considered: entropy and the Gini coefficient

Entropy is a measure of disorder in a system

The Gini coefficient is a measure of the impurity of a node. A small value of the Gini index indicates that a node mostly contains instances from the same class

Gini and entropy usually lead to very similar decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is purity/impurity for the Gini coefficient defined?

A

Ideally, a particular question will provide a perfect split between categories – i.e., each terminal node will be a pure set. For instance, if it had been the case that no technology stocks paid a dividend, this would be highly beneficial information and the node containing technology stocks would be pure (that is, it will only contain non dividend paying stocks). On the other hand, the worst possible scenario would be where exactly half of the technology stocks paid a dividend, and the other half did not, in which case having information only on whether a company was a tech stock or not would be much less useful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is pruning in the context of decisions trees?

A

Small trees offer several advantages over large trees: interpretability, fewer irrelevant features and, especially, avoidance of overfitting. As well as employing a separate testing sub-sample, overfitting can be prevented by using stopping rules specified a priori (pre- or online pruning) or by pruning the tree after it has been grown (post pruning).

An example of stopping rules is when a certain number of branches is reached, no further splitting is allowed. Another example is the termination of the splitting of a node if the number of observations under that node is smaller than a certain number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the differences and benefits of pre- and post-pruning?

A

Whereas pre-pruning prevents a tree from growing too much, post pruning consists of growing the tree fully and then identifying ‘weak links’ ex post. In other words, it consists of replacing some subtrees with leaves whose label is the class of most of the instances that reach the subtree in the original classifier (model). There are several pruning algorithms that can be distinguished between top-down and bottom-up approaches depending on whether they start at the leaves or at the root of the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is reduced error pruning?

A

One of the simplest forms of pruning is reduced error pruning, which is a bottom-up approach. Starting at the bottom, this algorithm replaces a node with its most popular class any time that the resulting pruned tree does not perform worse than the original tree in the validation sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is cost complexity pruning?

A

It consists of adding a penalty term to the RSS such that a trade-off between accuracy over the training sample and the number of terminal nodes is established. The extent of the trade-off is determined by a tuning parameter ​alpha​, which is chosen with cross-validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which three Ensemble Techniques are used in the context of decision trees?

A

1) Bootstrap Aggregation (a.k.a. bagging)
2) Random Forests
3) Boosting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does bootstrap aggregation work?

A

It involves bootstrapping (sampling with replacement) from among the training sample to create multiple decision trees. The resulting predictions or classifications are aggregated to construct a new prediction or classification.

Steps:

1) Sample a subset of the complete training set. For example, if the training set consists of 100,000 observations, sample 10,000.

2) Construct a decision tree in the usual fashion.

3) Repeat steps 1 and 2 many times, sampling with replacement, so that an observation in one subsample can also be in another subsample.

4) If the problem is a regression, average across the forecasts of the several trees to obtain the final prediction. If the problem is a classification, record the class predicted by each of the trees and take a majority vote: the class predicted by most of the trees is the overall prediction.

Side note: Because the data are sampled with replacement, some observations will not appear at all. The observations that were not selected (called out-of-bag data) will not have been used for estimation in that replication and can be used to evaluate model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is pasting in the context of bootstrap aggregation?

A

Pasting is an approach identical to bagging except that sampling takes place without replacement (so that each datapoint can only be drawn at most once in any replication). In pasting with 100,000 items in the training set and sub-samples of 10,000, there would be a total of 10 sub-samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do random forests work?

A

Aggregating several forecasts works particularly well when the different learners exhibit low correlation. Random forests provide an improvement over bagging by reducing correlation across the trees. To achieve this, each time that a tree is split, only a random subset of all the features is considered

The logic is that, if, for instance, a feature is a very strong predictor whereas the rest only have modest predictive power, all the resulting trees will have this feature at the top and they are likely to yield very similar forecasts. In contrast, by forcing some trees to deliberately ignore this strong predictor, the other features are given a chance, and the resulting forecasts are less correlated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How does boosting work?

A

Like bagging, boosting entails combining the forecasts from many decision trees. However, while in bagging each tree is grown independently from the others, in boosting each tree is grown while exploiting the information from the prediction errors of the previously grown trees. The two main varieties of boosting are gradient boosting and adaptive boosting (so-called AdaBoost).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the difference between gradient boosting and adaptive boosting (AdaBoost)?

A

Gradient boosting constructs a new model on the residuals of the previous one, which then become the target—in other words, the labels in the training set are replaced with the residuals from the previous iteration. AdaBoost involves training a model with equal weights on all observations and then sequentially increasing the weight on misclassified outputs to incentivize the classifier to focus more on those cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How does K Nearest Neighbors work?

A

K nearest neighbors (KNN) is a simple, intuitive, supervised machine-learning model that can be used for either classification or predicting the value of a target variable. To predict the outcome (or the class) for an observation not in the training set, we search for the K observations in the training set that are closest to it using one of the distance measures.

Our prediction is the mean of the nearest neighbors’ outcomes. If the problem is a classification one, the instance to be classified is assigned to the class to which most of the nearest neighbors belong (an approach that is known as majority voting).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why is KNN sometimes called a lazy learner?

A

KNN is sometimes termed a lazy learner because it does not learn the relationships in the dataset in the way that other approaches do. In other words, it does not actually build a model; instead, every time KNN encounters a new instance, it compares it to all the existing instances to make a prediction.

23
Q

What are the implemented steps in KNN implementation?

A

The steps involved in a typical KNN implementation are as follows:

Select a value of K and a distance measure, usually either the Euclidean or Manhattan measure.
Among the points in the training sample, identify the K points in feature space that are closest to the point in feature space for which a prediction is to be made according to the chosen measure.
a) If it is a prediction problem, compute the mean of the outcomes for the K neighbors that have been identified, or
b) If it is a classification problem, assign the instance to the class to which most of the nearest neighbors belong.

24
Q

Why is scaling applied in KNN?

A

Like the k-means algorithm, it is crucial that the features are standardized as the distances among the data points depend dramatically on the features’ scales. If the data is not scaled, large differences in the features’ scales would give more importance to features with the largest scale when the distances are computed.

25
Q

What is the impact of choosing K in KNN?

A

Small values of K tend to overfit the data whereas large values of K may underfit. A common choice is to set K approximately equal to the square root of N, the total size of the training sample. So, if N = 5,000 points, then set K = 71. Another approach is to use cross-validation to tune K and choose the value that minimizes the error over the validation sample

26
Q

What are the pros and cons of KNN?

A

KNN is simple and yet tends to yield quite accurate forecasts. However, its main disadvantage is that it is computationally intensive as the distances between one instance and all the others must be computed before KNN can identify the nearest neighbors. Another drawback of this method is that it performs poorly when there are a few irrelevant or noisy features, as those can drive similar instances apart in the feature space.

27
Q

What is the purpose of Support Vector Machines?

A

Support vector machines (SVMs) are a class of supervised machine-learning models that are particularly well suited to classification problems when there are large numbers of features.

Our objective is to identify the position of a line that would best separate the two groups (technically, the classification boundary), enabling us to predict for an additional data point not in this sample whether the outcome should be −1 or +1.

28
Q

Explain the margin in SVM

A

More generally, there is an infinite number of linear boundaries that perfectly classify the data. Therefore, we need a metric that helps us to identify which of the boundaries is the most appropriate. SVM uses a metric called margin. Broadly, the margin is the sum of the distances between the classification boundary and the closest instance in the training data for each of the two classes. Given a classification boundary, it is possible to construct two lines that are parallel to it and that touch the training data of opposite classes on either side, and that have no data points between them.

The training data points on these two lines are called support vectors and the distance between the two lines is the margin; the two lines are sometimes referred to as margin constraints. The optimal classification boundary, also known as the maximum margin classifier, is such that it is equidistant from each support vector and the margin is maximized.

29
Q

How to deal with classification problems using SVM where the target cannot be easily separated?

A

In such cases, where the data are not linearly separable, the approach described above could not be used and requires some modification. A more flexible method would be to use soft margins, which introduces a penalty term into the optimization for incorrect classifications. Further extensions are available to allow for the path being nonlinear.

30
Q

What is an ANN?

A

Artificial neural networks (ANNs) are a class of machine-learning approaches loosely modeled on how the brain performs computation. By far the most common type of ANN is a feedforward network with backpropagation, sometimes known as a multi-layer perceptron.

31
Q

Explain the logic behind neurons in ANNs?

A

The basic unit of a multi-layer perceptron is the neuron, a unit that holds information. The neurons are arranged in layers and a multi-layer perceptron consists of several layers of neurons.

A simple multi-layer perceptron with three layers: the input layer, a hidden layer, and the output layer. The input layer has three neurons, represented by circles, each of them containing one of three features, x1, x2, and x3. Each neuron in the input layer relates to each neuron in the hidden layer (there are three of them in this case). Every connection carries a weight, ​w sub j i raised to the open paren h close paren power​, where the notation means that we are connecting neuron j in layer h with the neuron i in the next layer. Each of the inputs is multiplied by the weight associated with each link and passed to the hidden layer; each neuron in the hidden layer receives as input the weighted sum of the features and transforms it through an activation function, f(.)

The term b1 is called a bias and it is often added to the weighted sum of the features. It is a constant, like the intercept in a standard linear regression.

32
Q

What is the purpose of activation functions in ANN?

A

The activation function introduces nonlinearity into the relationship between the inputs and output. Without it, the outputs from the model would merely be linear combinations of the hidden layer(s), which would, in turn, be linear combinations of the inputs. Such a structure would be, in essence, a linear regression and not of interest because the purpose of a neural network is to discover complex nonlinear relationships.

33
Q

What is feeding forward in ANNs?

A

The neurons in the hidden layer become inputs for the next layer (in this case, the output layer); they are multiplied by the weights ​w sub 11 raised to the 2 power comma w sub 21 raised to the 2 power​ and ​w sub 31 raised to the 2 power​ and again transformed by means of an activation function. The result is the prediction of the output. The process of propagating the attributes from the input layer to the output is called feeding forward

34
Q

What is Deep Learning?

A

Deep learning refers to machine learning methods utilizing multiple neural network layers to extract the nonlinear relationships embedded in the data being modeled. The deep learning methods are widely used in natural language processing, generative artificial intelligence, image processing and other areas.

35
Q

What are typical activation functions in ANNs?

A

There are several activation functional forms in common usage. The logistic (sigmoid) function we encountered in connection with logistic regression is a popular choice.

Other examples of activation functions are the softmax function, the rectified linear unit (ReLU), the leaky ReLU, and the hyperbolic tangent.

Notably, each layer can employ a different activation function (while all the neurons in the same layer apply the same activation function). However, it is common to use the same activation function for all the hidden layers (when there is more than one) and a different function for the output layer.

36
Q

What is backpropagation?

A

The strategy is to find weights such that some measure of classification or prediction error, a loss function, typically residual sum of squares or mean squared error (MSE), is minimized.

The procedure is recursive. At the beginning, the weights are assigned random values (typically within the range [−0.1, 0.1]). Then, a first training example is forward propagated to the network’s output. Then the weights are updated using the error between the calculated and actual values of the labels. After updating the weights, a new example is introduced, and the weights are updated again. When the last training example is introduced, one epoch (i.e., iteration) is completed.

In general, the successful training of a network requires many epochs (typically, tens of them for a simple architecture but it could be in the thousands depending on the complexity of the network and the convergence rate of the optimizer). Therefore, neural networks are usually regarded as a computationally intensive technique.

37
Q

What is the universal approximation theorem?

A

Mathematicians have proven that, with the right number of hidden neurons, and under some assumptions, neural networks are able to approximate any function with arbitrary precision.

38
Q

What is the biggest problem with neural networks and how can it be tackled?

A

Neural networks are highly computationally intensive.

[Consider a network with 100 features, one hidden layer with 100 neurons, and an output layer with only one neuron. This is not an unlikely case in real life applications and entails the estimation of 100 × 100 + 100 = 10,100 weights! Besides, many epochs are generally needed to reach convergence, which makes this approach highly impractical.]

An alternative technique to find an appropriate size of network proceeds as follows. We start with a small number of hidden layers. At the end of each epoch, the algorithm computes the value of the loss function over the training set. This value is likely to keep decreasing when the number of epochs increases, until a point is reached where the performance fails to improve. This can happen either because the network lacks the necessary flexibility to make correct predictions, or because a local minimum has been found.

When this is observed, additional layers are added to the network and the training is resumed. If this allows a reduction in the value of the loss function, the newly added neurons are retained; if not, the smaller model is preferred.

39
Q

What determines the number of neurons?

A

The number of neurons within each layer is dictated by the sizes of the features set and target(s). It was common practice to structure the network such that the number of neurons decreased from one layer to the next as the network approaches the output layer. This “pyramid” style structure has been largely superseded by a more uniform structure with an equal number of neurons in the hidden layers coupled with regularization techniques to ensure that the model is not overfitting. Still, depending on the model, it may be useful to have a larger number of neurons in the first layer than in the other layers

40
Q

What is the most common estimation problem of neural networks?

A

Especially when the model is large and insufficient training examples have been provided, neural networks tend to learn random artifacts of the training data. This implies that they will fail to generalize well to unseen test instances. An extreme form of overfitting is memorization, which results in an almost perfect fit to the training data, and which is not uncommon with neural networks.

Signs of overfitting are:

(1) The same model obtains very different predictions depending on the sample it is trained with.

(2) The gap between the prediction error over the training and the test samples is very large.

41
Q

Which techniques are used to tackle the problem of overfitting in neural networks?

A

(1) Penalty based regularization. This involves imposing a penalty over the loss function, such as ​lamda the sum from i equals 0 to d of w sub i squared comma​ where d is the number of neurons.

(2) Dropout. This is an ensemble technique explicitly designed for neural networks. It consists of creating alternative networks by selectively dropping a few neurons each time. The forecasts from these networks are then aggregated to create the final prediction.

(3) Early stopping. The optimization is stopped before converging to the optimal solution on the training data. A portion of the data is held out and used to determine the optimal stopping point. The training is stopped when the error in the hold-out sample starts to rise.

42
Q

What are Convolutional Neural Networks?

A

Convolutional neural networks (CNN) are a specialized form of neural networks where the neurons in one layer are only connected to a subset of the neurons in the next layer. They are designed to work with inputs that have a grid structure and where adjacent points in the grid exhibit dependencies. The most obvious application of CNNs is with 2-dimensional images, but they can also be employed for textual, voice or time series data.

CNNs are ideal in such cases because the number of network weights to be trained is drastically reduced, which results in faster training of the model.

CNNs take their name from the fact that they contain at least one convolutional layer.

43
Q

What is the most common type of CNNs and how do they work?

A

The most common type of convolutional layer is the 2D or planar convolutional layer. A 2D convolution applies an n × n kernel matrix W (where n is generally smaller than m) over a m × m input grid (typically, this is called the image regardless of whether or not it contains pixels) to obtain a new filtered image that has a smaller size than the original image. The latter is called feature map. The kernel matrix contains weights that should be learned during the training process.

The feature map is obtained by “sliding” the kernel over the image starting from the top left corner to move the kernel through all the positions where it fits entirely within the boundaries of the image. Each position corresponds to a single cell in the feature map, the value of which is calculated by multiplying together the kernel value and the underlying image for each of the cells in the kernel, and then adding all these numbers together. The area in red (blue, green) is termed a receptive field. It is the region in the input space that influences a cell in the feature map.

It is also common to have a pooling layer. The pooling layer replaces the output of the previous layer at certain locations with summary statistics. For instance, it would be possible to summarize F by taking the average or the maximum of the values in the cells. CNNs are parsimonious in terms of parameters as the same weights are applied to all the receptive fields. Therefore, CNNs are useful to process images, which typically involve millions of pixels.

44
Q

What are Recurrent Neural Networks?

A

Recurrent neural networks (RNNs) differ from a standard multi-layer perceptron as the former models employ a temporal sequence to preserve the order in which the observations occur. In other words, a RNN is designed to have some memory. RNNs are often employed in time series applications, and they are at the heart of large language models.

45
Q

What are Autoencoders?

A

Autoencoders are a class of artificial neural network models (ANNs) used for unsupervised learning. They are feedforward specifications, but the outputs are the same features as the inputs, and hence there are no labels. Unlike K-means clustering, autoencoders are primarily used for dimensionality reduction and so are best thought of as non-linear extensions of PCA.

46
Q

What is the difference between PCA and Autoencoders?

A

Autoencoders can provide a compact representation of the feature data and are particularly useful for high-dimensional systems. It should be noted that although PCA is commonly discussed in machine learning contexts, in fact there is no learning involved because it is merely a decomposition with a unique mathematical solution. Autoencoders, on the other hand, are trained and learn the relationships present in the data through model estimation. The advantage of autoencoders over PCA is the use of non-linear activation functions which provide the universal approximation property in high dimensional space

47
Q

How do Autoencoders work?

A

Beginning on the left-hand side, each circle represents a feature (m = 5 in this illustration). The features are then put through the encoder, which is a function, to arrive at the values in the hidden layer (also known as latent variables). In the second part of the autoencoder, the values in the hidden layer are converted back to the feature values through the decoder. Each line connecting the circles represents a weight or parameter to be estimated. The optimization objective is to reconstruct the original features as accurately as possible. The weights between the input layer and the hidden layer encode the information from the features, and the weights between the hidden layer and the output layer decode the information.

In most applications, there are fewer units in the hidden layer than there are features, which is the case in Figure 4.11 where there is a single hidden layer containing three neurons. This is known as a constricted or bottleneck hidden layer and will lead to dimensionality reduction

48
Q

How are autoencoders calculated?

A

The hidden layers are simply calculated as a weighted sum of the inputs (features), and the outputs (reconstructed features) are a weighted sum of the values at the hidden layers.

The weights are chosen by minimizing a loss function, L, akin to the residual sum of squares (RSS) in a linear regression, which is the difference between the actual values of the features and their reconstituted values.

An alternative measure called the mean squared error (MSE) which is the mean of the squared residuals is also used as the loss function.

L will be a positive number to reflect that the feature inputs will not be reconstructed precisely after the encoding and decoding processes, but this is the price paid to obtain a more parsimonious representation. Autoencoders can be used, for example, to store images such as photos without requiring much space, but when they are reconstructed, some definition will inevitably be lost.

49
Q

What is the mathematical relationship between PCA and Autoencoders?

A

As outlined until this point, the autoencoder model is linear in the weights, and thus it will perform a function comparable to PCA. More specifically, in such a linear model, if there is only one hidden layer with K nodes, if both the encoder and decoder are linear, and if the inputs are suitably normalized, then the encoder hidden nodes will be the first K principal components.

50
Q

Are hidden layers based on linear or non-linear activation functions?

A

It is more common to use a non-linear autoencoder by applying an activation function to the weighted sums in the hidden layer(s). The activation functions introduce nonlinearity into the relationship between the inputs and output. Without them, the outputs from the model would merely be linear combinations of the hidden layer(s), which would, in turn, be linear combinations of the inputs.

By capturing any non-linear relationships between the features that are present in the data, activation functions can also allow the number of nodes in the hidden layer to be further reduced so that the representation is even more compact than if a purely linear specification was used.

51
Q

What is the relationship between the number of the inputs and the neurons in the hidden layer?

A

The number of hidden neurons is usually lower than m so that dimensionality is reduced. If the number of hidden neurons was exactly m, it would be possible to trivially reconstruct the exact features, but this would be pointless because no reduction in dimensionality would have been achieved. The number of hidden neurons is set to a value greater than m in the context of a sparse autoencoder. Hence, even though the number of hidden units is large, many of the weights (e.g., 80% of them) are set to zero, so that the effective number of weights is much lower and commensurate with a smaller number of hidden units.

52
Q

What are deep autoencoders?

A

It is also possible to add additional hidden layers to form a deep autoencoder, which has more scope to capture more sophisticated non-linear patterns between the features. In such models, the design is usually (although not necessarily) symmetrical about the bottleneck center hidden layer. So, for instance, if we have m = 10 features, we might have a center layer with three neurons and two intermediate hidden layers (one on the encoder side and one on the decoder side), each with five neurons.

53
Q

What is the reconstruction error?

A

We evaluate autoencoders by calculating the loss, L, as described above, on the final fitted model, where it is called the reconstruction error.

54
Q
A