Module 2: Chapter 7 - Supervised Learning - Model Estimation Flashcards

1
Q

What is Non-Linear Least Squares?

A

OLS can be used in situations where the underlying model is linear in the parameters, but this does not apply to many machine learning models, such as neural networks. In these cases, a more flexible approach is needed. Nonlinear least squares (NLS) is an approach that can be used when the model is nonlinear, and it works using the same principles as OLS – i.e., by minimizing the residual sum of squares

But in this case, ​y hat sub i equals f of open paren x sub 1 comma x sub 2 comma dot dot dot comma x sub m semicolon normal w close paren​ where f can be any nonlinear function of the m explanatory variables or features, which are denoted by xi, and the corresponding parameters are denoted by wi (also known as weights in the case of neural networks). Similarly, mean squared error (MSE) can also be calculated. Because the relationship between the features and the output could in principle take any form, it is often not possible to derive a set of closed form solutions to this minimization problem. Therefore, NLS usually uses a numerical approach to finding the optimal parameter estimates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which steps are taken to identify the optimal solution of NLS?

A

(1) Begin with a set of initial values for the parameters – these could either be randomly generated or ‘best preliminary guesses.’

(2) Evaluate the objective function (RSS or MSE).

(3) Modify the parameter estimates and re-evaluate the objective function.

(4) If the improvement in the objective function is below a pre-specified threshold, then stop looking and report the current parameter values as the chosen ones. If not, return to step 2.

The third of the above steps is the crucial aspect, and usually, a gradient descent algorithm is employed, which is discussed in a following sub-section after a preliminary discussion of hill climbing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Hill Climbing?

A

A simple form of optimizer for estimating the parameters of a nonlinear model is hill climbing. This involves starting with initial guesses of each parameter and then making small changes in both directions to each parameter one-at-a-time. The aim is to maximize the value of an objective function (for instance, increasing the value of a likelihood function or increasing the negative of the RSS) until no further improvement in its value is observed.

Hill climbing is very straightforward because it does not require the calculation of derivatives, and therefore it can be applied to non-differentiable functions. It is also simple to implement and for this reason it is sometimes termed a “heuristic optimizer”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the disadvantages of Hill Climbing?

A

(1) Of all the optimization techniques available, hill climbing is the most susceptible to getting stuck in local optima.

(2) Convergence to the optimal solution can be very slow.

(3) Only one parameter can be adjusted at a time, meaning that it is easy for the algorithm to miss optimal parameter combinations, particularly for complex and highly interconnected models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Gradient Descent Method?

A

A popular numerical procedure for parameter estimation is known as gradient descent.

In this method, the objective function, for example the residual sum of squares, is minimized. Suppose that all the parameters to be estimated are stacked into a single vector W and the objective function in this case is known as the loss function and is denoted as L(W). At each iteration, the algorithm chooses the path of steepest descent (“slope”), which is the one that will minimize the value of the loss function the most. The method works similarly when the maximum likelihood method is used, in which case the negative of the log-likelihood function is minimized.

Note that adjustment to weight wi takes place in the opposite direction to the gradient (or “slope”), and hence the negative sign in the equation 7.8 above for updating the weight.

To avoid the iteration going into infinite cycles, a maximum number of iterations is assigned to the algorithm. In machine learning parlance, each iteration, comprising calculating the loss function and gradient then adjusting the weights using the whole training data sample, is known as an epoch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What alternatives ways to implement the gradient descent approach exist?

A

We usually use the entire training data sample and minimize the loss function with respect to all of it, which is known as batch gradient descent.

A slightly less common alternative approach is to apply gradient descent to each data point at a time, individually selected at random from the training set. This is known as stochastic gradient descent.

Alternatively, applying it to subsets of the training data is known as mini-batch gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the advantage of using stochastic gradient descent?

A

The benefit of stochastic gradient descent is that it does not require the entire database to be loaded into memory simultaneously, which reduces the required computational resources compared with the batch approach. However, because updating the weights will take place after the algorithm sees each new data point, the convergence on the optimal values will require more iterations and will be less smooth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

For what is the hyperparameter η used in the gradient descent approach?

A

The parameter η is a hyperparameter that defines how much adjustment to weights takes place. This hyperparameter must be chosen judiciously: if it is too small, each iteration will yield only modest improvements in the loss function and will result in a slow movement towards the optimum, requiring many iterations in total. On the other hand, if η is too large, there is a danger of overshooting and an erratic path towards the optimal ​w sub i raised to the asterisk power​. The algorithm can overshoot the minimum, oscillate around it, or may not converge.

For batch gradient descent, η is fixed a priori, but for stochastic gradient descent, the learning rate can be made a diminishing function of the number of epochs that have already occurred so that the weight updating slows down as the algorithm comes closer to the optimum. In other words, we could employ dynamic learning, which entails starting with a larger η to get close to the optimal solution faster, but to then reduce η as the learning proceeds to avoid overshooting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a problem associated with the gradient descent approach?

A

One such issue occurs when the function does not have a single, global optimum but rather a series of local optima or an extended plateau away from the true optimum, as illustrated in the right panel of Figure 7.3. In such circumstances, the optimizer can get stuck at the local optimum or plateau and never reach the optimal solution ​w sub i raised to the asterisk power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is backpropagation?

A

Determining the optimal weights in a neural network model is particularly challenging because, even with a single hidden layer, the output is a function of a function. It is like running a logistic regression on the output from another logistic regression.

Therefore, a technique known as backpropagation is used along with gradient descent to determine the weights in neural network models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does backpropagation work?

A

The backpropagation algorithm involves starting on the right-hand side of the neural network (i.e., beginning with the output layer) and then successively working backward through the layers to update the weights estimates at each iteration. This begins by calculating the errors (actual – fitted values) for each target data point, then these errors are “assigned” to each of the weights in the layer before it.

Gradient descent can then be applied to the weights to calculate improved values. The output layer error in the target is determined via a feedforward of the feature values with the updated weights, and the process continues again. The derivatives are computed starting from the output layer and moving backward, with an application of the chain rule (hence, the name backpropagation). The algorithm stops when convergence is achieved – that is, when updating the weights no longer reduces the cost function (or reduces it by a trivially small amount). The key to backpropagation is to consider each layer separately rather than trying to do all the computation in a single step because breaking it down in this way greatly simplifies the mathematics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the steps to identify the optimal weights in ANN?

A

(1) Generate initial guesses (usually at random) for all the weights, including the biases.

(2) Given these weights, feedforward the values of the inputs to calculate the values at each neuron and then finally the value of the outputs. This is done separately for each of the N data points in the training sample.

(3) Once the fitted values of the outputs are determined, the error, which is the difference between the network output and the actual value, can be calculated for each observation.

(3.a.) f this is the first iteration, proceed to step 4.

(3.b) If the residual sum of squares is below a particular threshold or has not improved much since the previous iteration, or the number of iterations has reached a pre-specified maximum value, stop and fix the weights at their current values. Otherwise, proceed to step 4.

(4) During the backward pass, the gradient descent method is used to calculate improved values of the weights. In this process the error is propagated through the network, and the weights are updated to minimize the loss function. Return to step 2 and run through a further iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why do we include a momentum term and how can it help with local minima?

A

Sometimes a momentum term is added to the optimizer, which increases the learning rate if the previous change in the weights and the current change are in the same direction but reduces it if they are in the opposite directions.

The benefit of this approach is that it speeds up convergence but at the same time reduces the probability of overshooting the optimal parameter values.

μ is the momentum rate, which can be chosen between 0 and 1. The parameter μ controls how much of the previous weight change we will keep in the next iteration.

This works by ‘overshooting’ the target, which helps prevent the algorithm from getting stuck in local minima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the issue raised by the term vanishing gradient?

A

deep networks are plagued by another computational issue: the so-called vanishing gradient. Loosely, the problem can be understood as follows. Backpropagation is an application of the chain rule, which entails the multiplication of several derivatives (as many as the layers in the network). When the derivatives are numbers between 0 and 1 (which is always the case for the logistic function), their product tends to become very small very quickly. An opposite problem is an exploding gradient, where the product of the derivatives becomes larger and larger.

When one of these problems emerges, the only way to find the optimum is by using an extremely large number of small updates, which of course makes the learning very slow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What solutions exist to tackle vanishing/exploding gradients?

A

(1) An appropriate choice of activation function. In recent years, the use of the logistic function (sigmoid) as an activation function for hidden layers has been abandoned in favor, for instance, of the ReLU function that is less prone to the vanishing gradient problem.

(2) Batch normalization. This consists of adding ‘normalization layers’ between the hidden layers. The normalization works in a fashion like what we discussed in Chapter 1, where the features were normalized by subtracting the mean and dividing by the standard deviation. Here, it is the new inputs originating from the hidden layers that are normalized.

(3) Specific network architectures. Some network architectures have been developed to be resistant to the vanishing or exploding gradient problem. An example is the long short-term memory (LSTM) network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is overfitting?

A

Overfitting is a situation in which a model is chosen that is “too large” or excessively parameterized. A simple example is when a high-dimensional polynomial is used to fit a data set that is roughly quadratic. The most obvious sign of an overfitted model is that it performs considerably worse on new data.

An overfitted model captures excessive random noise in the training set rather than just the relevant signal. Overfitting gives a false impression of an excellent specification because the RSS on the training set will be very low (possibly close to zero). However, when applied to other data not in the training set, the model’s performance will likely be poor, and the model will not be able to generalize well.

Overfitting is usually a more severe issue with machine learning than with conventional econometric models due to the larger number of parameters in the former. For instance, a standard linear regression model generally has a relatively small number of parameters. By contrast, it is not uncommon for neural networks to have several thousand parameters.

17
Q

What is underfitting?

A

Underfitting is the opposite problem to overfitting and occurs when relevant patterns in the data remain uncaptured by the model. For instance, we might expect the relationship between the performance of hedge funds and their size (measured by assets under management) to be quadratic. Funds that are too small would have insufficient access to resources with costs thinly spread, and funds that are too big may struggle to implement their strategies in a timely fashion without causing adverse price movements in the market. A linear model would not be able to capture this phenomenon and would estimate a monotonic relationship between performance and size, and so would be underfitted. A more appropriate specification would allow for a nonlinear relationship between fund size and performance.

It is clear from these examples that underfitting is more likely in conventional models than in some machine-learning approaches where only a minimal assumption (such as smoothness) on the signal is imposed.

However, it is also possible for machine-learning approaches, as well as econometric models, to underfit the data. This can happen either when the number or quality of inputs is insufficient, or if steps taken to prevent overfitting are excessively stringent.

18
Q

What is the Bias-Variance Trade-Off?

A

If the model is underfitted, the omission of relevant factors or interactions will lead to biased predictions but with low variance. On the other hand, if the model is overfitted, there will be low bias but a high variance in predictions.

19
Q

What can be done to address overfitting?

A

One approach to limit the chances of overfitting is to carry out calculations for the validation data set at the same time as the training data set. As the algorithm steps down the multi-dimensional valley, the objective function will improve for both data sets, but at some stage, further steps down the valley will start to worsen the value of the objective function for the validation set while still improving it for the training set. This is the point at which the gradient descent algorithm should be stopped because further steps down the valley will lead to overfitting the training data and therefore poor generalization and poor predictions for the test sample.

20
Q

What is the trade-off between accuracy and interpretability?

A

Machine learning models are often highly complex and heavily parametrized, so that they have often been accused of being “black boxes.”5 More flexible models often deliver more accurate predictions (with the caveats discussed above concerning overfitting), as they can generate a wider range of shapes for the function that maps the features to the outcome. Therefore, they can fit the highly complex and nonlinear patterns in real-world data. However, these models often lack interpretability.

The linear regression model is often inadequate to model the complex nature of real-world relationships between the predictors and the target variable. Yet, its popularity among financial economists remains unchallenged, thanks to its ability to deliver an easy-to-understand relationship between the predictors and outcome that resonates with financial theory. Generally, less flexible but more interpretable models are preferred when the goal is to investigate causal relationships. In contrast, more flexible models tend to be the obvious choice when the goal is to make accurate predictions.

21
Q

What is regularization?

A

The stepwise selection methods discussed in Chapter 3 add or remove predictors to a regression with the aim of finding the combination that maximizes the model performance. An alternative is to fit a model on all m features but using a regularization technique that shrinks the regression coefficients towards zero. Regularization can be used for standard linear regression models such as those discussed above and for many other machine-learning models discussed in subsequent chapters. It is usually a good practice to normalize or standardize the data beforehand using one of the methods.

The two most common regularization techniques are ridge regression and least absolute shrinkage and selection operator (LASSO). Both work by adding a penalty term to the objective function that is being minimized. The penalty term is the sum of the squares of the coefficients in ridge regression and the sum of the absolute values of the coefficients in LASSO. Regularization can simplify models, making them easier to interpret, and reduce the likelihood of overfitting to the training sample.

22
Q

How does Ridge Regression work?

A

Suppose that we have a dataset with N observations on each of m features in addition to a single output variable y and, for simplicity, assume that we are estimating a standard linear regression model with hats above parameters denoting their estimated values.

The first sum in this expression is the usual regression objective function (i.e., the residual sum of squares), and the second is the shrinkage term that introduces a penalty for large-slope parameter values (of either sign). The parameter λ controls the relative weight given to the shrinkage versus model fit, and some experimentation is necessary to find the best value in any given situation. Parameters that are used to determine the model but are not part of a model are referred to as hyperparameters. In this case, λ is a hyperparameter and ​beta hat​s are model parameters.

23
Q

What is LASSO regression?

A

LASSO is a similar idea to ridge regression, but the penalty takes an absolute value form rather than a square.

Whereas there is an analytic approach to determining the values of the βs for ridge regression, a numerical procedure must be used to determine these parameters for LASSO because the absolute value function is not everywhere differentiable.

24
Q

What are the key differences between LASSO and Ridge regression?

A

Ridge regression and LASSO are sometimes known, respectively, as L2 and L1 regularization due to the order of the penalty terms in these methods. There are key differences between them:

(1) Ridge regression (L2) tends to reduce the magnitude of the β parameters, making them closer to, but not equal to zero.

(2) This simplifies the model and avoids situations in which for two correlated variables, a large positive coefficient is assigned to one and a large negative coefficient is assigned to the other. LASSO (L1) is different in that it sets some of the less important β estimates to zero.

(3) The choice of one approach rather than the other depends on the situation and on whether the objective is to reduce extreme parameter estimates or remove some terms from the model altogether.

(4) LASSO is sometimes referred to as a feature selection technique because de facto it removes the less important features by setting their coefficients equal to zero. As the value of λ is increased, more features are removed.

25
Q

What is the relationship between Maximum Likelihood and Regularization techniques?

A

Ridge regression and LASSO can be used with logistic regression. Maximizing the likelihood is equivalent to minimizing its negative. Therefore, to apply a regularization, we add λ times the sum of the squares of the parameters or λ times the sum of the absolute values of the parameters to the negative of the expression for the log-likelihood. Then the objective would be to find the values of the parameters that jointly minimize this composite of the negative of the log-likelihood and the sum of the absolute values of the parameters.

26
Q

What is an elastic net?

A

A third possible regularization tool is a hybrid of the two above, where the loss function contains both squared and absolute-value functions of the parameters.

By appropriately selecting the two hyperparameters (λ1 and λ2), it is sometimes possible to obtain the benefits of both ridge regression and LASSO: reducing the magnitudes of some parameters and removing some unimportant ones entirely.

27
Q

How does cross-validation work?

A

It involves combining the training and validation data into a single sample, with only the test data held back. This means that there is effectively not a separate validation sample, only a combined sample, which we now call the training sample. Then, this training data are split into equally sized sub-samples, with the estimation being performed repeatedly and one of the sub-samples being left out each time.

The technique known as k-fold cross-validation splits the total data available, N, into k samples, and it is common to choose k = 5 or 10. Suppose for illustration that k = 5. Then the training data would be partitioned into five equally sized, randomly selected sub-samples, each comprising 20% of that data. If we define the sub-samples ki (i = 1,2,3,4,5), the first estimation would use samples k1 to k4 with k5 left out. Next, the estimation would be repeated with sub-samples k1 to k3 and k5, with k4 left out, and so on.

28
Q

What is LOOCV?

A

in the context of cross-validation, a larger value of k will imply an increased training sample size, which might be valuable if the overall number of observations is low. The limit as k increases would be k = N, which would correspond to having as many folds as the total number of data points in the training set. This situation is known as N-fold cross-validation, jack-knifing, or leave-one-out cross-validation (LOOCV).

29
Q

What are two disadvantages of cross-validation?

A

(1) Using LOOCV will increase the size of the matrix in Figure 7.7 to N × N, which maximizes the sizes of the training and validation samples but will be computationally expensive as the data are trained at each of the N iterations. (Remark: For linear models, LOOCV is computationally inexpensive because of the updating formula, i.e., the Sherman–Morrison–Woodbury formula.) For nonlinear models, cross-validation exercises always increase the computational costs, which grow with k.

(2) For unordered data, the points allocated to each fold will usually be selected randomly. This implies that k-fold cross-validation cannot be used when the data have a natural ordering. For example, if they are time-series observations, where there is a need to preserve the order and where the validation sample would usually comprise points that are chronologically after the training data. In this case, the appropriate framework would be to use a rolling window.

30
Q

What is Stratified Cross-validation?

A

When the overall sample is very small, there is a heightened risk that one or more of the training or validation sub-samples will, purely by chance, comprise a set of datapoints that are atypical compared with the other sub-samples. Some classes or types of data will be overrepresented in the training sample and underrepresented in the validation sample and vice versa for the other classes or types of data. Specifically, if the output data are categorical but unbalanced between categories, it might be the case that there are no instances of one or more categories in one or more of the sub-samples.

A potential solution to this problem is to use stratified k-fold cross-validation. In that case, instead of drawing k samples (without replacement) to comprise the validation data, the positive and negative outcomes would be sampled separately in proportion to their presence in the overall sample.

An alternative way to deal with imbalanced classes is to generate further artificial instances of the minority (underrepresented) class, using what is known as SMOTE (Synthetic Minority Over-sampling Technique). Or to use an asymmetric loss function that puts more weight on any incorrect predictions for this class. However, these approaches are more complex than those described above and hence are not considered further.

31
Q

What is bootstrapping?

A

A further variant on k-fold cross-validation is to use a bootstrap. Bootstrapping is a simulation technique where new data distributions are created by sampling with replacement from the original data. In this context, it would involve, for each iteration, drawing a sample of size N (the combined size of the training and validation sample) with replacement. It is highly likely that this sample will contain some instances more than once from the original sample and some instances will not appear at all – typically, around a third of the original data will not be sampled in each iteration.

Those instances not appearing in the bootstrapped training sample (called out-of-bootstrap data) then comprise the validation sample for that iteration. A large number of iterations would be performed (10,000 or more if computational resources permit) and the results averaged over the iterations as for k-fold cross-validation.

32
Q

Why is cross-validation and bootstrapping preferred over arbitrary selection of training/validation samples?

A

Effectively, every observation appears in both the training and validation samples for different folds.

Cross-validation is also straightforward to implement, but a disadvantage is that it might be computationally expensive if the model is complex, or the number of folds is large, or the total sample size is large. If there are k folds and h different possible values for the hyperparameter to consider, this will involve estimating kh separate models each for a sample of size N (1 − 1/k). That could be computationally infeasible. Bootstrapping with a large number of iterations will be even more computationally demanding, although recent advances in computing have made this less onerous than previously.

33
Q

What are Grid Searches?

A

The purpose of cross-validation might be to determine the optimal value of a hyperparameter. To do this, the researcher might use a grid search procedure, which involves selecting a set of possible parameter values. To illustrate, suppose that the model under study involves specifying one hyperparameter, λ. This might, for example, be the hyperparameter that controls the strength of the penalty term in a Lasso regularization. Assume that the researcher determines that a range of 0 to 100 is plausible to investigate, with a step size of 1. Using 5-fold cross-validation to determine the most appropriate value of λ could be achieved.

(1) Separate the composite training sample into five randomly assigned sub-samples.

(2) Set λ = 0.

(3) Perform the following operations:

(3.a.) Combine four of the sub-samples and estimate the model under study on that composite.

(3.b.) Using the remaining subsample, calculate a performance measure, such as the percentage of correct classifications or the mean squared error of the predictions.

(4) Repeat steps 3a and 3b for the other four combinations of the sub-samples.

(5) Calculate the average of the performance measure across the five validation folds.

(6) Add one to λ, and if λ ≤ 100, repeat steps 3 to 5, otherwise proceed to the next step.

(7) There will now be 101 performance statistics: one for each value of λ (0,100). Select the optimal value of λ (call this λ*) corresponding to the best value of the performance statistic.

(8) Perform one final estimation of the model, this time using the entire training sample with the hyperparameter set to λ*.

34
Q

What is the drawback of a grid search framework?

A

. A first issue is that the researcher may have no idea of even the scale of the hyperparameter, so a power scale might need to be used for λ, such as 10−1, 100, 101, 1010,… It would be possible to search over a coarse grid, including relatively few points over a wide range, but that could leave the best hyperparameter value from the grid search still a long way from the optimal value. Even getting close to the latter using a more refined grid could impose a vast computational burden, but it is becoming less of a concern with the recent advances in computing technologies.

A second problem is that searching over too many grid points is another manifestation of overfitting and could lead to weaker test sample performance.

An alternative to grid search for hyperparameter selection would be to use random draws. This could reduce the computational time significantly and seems to work surprisingly well compared with more structured approaches. But if the researcher is unlucky, it could be that none of the randomly selected hyperparameter values come close to the optimum.

35
Q
A