Module 2: Chapter 3 - Supervised Learning for Numerical Data- Part 1, Econometric Techniques Flashcards

1
Q

Linear Regression / Machine Learning
Describe the OLS regression term in ML parlance

A

y is a linear function of x and an unobservable error term
y is the target

u is the unobservable error term with mean zero and constant variance

β0 is a parameter to be estimated
β0 is the intercept parameter
β0 is known as the bias (ML)
value that y would take if x were zero

β1 is a parameter to be estimated
β1 measures the impact on y of a unit change in x
β1 is the weight (ML)

The index i for each variable, or feature, denotes the observation number (i = 1, …..N where N is the total number of data points, or instances available for each variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the three methods to estimate parameters of a regression model?

A

(1) Least squares
(2) Maximum likelihood
(3) The method of moments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain linearity in the context of regression

A

The model is both linear in the parameters (the equation is a linear function of β0 and β1) and linear in the variables (linear with respect to y and x). To use OLS, the model must be linear in the parameters, although it does not necessarily have to be linear in the features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How many parameters does a multiple regression model estimate?

A

In the multiple linear regression model, there will be m + 1 parameters (m ≥ 1) to estimate: one for the intercept, β0, and one for each of the m slope parameters (plus potential interaction or power terms)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the meaning of the different parameters in a multiple regression model

A

In the multiple linear regression model, each parameter measures the partial effect of the attached variable after controlling for the effects of all the other features included in the regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which specifications can we add to a multiple regression model?

A

1) It is common to apply a logarithmic transformation to some or all the feature variables and/or to the output variable. Such a transformation would imply a different interpretation of the parameter estimates but OLS could still be used as the model would remain linear in the parameters

2) We could incorporate interaction terms (i.e., features multiplied together)

3) We could include power terms of the features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What type of data do we need to run regression models?

A

The output variable y must be continuous, but the features could be continuous or discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which problems can we encounter with multiple regression models?

A

Problem 1: Wrong Features or Wrong “Functional Form”
Problem 2: Multicollinearity
Problem 3: Outliers
Problem 4: Heteroskedasticity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain: Problem 1: Wrong Features or Wrong “Functional Form”

A

1) The model omits some relevant features
This can be a serious potential misspecification that could lead the parameter estimates to be biased and not to become more accurate as the sample size increases

2) The model includes some irrelevant features
This is less serious than the first misspecification, but can result in “inefficiency,” where the parameters are not estimated precisely. A further consequence is that it is likely the model will find it hard to generalize from the specific training sample to the test sample

3) The model includes the correct features, but they are incorporated in the wrong way
This is known as an incorrect functional form. It could occur, for instance, if the true relationship between the features and the output is non-linear but a linear regression model is used.

These three problems are all more challenging to resolve in practice than they appear because the researcher never knows the true relationship between the variables. This is where a strong theoretical knowledge of the problem at hand and the wider context can be valuable in guiding the model development, rather than a purely data-driven approach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain: Problem 2: Multicollinearity

A

Multicollinearity occurs when the features are highly related to one another. We can draw a distinction between two degrees of multicollinearity: perfect and near.

Perfect multicollinearity occurs when two or more of the features have an exactly linear relationship that holds for every data point. In such cases, there is insufficient information to be able to estimate the parameters on both x2i and x3i, and hence the only solution is to remove one of these perfectly correlated variables from the model

If the correlation between x2i and x3i is 0.9. This would be known as near multicollinearity, and in such circumstances, the estimation technique would find it hard to disentangle the separate influences of each variable. A common consequence is that the parameter estimates become highly unstable, changing wildly when a feature is added or removed from the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to deal with multicolinearity?

A

There are various empirical approaches to dealing with near multicollinearity. These include the removal of one or more of the highly correlated variables or turning them into a ratio or difference rather than including them individually. Another way forward is to use regularization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain: Problem 3: Outliers

A

In broad terms an outlier refers to an anomalous data point that lies a long way from the others. The process of squaring these distances means that points that are a considerable distance from the others will exert a disproportionate effect on the estimates.

Outliers can be detected examining a plot of the residuals (the difference between the actual data points and the corresponding values fitted from the regression line) and noting any points that lie further from the line than others.

A more sophisticated approach to outlier detection is to calculate Cook’s distance, which measures the influence of each individual data point on the parameter estimates. This is achieved by removing each data point separately from the regression and determining the difference in model fit for all the remaining points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain: Problem 4: Heteroskedasticity

A

OLS assumes that the variance of the error term is constant and finite, which is known as the homoskedasticity assumption.

If the assumption does not hold, and the variance is not constant, this is known as heteroskedasticity. It occurs frequently in time-series of stock and bond returns and is usually also present in the residuals from models of these series.

Heteroskedasticity can lead to several issues with regression estimation, most notably that it becomes inefficient and that it is hard to accurately evaluate the empirical importance of each feature for determining the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to identify heteroskedasticity?

A

A residual plot can sometimes be useful in detecting heteroskedasticity, where we would be looking for whether the spread of the residuals around their mean (usually of zero) is constant or systematically changing. It is common to plot the residuals on the y-axis against the fitted values from the model on the x-axis

Alternatively, there are various formal statistical tests for heteroskedasticity. A popular such test is the Goldfeld-Quandt test, which splits the sample into two parts and statistically compares the residual variances between the two

An alternative is White’s test, which involves obtaining the residuals, ​u sub i​, from a regression such as (3.2) and conducting a second (“auxiliary”) regression of the squared residuals (​u sub i squared​) on features (such as x1i and x2i), the squares of features (such as ​x sub 1 i squared​ and ​x sub 2 i squared​), and the interactions between features (such as x1i x2i). If there is no heteroskedasticity, the parameter estimates from this auxiliary regression will not be statistically significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to deal with heteroskedasticity?

A

One approach is to weight the observations to account for the changing error variance, using a technique known as weighted least squares (WLS) instead of OLS.

Alternatively, making a logarithmic transformation of the variables and using these in place of the raw variables in the regression model can also help to resolve the issue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are stepwise regression procedures used for?

A

The presence of non-informative or redundant features in a linear regression model can add uncertainty to the predictions and reduce the effectiveness of the model, especially in the presence of highly correlated features. This introduces a need for methods to remove non informative predictors.

17
Q

What is the LASSO regularization technique?

A

LASSO belongs to the category of wrapper methods, which add or remove predictors to a regression with the aim of finding the combination that maximizes the model performance. The inclusion or exclusion of a feature is based on a criterion that measures the predictive accuracy of alternative sets of predictors.

A popular approach is to choose the model that minimizes the Akaike Information Criterion (AIC), which is a measure of prediction errors adjusted to account for the number of features in the model. Unlike R-squared, the AIC penalizes large models and therefore it can either increase or decrease when an additional feature is added to the model.

18
Q

What are the most common stepwise procedures?

A

The forward stepwise selection method starts with a model with no features and includes additional features into the model one-at-the-time starting from those that produce the largest drop in the AIC. The procedure stops when the addition of any new predictor fails to decrease the AIC.

The backward stepwise selection method begins with the full model and removes the predictors one-by-one starting with the least important, until any further elimination fails to decrease the AIC.

The two procedures will not necessarily select the same model. In fact, more generally, there is no guarantee that either of the two methods will select the optimal model, as only a subset of the 2m models (where m is the number of predictors) is considered. Although in principle it is possible to estimate all the possible models and compare them, this is very impractical when the set of candidate predictors is large.

Backward selection tends to be more computationally efficient when the set of candidate predictors is large. However, as the full model is the first to be estimated, the number of predictors is required to be strictly smaller than the number of observations. In contrast, forward selection can also be applied when the number of predictors is larger than the number of observations.

19
Q

What is a classification problem?

A

In the machine learning jargon, the problem of predicting a qualitative outcome is defined to be a classification problem and assigning an observation to one class rather than another is referred to as classifying the observation. A specific case of categorical data is where the output is binary – that is, it only has two outcomes.

In such cases, we would be interested in modeling the probability of one of the outcomes occurring (the probability of a prospective borrower defaulting, the probability of a transaction being fraudulent, etc.). One outcome (referred to as the positive outcome) is assigned a value of one, and the other (referred to as the negative outcome) is assigned a value of zero.

20
Q

What is the logistic regression model?

A

This specification uses a cumulative logistic function transformation, resulting in the output being bounded between zero and one.

21
Q

What other types of limited dependent variable models exist?

A

Discrete Choice Models

There are situations where we wish to model categorical data with more than two categories, which is called a discrete choice model. E.g. we use an extension of the logit regressions, known as multinomial logit models. We estimate models for all categories that are present in the data but one, which will serve as baseline category.

Ordinal Variables

A further class of problems relates to categorical data where an output could be drawn from one of several categories but where the categories have an implicit ordering – i.e., ordinal data. For modelling ordinal variables where there are more than two outcomes, ordered logit models would be used. The estimation principles are the same as for the binary case, but the values of cutoff parameters between categories must also be estimated.

22
Q

What is Linear Discriminant Analysis?

A

When there are multiple, well-separated classes, the estimates from a logistic regression turn out to be very unstable. In this case, an alternative to logistic regression is offered by linear discriminant analysis (LDA).

Linear discriminant analysis assumes that the joint distribution of features is multivariate normal with a common variance-covariance matrix, but with different mean vectors.

The idea is to assign each instance to the class with the highest conditional probability. It is the probability that a new data point belongs to that class. New data points are classified based on which class has the highest probability. LDA has been proven to work well in practice even when the assumptions are not met.