Module 2: Chapter 3 - Supervised Learning for Numerical Data- Part 1, Econometric Techniques Flashcards
Linear Regression / Machine Learning
Describe the OLS regression term in ML parlance
y is a linear function of x and an unobservable error term
y is the target
u is the unobservable error term with mean zero and constant variance
β0 is a parameter to be estimated
β0 is the intercept parameter
β0 is known as the bias (ML)
value that y would take if x were zero
β1 is a parameter to be estimated
β1 measures the impact on y of a unit change in x
β1 is the weight (ML)
The index i for each variable, or feature, denotes the observation number (i = 1, …..N where N is the total number of data points, or instances available for each variable)
What are the three methods to estimate parameters of a regression model?
(1) Least squares
(2) Maximum likelihood
(3) The method of moments
Explain linearity in the context of regression
The model is both linear in the parameters (the equation is a linear function of β0 and β1) and linear in the variables (linear with respect to y and x). To use OLS, the model must be linear in the parameters, although it does not necessarily have to be linear in the features.
How many parameters does a multiple regression model estimate?
In the multiple linear regression model, there will be m + 1 parameters (m ≥ 1) to estimate: one for the intercept, β0, and one for each of the m slope parameters (plus potential interaction or power terms)
Explain the meaning of the different parameters in a multiple regression model
In the multiple linear regression model, each parameter measures the partial effect of the attached variable after controlling for the effects of all the other features included in the regression.
Which specifications can we add to a multiple regression model?
1) It is common to apply a logarithmic transformation to some or all the feature variables and/or to the output variable. Such a transformation would imply a different interpretation of the parameter estimates but OLS could still be used as the model would remain linear in the parameters
2) We could incorporate interaction terms (i.e., features multiplied together)
3) We could include power terms of the features
What type of data do we need to run regression models?
The output variable y must be continuous, but the features could be continuous or discrete
Which problems can we encounter with multiple regression models?
Problem 1: Wrong Features or Wrong “Functional Form”
Problem 2: Multicollinearity
Problem 3: Outliers
Problem 4: Heteroskedasticity
Explain: Problem 1: Wrong Features or Wrong “Functional Form”
1) The model omits some relevant features
This can be a serious potential misspecification that could lead the parameter estimates to be biased and not to become more accurate as the sample size increases
2) The model includes some irrelevant features
This is less serious than the first misspecification, but can result in “inefficiency,” where the parameters are not estimated precisely. A further consequence is that it is likely the model will find it hard to generalize from the specific training sample to the test sample
3) The model includes the correct features, but they are incorporated in the wrong way
This is known as an incorrect functional form. It could occur, for instance, if the true relationship between the features and the output is non-linear but a linear regression model is used.
These three problems are all more challenging to resolve in practice than they appear because the researcher never knows the true relationship between the variables. This is where a strong theoretical knowledge of the problem at hand and the wider context can be valuable in guiding the model development, rather than a purely data-driven approach.
Explain: Problem 2: Multicollinearity
Multicollinearity occurs when the features are highly related to one another. We can draw a distinction between two degrees of multicollinearity: perfect and near.
Perfect multicollinearity occurs when two or more of the features have an exactly linear relationship that holds for every data point. In such cases, there is insufficient information to be able to estimate the parameters on both x2i and x3i, and hence the only solution is to remove one of these perfectly correlated variables from the model
If the correlation between x2i and x3i is 0.9. This would be known as near multicollinearity, and in such circumstances, the estimation technique would find it hard to disentangle the separate influences of each variable. A common consequence is that the parameter estimates become highly unstable, changing wildly when a feature is added or removed from the model.
How to deal with multicolinearity?
There are various empirical approaches to dealing with near multicollinearity. These include the removal of one or more of the highly correlated variables or turning them into a ratio or difference rather than including them individually. Another way forward is to use regularization.
Explain: Problem 3: Outliers
In broad terms an outlier refers to an anomalous data point that lies a long way from the others. The process of squaring these distances means that points that are a considerable distance from the others will exert a disproportionate effect on the estimates.
Outliers can be detected examining a plot of the residuals (the difference between the actual data points and the corresponding values fitted from the regression line) and noting any points that lie further from the line than others.
A more sophisticated approach to outlier detection is to calculate Cook’s distance, which measures the influence of each individual data point on the parameter estimates. This is achieved by removing each data point separately from the regression and determining the difference in model fit for all the remaining points.
Explain: Problem 4: Heteroskedasticity
OLS assumes that the variance of the error term is constant and finite, which is known as the homoskedasticity assumption.
If the assumption does not hold, and the variance is not constant, this is known as heteroskedasticity. It occurs frequently in time-series of stock and bond returns and is usually also present in the residuals from models of these series.
Heteroskedasticity can lead to several issues with regression estimation, most notably that it becomes inefficient and that it is hard to accurately evaluate the empirical importance of each feature for determining the output.
How to identify heteroskedasticity?
A residual plot can sometimes be useful in detecting heteroskedasticity, where we would be looking for whether the spread of the residuals around their mean (usually of zero) is constant or systematically changing. It is common to plot the residuals on the y-axis against the fitted values from the model on the x-axis
Alternatively, there are various formal statistical tests for heteroskedasticity. A popular such test is the Goldfeld-Quandt test, which splits the sample into two parts and statistically compares the residual variances between the two
An alternative is White’s test, which involves obtaining the residuals, u sub i, from a regression such as (3.2) and conducting a second (“auxiliary”) regression of the squared residuals (u sub i squared) on features (such as x1i and x2i), the squares of features (such as x sub 1 i squared and x sub 2 i squared), and the interactions between features (such as x1i x2i). If there is no heteroskedasticity, the parameter estimates from this auxiliary regression will not be statistically significant.
How to deal with heteroskedasticity?
One approach is to weight the observations to account for the changing error variance, using a technique known as weighted least squares (WLS) instead of OLS.
Alternatively, making a logarithmic transformation of the variables and using these in place of the raw variables in the regression model can also help to resolve the issue.