Econometrics 2: Bivariate Linear Regression 2.2 - Fitting the “best” line Flashcards

1
Q

Introduce & explain ‘best fit line’ and why it’s needed

A

The econometrician is interested in the values of the parameters 𝛽bottom right0 and 𝛽bottom right1, particularly 𝛽bottom right1 as this
directly shows how 𝑋 affects 𝑌. If we knew their numerical values then we would know the
position of the true regression line. However, we do not know these parameter values and
therefore we do not know the position of the true regression line. Hence, we collect data on the
variables of interest, 𝑋 and 𝑌. Then, we plot these data points on a scatter graph to gain an idea of the correlation. So, now we have the data, we also need a statistical technique
that will allow us to estimate values for 𝛽bottom right0 and 𝛽bottom right1. To glean any information about this economic relationship we have to use some kind of estimation
technique to estimate the values of 𝛽bottom right 0 and 𝛽bottom right1. We will denote these estimated values as 𝛽̂hat bottom right0 and 𝛽̂hat bottom right1.
So, 𝛽0 and 𝛽1 are the true parameters whose values are unknown and 𝛽̂hat0 and 𝛽̂hat1 are estimates of
these parameters. The main issue for the econometrician is how to get these estimates. We know
already that to do estimation we need data and we need a statistical procedure. Combining the two
will produce estimated values for the parameters 𝛽0 and 𝛽1. An important question is whether the
estimates we obtain are accurate. This relies on us using good quality data and an appropriate
estimation method. There are issues such as the fact that different model specifications may require
different techniques, or it may be that the data we collect is not exactly in the form that the model
specifies and this may affect the properties of the model and hence suggest a certain type of
estimation procedure. But for now, we will assume that the data are appropriate and the most basic
of estimation techniques are applicable.
Assume that we have a set of data that represents the income and consumption of a sample of
people, shown in the earlier plot. We want to find the line that “best” fits through this plot of data.
Once we have found this best line then we have found our estimates of 𝛽0 and 𝛽1 i.e. 𝛽̂hat0 and 𝛽̂hat1,
where 𝛽̂hat0 is the value where the line crosses the 𝑌 axis and 𝛽̂hat1 is the slope of the line. Now that
we are relating the regression model to the sample of data observations on income and
consumption, we can show that for each individual 𝑖 in the sample:
𝑌𝑖 = 𝛽0 + 𝛽1𝑋bottom right&𝑖 + 𝜀bottom right*𝑖 (3)
where 𝑖 = 1, … , 𝑛 . The subscript 𝑖 indicates that we are looking at data on individuals and there
are 𝑛 individuals in the data set. We say that the sample is of size 𝑛.

If we replace the parameters with their estimates (we’ll discuss
shortly how to get these estimates), we get
𝑌𝑖 = 𝛽̂hat0 + 𝛽̂hat1𝑋𝑖 + 𝜀̂bottom right𝑖
This is the estimated regression line and the term 𝜀̂𝑖 is called the residual, which is essentially the
estimated version of the error term 𝜀𝑖. The residual gives us the distance that each data point in
the sample lies away from the estimated regression line. Using Excel, we can find the best fitting line for any sample of data. Some individuals are above the line, some below, some close to it, some not so
close. But you can see that it’s not possible to fit a single straight line through all points. Excel also gives us 𝛽̂hat0 and 𝛽̂hat1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Introduce & explain OLS

A

The important question for the moment is how to actually find the best fitting line, i.e. how does
Excel come up with the line in a scatter graph. We need a mathematical/statistical criterion
with which to do this. Obviously, we want the residuals to be as small as possible, i.e. we want
observations to be as close to the line as possible. The example lines below (scatter graph with 2 lines of best fit - one is correct gradient but too high, another is completely wrong gradient) are pretty bad at
fulfilling such a criterion and clearly do not represent the general relationship between the two variables 𝑋 and 𝑌. One of the example lines here has the right slope but the intercept is too high
(and all the residuals are negative because all of the points lie below it). The other line has a negative
slope and the intercept is also too high (some residuals are positive and some negative).

How do we find the best line? What if we consider adding up all of the residual values and choose
the line that gives the lowest absolute summed value, i.e. find the values of 𝛽̂hat bottom right0 and 𝛽̂hat Bottom right1 that give the
value of S that is closest to 0 where

𝑆 = |∑ top right number of data points bottom right 𝑖=1 𝜀̂𝑖|.

The problem with this criterion is that a line like the downward sloping one above is likely to
produce the lowest value for S. This is because all of the positive and negative residuals would
cancel each other out and the sum would be close to 0. We clearly do not want to use a criterion
that chooses a downward sloping line for a set of data that is clearly trending upwards.
The criterion that works the best and which is used most often by econometricians is to minimise
the sum of squared residuals, i.e.
𝑆 = ∑top right number of data points bottom right 𝑖=1 𝜀̂𝑖^2

That way, negative residuals, when squared, would become positive and would no longer cancel
out with the squares of the positive residuals when summed together. The process that finds
estimates based on this criterion is called Ordinary Least Squares estimation or OLS for short. This
is an extremely common estimation technique for econometricians and is often the basis for other
techniques, when OLS itself is not appropriate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly