Chapter 3: Multiple Regression Panel Data Flashcards
1
Q
Linear Regression
A
- Investigate the relationship between two variables
- Does blood pressure relate to age?
- Does income relate to education?
- Do sales relate to years of experience?
- Regressions identify relationships between dependent and independent variables
- Is there an association between the two variables
- Estimation of impact of an independent variable
- Used for numerical prediction and time series forecasting
- Regression as a fairly established statistical technique:
- Sir Francis Galton (1822-1911) studied the relationship between a fatherβs height and the sonβs height
2
Q
The Simple Linear Regression Model
A
- Linear regression is a statistical tool for numerical predictions The first order linear model
- y = beta0 + beta1 + epsiolon
- y = dependent variable
- x = independent variable
- beta0 = y-intercept
- beta1 = slope of the line
- epsilon = error variable (residual)
3
Q
Estimating the Coefficients
A
- Coefficients are random variables
- (Ordinary least squares) estimates are determined by
- drawing a sample from the population of interest,
- calculating sample statistics, and
- producing a straight line that cuts into the data.
4
Q
The Multiple Linear Regression Model
A
- A π variable regression model can be expressed as a series of equations
- Equations condensed into a matrix form, gives the a general linear model
- b coefficients are known as partial regression coefficients
- π1,π2, for example,
- π1=βyears of experienceβ
- π2=βageβ
- π¦=βsalaryβ
- Estimated equation:
*
5
Q
Matrix Notation
A
6
Q
OLS Estimation
A
- Sample-based counter part to population regression model:
- y = XΞ² + Ξ΅
- y = XΞ² + e
- OLS requires choosing values of the estimated coefficients, such that error sum-of-squares (SSE) is as small as possible for the sample.
- SSE = eTe = (y -XΞ²)T (y -XΞ²)
- Need to differentiate with respect to the unknown coefficients.
7
Q
Selected Statistics
A
- Adjusted R2 (More parameters < error rate)
- It represents the proportion of variability of Y explained by the Xβs. R2 is adjusted so that models with different number of variables can be compared.
- The F-test (If any parameter has influence)
- Significant F indicates a linear relationship between Y and at least one of the Xβs.
- The t-test of each partial regression coefficient 1 (If parameter X has influence )
- Significant t indicates that the variable in question influences the response variable while controlling for other explanatory variables.
8
Q
Gauss-Markov Assumptions
A
The OLS estimator is the best linear unbiased estimator (BLUE), iff
- 1) There is a linear relationship among the predictors and π¦
- 2) Expected value of the residual vector is 0
- 3) There is no correlation between the πth and πth residual terms
- 4) The residuals follow a Gauss distribution and exhibit constant variance (homoscedasticity)
- 5) The covariance between the πβs and residual terms is 0
- Usually satisfied if the predictor variables are fixed and non-stochastic
- 6) No multicollinearity
- 6) Assumption of no multicollinearity
- The rank of the data matrix, π is π, the number of columns
- p < n, the number of observations
- No exact linear relationship among X variables
- r(x) = p + 1
- p < n, the number of observations
- A basic check of multicollinearity is to calculate the correlation coefficient for each pair of predictor variables
- Large correlations (both positive and negative) indicate problems.
- One interpretation of large is: greater than the correlations between the predictors and the response
- The rank of the data matrix, π is π, the number of columns
9
Q
Heteroscedasticity
A
- When the requirement of a constant variance is violated we have heteroscedasticity.
- Breusch-Pagan test or White test are used to check for heteroscedasticity.
10
Q
Homoscedasticity
A
- When the requirement of a constant variance is not violated we have homoscedasticity.
11
Q
Outliers
A
- An outlier is an observation that is unusually small or large. Several possibilities need to be investigated when an outlier is observed:
- There was an error in recording the value.
- The point does not belong in the sample.
- The observation is valid.
- Identify outliers from the scatter diagram.
- There are also methods for βrobustβ regression.
12
Q
Modeling: Nominal Predictor Variables
A
- Binary variables are coded 0, 1.
- For example a variable π1 (Gender) is coded male = 0, female = 1.
- Then in the regression equation π = π½π + π½πππ + π½πππ when π₯ππ = 1 the value of π indicates what is obtained for female gender;
- when π₯ππ = 0 the value of π indicates what is obtained for males.
- If we have a nominal variable with more than two categories one can create a number of new dummy (also called indicator) binary variables
13
Q
Model Comparisons
A
- Hundreds of predictor variables β what to do?
- Too many βirrelevantβ attributes can negatively impact the performance of a model
- Our interest is in parsimonious modeling
- We seek a minimum set of π variables to predict variation in π response variable.
- Goal is to reduce the number of predictor variables to arrive at a more parsimonious description of the data.
- Does leaving out one of the π½βs significantly diminish the variance explained by the model.
- Compare a Saturated (full) to an Unsaturated model
- Note there are many possible Unsaturated models.
14
Q
βStepwiseβ Linear Regression
A
- Considers all possible simple regressions.
- Starts with the variable with largest correlation with π
- Considers next the variable that makes the largest contribution to the regressionβs sum of squares
- Tests significance of the contribution
- Checks that individual contributions of variables already in the equation are still significant
- Repeats until all possible additions are non-significant and all possible deletions are significant
- We will discuss attribute selection later in the course in more detail.
15
Q
Applications of Linear Regressions to Time Series Data
A
Average hours worked per week by manufacturing workers: