Study Session 3 - Correlation and Regression Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Formula for sample covariance

A

Sum of (Xi-Xbar)(Yi-Ybar)/N-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Formula for correlation

A

Cov,xy/sd,x(sd,y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the three limitations of correlation analysis?

A

Impact of outliers, spurious correlation, and nonlinear relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an outlier?

A

Relative to the rest of the data the value of a sample may be extraordinarily large or small.. Can skew data to show there is a relationship when there isn’t, and vise versa.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is spurious correlation?

A

The appearance of a casual linear relationship when there is none. Correlation by chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a nonlinear relationship in correlation analysis?

A

Correlation captures linear relationships but not nonlinear relationships such as parabolas or other shapes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the hypothesis test for testing correlation?

A

Testing if H0: p = 0 v. Ha: p=/=0

to test whether the correlation between the population of two variables is equal to zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Formula for critical value t is correlation hypothesis test

A

t=r√n-2/√1-r²

n-2 degrees of freedom, r is the correlation of the sample.

Reject outside t value, fail to reject within interval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the purpose of a simple linear regression?

A

To explain the variation in a dependent variable in terms of variation in a single independent variable. Dependent is the explained variable - the predicted. Independent is the explainer - the predictor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the underlying assumptions of linear regression?

A
  1. a relationship exists between the independent and dependent variable.
  2. the independent variable is uncorrelated with the residuals.
  3. The expected value of the residual term is zero E(𝝴)=0
  4. Each residual term is independently distributed, not related to that of another.
  5. The residual term is normally distributed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the regression model formula?

A

Yᵢ=b₀+b₁Xᵢ+𝝴ᵢ, i=1…, n

Y=value of the dependent variable
b0= regression intercept term
b1= regression slope coefficient
𝝴ᵢ= residual for ith observation

Give lines through scatter plot that ‘best’ explains the values for Y in terms of X. “best fit”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the linear equation?

A

Ŷᵢ=B₀ + b₁Xᵢ i=1

Same as regression but y, b0, b1 are ESTIMATED values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Sum of Squared Errors (SSE)?

A

The sum of the squared distances between the estimated and actual Y-values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Formula for Estimated slope coefficient

A

For regression line describes the change in Y for one unit of X

b₁=covᵪᵧ/ο²ᵪ

slope of the simple regression is estimated by covariance divided by variance of the sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the intercept term? And formula

A

Y where X=0

b₀=Ῡ-b₁X

where b0, b1 are ESTIMATES and X, Y are MEANS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Standard error of estimate and what does it measure?

A

SEE measures the degree of variability of the actual Y-values relative to the estimated. Smaller the error, better the fit. It is the standard deviation of the error terms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the coefficient of determination?

A

R². Higher R2, more the independent variable explains the results of the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the common hypothesis test for regression coefficient?

A

To see if the slope coefficient is different than 0.

H₀: b₁=0 Ha: b₁=/=0

b1+/-t x standard error of regression coefficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain hypothesis test for true slope coefficient?

A

a t-test can be set up to test if the true slope coefficient is statistically different from a hypothesized value.

with n-2 degrees of freedom,

tb₁= b1(estimate)-b1/standard deviation of b1 estimates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the formula for PREDICTED VALUES?

A

Ŷ=B̂₀+B̂₁Xp

Y is the predicted value of dependent
Xp is the FORECASTED value of independent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is ANOVA?

A

Analyzes total variability of the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Total Sum of Squares (SST)?

A

SST is the total variation in the dependent variable.

SST=RSS+SSE
SST=Explained+unexplained

∑(Yᵢ-ȳᵢ)²

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Regression Sum of Squares?

A

measures the variation of the dependent variable that is explained by the independent variable. Sum of the squared distances between predicted values and mean of Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is mean regression sum of squares (MSR)?

A

RSS/k

Regression sum of squares divided by degrees of freedom - k - number of independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is mean squared error (MSE)?

A

SSE/n-2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How is F calculated? What is it?

A

MSR/MSE = RSS/k/SSE/n-k-1

explains how well, as a group the independent variables explain the variation of the dependent variables.

always a one tailed test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

R² Formula

A

=SST-SSE/SST
=RSS/SST

expressed as a percentage

28
Q

How to calculate the standard deviation of the regression error terms (SEE)?

A

SEE = √MSE = √SSE/n-2

29
Q

In simple linear regression (one independent variable) what is the significance of F?

A

Same as testing the slope coefficient being different from 0.

30
Q

What is the decision rule for an F test?

A

One tailed always

if F calculated exceeds F critical, reject.

31
Q

What are two limitations of regression analysis?

A

Linear relationships can change over time - referred to as parameter instability. Usefulness in an investment scenario depends on other people using the same approach.

32
Q

What is the overall point of multiple regression?

A

The resulting formula will minimize the residuals squared.

33
Q

What is the intercept term represent in a multiple regression?

A

The intercept is the value of the dependent variable when all independent variable are set to 0.

34
Q

What are the slope coefficient in multiple regression?

A

Represents how much the dependent variable will change if all other independent variables are held constant.

35
Q

Explain a hypothesis test for coefficients in multiple regression

A

Tests whether each of the slope coefficients contribute significantly to explaining the variation of the dependent variable.

t= estimated regression coefficient - hypothesized value/ coefficient standard error

with, df =n-k-1

36
Q

Statistical significance tests are always…

A

H0: b/y/x = 0 v. Ha: b/y/x =/=

otherwise, hypothesized value instead of 0, incorporated into the calculation of T.

37
Q

What is the p-value?

A

The smallest level of significance that the hypothesis can be rejected.

p a, fail to reject

38
Q

What are the assumptions of a multiple regression model?

A
  • A linear relationship between independent and dependent variables exists.
  • The expected value of the error term is 0
  • The variance of the error terms is constant for all observations
  • The error term is normally distributed.
39
Q

How is the F test used in multiple regression?

A

Test whether at least one of the independent variables contributed significantly to the variation of the dependent variable.

40
Q

How is a hypothesis test structured for F test in multiple regression?

A

H0: b1 = b2 = b3, etc. = 0,
Ha: at least one bj =/=0

F, is always one tailed. > than critical, reject.

41
Q

What is R² in term of multiple regression?

A

In multiple regression, its the percentage of variation explained by the independent variables, collectively.

42
Q

Why is R² not reliable in multiple regression?

A

Because R2 almost always increases are variables are added to the model. Commonly known as overestimating the regression.

43
Q

What is adjusted R² and formula?

A

Adjusts R2 for the number of variables in the model.

R²a= 1 - {(n-1)/n-k-1) * (1-R²)}

1 - df in F, times the remaining in R2.

44
Q

What is a dummy variable?

A

When a coefficient is binary in nature, either on or off. Assigned a value of 1. Coefficient equals the change in dependent if present.

45
Q

How many dummy variables in appropriate and why?

A

When we want to distinguish between n classes, we must use n-1 dummy variables. other no exact relationship is violated.

Whatever class is omitted is usually the reference point/intercept for the model.

46
Q

What are the 4 things I need to know for the 3 assumption violations in multiple regression analysis?

A

What is it?
What is its effect on regression?
How do we detect it?
How do we correct it?

47
Q

What is heteroskedasticity?

A

Assumption: variance of the residuals in constant across all observations.
Violation: it is not the same across because there are subsamples that are more spread out.

Unconditional: doesn’t increase with the value of independent variable, not a problem for regression
Conditional: increases with value of independent, causes issues.

48
Q

What are the effects of heteroskedasticity?

A
  1. Standard errors are unreliable
  2. The coefficient estimates aren’t affected.
  3. If errors are too small, t will be too large, hypothesis too often rejected, opposite is true.
  4. F test is unreliable.
49
Q

How do we detect heteroskedasticity?

A

Look at the plot is there a point where the distance of the error suddenly changes?

More common: Breusch-Pagan/X² test

chi-square test = R² x n, with k df
one tailed test, use chart

50
Q

How do we correct heteroskedasticity?

A

By using robust standard errors - then used to recalculate t

or generalized least squares, which attempts to eliminate the HSK by modifying the equation.

51
Q

What is serial correlation?

A

When the residuals are correlated.

Positive SC: when a positive regression error in one time period increases the likelihood of a positive one in the next.
Negative SC: when a positive regression error in one time period increases the likelihood of a negative one in the next.

52
Q

What is the effect of serial correlation on regression?

A

Because the data clusters together, typically results in standard errors that are too small. Causes t to be too big, rejecting too often, too many Type I errors.

53
Q

How do we detect serial correlation? What is the decision rule?

A

Durbin-Watson (DW) test.

DW ~ 2(1-r), if the sample is large enough.

r=correlation

0-Dlower, reject
D lower - D upper, inconclusive
D upper, do not reject H0

54
Q

How can we correct serial correlation?

A
  • Adjust the coefficient standard errors using the Hansen method, also corrects heteroskedasticity, then used in hypothesis testing.
  • Or improve the specification of the model, explicitly incorporate the nature of the model (seasonality, etc.). Hard.
55
Q

How do you detect Multicollinearity?

A

t-test indicate none of the individual coefficient is different than zero

F test is statically significant

R² is high

Means together the variables explain a lot but individually do not. Means the independents are highly correlated.

Low correlation amongst Independent variables does not mean there isn’t multicollinearity.

56
Q

How do you correct for multicollinearity?

A

Most common is to omit one of the correlated variables. Hard to tell which one is to blame sometimes.

57
Q

What are the three broad types of model misspecification?

A

Functional
Explanatory
Other time series resulting in non stationary.

58
Q

What are the sub groups of misspecification for functional?

A

Important variables are omitted.
Variables need to be transformed. - Using Ln instead of Linear or vice versa
Data is improperly pooled - pooling data that should be kept separate

59
Q

What are the sub groups of Explanatory variables in misspecification?

A

A lagged dependent variable is used as an independent variable.
A function of the dependent variable is used as an independent variable.
Independent variables are measured with error.

60
Q

How is the mistake of forecasting the past committed?

A

Using variables or data to forecast i.e. july, using data from july.

61
Q

How is the mistake of measuring independent variables with error committed?

A

Using proxy variables. For ex. Corporate Governance Quality could be measured by Free float but it is not an actual measure, measuring it with error, ruining our regression.

62
Q

What is a qualitative dependent variable?

A

A dummy variable with a value of 1 or 0 to predict the likelihood of a event happening or not.

63
Q

What is the difference between a profit and logic model?

A

Profit model is based on a normal distribution.

Logit model is based on the logistic distribution.

64
Q

What is a discriminant model?

A

Make different assumptions regarding the independent variables. Results in linear function similar to an ordinary regression which generates a score to rank an observation.

Ex. using financial ratios as the independent variable to predict the qualitative dependent variable of bankruptcy.

65
Q

What is the correct steps for working through a multiple regression?

A

Is the model correctly specified? Correct if not.
T -test individual coefficient to check for significance.
F-test for model significance. Different model if not.
Check for heteroskedasticity with Chi square test.
Check for serial correlation with Durbin Watson
Check for Multicollinearity. Fix if any.
Use model.