Regression Analysis Flashcards
Do you know what the standard error of the coefficient captures?
The standard error of the coefficient is the standard deviation of an estimate. It measures how precisely the model estimates the coefficient’s unknown value. The standard error of the coefficient is always positive.
The smaller the standard error, the more precise the estimate. Dividing the coefficient by its standard error calculates a t-value. If the p-value associated with this t-statistic is less than your alpha level, you conclude that the coefficient is significantly different from zero.
Do you know how to calculate the t-statistic using a formula?
t=B1/SE(B1)
Do you know what is a p-value and what it tells us?
A p-value is a statistical measurement used to validate a hypothesis against observed data.
The p-value is the probability that the null hypothesis is true. (1 – the p-value) is the probability that the alternative hypothesis is true. A low p-value shows that the results are replicable.
The lower the p-value, the greater the statistical significance of the observed difference.
p-value = the probability that F-value (or z-score) is above the corresponding critical value (the one that we find in the table in the book)
Do you know the difference between t-tests and F-tests?
F-test and T-test are the two statistical test used for hypothesis testing. They assist the researchers to decide whether to accept the null hypothesis or reject it.
- T-test is a univariate hypothesis test, that is applied when standard deviation is not known and the sample size is small.
- F-test is statistical test, that determines the equality of the variances of the two normal populations.
- T-statistic follows Student t-distribution, under null hypothesis.
- F-statistic follows Snedecor f-distribution, under null hypothesis.
- T-test is used to compare the means of two populations.
- F-test is used to compare two population variances.
Do you know the null and alternative hypotheses behind the p-value of the F-test in Stata outputs? They are reported in the top right corner.
P-value:
- H0: p-value > α
- Ha: p-value ≤ α
F-test:
- H0: βs are jointly = 0
- Ha: At least one β is different from 0
Do you know how to conduct an F-test in Stata using the test command after running a regression?
Yes. You just write the command test + the independent variables that you want to check (you want to check if it makes sense to include them in the model i.e. if not including them could induce OVB). You need to include them if p-value<0.05.
Do you know how to use the factor variable notations in Stata?
Factor variables are categorical variables. Need to add i. before the variable(s) when running a regression.
What is heteroskedasticity?
There is heteroskedasticity when the variance for residuals across x are unequal (homoskedasticity is the opposite).
One of the assumtpions of OLS is that there is no heteroskedasticity. To control for heteroskedasticity we add the option “robust” or “r” in the regression code
Why is heteroskedasticity a problem?
Heteroskedasticity refers to a situation where the variance of the residuals is unequal over a range of measured values.
If heteroskedasticity exists, the population used in the regression contains unequal variance, the analysis results may be invalid.
Models involving a wide range of values are supposedly more prone to heteroskedasticity.
Heteroskedasticity is a problem because ordinary least squares (OLS) regression assumes that all residuals are drawn from a population that has a constant variance (homoskedasticity) –> one of the assumptions of the Gauss-Markov Theorem.
What are robust standard errors?
They are the standard errors that we obtain when we put the r option (when we control for heteroskedasticity). “Robust” standard errors is a technique to obtain unbiased standard errors of OLS coefficients under heteroscedasticity.
What is OVB? What are the 2 conditions for OVB? Why is OVB a problem?
OVB arises when a relevant variable is omitted from the regression.
There are two conditions that may induce OVB:
1. The omitted variable (Z) is a determinant of Y (i.e. Z is part of u); and
2. Z is correlated with the regressor X (i.e. corr(Z,X) ≠ 0)
Having an omitted variable in research can bias the estimated outcome of the study and lead the researcher to an erroneous conclusion.
Do you know the differences between perfect and imperfect collinearity?
Perfect multicollinearity: When an independent variable or a set of independent variables predict the value of another independent variable perfectly. There is redundant information. One implies the other (i.e. male-female)
Imperfect multicollinearity: two independent variables are highly correlated (relationship is not perfect). i.e. height and weight.
What is a dummy variable trap?
Why is it a problem?
Dummy Variable Trap: When the number of dummy variables created is equal to the number of values the categorical variable can take on.
It is a problem because it leads to multicollinearity, which causes incorrect calculations of regression coefficients and p-values.
Do you know why collinearity is problematic?
A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you hold all of the other independent variables constant. However, when independent variables are correlated, it indicates that changes in one variable are associated with shifts in another variable. The stronger the correlation, the more difficult it is to change one variable without changing another. It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable independently because the independent variables tend to change in unison.
What is regression?
A stastistical method that uses data to test whether a relationship exists between two or more variables, and to quantify it.
What are the objectives of regression analysis?
1) To estimate the effect of an independent variable to a dependent variable;
2)To test whether the effect is statistically different from zero (or a certain value)
What is a random variable?
A variable which values are based on an outcome of a probabilistic event
What does it mean when we say that X has a linear relationship with Y?
It means that at all levels of X we have a proportional effect on Y –> at all levels the degree of change is the same (same slope).