13. Linear regression models Flashcards
What does multiple regression do?
Tells us how does the mean of DV change as a function of IVs- can partial the effect of each IV
* Ceteris paribus = other things equal
* What is the effect of education on salary, keeping gender, region, industry…. equal?
* Whatever is part of the regression is controlled for and held constant
* Flexibility in terms of functional form
* Popularity= Macros, shortcuts, post-hoc fixes and hacks unlike any other method
* Building on these principles for other more complicated versions of regression
What can multiple regression test?
REGRESSION LANGUAGE- NO CAUSALITY
* Remember causality is inferred not tested
* Causality is established through research design
* Very often, we have cross-sectional design when utilizing regression in that case…
* Adjust the wording of the hypothesis!
Hypothesis: Employees who feel more engaged experience the symptoms of burnout less frequently.
Hypothesis: For employees with temporary contracts, job satisfaction is less strongly related to turnover intentions
than for employees with permanent contracts.
…we expect positive/negative relationship, ….will increase/decrease with…
Regression supports/doesn’t support hypothesis (at specific significance level)
=> Regression can tell you is there a significant (non-zero) relationship between IV and DV, how big is this
relationship (magnitude) approximately, what is the direction of this relationship (positive/negative)
What does multiple regression technically do?
General model
Y=β0+β1X1+β2X2+…βkXk+u
Y= Dependent variable
β0= Intercept
β1X1+β2X2= Coefficients on independent variables, change in y with respect to change in x1, ceteris paribus
u= error term, random error and factors other than Xk that influence y
k=number of independent variables
What is centering?
Centering- substracting the mean of the variable
* Meaningful intercept
* Easier interpretation of interactions
* Don’t do for binary variables
* Interpretation of coefficients the same
What is standardization?
Standardization- substracting the mean and dividing by std.dev
* Standard variables- mean zero, std. dev. 1
* Comparable if on different scales
* Interpretation changes
* SPSS calculates standardized coefficients for you- no reason to change raw scores
What does significance testing?
Significance testing – is our coefficient significantly different from 0?
* So if we want to test that H0: β1=0
* t-statistics of β’1 = β’1 /standard error (β’1)
* Standard error- depends on the sample size
* Our trust in the results should be influenced by how much information we have about the population = sample size
* Standard error= deviation of the coefficient’s accuracy divided by sample size (gets smaller with larger sample)
How do I know my t-satistic is big enough?
H0- coefficient is 0- we usually want to reject
* P-value- given the observed t-statistics what is the smallest significance level at which H0 would be rejected?
* *** p-value< 0.001 ** p-value< 0,01 *p-value<0,05
* Significance level tells me about the probability of being wrong- probability of being wrong should be low
* Type I error- we reject H0 although it is truth- we SHOULD ONLY do this 0,1% 1% 5% of time => we pick our significance level
Methods of choosing predictors
Hierarchical entry- you can choose to enter and remove entire block of variables in gradual steps
* Your first block would be variables found in previous research (controls)
* Either built up or down (enter or remove)
Forced entry
* All variables are entered simultaneously- you select and decide which one stays
Stepwise methods- SPSS chooses for you
* Forward- starting with intercept only, variable with the highest simple correlation with DV is chosen
* Variable that could best explain the remaining variance is chosen next
* Stepwise- same as forward, but also removes variables that are “least useful” in explaining variance at the same time as adding new variable
* Backward- same as forward but the opposite logic, removing variables gradually based on p-values or t-statistics
* SPSS is not smarter than you- you should always make the decisions yourself
* SPSS has no idea about theory, previous research and which hypotheses you are testing
* Your strategy- controls and basic effects first- then variables in hypotheses- then interactions, mediation tests
What is variance?
Goodness of fit- we judge how good our “model” is by seeing how much actual variance in dependent
variable it explains
* Total sum of squares SST- total variance of dependent variable
* Explained sum of squares SSE- total variance explained due to our model
* Sum of squared residuals SSR- left over variance that is unexplained
What are control variables?
Control variables= same as any other IV but researcher is not really interested in their effects
What happens when adding control variables?
Adding control variables in the model means that the variables of interest are explaining the “left over” variance, especially
if they are correlated with the control variables themselves
* GOOD- ceteris paribus
* GOOD- making sure we are not omitting an important variable that has an effect
* BAD- continuity of research- can we compare to research with other or no control variables?
* BAD- multicollinearity can decrease significance of variables
* BAD- because we are usually more lenient about their precision (age proxy for career stage?)
* UGLY- we can find a combination that suits our research agenda (do I want effects of interest to be significant or not?)
How do we test significance?
Statistical tests in nutshell help us to decide whether our “model” is good
We are trying to model reality as close as possible- model the real relationship between IVs and DV for example
Test statistics- a number that summarizes how good is our model, how close is it to reality, how much of the variance in DV is our
model explaining
* F-test- variance explained by model (between-group variance in ANOVA) / error variance
* Chi-squared- the result of fitting function, we know that if we model reality in our CFA, the fitting function would be 0
* T-statistics in a regression is based on variance of the coefficient beta estimate