Topic 2: Correlation, Simple, and Multiple Linear Regression Flashcards
correlation
displays the form, direction, and strength of a relationship
pearson’s correlation
measures the direction & strength of a linear relationship between two quantitative variables
covariance
indicates the degree to which x and y vary together
interpreting covariance
positive = x and y move in the same direicotn
negative = x and y move in opposite directions
0 = x and y are independent
when should we not use r?
- when two variables have a non-linear relationship
- observations aren’t independent
- outliers exist
- homoscedasticity is violated
- the sample size is very small
- both variables are not measured on a continuous scale
point-biserial correlation
binary & continuous variables
phi coefficient
two binary variables
spearman’s rho
- two ordinal variables
- recommended when N > 100
kendall’s tau
- two ordinal varibales
- recommended when N < 100
counfounder
an association between two variables that might be explained by some observed common factor that influences both
lurking factors
potential common causes that we don’t measure
partial correlation
the correlation between two variables after the influence of another variable is removed
hypotheses for significance of a correlation coefficient
- H₀: ρ = 0 (ρ = population correlation)
No linear association between the two variables - H₁: ρ ≠ 0 Linear association between variables
simple linear regression
- used to study an asymmetric linear relationship between x and y
- describes how the DV changes as a single IV changes
β
the relationship of x on y
linear regression equation
- Ŷ = ɑ + βX
- Ŷ = predicted line
- ɑ = intercept
- β = slope
method of least squares
- makes the sum of the squares of the vertical distances of the data points from the line as small as possible
- minimizes ss (error)
stating hypotheses for the significance of the slope in simple linear regression
- H₀: β = 0 (There is no linear relationship between x & y)
- H₁: β ≠ 0 (There is a linear relationship between x & y)
t-test formula
t = sample statistic/ standard error
standard error
standard deviation of a sample population
assumptions to apply t-test to slope
normal distribution & independence of observation
partitioning variance in simple linear regression
- ss (regression) = variation in y explained by the regression line
- ss (error) = variation in y unexplained by the regression line
r²
the proportion of total variation in y accounted for by the regression model
interpreting r²
0 = no explanation at all
1 = perfect explanation
multiple linear regression
explains how the DV changes as multiple IVs change
regression plane
Ŷ = ɑ + β₁X₁ + β₂X₂
how can we compute a and bⱼ?
using the least squares method
standardized regression coefficient
- the effect of a standardized IV on the standardized DV (z-scores)
- the change in the standard deviation of the DV that results from a change of one standard deviation in xⱼ
stating hypotheses for overall significance in multiple regression
- H₀: β₁ = β₂ = … = βⱼ= 0
(None of the xs are linearly related to y) - H₁: at least one coefficient is not 0
(At least one x is linearly related to y)
stating hypotheses for individual regression coefficients in multiple regression
- H₀: β₁ = 0
(Xⱼ is not linearly related to y) - H₁: β₁ ≠ 0
(Xⱼ is linearly related to y)
r² in multiple regression
If a new IV is added to the model, SS (Error) will always be smaller and SS (Reg) will always become larger, so r² never decreases when another variable is added to the model
adjusted r²
if no substantial increase in r² is obtained by adding a new IV, adjusted r² rends to decrease
f-test vs. t-test in linear regression
- simple: f-test = t-test
- multiple: can only use f-test