week 3 Flashcards
B0
intercept
B1
regression coefficient
If you know the values for β0 and β1, then you can find….
a straight line that describes the linear relationship between x and y.
β0 is the value of y when…
x equals to 0
B1 is the amount of change in y…
when x is increased by 1 unit.
ANOVA and regression are both part of the same
General Linear Model (GLM).
ANOVA is a special case of regression where the IVs are categorical or ordinal.
What can the simple linear regression can be used as…
As a descriptive technique.
It can also be used for statistical inference.
What does simple linear regression using for stat inference:
- involves statistical modelling
- involves thinking about the true population model.
- involves hypothesis testing
- use sample regression coefficient to make inferences about the population regression coefficient.
What does simple linear regression involve the mechanics of?
Fitting a line to data
Minimization problem in math
Does simple linear regression involve statistical modelling?
- does not involve statistical modelling
Predictor variable
x is the predictor
Criterion variable
y is the criterion
The goal of the simple linear regression is …
to search for a best-fitting linear line that describes the relationship between x and y.
what method is used to determine the best fitting line?
Use the least squares method
What does the least squares method involve?
least squares method involves calculating the sum of squared residual (SSresidual). Let ei represent the residual for each
participant.
Criterion for the “best fitting line”
The line that minimizes the SSresidual.
residuals
ei = yi - yˆi
find the differences between observed yi and predicted yˆi.
Computing SS residual
observed - predicted
Square the residuals
then sum them up
rxy
correlation between x and y
sx, sy
standard deviation for x, y
sxy
covariance of x and y
In the least square method, we minimize the sum of the squared vertical distances between the observed and predicted values to find the __________________
“best fitting line”.
There are other criteria for finding the best fitting lines.
minimize the sum of the squared horizontal distances.
§ minimize the sum of the squared perpendicular distances.
§ minimize the sum of the absolute vertical distances.
Based on the equation for β0, what is the predicted value of the criterion variable y when the predictor variable x is at its mean? In other words, for the regression equation
yˆ = β0 + β1x,
what is yˆ when x = x¯, given that β0 = y¯ - β1x¯ ?
y¯
It shows that the point (¯x, ¯y) always passes through the regression line.
For the simple linear regression, based on the equation for β1, do you think the regression coefficient and correlation coefficient will always have the same sign? (Hint: think about the range of possible values for sy and sx).
B1 = rxy(sy/sx)
Recall the standard deviation is always positive. Therefore, sy
sx is always positive. This means the regression coefficient and correlation coefficient will have the same sign in simple linear regression.
Positive β1: direct relationship between x and y and positive rxy.
Negative β1: inverse relationship between x and y and negative rxy
Simple Linear Regression with Standardized Variables
we standardize x and y and denote standardized x and y as zx and zy, respectively
zbarx = 0, szx =1 zbary =0, szy =1
when we conduct the simple linear regression analysis with zx and zy, the regression coefficient equals to the following
B1 = rxy
Therefore, for the simple linear regression, the regression coefficient β1 for the standardized variables is the…
correlation coefficient rxy.
If we use the standardized variables zx and zy to conduct the regression analysis, what will the intercept be?
B0 = 0
for simple linear regression, when we have standardized variables, the regression line always passes through the origin and its slope (regression coefficient) is the correlation coefficient.
If we want to do statistical inference (e.g., test hypothesis), we need…
statistical modelling
involves making assumptions about the true population model.
The population regression model is the regression equation found via the least-squares method using
population data.
The sample regression model is the regression equation found via the least-squares method using
sample data.
xi
score on the predictor for ith participant
considered a constant across repeated studies in the classical regression analysis.
.For experimental designs, the predictor is fixed by the experimenter
u yi|xi :
the predicted score on the criterion variable for the ith participant using the population regression model.
also the long-run average of the observed criterion variable yi conditioning on the value of xi across repeated studies.
In the population regression model, the difference between the observed score and the predicted score on the criterion variable is called the:
error.
yi:
the score on the criterion variable for the ith participant.
An important assumption of the error term ϵi
ei ~ N(0, o2)
Population Model – Assumption for the Error Term
Based on probability theory, this assumption implies that…
yi ~ N(B0 + B1xi, o2)
Population Model – Assumption for the Error Term
- Normality: ϵi and yi are normally distributed.
- Linearity: Because the mean of ϵi is 0, the predicted score in the population is a linear function of the predictor: µyi|xi = β0 + β1xi.
- Constant Variance (a.k.a., homoscedasticity): Var(ei) = σ2 are constant across participants.
- Independence: ϵi is not related to ϵj when i and j represent different participants.
Bhat0
estimate of the population β0.
Note: the hat accent means the “sample estimate” of a certain parameter.
Bhat1
estimate of the population β1.
yhati
predicted value of the criterion variable for the ith participant based on the sample regression model.
residual
In the sample regression model, the difference between the observed score and the predicted score on the criterion variable
yi
the score on the criterion variable for the ith participant.
yˆi = βˆ0 + βˆ1xi
the predicted score on the criterion variable for the ith participant using the sample regression model.
For the sample regression model, the sample regression line is obtained via the least-square method by
minimizing the sum of squared residuals:
For simple linear regression (or multiple regression in general), you can conduct a hypothesis test….
§ for the intercept
§ for each of the regression coefficients
§ for the overall regression model taking into account all predictors.
For simple linear regression, since we only have one predictor x1, the hypothesis test for the overall regression model is equivalent to
the hypothesis test for the regression
coefficient β1 for x1.
A significant result on a simple linear regression
indicates the observed criterion variable
y can be significantly predicted or explained by the predictor x.
To conduct a hypothesis test regarding the population regression coefficient, we need to figure out the…
sampling distribution of the sample regression coefficient.
That is the distribution of the sample regression coefficient over repeated studies
µβˆ:
mean of sample regression coefficients over repeated samples.
β1
population regression coefficient
o2B
variance of sample regression coefficients over repeated
o2
population error variance
s2x
variance of the predictor x
Replacing σ with its sample estimate results in a t-stats following the
t-distribution.
σhat
is the sample estimate of the population error variance σ.
Assuming H0 : β1 = 0 is true, over repeated studies, the sampling distribution of t-stats is
t-stats ~ t(n-2)
With the standard error formula, we can also find the 95% CI for the regression coefficient:
95% CI = Bhat1 +/- tcrit (SE(Bhat1))
Interpret non-significant p-value
The p=0.0702 means that assuming H0 : β1 =0 is true, the probability of obtaining a sample t-statistic as extreme as the one we have obtained (i.e., t = 2.088) is 0.0702.
Correct interpretation of confidence interval
Over repeated studies, 95% of the CIs contain the population regression coefficient β1.