Lecture 6 - Statistical Tests II: Linear Regression Flashcards
in what instance would we select a linear regression?
interested in association → interested in trend [where x is continuous] → “experiment” → y-continuous → y-normal → linear regression
linear regression is a statistical model which shows:
the relationship between 2 continuous variables
what questions should we be asking when we are choosing a statistical test?
(1) what type of response variable? [continuous, discrete/count, proportion, binary]
(2) what type of explanatory [continuous, discrete/count, proportion, binary, categorical]
(3) interested in differences or trends/relationships?
(4) paired or independent sample
(5) normal/normal distribution
what type are variables are present when we select a chi-squared statistical test?
when we are dealing with two categorical variables (y-counts, x categorical)
what variables must be present in order for us to carry out a linear regression statistical analysis?
for a linear regression we must both have a continuous X & Y variable
gradient =
change in y / change in x
what are the three stages when trying to calculate a linear regression?
(1) choose your model: linear / non-linear
(2) estimate the parameters of the model
(3) model fit: how well does the model describe our data
what is a Y bar that is found horizontally across the span of a graph?
a Y-bar, indicated by a dotted line labeled with a Y with a line on top of it shows the mean value line in your data
how do you calculate the total sum of squares?
total sum of squares is the sum of all the squared distances between your data points and the Y-Line
what is the equation of a line and its units?
y [w/ a hat] = a +bx
where:
a = intercept
b = slope
what is the error sum of squares (residuals)?
error sum of squares or residuals = the sum of all the distances between each individual data point and the line of best fit (y[w/ a hat] = a + bx)
what must all lines of best fit pass through and what allows us to choose what line of best fit is the most appropriate one?
all lines of best fit need to go through the mean-line of Y & X
we select the best line when the unexplained variation in our response is the smallest - when our residuals are the smallest
with regression, if the slope is positive or negative, what does this show about the relationship between the two variables?
if the slope is positive: the relationship between the variables is positive
if the slope is negative: the relationship between the variables is negative
what happens to the total sum of squares, SST, if we add additional data points?
the value gets larger
how to calculate mean sum of squares:
calculate mean variability = mean sum of squares (MS) = divide our total sums of squares by our sample size
mean sum of squares = sum of square deviations from the mean / degrees of freedom