CCS: Regression Flashcards
General linear model
Says your data can be explained by two things:
DATA = MODEL + ERROR
Model
Y = mx + b or y = b + mx
Ex - Predicting Trick or Treaters
GLMs break data into two categories:
- Information that can be accounted for by our model
- Information that can’t
Many types of GLMs
Error
Deviation from the model
The DATA isn’t wrong, the MODEL is
Models allow…
Allow us to make inferences.
Linear regression
Can provide a prediction for the data.
Instead of providing a prediction with a categorical variable (as with a t-test), we use a continuous one.
EX: Predict number of likes a trending youtube video gets given the number of comments it has
Y = B + MX
Predicted likes = likes if 0 comments + (Beta x increase in likes per comment)
Plotting a linear regression
- Plot data (from 100 videos)
This allows us to see if the data would be best fit by a straight line - Look at outliers (and how to handle them)
We have to set criteria for what an outlier is.
In regression, why are we so concerned about identifying and removing outliers?
Values that are really far away from the rest of our data can have an undue influence on the regression line.
What’s one of the biggest assumptions about linear regression?
One of the assumptions we make when using linear regression is that the relation is linear
Regression line
A straight line that’s as close as possible to all the data points at once.
It minimizes the sum of the squared distance of each point to the line.
Formula for the sum of all squared distances of each point to the line
Capital Sigma (Yi - Y-bar)^2
Youtube Likes = Y intercept (9100) + Slope/Coefficient (6.5) x number of comments + error
Residual plot
Visual representation of the error.
We can tell a lot by it’s shape.
We want an evenly distributed plot of residuals.
It’s especially concerning if you can see a weird pattern (mustache shape).
F-Test
Helps us quantify how well we think our data fit a distribution (like a t-test), like the null distribution
General form of test statistics
Test Statistic = (Observed data - what we expect if the null is true) / Average variation
Null hypothesis
Null hypothesis: There’s no relationship between number of comments and number of likes on Youtube
- Scatter plot is messy, no discernible pattern
- Regression line with slope of 0
Mapping the regression line
Y-hat equals predicted number of outcome variable (predicted number of likes)
Y-bar: mean value of likes in the sampe
SStotal = (Yi - Y-bar)^2
Total variation in data set
Similar to how to calculate variance
Var(Y) = SStotal/N
How much of the variation is accounted for by our model, and how much is just error?
Three key regression stats
Total Sums of Squares (SST): summed square of distance from data to null hypothesis line (0 slope)
Represents ALL the information we have
SST = SSR + SSE
Sums of Squares for Regression: summed square of distance from regression line to null hypothesis line
F-statistic: (observed model - model if null is true)/Average variation
Numerator: SSR
Represents the portion of the information we can explain using the model we created
DATA = MODEL + ERROR
Sums of Squares for Error (SSE): summed square of distance from data points to regression line
Represents the portion of the data we CAN’T explain by our model.
- Small SSE means data points are close to the regression line
- Large means the opposite
F-STATISTIC: SSR/SSE