CCS: Regression Flashcards

1
Q

General linear model

A

Says your data can be explained by two things:
DATA = MODEL + ERROR

Model
Y = mx + b or y = b + mx

Ex - Predicting Trick or Treaters

GLMs break data into two categories:

  1. Information that can be accounted for by our model
  2. Information that can’t

Many types of GLMs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Error

A

Deviation from the model

The DATA isn’t wrong, the MODEL is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Models allow…

A

Allow us to make inferences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Linear regression

A

Can provide a prediction for the data.

Instead of providing a prediction with a categorical variable (as with a t-test), we use a continuous one.

EX: Predict number of likes a trending youtube video gets given the number of comments it has

Y = B + MX
Predicted likes = likes if 0 comments + (Beta x increase in likes per comment)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Plotting a linear regression

A
  1. Plot data (from 100 videos)
    This allows us to see if the data would be best fit by a straight line
  2. Look at outliers (and how to handle them)
    We have to set criteria for what an outlier is.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In regression, why are we so concerned about identifying and removing outliers?

A

Values that are really far away from the rest of our data can have an undue influence on the regression line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What’s one of the biggest assumptions about linear regression?

A

One of the assumptions we make when using linear regression is that the relation is linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Regression line

A

A straight line that’s as close as possible to all the data points at once.

It minimizes the sum of the squared distance of each point to the line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Formula for the sum of all squared distances of each point to the line

A

Capital Sigma (Yi - Y-bar)^2

Youtube Likes = Y intercept (9100) + Slope/Coefficient (6.5) x number of comments + error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Residual plot

A

Visual representation of the error.

We can tell a lot by it’s shape.

We want an evenly distributed plot of residuals.

It’s especially concerning if you can see a weird pattern (mustache shape).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

F-Test

A

Helps us quantify how well we think our data fit a distribution (like a t-test), like the null distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

General form of test statistics

A

Test Statistic = (Observed data - what we expect if the null is true) / Average variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Null hypothesis

A

Null hypothesis: There’s no relationship between number of comments and number of likes on Youtube

  • Scatter plot is messy, no discernible pattern
  • Regression line with slope of 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Mapping the regression line

A

Y-hat equals predicted number of outcome variable (predicted number of likes)

Y-bar: mean value of likes in the sampe

SStotal = (Yi - Y-bar)^2
Total variation in data set

Similar to how to calculate variance
Var(Y) = SStotal/N

How much of the variation is accounted for by our model, and how much is just error?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Three key regression stats

A

Total Sums of Squares (SST): summed square of distance from data to null hypothesis line (0 slope)
Represents ALL the information we have
SST = SSR + SSE

Sums of Squares for Regression: summed square of distance from regression line to null hypothesis line
F-statistic: (observed model - model if null is true)/Average variation
Numerator: SSR
Represents the portion of the information we can explain using the model we created
DATA = MODEL + ERROR

Sums of Squares for Error (SSE): summed square of distance from data points to regression line
Represents the portion of the data we CAN’T explain by our model.
- Small SSE means data points are close to the regression line
- Large means the opposite
F-STATISTIC: SSR/SSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Degrees of freedom

A

Represent the amount of independent information we have

17
Q

DoF in Regression

A

SSR has one degree of freedom.

We have to divide each sums of squares by its degrees of freedom because we want to weight each one appropriate.

We need to scale the sums of squares based on the amount of independent information each has.

18
Q

Final bit of rejecting the null hypothesis with f-statistic.

A

Using an F-distribution, we get our P-value, the probability that we’d get an F-statistic as big or bigger than 59.613.

19
Q

Final F-Statistic equation

A

F-STATISTIC =
(SSR/DFr)
/
(SSE/DFe)

20
Q

F-Statistic allows what?

A

Allows us to directly compare the amount of variation that our model can and cannot explain.

When it explains a lot, we consider it statistically significant.

When you do a t-test, you’ll get the same p-value. If you square the t-statistic, you’ll get the f-statistic. These ways of calculating are related.