CCS: Regression Flashcards

Question 1

Q

General linear model

Answer

A

Says your data can be explained by two things:
DATA = MODEL + ERROR

Model
Y = mx + b or y = b + mx

Ex - Predicting Trick or Treaters

GLMs break data into two categories:

Information that can be accounted for by our model
Information that can’t

Many types of GLMs

Question 2

Q

Error

Answer

A

Deviation from the model

The DATA isn’t wrong, the MODEL is

Question 3

Q

Models allow…

Answer

A

Allow us to make inferences.

Question 4

Q

Linear regression

Answer

A

Can provide a prediction for the data.

Instead of providing a prediction with a categorical variable (as with a t-test), we use a continuous one.

EX: Predict number of likes a trending youtube video gets given the number of comments it has

Y = B + MX
Predicted likes = likes if 0 comments + (Beta x increase in likes per comment)

Question 5

Q

Plotting a linear regression

Answer

A

Plot data (from 100 videos)
This allows us to see if the data would be best fit by a straight line
Look at outliers (and how to handle them)
We have to set criteria for what an outlier is.

Question 6

Q

In regression, why are we so concerned about identifying and removing outliers?

Answer

A

Values that are really far away from the rest of our data can have an undue influence on the regression line.

Question 7

Q

What’s one of the biggest assumptions about linear regression?

Answer

A

One of the assumptions we make when using linear regression is that the relation is linear

Question 8

Q

Regression line

Answer

A

A straight line that’s as close as possible to all the data points at once.

It minimizes the sum of the squared distance of each point to the line.

Question 9

Q

Formula for the sum of all squared distances of each point to the line

Answer

A

Capital Sigma (Yi - Y-bar)^2

Youtube Likes = Y intercept (9100) + Slope/Coefficient (6.5) x number of comments + error

Question 10

Q

Residual plot

Answer

A

Visual representation of the error.

We can tell a lot by it’s shape.

We want an evenly distributed plot of residuals.

It’s especially concerning if you can see a weird pattern (mustache shape).

Question 11

Q

F-Test

Answer

A

Helps us quantify how well we think our data fit a distribution (like a t-test), like the null distribution

Question 12

Q

General form of test statistics

Answer

A

Test Statistic = (Observed data - what we expect if the null is true) / Average variation

Question 13

Q

Null hypothesis

Answer

A

Null hypothesis: There’s no relationship between number of comments and number of likes on Youtube

Scatter plot is messy, no discernible pattern
Regression line with slope of 0

Question 14

Q

Mapping the regression line

Answer

A

Y-hat equals predicted number of outcome variable (predicted number of likes)

Y-bar: mean value of likes in the sampe

SStotal = (Yi - Y-bar)^2
Total variation in data set

Similar to how to calculate variance
Var(Y) = SStotal/N

How much of the variation is accounted for by our model, and how much is just error?

Question 15

Q

Three key regression stats

Answer

A

Total Sums of Squares (SST): summed square of distance from data to null hypothesis line (0 slope)
Represents ALL the information we have
SST = SSR + SSE

Sums of Squares for Regression: summed square of distance from regression line to null hypothesis line
F-statistic: (observed model - model if null is true)/Average variation
Numerator: SSR
Represents the portion of the information we can explain using the model we created
DATA = MODEL + ERROR

Sums of Squares for Error (SSE): summed square of distance from data points to regression line
Represents the portion of the data we CAN’T explain by our model.
- Small SSE means data points are close to the regression line
- Large means the opposite
F-STATISTIC: SSR/SSE

Question 16

Q

Degrees of freedom

Answer

Study These Flashcards

A

Represent the amount of independent information we have

Question 17

Q

DoF in Regression

Answer

Study These Flashcards

A

SSR has one degree of freedom.

We have to divide each sums of squares by its degrees of freedom because we want to weight each one appropriate.

We need to scale the sums of squares based on the amount of independent information each has.

Question 18

Q

Final bit of rejecting the null hypothesis with f-statistic.

Answer

Study These Flashcards

A

Using an F-distribution, we get our P-value, the probability that we’d get an F-statistic as big or bigger than 59.613.

Question 19

Q

Final F-Statistic equation

Answer

Study These Flashcards

A

F-STATISTIC =
(SSR/DFr)
/
(SSE/DFe)

Question 20

Q

F-Statistic allows what?

Answer

Study These Flashcards

A

Allows us to directly compare the amount of variation that our model can and cannot explain.

When it explains a lot, we consider it statistically significant.

When you do a t-test, you’ll get the same p-value. If you square the t-statistic, you’ll get the f-statistic. These ways of calculating are related.

CCS: Regression Flashcards

(20 cards)