Unit 7 Flashcards

Question 1

Q

Why do we use scatterplots?

Answer

A

To graph the relationship between two quantitative variables.
When we have two variables that we believe are related to one another. With two variables, we cannot use a histogram.

Question 2

Q

What is a scatterplot?

Answer

A

each point on a scatter plot shows the value of two different variables measured on the same individual
if you have an explanatory variable and a response variable they should be plotted on the x-axis and y-axis respectively
if there is no explanatory variable, the labelling of the axis is unimportant

Question 3

Q

What are the two different types of models to model a variable y by variables x

Answer

A

deterministic model: a model where y can always be predicted exactly by x alone
probabilistic model: a model that includes both a deterministic portion and a random error term

Question 4

Q

What is the general form of a probabilistic linear model?

Answer

A

y-intercept + slope*x + random error term (which follows a normal distribution of N(0, sigma)

Question 5

Q

How does the scatterplot affect the sigma value used for the random error term?

Answer

A

points spread out -> sigma large
points close to the line -> sigma small

Question 6

Q

What does the probabilistic linear model look like for average y?

Answer

A

mean-y = y-intercept + slope*x (the errors should average out to 0 since the error follows N(0, sigma))

Question 7

Q

What is the regression line?

Answer

A

The line we will use to model our data and make predictions.
the predicted value of y for a given x = the estimate of the intercept + the estimate of the slope*x

Question 8

Q

What is the difference between observed and predicted values of y?

Answer

A

Observed values of y are the points on our scatterplot. Our predicted values of y are the values along the line we fit.

Question 9

Q

What is the relationship between the observed and predicted values of y called?

Answer

A

The vertical difference between them is the “error”; ow far off our real world value is from our modeled/predicted value

Question 10

Q

What do we want to minimize in fitting our curve?

Answer

A

the sum of (observed y - predicted y)^2. Not the sum of the differences because positives and negatives would cancel each other out like in variance

Question 11

Q

What are the values of the estimate of the intercept and the estimate of the slope that minimize the squared errors?

Answer

A

estimate of slope: SSxy / SSxx
estimate of intercept: sample y - estimate of slope*sample x

Question 12

Q

What does the slope of the regression line (beta-1-hat) tell us?

Answer

A

How much y changes when x changes by one

Question 13

Q

What is the general form for an interpretation of beta1-hat?

Answer

A

For every unit increase in X we predict that Y will increase(decrease) by beta-1-hat units

Question 14

Q

What is the general form for an interpretation for beta0-hat?

Answer

A

The predicted value of Y when X is zero (if x = 0 is outside the range of the data, this interpretation may be meaningless or nonsensical)

Question 15

Q

What are the assumption we need to make for our regression to be valid?

Answer

A

the expected value (the mean) of the errors is 0
the variance of the error terms in constant across all values of y
the data set is linear
the probability distribution of the errors is normal
the errors are independent between observations

Question 16

Q

What is a residual plot?

Answer

A

A way to test if the variance of the error terms is constant and if the dataset is linear. We plot the predicted values against the errors. If there doesn’t appear any section that’s obviously much taller than any other, the variance of the errors is constant. If there is no discernable pattern in the plot, the data is linear.

Question 17

Q

How can we test if the probability distribution of the errors is normal?

Answer

A

Make a normal quantile plot of the error terms. In particular, we make a normal quantile plot of the standardized residuals

Question 18

Q

What are standardized residuals?

Answer

A

Error terms converted to z-scores

Question 19

Q

What does the error variance tell us about our model?

Answer

A

The bigger the variance, the worse our prediction will be. The more the errors very from 0, the more error there is in the model

Question 20

Q

How can we find an estimate for the variance of the error?

Answer

A

sigma-hat = sqrt(SSyy - beta1-hat*SSxy / (n - 2) )

Question 21

Q

What is the general interpretation of standard error?

Answer

A

Approximately 95% of our observations will fall within two standard errors of the regression line

Question 22

Q

What would our model look like if x had no relationship with y?

Answer

A

y would equal the population mean of y plus or minus some error (beta-1-hat = 0)

Question 23

Q

What is the correlation of a sample?

Answer

A

r (a number between -1 and 1) measures the strength of a linear relationship between two variables. It can tell us both the strength and the direction (positive or negative) of the relationship

Question 24

Q

What are the properties of correlation?

Answer

A

positive r values indicate a positive association. Negative r values indicate a negative association
r falls between -1 and 1 inclusively
r values near -1 and 1 indicate a strong linear relationship. r values near 0 indicate a weak linear relationship.
4.. If r = -1 or r = 1, the points fall exactly on a straight line and are perfectly correlated

Question 25

Q

What is a lurking variable?

Answer

A

A variable that helps explain the relationship between variables in a study but which is not itself included in the study

Question 26

Q

What is the coefficient of determination?

Answer

A

r squared: it helps us establish how much of y’s movement is explained by its relationship to x and how much is due to error/other factors. Of all the movement in our y, what proportion of it can be attributed to a relationship with x. The better the points fit to the line, the higher this value will be

Question 27

Q

What does a coefficient of determination of zero mean?

Answer

A

x explains nothing about the variation in y and it’s all due to error

Question 28

Q

What does a coefficient of determination of one mean?

Answer

A

x explains all of the variation in y and there is no error

Question 29

Q

What are the properties of r squared?

Answer

A

r squared takes on values between 0 and 1 inclusive
if r squared = 1 we can predict Y exactly from X and we say the regression accounts for all the variability of Y
if r squared = 0, regression on X tells us nothing about the value of Y

Question 30

Q

What is the residual?

Answer

A

observed - predicted. The error of our estimate.

Question 31

Q

What does least-squares regression actually do?

Answer

A

If we were to just minimize the errors, positive and negative errors would cancel out. We square instead of take the absolute value to put a penalty on extreme errors.

Question 32

Q

What are outliers in a two-variable model?

Answer

A

Points that either do not fit in with the general pattern and/or range of our other x and y values. They could effect the slope of our regression line because it is based on the sums of squares which can be huge if we have points very far away from the mean.

Question 33

Q

What are the different kinds of outliers in a two-variable model?

Answer

A

x direction (outside of the general range of our x points. Creates massive error term)
y direction (the slope doesn’t drastically change. If we tilt the line too much to meet it, we would create really bad outliers on the other side)
x and y direction (may not change the slope, but will change average x and y and hence sums of squares)
outside of general pattern (violates normality of errors assumption)

Brainscape's Knowledge GenomeTM

Unit 7 Flashcards

Brainscape's Knowledge Genome^TM