Unit 7 Flashcards

1
Q

Why do we use scatterplots?

A

To graph the relationship between two quantitative variables.
When we have two variables that we believe are related to one another. With two variables, we cannot use a histogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a scatterplot?

A
  • each point on a scatter plot shows the value of two different variables measured on the same individual
  • if you have an explanatory variable and a response variable they should be plotted on the x-axis and y-axis respectively
  • if there is no explanatory variable, the labelling of the axis is unimportant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two different types of models to model a variable y by variables x

A
  1. deterministic model: a model where y can always be predicted exactly by x alone
  2. probabilistic model: a model that includes both a deterministic portion and a random error term
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the general form of a probabilistic linear model?

A

y-intercept + slope*x + random error term (which follows a normal distribution of N(0, sigma)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does the scatterplot affect the sigma value used for the random error term?

A

points spread out -> sigma large
points close to the line -> sigma small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the probabilistic linear model look like for average y?

A

mean-y = y-intercept + slope*x (the errors should average out to 0 since the error follows N(0, sigma))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the regression line?

A

The line we will use to model our data and make predictions.
the predicted value of y for a given x = the estimate of the intercept + the estimate of the slope*x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between observed and predicted values of y?

A

Observed values of y are the points on our scatterplot. Our predicted values of y are the values along the line we fit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the relationship between the observed and predicted values of y called?

A

The vertical difference between them is the “error”; ow far off our real world value is from our modeled/predicted value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do we want to minimize in fitting our curve?

A

the sum of (observed y - predicted y)^2. Not the sum of the differences because positives and negatives would cancel each other out like in variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the values of the estimate of the intercept and the estimate of the slope that minimize the squared errors?

A

estimate of slope: SSxy / SSxx
estimate of intercept: sample y - estimate of slope*sample x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the slope of the regression line (beta-1-hat) tell us?

A

How much y changes when x changes by one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the general form for an interpretation of beta1-hat?

A

For every unit increase in X we predict that Y will increase(decrease) by beta-1-hat units

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the general form for an interpretation for beta0-hat?

A

The predicted value of Y when X is zero (if x = 0 is outside the range of the data, this interpretation may be meaningless or nonsensical)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the assumption we need to make for our regression to be valid?

A
  1. the expected value (the mean) of the errors is 0
  2. the variance of the error terms in constant across all values of y
  3. the data set is linear
  4. the probability distribution of the errors is normal
  5. the errors are independent between observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a residual plot?

A

A way to test if the variance of the error terms is constant and if the dataset is linear. We plot the predicted values against the errors. If there doesn’t appear any section that’s obviously much taller than any other, the variance of the errors is constant. If there is no discernable pattern in the plot, the data is linear.

17
Q

How can we test if the probability distribution of the errors is normal?

A

Make a normal quantile plot of the error terms. In particular, we make a normal quantile plot of the standardized residuals

18
Q

What are standardized residuals?

A

Error terms converted to z-scores

19
Q

What does the error variance tell us about our model?

A

The bigger the variance, the worse our prediction will be. The more the errors very from 0, the more error there is in the model

20
Q

How can we find an estimate for the variance of the error?

A

sigma-hat = sqrt(SSyy - beta1-hat*SSxy / (n - 2) )

21
Q

What is the general interpretation of standard error?

A

Approximately 95% of our observations will fall within two standard errors of the regression line

22
Q

What would our model look like if x had no relationship with y?

A

y would equal the population mean of y plus or minus some error (beta-1-hat = 0)

23
Q

What is the correlation of a sample?

A

r (a number between -1 and 1) measures the strength of a linear relationship between two variables. It can tell us both the strength and the direction (positive or negative) of the relationship

24
Q

What are the properties of correlation?

A
  1. positive r values indicate a positive association. Negative r values indicate a negative association
  2. r falls between -1 and 1 inclusively
  3. r values near -1 and 1 indicate a strong linear relationship. r values near 0 indicate a weak linear relationship.
    4.. If r = -1 or r = 1, the points fall exactly on a straight line and are perfectly correlated
25
Q

What is a lurking variable?

A

A variable that helps explain the relationship between variables in a study but which is not itself included in the study

26
Q

What is the coefficient of determination?

A

r squared: it helps us establish how much of y’s movement is explained by its relationship to x and how much is due to error/other factors. Of all the movement in our y, what proportion of it can be attributed to a relationship with x. The better the points fit to the line, the higher this value will be

27
Q

What does a coefficient of determination of zero mean?

A

x explains nothing about the variation in y and it’s all due to error

28
Q

What does a coefficient of determination of one mean?

A

x explains all of the variation in y and there is no error

29
Q

What are the properties of r squared?

A
  1. r squared takes on values between 0 and 1 inclusive
  2. if r squared = 1 we can predict Y exactly from X and we say the regression accounts for all the variability of Y
  3. if r squared = 0, regression on X tells us nothing about the value of Y
30
Q

What is the residual?

A

observed - predicted. The error of our estimate.

31
Q

What does least-squares regression actually do?

A

If we were to just minimize the errors, positive and negative errors would cancel out. We square instead of take the absolute value to put a penalty on extreme errors.

32
Q

What are outliers in a two-variable model?

A

Points that either do not fit in with the general pattern and/or range of our other x and y values. They could effect the slope of our regression line because it is based on the sums of squares which can be huge if we have points very far away from the mean.

33
Q

What are the different kinds of outliers in a two-variable model?

A
  • x direction (outside of the general range of our x points. Creates massive error term)
  • y direction (the slope doesn’t drastically change. If we tilt the line too much to meet it, we would create really bad outliers on the other side)
  • x and y direction (may not change the slope, but will change average x and y and hence sums of squares)
  • outside of general pattern (violates normality of errors assumption)