Module 6 Flashcards

1
Q

scatterplots

A

plots that graph bivariate, quantitative data. Horizontal axis has explanatory variable and veritcal axis has the response variable. Each observation is plotted as a point. These plots are used to help us visualize relationships between two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Association

A

used to describe a relationship between 2 variables. There is positive, negative and no association.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

correlation

A

measure of the strength of linear association between 2 variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a curve in a scatterplot indicate?

A

no linear relationship, measuring correlation is not appropriate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Properties of r

A
  1. independent of units
  2. two variables have the same association whether you are looking at the explanatory or response variable.
  3. Magnitude determines strength of relationship
  4. Falls between -1 and 1
  5. Sign determines type of relationship(postive or negative)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What plot must be analyzed before looking at a linear relationship?

A

a scatterplot must be analyzed to see if it makes sense to look at a linear relationship and make sure there ar eno curves present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do r values of -1, 0 and 1 indicate?

A
  • 1: perfect negative LINEAR correlation
    0: no linear correlation
    1: perfect positive LINEAR correlation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what do different ranges of r indicate?

A

0 to 0.4 or -0.4 to 0: weak postive/negative correlation

  1. 4 to 0.8 or -0.8 to -0.4: moderate positive/negative correlation
  2. 8 to 1 or -1 to -0.8: strong positive/negative correlation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

regression equation and variable meanings

A
yhat=a+bx
a=y-intercept
b=slope
x=explanatory variable
yhat=predicted mean value of response variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to interpret slope

A

the slope equals the amount that the predicted mean value of the response variable(y) changes when the explanatory variable increases by unit(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

residual definition

A

difference between observed and predicted values. Residual=observed-predicted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do residuals represent graphically?

A

the vertical distance each point lies from the regression line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How many residuals does each observation have?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are properties of residuals if the observation is larger and if it is smaller than the predicted value?

A

Larger: Postive residual value, will lie above line

Smaller: Negative residual value, will lie below line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which regression line fits a data set best?

A

the line with the smallest sum of squared errors or minimizes the sum of the squared residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Coefficient of determination

A

r^2, lies between 0 and 1, determines percentage of variation in the observed values of the response variable that is explained by the regression line, measure of usefulness for making predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

extrapolation

A

using a regression line to make predictions for the response variable outside the range of the explanatory variable, may result in incorrect predictions if the linear relationship does not hold past the range of the explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

regression outlier

A

a data point that falls far from the regression line relative to other data points, these points are removed if they are the result of a measurement or recording error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

influential observation

A

an observation that, when removed, causes the regression line and equation to change considerably, does not have to be a regression outlier, a data point separated in the x-direction from the other data points, removed if result of measurement or recording error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Simpson’s paradox

A

the direction of an association can change between two variables can change after adding a third variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

deterministic relationship

A

a relationship in which y is completely determined by the value of x, not a regression model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

probabilistic relationship

A

a relationship in which the value of y is related to the value of x but not all variation in y is explain by the x value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

population regression line

A

µy=a+bx
a=population y-intercept
b=population slope

describes how the population mean of each conditional distribution for the response variable(y) depends on the value of the value of the explanatory variable(x), describes variability of y observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Sample regression line

A

normal regression line that describes the relationship between x and the estimate means of y at various values of x, different for each sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Purpose of sum of squares in regression models

A

quantify explain and unexplained variability in regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

total sum of squares

A

shows total variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

regression sum of squares

A

shows explained variation

28
Q

error sum of squares

A

shows unexplained variation

29
Q

regression identity

A

SST=SSR+SSE

r^2=SSR\SST

30
Q

what type of variation does a residual quantify?

A

unexplained

31
Q

What information can we get from the distance between ybar, the regression line and the point on a graph

A

if the distance from ybar to the regression line is greater than the distance from the point to the regression line, we have more explained variability and r^2 will be larger and vice versa

32
Q

purpose of a regression t-test

A

used to determine if x is useful in predicting y

33
Q

What are the options for the null hypotheses in a regression t-test?

A
  1. B=0
  2. Model is useful in predicting y
  3. variables are independent
34
Q

What are you testing in a regression t-test?

A

whether the regression equation or the mean of y gives a better prediction for y given a value of x

35
Q

what does B=0 indicate conceptually?

A

ybar is more useful in predicting any value of x

36
Q

What does B=0 indicate graphically?

A

if we graph a line equal to y, it would be a horizontal line that crosses the y-axis at the mean of y

37
Q

standardized/studentized residuals

A

measures how many standard errors each observation is from the regression line

38
Q

how do you use standardized/studentized residuals to find outliers?

A

make histograms and boxplots of the to find outliers, see which observations are more than 3 standard errors from the regression line

39
Q

most important simple linear regression assumption

A

linear relationship between explanatory and response variables

40
Q

residual standard deviation definition

A

aka standard error of the estimate, used to estimate the common std of the conditional distributions of y

41
Q

residual std interpretation

A

indicates, on average, how much the predicted values of the response variable, yhat, differ from the observed values of the response variable, y

42
Q

confidence interval for population regression line

A

used to estimate the mean of y for all observations that have a particular observation of x, give range of plausible values for the mean, same assumptions as simple linear regression

43
Q

confidence interval for y

A

used to estimate the value of y for an individual who has a particular value of x, gives range of plausible values for a randomly selected subject, same assumptions as simple linear regression

44
Q

reasons to use multiple regression

A
  1. make better predictions by using several explanatory variables at once
  2. consider simultaenous impact of predictors of interest on response
  3. the effect of an explanatory variable can change after you account for potential lurking variables
45
Q

how does multiple linear regression work?

A

you analyze the association between 2 variables while controlling/fixing the values of other variables

46
Q

similarities between multiple and simple linear regression

A
  1. Use least squares to estimate B
  2. same calculations for residuals
  3. same calculation of standard error estimate
  4. same assumptions
47
Q

differences between multiple and simple linear regression

A
  1. multiple linear regression has more mode flexibility
  2. multiple regression does not fit a 2-D line to data
  3. interpretations of B changes
  4. multiple regression is more complex due to having multiple explanatory variables
  5. can answer different types of questions
48
Q

population multiple regression model

A

relates the mean value of y of a quantitative response variable y to a set of explanatory variables. µy=alpha+B1X1+B2X2+…BnXn

49
Q

sample multiple regression line

A

when we substitute the values of x1, x2,…xn, the equation specifies the population mean of y for all subjects with those subjects with those values. y=same as population multiple regression model equation

50
Q

multiple regression coefficient interpretation

A

holding all other variables constant, for every one unit increase in xn, ybar increases ….. units

51
Q

multiple correlation

A

R, describes association between y and a set explanatory variables, same weak/moderate/strong categories as before, ranges from 0 to 1, percentage variability in y that is explained by the regression equation

52
Q

R^2

A

R^2=SSR/SST, larger value equals more variability being explained by the model, meaning better predictions. Increases as predictors are added to the model, does not depend on units

53
Q

what do R^2 values of 0 and 1 indicate

A

0: all yhat=yhat
1: all of the y=yhat

54
Q

how to solve for F in a multiple regression model

A

F=MSR/MSE

55
Q

how to find degrees of freedom for a multiple regression model

A

df=# of explanatory variables

df2=n-#of predictors in the model

56
Q

purpose of confidence intervals in multiple regression

A

to estimate values of beta parameters and give plausible values for the parameter

57
Q

what does 0 being in the confidence interval indicate during multiple regression?

A

the explanatory variable may have no effect on the response variable when other explanatory variables are held constant

58
Q

multiple regression model assumptions

A

L.I.N.E.
Linearity: the relationship in the population is the same as what we are using in the data to describe it

Independence

Normality: normal y distribution for each setting of the explanatory variable

Equality of variance: distribution of y values have the same variance for every setting of the explanatory variables

59
Q

how to check normality in the multiple regression model

A

Check normality of residuals: make sure QQ plot is roughly linear and histogram is rougly bell shaped

60
Q

What should a scatterplot of the residuals look like when checking multiple regression assumptions?

A

fall roughly in a horizontal band that is centered about the x-axis and not exhibit any curvature or pattern

61
Q

What do residual plots that only violate the linearity assumption look like?

A

linear or curved

62
Q

What does a residual plot that only violates the equal variance assumption look like?

A

fanning shape

63
Q

what does a residual plot that violates the linearity and equal variance assumption look like?

A

linear/curved WITH fanning

64
Q

What must each graph look like to pass each assumption of multiple regression?

A

Linearity: scatter plot shows linear relationship, residual plot shows no curve/pattern

Independence: assume this is not violated

Normality: normality probablity plot shows a straight line

Equal variance: residual plot shows no fanning or pattern,

65
Q

How do you know if a new model is better in multiple regression?

A
  1. Adjusted R^2 increases
  2. F-stat increases
  3. MSE decreases
  4. Fewer variables arent useful