Module 6 Flashcards
scatterplots
plots that graph bivariate, quantitative data. Horizontal axis has explanatory variable and veritcal axis has the response variable. Each observation is plotted as a point. These plots are used to help us visualize relationships between two variables
Association
used to describe a relationship between 2 variables. There is positive, negative and no association.
correlation
measure of the strength of linear association between 2 variables
What does a curve in a scatterplot indicate?
no linear relationship, measuring correlation is not appropriate.
Properties of r
- independent of units
- two variables have the same association whether you are looking at the explanatory or response variable.
- Magnitude determines strength of relationship
- Falls between -1 and 1
- Sign determines type of relationship(postive or negative)
What plot must be analyzed before looking at a linear relationship?
a scatterplot must be analyzed to see if it makes sense to look at a linear relationship and make sure there ar eno curves present
What do r values of -1, 0 and 1 indicate?
- 1: perfect negative LINEAR correlation
0: no linear correlation
1: perfect positive LINEAR correlation
what do different ranges of r indicate?
0 to 0.4 or -0.4 to 0: weak postive/negative correlation
- 4 to 0.8 or -0.8 to -0.4: moderate positive/negative correlation
- 8 to 1 or -1 to -0.8: strong positive/negative correlation
regression equation and variable meanings
yhat=a+bx a=y-intercept b=slope x=explanatory variable yhat=predicted mean value of response variable
How to interpret slope
the slope equals the amount that the predicted mean value of the response variable(y) changes when the explanatory variable increases by unit(x)
residual definition
difference between observed and predicted values. Residual=observed-predicted
What do residuals represent graphically?
the vertical distance each point lies from the regression line
How many residuals does each observation have?
1
What are properties of residuals if the observation is larger and if it is smaller than the predicted value?
Larger: Postive residual value, will lie above line
Smaller: Negative residual value, will lie below line
Which regression line fits a data set best?
the line with the smallest sum of squared errors or minimizes the sum of the squared residuals
Coefficient of determination
r^2, lies between 0 and 1, determines percentage of variation in the observed values of the response variable that is explained by the regression line, measure of usefulness for making predictions
extrapolation
using a regression line to make predictions for the response variable outside the range of the explanatory variable, may result in incorrect predictions if the linear relationship does not hold past the range of the explanatory variable
regression outlier
a data point that falls far from the regression line relative to other data points, these points are removed if they are the result of a measurement or recording error
influential observation
an observation that, when removed, causes the regression line and equation to change considerably, does not have to be a regression outlier, a data point separated in the x-direction from the other data points, removed if result of measurement or recording error
Simpson’s paradox
the direction of an association can change between two variables can change after adding a third variable
deterministic relationship
a relationship in which y is completely determined by the value of x, not a regression model
probabilistic relationship
a relationship in which the value of y is related to the value of x but not all variation in y is explain by the x value.
population regression line
µy=a+bx
a=population y-intercept
b=population slope
describes how the population mean of each conditional distribution for the response variable(y) depends on the value of the value of the explanatory variable(x), describes variability of y observations
Sample regression line
normal regression line that describes the relationship between x and the estimate means of y at various values of x, different for each sample
Purpose of sum of squares in regression models
quantify explain and unexplained variability in regression
total sum of squares
shows total variation
regression sum of squares
shows explained variation
error sum of squares
shows unexplained variation
regression identity
SST=SSR+SSE
r^2=SSR\SST
what type of variation does a residual quantify?
unexplained
What information can we get from the distance between ybar, the regression line and the point on a graph
if the distance from ybar to the regression line is greater than the distance from the point to the regression line, we have more explained variability and r^2 will be larger and vice versa
purpose of a regression t-test
used to determine if x is useful in predicting y
What are the options for the null hypotheses in a regression t-test?
- B=0
- Model is useful in predicting y
- variables are independent
What are you testing in a regression t-test?
whether the regression equation or the mean of y gives a better prediction for y given a value of x
what does B=0 indicate conceptually?
ybar is more useful in predicting any value of x
What does B=0 indicate graphically?
if we graph a line equal to y, it would be a horizontal line that crosses the y-axis at the mean of y
standardized/studentized residuals
measures how many standard errors each observation is from the regression line
how do you use standardized/studentized residuals to find outliers?
make histograms and boxplots of the to find outliers, see which observations are more than 3 standard errors from the regression line
most important simple linear regression assumption
linear relationship between explanatory and response variables
residual standard deviation definition
aka standard error of the estimate, used to estimate the common std of the conditional distributions of y
residual std interpretation
indicates, on average, how much the predicted values of the response variable, yhat, differ from the observed values of the response variable, y
confidence interval for population regression line
used to estimate the mean of y for all observations that have a particular observation of x, give range of plausible values for the mean, same assumptions as simple linear regression
confidence interval for y
used to estimate the value of y for an individual who has a particular value of x, gives range of plausible values for a randomly selected subject, same assumptions as simple linear regression
reasons to use multiple regression
- make better predictions by using several explanatory variables at once
- consider simultaenous impact of predictors of interest on response
- the effect of an explanatory variable can change after you account for potential lurking variables
how does multiple linear regression work?
you analyze the association between 2 variables while controlling/fixing the values of other variables
similarities between multiple and simple linear regression
- Use least squares to estimate B
- same calculations for residuals
- same calculation of standard error estimate
- same assumptions
differences between multiple and simple linear regression
- multiple linear regression has more mode flexibility
- multiple regression does not fit a 2-D line to data
- interpretations of B changes
- multiple regression is more complex due to having multiple explanatory variables
- can answer different types of questions
population multiple regression model
relates the mean value of y of a quantitative response variable y to a set of explanatory variables. µy=alpha+B1X1+B2X2+…BnXn
sample multiple regression line
when we substitute the values of x1, x2,…xn, the equation specifies the population mean of y for all subjects with those subjects with those values. y=same as population multiple regression model equation
multiple regression coefficient interpretation
holding all other variables constant, for every one unit increase in xn, ybar increases ….. units
multiple correlation
R, describes association between y and a set explanatory variables, same weak/moderate/strong categories as before, ranges from 0 to 1, percentage variability in y that is explained by the regression equation
R^2
R^2=SSR/SST, larger value equals more variability being explained by the model, meaning better predictions. Increases as predictors are added to the model, does not depend on units
what do R^2 values of 0 and 1 indicate
0: all yhat=yhat
1: all of the y=yhat
how to solve for F in a multiple regression model
F=MSR/MSE
how to find degrees of freedom for a multiple regression model
df=# of explanatory variables
df2=n-#of predictors in the model
purpose of confidence intervals in multiple regression
to estimate values of beta parameters and give plausible values for the parameter
what does 0 being in the confidence interval indicate during multiple regression?
the explanatory variable may have no effect on the response variable when other explanatory variables are held constant
multiple regression model assumptions
L.I.N.E.
Linearity: the relationship in the population is the same as what we are using in the data to describe it
Independence
Normality: normal y distribution for each setting of the explanatory variable
Equality of variance: distribution of y values have the same variance for every setting of the explanatory variables
how to check normality in the multiple regression model
Check normality of residuals: make sure QQ plot is roughly linear and histogram is rougly bell shaped
What should a scatterplot of the residuals look like when checking multiple regression assumptions?
fall roughly in a horizontal band that is centered about the x-axis and not exhibit any curvature or pattern
What do residual plots that only violate the linearity assumption look like?
linear or curved
What does a residual plot that only violates the equal variance assumption look like?
fanning shape
what does a residual plot that violates the linearity and equal variance assumption look like?
linear/curved WITH fanning
What must each graph look like to pass each assumption of multiple regression?
Linearity: scatter plot shows linear relationship, residual plot shows no curve/pattern
Independence: assume this is not violated
Normality: normality probablity plot shows a straight line
Equal variance: residual plot shows no fanning or pattern,
How do you know if a new model is better in multiple regression?
- Adjusted R^2 increases
- F-stat increases
- MSE decreases
- Fewer variables arent useful