Module 6 Flashcards
scatterplots
plots that graph bivariate, quantitative data. Horizontal axis has explanatory variable and veritcal axis has the response variable. Each observation is plotted as a point. These plots are used to help us visualize relationships between two variables
Association
used to describe a relationship between 2 variables. There is positive, negative and no association.
correlation
measure of the strength of linear association between 2 variables
What does a curve in a scatterplot indicate?
no linear relationship, measuring correlation is not appropriate.
Properties of r
- independent of units
- two variables have the same association whether you are looking at the explanatory or response variable.
- Magnitude determines strength of relationship
- Falls between -1 and 1
- Sign determines type of relationship(postive or negative)
What plot must be analyzed before looking at a linear relationship?
a scatterplot must be analyzed to see if it makes sense to look at a linear relationship and make sure there ar eno curves present
What do r values of -1, 0 and 1 indicate?
- 1: perfect negative LINEAR correlation
0: no linear correlation
1: perfect positive LINEAR correlation
what do different ranges of r indicate?
0 to 0.4 or -0.4 to 0: weak postive/negative correlation
- 4 to 0.8 or -0.8 to -0.4: moderate positive/negative correlation
- 8 to 1 or -1 to -0.8: strong positive/negative correlation
regression equation and variable meanings
yhat=a+bx a=y-intercept b=slope x=explanatory variable yhat=predicted mean value of response variable
How to interpret slope
the slope equals the amount that the predicted mean value of the response variable(y) changes when the explanatory variable increases by unit(x)
residual definition
difference between observed and predicted values. Residual=observed-predicted
What do residuals represent graphically?
the vertical distance each point lies from the regression line
How many residuals does each observation have?
1
What are properties of residuals if the observation is larger and if it is smaller than the predicted value?
Larger: Postive residual value, will lie above line
Smaller: Negative residual value, will lie below line
Which regression line fits a data set best?
the line with the smallest sum of squared errors or minimizes the sum of the squared residuals
Coefficient of determination
r^2, lies between 0 and 1, determines percentage of variation in the observed values of the response variable that is explained by the regression line, measure of usefulness for making predictions
extrapolation
using a regression line to make predictions for the response variable outside the range of the explanatory variable, may result in incorrect predictions if the linear relationship does not hold past the range of the explanatory variable
regression outlier
a data point that falls far from the regression line relative to other data points, these points are removed if they are the result of a measurement or recording error
influential observation
an observation that, when removed, causes the regression line and equation to change considerably, does not have to be a regression outlier, a data point separated in the x-direction from the other data points, removed if result of measurement or recording error
Simpson’s paradox
the direction of an association can change between two variables can change after adding a third variable
deterministic relationship
a relationship in which y is completely determined by the value of x, not a regression model
probabilistic relationship
a relationship in which the value of y is related to the value of x but not all variation in y is explain by the x value.
population regression line
µy=a+bx
a=population y-intercept
b=population slope
describes how the population mean of each conditional distribution for the response variable(y) depends on the value of the value of the explanatory variable(x), describes variability of y observations
Sample regression line
normal regression line that describes the relationship between x and the estimate means of y at various values of x, different for each sample
Purpose of sum of squares in regression models
quantify explain and unexplained variability in regression
total sum of squares
shows total variation