Ch. 17 Flashcards
Regression
A method that predicts values of one numerical variable from values of another numerical variable
Difference between regression and correlation
Correlation measures the strenght of association in the data, which reflects on the scatter of the data
Regression fits a line through the data to predict one vriable from another and to measure how steeply one variable changes w/ changes in the other
Linear regression
Most common regression
Assumes a linear relationship between variables
Least-squares regression line
Line for which the sum for all squared deviation in Y is the smallest
Slope (what is it?)
The slope of a linear regression is the rate of change in Y per unit X
Represented by b(sample estimate), population version (B, beta)
What is “Y-hat”?
It represents the prediction of Y-values
What do predicted values of Y tell you?
They give you an estimate of the mean value of Y for all individuals for that given value of X
Residual
Observed value minus predicted value
MSresiduals
Gives the variance of the residuals
Confidence bands
95% Confidence bands measure the precision of the predicted MEAN Y for each value of X
Prediction intervals
Measure the precision of the predicted SINGLE Y-values for each X (usually 95%)
Extrapolation
The prediction of the value of a response variable outside the range of X-values in the data
Why is extrapolation a bad idea?
There is no way to guarantee the relationship between X and Y holds for points beyond the range of the data; thus, it is not accurate.
Degrees of Freedom for Regression?
n-2 (because we needed to calculate slope and intercept)
When can ANOVA be used in place of the t-test?
When the test is two-sided and the null hypothesized slope is ZERO
R2
SSregression/SStotal;predicts the amount of variance explained by the regression line
What does it mean when R2 is close to 1?
It means that X predicts most of the variation in Y (and that Y would be clustered tightily around the regression line with little scatter)
What does it mean when R2 is close to 0?
It means that X does not predict much of the variation in Y, and the data points will be widely scattered above and below the regression line
What is the name for r^2?
Coefficient of determination
Assumptions for linear regression?
- For each value of X, there is a population of possible Y-values whose mean lies on the true reegression line (this is the assumption that the relationship must be linear)
- Y is normally distributed with equal variance for all values of X
- Y is a random sample of possible Y values
How to detect outliers?
Use a scatter plot of the data and examine it
How to reduce effect of the outlier?
Transform the data
How to detect non-linearity?
Use a scatter plot and see whether you can fit a straight line through the data well
Residual plot
A residual plot is a scatter plot of the residuals (Yi - Yhat; i.e. Y in sample subtracted by Y predicted), against X, the values of the explanatory variable
How would one detect non-normality and/or unequal variance?
Inspect a residual plot; should have:
- a symmetric cloud of points above and below the horizontal line at 0; with higher density of points close to the line than away from the line
- Little noticeable curvature moving left to right along x-axis
- Approximately equal variance of points above and below the line at all values of X
Effect of measurement error in Y in regression?
Variance of residuals increases; sampling error increases, slope expected remains the same
Effect of measurement error in X?
Increases variance of the residuals; causses bias in expected estimate of the slope (closer to zero than true slope B, on average)
How to deal with non-linear relationships?
Transformation
Quadratic/Polynomial regressions
Splines (smoothing)
Smoothing
Fitting a curve to data without specifying a formula
Limits to terms with polynomial?
Sample size should be at least 7 times the number of terms
(i.e. Keep It Simple Stupid)
Logistic regression
Tests for relationship between a numerical variable (as the explanatory variable) and a binary variable ( as the response)