Correlation and Regression Flashcards
Correlation
Assesses nature and strength of linear association b/w variables
The direction
No dependent/independent variable structure
Linear Regression
Equation that best describes linear relationship between variables (equation of the line)
Dependence structure:
Y=dependent variable
X=independent variables (this is what we have control over)
Correlation Coefficient
Population correlation coefficient:
(linear association)
p, -1 < or = to r < or = to +1 (See slide)
Sample corelation coefficient:(See slide)
r, -1 < or = to r < or = to + 1
-is an estimate of the population correlation
Sign indicates nature r/s (positive or diect, negative or inverse)
Magnitude indicates strengh:
- Values close to plus or minus 1 indicate strong linear association
- Values close to 0 indicate weak linear association
Sample Correlation Coefficient
The formula involves covariance; which is a raw measure of variability. This is how the 2 variables move relative to their mean and together.
Correlation coefficient quantifies correlation AKA Pearson’s product-moment sample correlation coefficient, r
r, is the sample correlation coeffiecent aka the point estimate; which is used as as an estimate of the population correlation coeffiecient, p (Ro)
Involves using the variance of Pearson’s product-moment sample correlation coefficient; variance is of function of sample correlation coefficient and sample size
Involves using the standard error of r
- the standard error aka sampling variability of r is used to construct the test statistic for hypothesis tests and confidence intervals for the population correlation coefficient, p
- the n-2 in this formula is aka the DOF and averaging constant
Correlation Analysis
Hypothesis test on p
When testing whether p=0, the test statistic is based on the t-distribution…the t value is revealing how many standard error units it falls from the hypothesized value; which is 0
the numerator is the sample size and r…r is the estimate of the sample correlation coeffiecent…sample size is subracted by 2 because of the DOF
all you need to compute this test statistic is the sample size and the sample correlation coefficient
r close to zero will support the null
Assumptions for Pearson’s Product Moment Correlation Coefficient
Pearson’s is the most common way to measure the correlation coefficient
It is assumed that each variable follows a normal distribution.
It is further assumed that the 2 variables involved in the correlation follow a bivariate normal distribution (special form of a multivariate distribution)
SPECIFICALLY MEASURES LINEAR
Spearman’s Correlation
A common alternative to Pearson’s
An alternative nonparametric measure (aka not assuming a specific distribution) of correlation
No distributional assumptions are made on the variables
The correlation is computed exactly the same, except that the ranks of the data values are used in measurements…so basically this means what goes into the formula is different…you will rank the variable and associate the rank…FOR TIED VALUES; AVERAGE THE RANKS
Can assess linear and non-linear associations
Spearman’s Correlation Coefficient
Spearman’s sample corelation coefficient, r (S) (See Slide)
Wher r (X) and r (Y) are the ranked values of X and Y.
Note that r (S) is computed by applying Pearson’s formula after replacing the observed data values by their respective ranks
Simple Linear Regression
allows us to fit the line
Linear regression is a general statistical methodology that allows the assessment of the relationship b/w variables (usually continuous) and prediction
Linear regression assumes a dependence structure in which the level of one variable is assumed to vary linearly depending on the level of the other variable.
independent and dependent variables.
Y= Dependent, outcome variable aka response variable X= Independent, predictor variable aka covariant variable
Regression assumes that the mean of Y can be r/t the level of X using the equation of a line:
Y = a + bX
Where “a” is the value of Y when the line crosses the Y-axis, and “b” is the slope of the line that measures the rate of change in the mean of Y as a linear function of X
- the slope is a measure of rate of change; how much Y varies for every one unit increase in X
- slope=rise/run=b
The equation of a line is completly determined by the y-intercept and slope
These are population values or parameters of the population regression line, and are gennerally unkown
We must estimate these parameters from sample data
Once the slope and y-intercept are estimated, we construct the estimated or fitted regression line
Population regression line is the model for the mean response of Y…so we use sample data to estimate the y intercept and slope aka population parameters to estimate this line
Simple Linear Regression Assumptions
Linear r/s b/w X and Y; we use the slope to test the r/s b/w X and Y
Independence of errors
Homoscedasticity (constant variance aka variance that is the same) of the errors
Normality of errors
Linear regression uses an estimation method called least squares to find estimates for the slope and y-intercept
Least squares is the most common methodology used to determine the values for the estimates of the slope and y-intercept that mimimize the sum of squared deviations of the observed data points from the line in the vertical direction; it determines the line of best fit
IN SIMPLE LINEAR REGRESSION ASSUMPTIONS THE MAIN INTEREST ARE IN THE SLOPE
There is a r/s b/w correlation efficient and slope only in simple linear regression
Regression Analysis: Inference on the Slope
produces predictions and the estimated regression line; which allows you to use regression equation to compare different variables….the slope can be interpreted as the rate of change in the mean of Y for a one unit increase in X
Involves:
Standard error of the estimated slope (the thing we want to know the most about..the most important part of the line..we don’t know the true slope, so we have to use the estimated slope)
it is a function of both the population SD and the total variability in X
sigma or the standard deviation is usually unknown
to estimate, sigma, we use our best estimate of random error, MSE from the ANOVA table
The standard error of the slope is required for hypothesis test or confidence intervals involving the slope
The slope is the most important parameter in the regression equation as it measures the expected rate of change in the dependent variable for a one unit increase in the independent variable
Quantifies the r/s b/w X and Y
The primary inference goal of regression is to determine if the slope is significantly different from zero
This is carried out through testing of the following hypothesis (see slide 52)
- Beta 1 is the observed slope
- zero is the estimated slope
2 ways to carry out the test:
- T-test
- F-test
In F-test statistic testing MS regression is used
the rejection rule is: reject the null if F greater than or equal to F…if F is bigger than 1-> reject
MS regression is the estimate of that part of the total variability in the response that can be explained by the linear association betwen X and Y
The results from a regression analysis are usually presented in the form of an ANOVA table
The analysis of variance can be thought of as a special case of regression analysis
In regression, the 2 sources of variability that are being analyzed are the variance in the response explained by the linear association of Y and X and the variance not explained.
The later is often referred to as “residual” variance and is denoted by Mean Squared Residual or MSE
Regression Analysis: Inference on the slope
What does MSE measure in a regression model?
The average of the squared deviation of individual observed response values from those predicted by the regression model