Correlation and Regression Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Correlation

A

Assesses nature and strength of linear association b/w variables

The direction

No dependent/independent variable structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear Regression

A

Equation that best describes linear relationship between variables (equation of the line)

Dependence structure:
Y=dependent variable
X=independent variables (this is what we have control over)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Correlation Coefficient

A

Population correlation coefficient:
(linear association)
p, -1 < or = to r < or = to +1 (See slide)

Sample corelation coefficient:(See slide)
r, -1 < or = to r < or = to + 1
-is an estimate of the population correlation

Sign indicates nature r/s (positive or diect, negative or inverse)

Magnitude indicates strengh:

  • Values close to plus or minus 1 indicate strong linear association
  • Values close to 0 indicate weak linear association
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sample Correlation Coefficient

A

The formula involves covariance; which is a raw measure of variability. This is how the 2 variables move relative to their mean and together.

Correlation coefficient quantifies correlation AKA Pearson’s product-moment sample correlation coefficient, r

r, is the sample correlation coeffiecent aka the point estimate; which is used as as an estimate of the population correlation coeffiecient, p (Ro)

Involves using the variance of Pearson’s product-moment sample correlation coefficient; variance is of function of sample correlation coefficient and sample size

Involves using the standard error of r

  • the standard error aka sampling variability of r is used to construct the test statistic for hypothesis tests and confidence intervals for the population correlation coefficient, p
  • the n-2 in this formula is aka the DOF and averaging constant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Correlation Analysis

A

Hypothesis test on p

When testing whether p=0, the test statistic is based on the t-distribution…the t value is revealing how many standard error units it falls from the hypothesized value; which is 0

the numerator is the sample size and r…r is the estimate of the sample correlation coeffiecent…sample size is subracted by 2 because of the DOF

all you need to compute this test statistic is the sample size and the sample correlation coefficient

r close to zero will support the null

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Assumptions for Pearson’s Product Moment Correlation Coefficient

A

Pearson’s is the most common way to measure the correlation coefficient

It is assumed that each variable follows a normal distribution.

It is further assumed that the 2 variables involved in the correlation follow a bivariate normal distribution (special form of a multivariate distribution)

SPECIFICALLY MEASURES LINEAR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Spearman’s Correlation

A

A common alternative to Pearson’s

An alternative nonparametric measure (aka not assuming a specific distribution) of correlation

No distributional assumptions are made on the variables

The correlation is computed exactly the same, except that the ranks of the data values are used in measurements…so basically this means what goes into the formula is different…you will rank the variable and associate the rank…FOR TIED VALUES; AVERAGE THE RANKS

Can assess linear and non-linear associations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Spearman’s Correlation Coefficient

A

Spearman’s sample corelation coefficient, r (S) (See Slide)

Wher r (X) and r (Y) are the ranked values of X and Y.

Note that r (S) is computed by applying Pearson’s formula after replacing the observed data values by their respective ranks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Simple Linear Regression

A

allows us to fit the line

Linear regression is a general statistical methodology that allows the assessment of the relationship b/w variables (usually continuous) and prediction

Linear regression assumes a dependence structure in which the level of one variable is assumed to vary linearly depending on the level of the other variable.

independent and dependent variables.

Y= Dependent, outcome variable aka response variable
X= Independent, predictor variable aka covariant variable

Regression assumes that the mean of Y can be r/t the level of X using the equation of a line:
Y = a + bX

Where “a” is the value of Y when the line crosses the Y-axis, and “b” is the slope of the line that measures the rate of change in the mean of Y as a linear function of X

  • the slope is a measure of rate of change; how much Y varies for every one unit increase in X
  • slope=rise/run=b

The equation of a line is completly determined by the y-intercept and slope

These are population values or parameters of the population regression line, and are gennerally unkown

We must estimate these parameters from sample data

Once the slope and y-intercept are estimated, we construct the estimated or fitted regression line

Population regression line is the model for the mean response of Y…so we use sample data to estimate the y intercept and slope aka population parameters to estimate this line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Simple Linear Regression Assumptions

A

Linear r/s b/w X and Y; we use the slope to test the r/s b/w X and Y

Independence of errors

Homoscedasticity (constant variance aka variance that is the same) of the errors

Normality of errors

Linear regression uses an estimation method called least squares to find estimates for the slope and y-intercept

Least squares is the most common methodology used to determine the values for the estimates of the slope and y-intercept that mimimize the sum of squared deviations of the observed data points from the line in the vertical direction; it determines the line of best fit

IN SIMPLE LINEAR REGRESSION ASSUMPTIONS THE MAIN INTEREST ARE IN THE SLOPE

There is a r/s b/w correlation efficient and slope only in simple linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Regression Analysis: Inference on the Slope

A

produces predictions and the estimated regression line; which allows you to use regression equation to compare different variables….the slope can be interpreted as the rate of change in the mean of Y for a one unit increase in X

Involves:
Standard error of the estimated slope (the thing we want to know the most about..the most important part of the line..we don’t know the true slope, so we have to use the estimated slope)

it is a function of both the population SD and the total variability in X

sigma or the standard deviation is usually unknown

to estimate, sigma, we use our best estimate of random error, MSE from the ANOVA table

The standard error of the slope is required for hypothesis test or confidence intervals involving the slope

The slope is the most important parameter in the regression equation as it measures the expected rate of change in the dependent variable for a one unit increase in the independent variable

Quantifies the r/s b/w X and Y

The primary inference goal of regression is to determine if the slope is significantly different from zero

This is carried out through testing of the following hypothesis (see slide 52)

  • Beta 1 is the observed slope
  • zero is the estimated slope

2 ways to carry out the test:

  • T-test
  • F-test

In F-test statistic testing MS regression is used

the rejection rule is: reject the null if F greater than or equal to F…if F is bigger than 1-> reject

MS regression is the estimate of that part of the total variability in the response that can be explained by the linear association betwen X and Y

The results from a regression analysis are usually presented in the form of an ANOVA table

The analysis of variance can be thought of as a special case of regression analysis

In regression, the 2 sources of variability that are being analyzed are the variance in the response explained by the linear association of Y and X and the variance not explained.

The later is often referred to as “residual” variance and is denoted by Mean Squared Residual or MSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Regression Analysis: Inference on the slope

What does MSE measure in a regression model?

A

The average of the squared deviation of individual observed response values from those predicted by the regression model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly