Chapter 9 simple linear regression/correlation slide 13 onwards Flashcards
3 assumptions when doing pearson correlation analysis
- X and Y are quantitative data
- Variables X and Y are simple random variables
- Pairs of X and Y follow the bivariate normal distribution
- individual variables are normally distributed
- r is sensitive to outliers
Strong positive relationship correlation coefficient
Close to +1
Negative linear relationship correlation coefficient
Close to -1
No linear relationship correlation coefficient
Close to zero
Hypothesis testing for correlation
ρ (represents correlation coefficient of populations)
Null hypothesis: ρ =0
Alternative hypothesis: ρ not equals to zero
Often, coefficient of determination is expressed as
Proportion or percentage
What does it mean when coefficient of determination r^2 is 0.85
85% of the change in y is caused by a change in x
The ______ the correlation coefficient, the ______ the coefficient of determination. This implies that _________ in dependent variable is influenced by independent variable
larger, larger, more changes
interpretation when r^2 is zero
no correlation
interpretation when r^2 is up to 49%
low correlation (need consider e sign)
interpretation when r^2 is 50-95%
high correlation, need to consider the sign
interpretation when r^2 is 96-99%
very high correlation, need to consider the sign
interpretation when r^2 is 100
perfect correlation (need to consider the sign)
2 methods for determining correlation for non-normally distributed data
logarithmic transformation and non-parametric correlation analysis
what is non-parametric correlation analysis
ranking data from smallest to largest using Kendall’s or Spearman’s rank correlation to calculate the correlation coefficient
5 properties of simple linear regression
- relationship between 2 variables is approx best fit line or straight line
- one independent and one dependent variable
- aka regression analysis
- forms a simple equation to describe relationship
- intercept: alpha, slope: beta
General equation for simple linear regression
y’=beta x + alpha
What is Y-Y’
errors of prediction
What is Y’
predicted values
5 things needed to calculate linear regression
- mean of X
- mean of Y
- standard dev of X
- standard dev Y
- r (correlation between x and y)
Formula for slope
r(standard dev of y)/standard dev x
Formula for intercept
alpha= mean of y- beta(mean of x)
3 potential factors that affect correlation and regression
Outliers, multiple observations from same subject, combine data collected from differet populations
2 effects of outliers on regression line
- over influence the regression line
- increase the residual error and reducing correlation
Outlier at high end of distribution affects the correlation coefficient _________ than outliers that do not lie at the high end of distribution
more
2 methods to identify outliers
Examine scatter plot or examine residual plot
Explain more about the scatter plot and how it can be used to observe outliers
- linear regression takes into account the points in order to derive the best fit line
- the point to the regression line is known as errors of prediction
- small value: error of prediction is small
- large value: error of prediction is large
1 commonly used criterion for best fitting line
line that minimizes the sum of squared errors of prediction
What does residual plot consist of
Consists of plotting the (y-y’) against x
What is a good regression plot
- usually no pattern
- (y-y’) are randomly distributed and it is closer to zero (y axis)