1.2 the correlation between two variables Flashcards
Correlation
measures the linear relationship between two variables
we need to find the covariance first
covariance formula
sXY = (Sum of all (Xi - Xmean)(Yi - Ymean))/(n-1)
correlation formula
rXY = sX*sY
the 4 important properties of correlation
- The correlation coefficient is bounded by -1 and +1.
-1 <= rXY <=1
- A correlation of 0 (i.e., rXY=0) indicates that there is no linear relationship between the two variables.
- A positive correlation coefficient (i.e., rXY>0) indicates a positive linear relationship between the variables.
–> In other words, an increase in X is associated with an increase in Y.
–> When rXY=1, the variables have a perfect positive linear relationship
- A negative correlation coefficient (i.e., rXY<0) indicates a negative linear relationship between the variables.
–> In other words, an increase in X is associated with a decrease in Y
–> When rXY=−1, the variables have a perfect inverse linear relationship or perfect negative linear relationship.
if there is no linear pattern, is it appropriate to use the correlation coefficient to test any relationship between variables?
nope
Limitations of Correlation Analysis
- The correlation coefficient is not a reliable measure when the variables have a nonlinear relationship
- The correlation coefficient is very sensitive to outliers.
–> Analysts need to justify the inclusion of the outliers in the data or handle them through trimming or winsorization
- Correlation does not imply causation
- the conclusions on any causal relationships, even if supported by data, may not be valid.
–> A spurious correlation
- Correlation may not produce a full picture of the data
–> Different pairs of datasets may have the same correlation but different underlying relationships