L4 - Correlation Analysis Flashcards
What is the Correlation Coefficient for raw data?
- While this type of graphical analysis can be quite informative, it is often the case that we would like a single, summary statistic of the strength of the relationship between two variables.
- This is provided by the (sample) correlation coefficient, which is defined in the following way. Suppose that we have a sample of N pairs of observations on the variables X and Y:
- (X{1},Y{1}),(X{2},Y{2}),…,(X{N},Y{N})
- The correlation coefficient is then given by:
- r{xy}=((NΣXY)-(ΣXΣY))/sqrt((NΣX^2-(ΣX)^2) x ((NΣY^2)-(ΣY)^2))
How can we get to the Correlation Coefficient if the data is in mean deviation form from the sample variance?
- Recall that the sample variance of X was defined as:
- s_X^2=Σ(X-x̅)^2/(N-1)
Similarly, the sample variance of Y is:
-s_Y^2=Σ(Y-ȳ)^2/(N-1)
We may also define the sample covariance between X and Y to be:
-s{XY}=Σ(X-x̅)(Y-ȳ)/(n-1) - This is a measure of how the two variables covary, i.e., move together. If negative (positive) mean deviations in X are predominantly accompanied by negative (positive) mean deviations in Y then the product of these mean deviations, and hence , will be positive.
- If, however, mean deviations of opposite sign predominantly accompany each other then their product and will be negative.
- The covariance thus gives a measure of the strength and direction of the covariation existing between X and Y.
- It has the disadvantage, however, that it will be measured in units that are the product of the units that X and Y are measured in and hence could take a value of any magnitude.
- A scale free measure can be obtained by dividing the covariance by the square root of the product of the variances, i.e., by dividing by the product of the (sample) standard deviations, in which case it can be shown that:
-1 ≤ s{XY}/s{X}s{Y}≤1
In fact, this ratio defines the correlation coefficient, i.e.,
-r{xy}=s{XY}/s{X}s{Y}= Σ(X-x̅)(Y-ȳ)/sqrt((Σ(X-x̅)^2) x (Σ(Y-ȳ)^2))
How can we simplify the Correlation Coefficient if the data is in mean deviation form?
The formula will be easier to use in subsequent algebraic derivations since, on defining the mean deviations x{i}=X{i}-x̅ and y{i}=Y{i}-ȳ , it can be written concisely as
- r{XY}=Σxy/sqrt(Σx^2 x Σy^2)
What does the Correlation Coefficient tell us?
Thus, r{XY}>0 signifies a positive correlation between X and Y ,
- with r{XY}=1 signifying a perfect positive correlation, where all the points in the scatterplot lie exactly on an upward sloping straight line. - Conversely, r{XY}=-1 signifies a negative correlation and a perfect negative correlation where all the points lie on a downward sloping straight line.
- X and Y are uncorrelated if {XY}=0 .
- Any values about 0.5/0.6 and below -0.5/-0.6 normally signifies a strong correlation
How can you calculate outliers in a sample Correlation Coefficient?
- Like many summary statistics, the correlation coefficient can be heavily influenced by outliers
- The data set is obviously ‘contaminated’ by a single outlying observation, otherwise there would be a perfect correlation of +1.
- As it is, the sample correlation coefficient is only 0.816, large but some way below unity. One way of dealing with the outlier is to compute the (Spearman’s) rank correlation coefficient. - Rather than using the raw data, this uses the ranks of the data instead.
- So taken the values of X and Y and rank then from lowest to highest
- Here d= rank(X)-rank(Y) is the difference in ranks and the rank correlation is given by
- r_XY^s= 1- ((6Σd^2)/(N((N^2)-1)))
Why is Correlation and Causation different?
- It is tempting to conclude that, when a large correlation between two variables is found, one of the variables in some sense causes the other. (Note that we do not attempt to give a formal definition of what is meant by causality, as this is still the subject of great philosophical debate!)
- Such a temptation should be resisted because, unless we are able to invoke a feasible causal theory that posits that changes in one variable produces changes in the other, correlation does not imply causation.
- Sometimes there is a ‘natural’ causal ordering: it is hard to believe that the large correlation between salary and years of higher education reflects other than a causal link from the latter to the former, as the reverse link would be ‘time inconsistent’.
- However, the negative correlation between inflation and unemployment could just as well be argued to represent a causal effect running from unemployment to inflation (high unemployment leads to lower demands for wage increases and hence lower inflation) as one running from inflation to unemployment (workers price themselves out of a job by demanding wages that keep up with inflation).
- Even the consumption-income relationship is by no means clear cut: the consumption function states that consumption is a function of income, but the national accounting identity has income defined as the sum of consumption, investment, government expenditure, etc., thus making the relationship one of simultaneity, i.e., the two variables jointly influence each other.
What are some further pitfalls of correlation analysis?
The correlation between the two variables is clearly zero since the covariance is Σxy=0 and hence it appears that X and Y are unrelated. However, the scatterplot of the two variables belies that conclusion: it shows they are, in fact, perfectly related, but that the relationship is nonlinear, as all the data points lie on the circle Y^2+X^2=9 , so that Y=sqrt(9-X^2) .
- This illustrates the important point that correlation is a measure of linear association and will not necessarily correctly measure the strength of a nonlinear association.
- Thus, for example, the correlation of between inflation and unemployment may well underestimate the strength of the relationship if, as suggested by the earlier Phillips Curve analysis, it is really a nonlinear one.
- Another way that correlation can give a misleading measure of association is when the observed correlation between two variables is a consequence of both variables being related to a third. This gives rise to the phenomenon known as spurious correlation.
How can you calculate a partial correlation coefficient?
In this example we found that all three variables were highly correlated, so that it is possible that the large positive correlation between, for example, salary and education may be due to both being highly correlated with years of work experience.
- To ascertain whether this may be the case, we can calculate the set of partial correlation coefficients. - These measure the correlation between two variables with the influence of the third removed or, to be more precise, ‘held constant’.
- The partial correlation between Y (salary) and X (education) with Z experience) held constant is defined as
- r{XY.Z}=((r{XY}-r{XZ}r{YZ})/(sqrt(1-r{XZ}^2) x sqrt(1-r{YZ}^2)
- Only if both r{XZ} and r{YZ} are both zero, i.e., both X and Y are uncorrelated with Z, will the partial correlation be identical to the ‘simple’ correlation r{XY} .
what does the Partial correlation between Salary, education and experience tell us?
- The partial correlation between Y (salary) and X (education) with Z (experience) held constant r{XY.Z} = -0.246
- This is not only much smaller than the simple correlation, it is negative, implying that, by holding the influence of experience constant, there is a negative association between salary and education! The positive correlation between salary and education is therefore spurious and is a ‘statistical artifact’ produced by omitting experience from the analysis
- The partial correlation between X (education) and Z (experience) and Y (Salary) held constant r{XZ.Y} = 0.447
- The partial correlation between Y (Salary) and Z (experience) with X (education) held constant r{YZ.X}= 0.975
- we can see that the strength of the association between salary and experience holds up having taken account education
- so experience is really the driver of salary and not education
What is the partial correlation between Consumption, Income and Time Trend?
- We can see from the following time series ploy Consumption and Income that both have pronounced upward movements over the sample i.e. they have time trends, which is what we expect to see in macroeconomics aggregate data
- Is the strong correlation between C and Y (0.997) be due to a shared correlation with time, A time trend variable is t=1,2,,4,…
- Correlations between X and t, and Y and t are r{Ct} = 0/966 and r{Yt}= 0.976 which indicate that there are time trends in both series
- The partial correlation between C and Y holding t constant is as follows, which although smaller that r{CY} is still large
- this indicates that once we have controlled for the common time trend there is still a strong positive correlation between Ca and Y i.e. it is not a spurious correlation between C and Y