Lecture 7 - Non parametric statistics and measures of association Flashcards
Parametric vs Non parametric
• Parametric data:
– Assumes normal distribution, homogenous variance, and data sets are typically ratio or interval.
– Can draw more conclusions.
• Non-Parametric data:
– No assumption on distribution or variance relationship, and data sets are typically ordinal or nominal.
– More simple and less affected by outliers.
Correlation and Correlation Coefficient
• Technique for investigating the relationship between two numerical variables
• A correlation coefficient is a measure of the relationship between two numerical measurements
– Magnitude of relationship – Direction of relationship – Bivariate distribution
Positive and negative correlation
• Positive Correlation (Direct)
– Present when high values in one variable are associated with high values of another variable or vice versa
• Negative Correlation (Indirect)
– When one values on one variable are associated with low values of other variable or vice versa
Correlation: Scatterplot
• Scatterplot
– A two dimensional graph displaying the relationship between two numerical characteristics of variables
• Whether there is an association between variables
– What the association looks like (linear? nonlinear?)
– The trend of the association (positive, negative)
Pearson correlation coefficient (r)
• Measures the strength of linear association between two quantitative variables
– The r value has no units
• Level of measurement of the data for the two variables are either interval or ratio scale
Interpretation of r:
Negative correlation gets stronger as it approaches zero
Positive correlation gets stronger as it approaches 1
Usefulness of scatterplot
- We learn the truth by simply looking at the graphs:
- The upper-left graph looks what we may have expected from the regression output: a straight-line relationship with some scatter about the best line.
- The upper-right graph shows a strong relationship between x and y, but it is NOT linear.
- In the lower-right graph, it doesn’t make any sense to fit a line since there is essentially no variability in the x values.
- In the lower-left graph, there is a strong linear relationship with the exception of one outlier.
- The moral of this example is: ALWAYS FIRST GRAPH YOUR DATA and don’t rely solely on summary output.
Spearman Correlation Coefficient (rs)
- The non parametric equivalent of Pearson product moment correlation
- Measures the strength of association between two ranked variables
- The Spearman correlation can be used when the assumptions of the Pearson correlation are markedly violated.
- A second assumption is that there is a monotonic relationship between your variables.
- It is calculated by first ranking the data for each quantitative variable and then applying the linear correlation coefficient formula on the ranked data.
Correlation and causality
Correlation does not imply causation
• Example:
– MMR vaccination and autism spectrum disorders
– Gender and IQ
– Alcohol and lung cancer
Regression analysis
• It is a common way of estimating the relationship among variables
– E.g.: Given the age of an individual, can we estimate their income levels?
– Also, can we use the age of the individual to predict their income levels?
Liner regression is the most basic and common type of predictive analysis
– At the centre of the regression analysis is the task of fitting a single straight line through a scatter plot
• Regression line
Non parametric statistics for hypothesis testing
• The population median  instead of the population mean μ
Sign test (+,-)
• Testing hypothesis concerning the median
H0: n = n0
H1: n /= n0
• If the null hypothesis is true, there is approximately an equal number of observations greater and less than the median