Lecture 7 - Statistical Tests III: Correlations & Comparing Two Groups Flashcards
what are the two statistical tests for correlative studies for both parametric and no-parametric data?
correlative parametric data - pearson’s correlation
correlative non-parametric data - spearman’s rank correlation
what will pearsons correlation be used for?
pearsons correlation will be used for two continuous variables where the correlation coefficient “R” describes the strength and direction of the association, numbered between -1 and 1
what does correlation describe?
correlation describes the amount of variation or scatter in a scatter plot
the higher the scatter…
the lower the strength of correlation
R values for positive, negative & no correlations:
positive correlation: r >0
negative correlation: r <0
no correlation: r = 0
what is the difference between a linear regression and a pearsons correlation?
the difference is that with a pearsons correlation there is no line fitted however with a linear regression there is an implemented regression line
pearsons assumptions:
both continuous variables are normally distributed
random sampling
independence of observations
pearsons null hypothesis:
there is no correlation between the variable p (rho) = 0
if the p-value is larger than 0.05, it is not worth discussing the R values
regression or correlation?
how are x & y related? how much does y change with x? = regression
how well are x & y related? = correlation
it is correlation rather than regression if:
it is correlation rather than regression if neither of the two continuous variables is predicted to depend on the other (e.g. there may not be a biological reason to assume such dependant - when the correlation seems to have little reasoning
it is regression rather than a correlation if:
your data comes from an EXPERIMENT as with experiments there is usually a direct relationship [we assume y is dependant on x] between the two variables, therefore a linear regression must be plotted
how can we check to see if it is safe to use pearsons correlation?
after first deducing that it’s a random correlation and not a direct relationship [as a result of experiment], you must check if both variable data sets are of a normal distribution using the shapiro.test command in R
how can we check for normal distribution of variable data before confirming if we can use pearsons correlation?
we attach our data frame and command for the names(data)
then for each name we input:
shapiro.test(variable_1_name)
shapiro.test(variable_2_name)
providing the p-values for both sets of data are ABOVE 0.05 we can assume for normal data distribution
how can you command R to give the pearsons correlation?
cor.test(variable_1, variable_2, method = “pearson”)
note: doesn’t matter what way around your variables are - answer will be the same either way
how do we write up the results of a pearsons cor.test in R?
the (variable one) and (variable two) of (object) were negatively/positively correlated (pearsons correlation; R = value, p = value, N = 15)
what do we receive from a pearsons cor.test command and how do you infer it?
you will get a p-value and a test statistic found underneath “cor” at the bottom of the output which is our correlation coefficient
(1) if the p value is smaller than <0.05 then we can assume that the two variables are correlated
(2) if the cor value if positive it means there is a positive correlation, if the cor value is negative it means there is a negative correlation
if the shapiro.test results are greater/lower than 0.05 we:
> 0.05: data IS normally distributed
<0.05: data IS NOT normally distributed
what is the non-parametric equivalent of the pearsons correlation?
spearman’s rank
spearman’s rank overall function and assumptions:
- ranks both the x and y variable used to calculate a measure of correlation
- assumptions: none about distribution of variables; random sampling; independence of observations
what does spearman’s rank correlation, r/s / R/s describe?
describes the strength and direction of the linear association between the ranks of the two variables, number between -1 & 1
what is different between the pearsons correlation and spearman’s rank?
pearsons is parametric data that is unranked
spearman’s rank is non-parametric data that in ranked
what must be done to your variables when calculating spearman’s rank?
the data from both variables must be ranked separately from low to high - lowest values gets rank one and they progressively get higher integers for the larger they are
how can you use R to calculate your spearman’s rank values?
we, once again use:
cor.test(variable one, variable two, method = “spearman”)
how do we infer the results of our spearman’s rank values in R?
you are given a p-value: if it is greater than 0.05 then we must accept the null hypothesis and assume no correlation, if the value is smaller than 0.05 we must accept the alternative hypothesis and assume a correlation
you are also given a “rho” test statistic (Rs) at the bottom of the output: ONLY if the p-value is <0.05, we look at this value - if it is positive it suggests a positive correlation and if it is a negative value it suggests a negative correlation
what is a crucial thing you must always do before statistically testing correlations to ensure you are using the right test?
you must always check if the data present for each variable is either parametric or non-parametric using the shapiro.test(variable name) command in R
as parametric = pearsons
and non-parametric = spearmans
what statistical tests do we use when investigating the difference between normally distributed samples?
for paired parametric samples: paired t-test
for independent parametric samples: t-tests
what statistical tests do we use when investigating the difference between non-parametric samples?
for non-parametric paired samples = Paired Wilcoxon Test
for non-parametric independent samples = Mann-Whitney U Test / Wilcoxon test
when is students t-test used?
normal distribution of both groups and equal variances
when is Welch’s t-test used?
normal distribution of both groups and unequal variance
when is Mann Whitney U Test / Wilcoxon Test used?
non-normal distribution (no assumptions)
how can we test for normality?
graphically: histograms or quantile plots
formal tests: Shapiro-Wilk Test [shapiro.test(variable name)]
how can you use R and histograms to test for normality when you are comparing two groups?
hist(x_variable[male_type==“control”])
hist(x_variable[male_type==“knockout”])
F-test command:
var.test(x-variable~y-variable)
what requirements do we need in order to do a t-test?
we need to ensure we have normally distributed data and non-differing variance
test distribution using: shapiro.test(variable name)
test for variance using: var.test(y-variable~x-variable) - if the p-value if over 0.05 in the F test it means that the variances do not differ
what is the t-test command in R?
t.test(y-variable~x-variable, var.equal=TRUE)
note: you can only carry out the T-test providing that the variation is actually equal to zero, something you can find out through doing an f test with the command var.test(y~x) - your p-value must be >0.05 for the variances not to differ
how do we infer the results of our t-test in R?
p-value = if your p-value is below 0.05 it means there is a relationship
how can you get the mean results for different variable data sets?
you can get the mean results for variable data sets using the command: tapply(variable name, variable name, etc)
what statistical test would you use if continuous variable in each of the groups were normally distributed but the variances were not equal?
[variances = not equal] & [distribution = normal] = welch’s t-test
how do you do a welch’s-t-test in R, and how does it differ form a normal students t-test command?
you simply write > t.test(y-v~x-v)
this above differs from the student t-test and the code doesn’t have the additional “…var.equal=TRUE” as we only apply welch’s test when variation isn’t equal
mann-whitney U test/ Wilcoxon-Mann-Whitney test requirements:
non-parametric equivalent of the independent samples t-test, one continuous variable (response variable) and one categorical variable with two factor levels (explanatory variable)
wilcoxon test in R:
(1) attach(data-frame)
(2) names(data)
(3) wilcox.test(y~x)
how do we infer the results of our wilcoxon test in R?
you are given a test statistic (= w) and also a p-value, if your p-value is <0.05 then it means that we accept the alternative hypothesis - significant difference established
when can we construct once we have confirmed statistical significant difference via a wilcoxon test in R?
once confirming significant difference (<0.05 - wilcoxon p-value) you can then construct your plot via the command:
> plot(y~x, las = 1)
paired wilcoxon test requirements:
- non-parametric test, uses medians
- assumptions: non
- null hypothesis: median difference between measurements is 0
paired wilcoxon test R command and interpretation:
wilcox.test(paired-variable-1,paired-variable-2, paired = T)
p-value <0.05 = reject null hypothesis - alternative hypothesis = true
Mann-Whitney U Test & Wilcoxon Tests are:
the exact same non-parametric test!
how can we check the for the assumptions of a linear regression in R?
at the end of your linear regression command you can check that constant variance and normal distribution is present via the command:
> plot(m1)
this will show you the two graphs where (1) = star-filled sky & (2) dots along the line - providing assumptions are met
the three statistical tests used for comparing two groups against a continuous variable:
students t-test: both groups are parametric with equal variances
welch’s t-test: both groups parametric but with unequal variance
Mann-Whitney U Test / Wilcoxon Test: non-parametric data (no assumptions
strongest statistical test out of std.t-test, welch’s & Mann-Whitney / Wilcoxon:
std.t-test
correlations are used when:
- when we are interested in how WELL x and y are related
- if neither of the two variables is predicted to depend on the other (not clear what is the response variable and what is the explanatory)