Skills for Geographers, quantitative analysis Flashcards
What is accuracy?
Refers to how close a measurement is to the true or accepted value
What is precision?
Refers to how close repeated measurements are to each other (e.g. range, standard deviation, etc.)
Can you be precise and inaccurate and vice versa?
Yes. They are independent of each other
What are significant figures?
Digits expressing a measurement (or the results of a calculation involving such measurements) such that only the last digit is uncertain are called significant figures.
What do significant figures indicate?
the precision of a measuring tool that was used to measure a value.
Significant figure rule
The number of significant figures for the results of a calculation should not exceed that of the least accurate measurement used in the calculation
For example: 4.3 / 0.2748 = 15.6 NOT 15.6477
What is nominal/categorical data?
one in which there is no particular relationship between the different possibilities.
it doesn’t make any sense to say that one of them is ‘bigger’ or ‘better’ than any other one, and it doesn’t make any sense to average them.
Nominal level data can be differentiated and grouped into categories by “kind,” but are not ranked from high to low.
The only thing that you can say about the different possibilities of nominal data are that “they are different”.
Note: Nominal data can also be displayed using number, e.g. 0 = ‘Male’; 1 = ‘Female’. These numbers are placeholders for categorical labels and this does not make them a continuous variable!
Examples of nominal/categorical data:
a land cover category on a map –> can be grouped by “kind” eg. woods, scrub, orchard, vineyard, or mangrove
Hair colour
Names
How would you represent nominal data?
Mode (the most common value); Frequency tables; Bar charts.
Usually, you would display the data as percentages rather than as counts, but you can consider adding both into your tables/graph
What is ordinal data?
An ordinal variable is one in which there is a natural, meaningful way to order the different possibilities, but you can’t do anything else.
They can be placed in a meaningful order e.g. rank order, but they can’t be averaged!
Examples of ordinal data:
transportation routes that are classified hierarchically (Motorways, A roads, B roads, unclassified roads )
Education level
Satisfaction rating
How do you represent ordinal data?
Mode (the most common value); Frequency tables, Bar charts. Usually, you would display the data as percentages rather than as counts, but you can consider adding both into your tables/graphs.
What is interval data?
interval scale variables are variables for which the numerical value is genuinely meaningful.
In the case of interval scale variables the differences between the numbers are interpretable, but the variable doesn’t have a “natural” zero value.
Examples of interval data:
year e.g. 2021, 2022.
Suppose I’m interested in looking at how the attitudes of first-year university students have changed over time. Obviously, I’m going to want to record the year in which each student started. This is an interval scale variable. A student who started in 2010 did arrive 5 years after a student who started in 2005 (you can add and subtract the values meaningfully) . However, it would be completely crazy for me to divide 2010 by 2005 and say that the second student started ‘1.0024 times later’ than the first one!
Mark grading
Time passed
IQ test
Temperature
What is ratio data?
For ratio scale variables the numerical value is genuinely meaningful and a zero really means zero, and it’s okay to multiply and divide
Examples of ratio data:
Distance and height - zero metres and zero feet mean exactly the same thing
Addition, subtraction, multiplication division all make sense when using ratio data. An implication of this difference is that a quantity of 20 measured at the ratio scale is twice the value of 10 (20 metres is twice the distance of 10 metres), a relation that does not hold true for quantities measured at the interval level (20 degrees is not twice as warm as 10 degrees).
Income
Weight
What is a variable?
A variable is a record of any number, quantity or characteristic that can be measured.
Variables can be manipulated, controlled for or measured in your research. Research experiments will consist of a series of different variables. When we analyse our data, we usually try to explain some of the variables in terms of some of the other variables.
What is an independent variable?
the variable that’s doing the explaining
What is a dependent variable?
The variable being explained
What is a measure of central tendency?
Measures of Central Tendency provide details that will help you describe the centre of your data in a set of single values. (Mean, Median, Mode)
What shape does normal distribution follow?
Bell-shaped curve
What is skewness?
Skewness is basically a measure of asymmetry. If your data are normally distributed then skewness = 0. Data can be positively or negatively skewed, as we saw earlier.
If the reported skewness value is lower than -1 (negative skewed) or greater than 1 (positive skewed), your data are extremely skewed!
What is kurtosis?
Provides information about how thin or fat (sometimes called light and heavy) the tails of a data distribution are? Kurtosis is related to the degree of presence of outliers in your data. We say that a normal distribution curve has a zero kurtosis, and the degree of kurtosis is assessed relative to this curve.
What are the 3 common measures of dispersion?
Minimum, maximum and range
Quartiles and inter quartiles
Variance and standard deviations.
What does the standard deviation tell you?
Standard deviations tell you about the average variability in your dataset. They tell you on average how far values are away from the mean. Most values will cluster around the central region with fewer and fewer values at the edge.
The bigger the standard deviation, the bigger the spread of the data.
What units does standard deviation take?
The standard deviation takes the same units as your variable. So if your variable is age (e.g. years), then the standard deviation is reported in the same units i.e. years.
What is the empirical rule?
Standard deviation is related to the normal distribution so it only really makes sense when your data are normally distributed. If your data are normally distributed and you know (i) the standard deviation of your data and (ii) the mean, you can tell where most of the values in your distribution should lie.
The empirical rule is:
Around 68% of values are within 1 standard deviations of the mean
Around 95% of values are within 2 standard deviations of the mean
Around 99.7% of values are within 3 standard deviations of the mean
For Single Variable, Continuous Data, what plot should you use?
Histograms: A histogram is similar in appearance to a bar chart. A histogram condenses your data into a series of logical ranges (bars). Each bar shows you how many data points appear in that range.
Density Plot: This plot is similar to a histogram, but instead of having separate bars, you get a smooth line to represent the distribution of your data.
Box plots: We’ve already talked a bit about box plots when we talked about quartiles. A box plot is another way of displaying the distribution of your data. A box plot displays your Minimum, Maximum, the IQR (the box), your Median (line through the box), the Mean (black square) and any outliers (black dots along the “whisker”).
For Single Variable, categorical Data, what plot should you use?
Bar Charts: Each bar represents a category and the proportion (height/length) is related to the value it represents. Bars can be plotted vertically or horizontally. You can plot them as grouped or stacked bar charts. Choose between count (represents the actual numbers of cases/observations) or percentages.
For multiple Variables, Continuous Data, what plot should you use?
Scatterplot: These show the relationship between two continuous variables. This relationship can be displayed by a line (usually linear - but it can also be non-linear). A scatterplot uses dots to represent your data. Each dot represents the intersection between the values of the variables you are using.
Note: These are just plots! Without accompanying data or statistical tests, you should not infer that a relationship is statistically significant!
What are inferential statistics?
Essentially inferential statistics allows you to make predictions (“inferences”) from data that you have obtained.
With inferential statistics, you take data from samples and make generalizations about a population.
What is a sample?
a subset of the population
What is the difference between inferential and descriptive statistics?
you might stand in a shopping centre and ask a sample of 100 people if they like shopping at Poundland. You could make a bar chart of yes or no answers (that would be descriptive statistics) or you could use your research (and inferential statistics) to reason that around 75-80% of the population (all shoppers in all shopping centres) like shopping at Poundland!
What is hypothesis testing?
Hypothesis testing is where you can use sample data to answer research questions.
What are the key steps in hypothesis testing?
- Research hypothesis - Define the research hypothesis for the study.
- Selection of variables to measure - Explain how you are going to operationalize (that is, measure or define) what you are studying and set out the variables to be studied.
- Statistical hypothesis - Define the statistical hypothesis: Null and alternative hypothesis.
- Sample - Collect your sample of data i.e. sample the population that you actually want to know about.
- Normality - Determine whether the distribution that you are studying is normal as this has implications for the types of statistical tests that you can run on your data.
- Statistical test - Select an appropriate statistical test based on the variables that you have defined and whether the distribution is normal or not.
- Significance level - Set the significance level for your chosen test.
- P-value - Run the statistical tests on your data and interpret the output p-value.
- Decide - Based on the outputs from the test, reject or fail to reject the null hypothesis.
What is a research hypothesis?
A research hypothesis is your proposed answer to your research question. The research hypothesis usually includes an explanation (‘x affects y because …’).
What is a statistical hypothesis?
A statistical hypothesis is a mathematical statement about the characteristics of the data generating mechanism (i.e. the population). Statistical hypotheses always come in pairs: the null and alternative hypotheses.
Describe an example to explain the difference between a research hypothesis and a statistical hypothesis.
Two statistics lecturers, Rose and Fred, think that they use the best method to teach their students. Each lecturer has 50 statistics students who are studying for a degree in geography. In Rose’s class, students have to attend one lecture and one practical class every week, whilst in Fred’s class students only have to attend one lecture.
Rose thinks that practical classes, in addition to lectures, are an important teaching method in statistics, whilst Fred believes that lectures are sufficient by themselves and thinks that students are better off solving problems by themselves in their own time.
This is the first year that Rose has given practical classes, but since they take up a lot of her time, she wants to make sure that she is not wasting her time and that they do improve students’ performance.
Research aim:
to determine whether performance is different between the two different teaching methods.
Research hypothesis:
When students attend practical classes, in addition to lectures, their performance increases.
In statistics terminology, the students in the study are the sample and the larger group they represent (i.e., all statistics students on a geography degree) is called the population.
Assuming that the sample of students in the study are representative of a larger population of statistics students (assuming we had a good sampling design!) , we can use hypothesis testing to understand whether any differences or effects discovered in the study exist in the population.
In other words, hypothesis testing is used to establish whether a research hypothesis extends beyond those individuals examined in a single study. We perform statistics on our sample data to make inferences about the population from which they are drawn (or sampled).
Statistical hypothesis:
The mean exam mark (for the population) for students exposed to the “practical” and “lecture-only” teaching methods is not the same.
What is a null hypothesis?
The null hypothesis is essentially the “devil’s advocate” position. That is, it assumes that whatever you are trying to prove did not happen (hint: it usually states that something equals zero)
Using the ‘lecturers’ dilemma’, what would the null hypothesis be?
The mean exam mark for the “practical” and “lecture-only” teaching methods is the same in the population (i.e. zero difference).
Using the ‘lecturers’ dilemma’, what would the alternative hypothesis be?
The mean exam mark for the “practical” and “lecture-only” teaching methods is not the same in the population.
What is the goal of a hypothesis test?
the goal of a hypothesis test is not to show that the alternative hypothesis is (probably) true.
The goal is to show that the null hypothesis is (probably) false.
How is the level of statistical significance often expressed?
as the so-called p-value or probability value.
What is the p-value?
The p-value is a number describing how likely it is that we would get this sample result by chance if there is NO effect in the population (i.e. if the null hypothesis is true).
It is always related to the null hypothesis NOT to the alternative hypothesis.
What does the p-value mean in terms of the ‘lecturers’ dilemma’?
How likely would it be to see a difference in the mean exam performance between the two teaching methods as large as (or larger than) that which has been observed in your sample if there really is no difference between the two teaching methods in the wider population (i.e. the null hypothesis is true)?
What is the normal scale of a p-value?
0-1
Explain what you can interpret based on different p-values (using specific values)
A p-value smaller or equal to 0.05 is statistically significant. It indicates strong evidence against the null hypothesis.
there is less than a 5 % probability that the null hypothesis is correct (or more specifically the actual p value reported!)” which is usually tolerable!
Consequently, we reject the null hypothesis, and accept the alternative hypothesis.
A p-value larger than 0.05 is not statistically significant and indicates strong evidence for the null hypothesis.
Essentially it can be interpreted as “The error rate I have to tolerate if I reject my null hypothesis is > 5%”, which is usually not tolerable!
This means we can not reject the null hypothesis.
The smaller the p-value, the more confident you can be in rejecting the null hypothesis.
How does some statistical software indicate p-value significance?
using significance stars
p < 0.05 = *
p < 0.01 = **
p < 0.001 = ***
What do various statistical methods used for the analysis of continuous data, make assumptions about?
NORMALITY
including correlation, regression, t-tests, and analysis of variance (ANOVA).
If continuous data follow a normal distribution, how can we present the data?
using the mean value i.e. all of the parametric tests make use of the mean value.
What happens if the data is not normally distributed?
the mean is not a representative value of the data (remember the session on descriptive stats)! Consequently a wrong selection of the representative value of a data set and further calculated significance level using this representative value, might give a wrong interpretation!
What must we do before undertaking any sort of statistical analyses?
determine if our data are normally distributed; that is we test the normality of the data. If the data are normally distributed, then we know that the mean is applicable as a representative value for our data and we can use the mean values in parametric tests otherwise we need to use median values and nonparametric methods.
What are the two main methods of assessing normality?
Graphical
Numerical
Advantages of using graphical interpretation to assess normality:
allows good judgment to assess normality in situations when numerical tests might be over or under sensitive
Examples of graphical interpretation to assess normality
histograms, density plots and Q–Q (quantile-quantile) plots.
A Q-Q plot is particularly useful to visually test for normality. Essentially it plots your data on the Y axis and that is what would be expected if your data were normally distributed, on the X axis. If the values are the same or similar, you can be confident that your data are normally distributed.
Advantages of using descriptive statistics to assess normality:
easy to generate and can provide some initial indication regarding the distribution and variability present within the data.
Examples of descriptive statistics to assess normality
mean, standard deviation, skewness and kurtosis
Advantages and disadvantages of using statistical tests to assess normality:
Have the advantage of making an objective judgment of normality but have the disadvantage of sometimes not being sensitive enough at low sample sizes or overly sensitive to large sample sizes.
Most popular examples of descriptive statistic tests to assess normality
Shapiro–Wilk test and the Kolmogorov–Smirnov test.
When is the Shapiro-Wilk test most appropriate?
The Shapiro–Wilk test is a more appropriate method for small sample sizes (n <50 samples).
Although it can also be used on larger sample sizes, it often becomes very sensitive to small deviations from normality (which are potentially inconsequential) for very large sample sizes.
When is the Kolmogorov–Smirnov test more appropriate?
Kolmogorov–Smirnov test is usually used for n ≥50 (where n is commonly used to indicate the number of samples).
What is variance?
the distribution, or “spread”, of values around the mean
What is Levene’s test for homogeneity of variance?
You use it to test whether your data satisfy this assumption (test statistic is F). The hypotheses can be set-up as below:
Null Hypothesis (H0): There is no significant difference between the population variances of the groups (i.e. variance between groups is equal)
Alternative Hypothesis (H1): There is a significant difference between the population variances of the groups (i.e. variance between groups is not equal)
If p > 0.05 the null hypothesis CAN NOT be rejected and thus we can assume that the population variances between groups are the same or similar.
What questions do we ask to decide what test to run to analyse our data?
What type/level of data measurements do I have? (e.g. nominal, ordinal, continuous)
Are my data normally distributed?
Do I want to look at differences between sets of measurements (groups) or do I want to look at associations (relationships) between them?
How many sets of measurements (groups) do I have in my data?
Are my measurements in pairs (e.g. repeated measurements on the same individual/sample)?
Am I investigating the effect of one factor, or two together?
What is the Chi-square goodness of fit test?
a statistical hypothesis test used to determine whether a variable is likely to come from a specified distribution or not.
It is often used to evaluate whether your sample data are representative of the full population by testing whether your data are similar to the expected outcome i.e. if you were to actually collect data from the “entire” population!!
The Chi-square test quantifies the difference between observed and expected values. Is the actual data from our sample “close enough” to what is expected to conclude that the no. of settlements on flat and steep land in the full population of settlements are equal? Or not?
When do we use the chi-square goodness of fit test?
you have counts of values for a categorical variable (it is not for continuous data!);
data values are a simple random sample from the full population and;
the dataset is large enough so that at least five values are expected in each of the observed data categories (or groups).
What do we need to run the Chi-square goodness of fit test?
We only need one variable. We also need an idea, or hypothesis, about how that variable is distributed
Example of when to use the Chi-square goodness of fit test:
You are investigating the location of Huron Indian populations in Ontario, Canada and you want to know if local slope conditions are important in influencing settlement location.
Specifically, you want to know if there is a difference between the number of settlements on steep land, and the number of settlements on flat land.
If slope does not influence the number of settlements in that location, then we would expect the number of settlements on steep and flat land to be similar.
We have a simple random sample of 50 villages. We meet this requirement.
Our categorical variable is the land type.
We have the count of villages in each land type. We meet this requirement.
Adjusting for the different areas of each land type (63.6 % of the total land area is flat, and 36.4 % is steep), we expect that out of the 50 villages sampled, that 32 (rounded up from 31.8!) of those villages would be expected to occur on flat land (i.e. 63.6 % of 50) and 18 on steep land (i.e. 36.4 % of 50). This is more than the requirement of five expected values in each category!
Our statistical hypotheses would be:
Null Hypothesis (H0): There is no significant difference between settlement numbers on flat land compared to steep land.
Alternative Hypothesis (H1): There is a significant difference between settlement numbers on flat land compared to steep land.
What are T-tests?
T-tests are a suite of parametric tests used to test for “differences” between sets of measurements.
What do we need to run t-tests?
T-tests use “mean values” so the data need to be normally distributed and the variances of the groups to be compared need to be equal or similar (remember how to test this in Jamovi, from the last session?).
You need to have a quantitative variable (not counts!) and a categorical or nominal variable (i.e. a grouping variable).
There are a number of different types of t-test, which one is most appropriate depends on your data and research question.
What are the types of t-tests?
One sample t-test: Is the mean different from an expected value? (similar to the Chi-square test, but this is for continuous data not counts!)
Paired t-test: Are the means of two measurements made on the same sample different from each other?
Two-sample t-test: Are the means of two separate samples different from each other? –> most common one
What is a two-sample t-test?
The two-sample t-test (also known as the independent samples t-test) is a method used to test whether the unknown population means of two groups are equal or not.
What do you need to run a two-sample t-test?
we need two variables. One variable defines the two groups (categorical or nominal variable). The second variable is the measurement of interest.
We also have an idea, or hypothesis, that the means of the underlying populations for the two groups are different.
When can we use a two-sample t-test?
Data values must be independent. Measurements for one observation do not affect measurements for any other observation.
Data in each group must be obtained via a random sample from the population.
Data in each group are normally distributed.
Data values are continuous.
The variances for the two independent groups are equal.
Example of when to use a two-sample t-test:
I want to know if the mean age of people who live on two different housing estates in Manchester is different.
The data values are independent. The age of any one person does not depend on the age of another person.
We assume the people represent a simple random sample from the population who live in the housing estates.
We assume the data are normally distributed and we can check this assumption (e.g. Shapiro-Wilk test).
The data values are age so the measurements are continuous.
We assume the variances for Estate A and Estate B are equal and we can check this assumption (e.g. Levene’s test).
statistical hypotheses:
Null Hypotheses (H0): There is no significant difference in the mean ages of people who live on Estate A and Estate B
Alternative Hypothesis (H1): There is a significant difference in the mean ages of people who live on Estate A and Estate B
The software will calculate the t-test statistic ( t-statistic) for your sample data and compare it against a hypothesised t-distribution for “no difference” between groups (i.e. we want to know how far from the null hypothesis (i.e. difference = 0) our data are).
Our results indicate that the p-value of the t-test statistic (which is ultimately the value that we are most interested in!) is p = 0.33; suggesting that there is a 33 % risk of concluding the mean ages for people in each of the housing estates are different, when they are really not. This is far greater than our 0.05 (5 %) significance level so we “can not” reject the null hypothesis
What is SE?
SE = standard error, which is the sample standard deviation/number of samples.
What is a t-statistic?
The t-statistic is the number of standard deviations from the (hypothesised) mean of a t-distribution.
What is a t-distribution?
a type of distribution that looks similar to a normal distribution (also called z-distribution) but is more suited to smaller sample sizes (n < 30) as it’s more conservative than the normal distribution (lower probability around the mean and higher probabilities for the more extreme values, than the normal distribution).
As the sample size increases, the t-distribution starts to look more like a normal distribution.
What is an ANOVA test?
a way to find out if survey or experiment results are significant. Specifically ANOVA tests to see if there is a difference between more than two different groups of observations. The grouping variable is the independent variable (or predictor variable).
we should always use ANOVA when we have more than two groups/categories within our data
Example of when to use an ANOVA test:
We would use ANOVA if we were measuring plant height along the Manchester Canal and we wanted to see if plant height differs by species. ANOVA would be suitable if we measured the height of more than two species of plant (i.e. groups). If we were only measuring two species then we would use the T-test
What do we need for ANOVA?
A mix of one continuous “quantitative” dependent variable (which corresponds to the measurements to which the question relates) and one “qualitative” variable (with at least 2 groups to compare; 3 if we actually use ANOVA rather than a T-test)
When can we use ANOVA?
Independence: The data should be collected from a representative and randomly selected portion of the total population and should be independent between groups and within each group.
Ask yourself if one observation is related to another (if one observation has an impact on another) within each group or between the groups themselves. If not, it is most likely that you have independent samples.
Normality: Your sample data should approximate a normal distribution. Less important for sample sizes > 30 but you should always check your data.
If your data are not normally distributed use the non-parametric alternative to ANOVA which is the Kruskal-Wallis test
Homogeneity of variances: The variances of the different groups should be equal in the populations.
What happens once ANOVA has been run and we know we can reject the null hypothesis?
ANOVA is an omnibus statistic that tests against the null hypothesis, which states that there is no difference between the means of each group.
If the null hypothesis is rejected (e.g. p < 0.05), it doesn’t tell us which of the groups are different from which, just that there is a difference between at least one of the groups and one other.
We need to undertake a post-hoc test to determine what those differences are
How do we interpret ANOVA?
The ANOVA test statistic is F (F ratio).
What we need to focus on in our output is the significance of this test statistic (p value). If the p value is < 0.05 then we can reject the null hypothesis and conclude that the means are significantly different.
However, we don’t usually want to stop here if we have a significant result. What we usually want to know is which means are different from which?
Following one-way ANOVA, how can we find out which means are different from which?
We use a Post-hoc test, which takes into account the fact that multiple tests between means are needed, but they deal with the problem by adjusting the significance level in someway so that the probability of observing at least one significant result due to chance remains below our desired global significance level (e.g. p < 0.05).
What are the two most common post-hoc tests?
Bonferroni correction: if you have a set of planned comparisons to do.
Tukey HSD: used to compare all groups to each other (so all possible comparisons of 2 groups)
What is the Bonferroni correction?
The Bonferroni correction is simple: you simply divide the desired global required significance level e.g. 0.05, by the number of comparisons.
Using the flower example:
we have 3 comparisons (i.e. 3 species) so if we want to keep a global significance level of 0.05 we have a new local significance level for each individual test of 0.05/3 = 0.0167. Once we have done that, then we can simply perform a student’s t-test for each comparison (as we did last week), and compare the obtained p-values with the new significance level ( p < 0.0167) to see if the difference is significant.
Essentially instead of the significance level being 5% for each t-test, it’s now more stringent i.e. 0.0167%! We don’t want to tolerate more than ~2 % error in each of our t-tests.
What is the Tukey HSD (Honest Significant Difference) Test?
This test compares all means to one another at the same time and outputs a series of p-values for each combination of groups, which has been adjusted to account for the need for multiple tests (we don’t need to know the maths behind it, just what has happened!).
It is these adjusted p-values that are used to test whether two groups are significantly different or not, and we can be confident that the entire set of comparisons collectively has an error rate of no more than 0.05
In a similar vain to all the previous tests we’ve talked about, for each pair of means, we decide whether we can reject the null hypothesis of no significant difference between the means, based on the reported p-values
What is the Pearson’s correlation coefficient usually denoted as and when is it used?
The correlation coefficient (Pearson’s correlation coefficient) usually denoted as r; is used to determine the strength of a relationship between two variables (X and Y). The correlation coefficient varies from -1 to 1. When:
r = -1 means that we have a perfect negative relationship
r = 1 means that we have a perfect positive relationship
r= 0 means that there is no relationship at all
-1 to -0.9 = very strong negative
-0.9 to -0.7 = strong negative
-0.7 to -0.4 = moderate negative
-0.4 to -0.2 = weak negative
-0.2 to 0 = negligible
0 to 0.2 = negligible
0.2 to 0.4 = weak positive
0.4 to 0.7 = moderate positive
0.7 to 0.9 = strong positive
0.9 to 1 = very strong positive
How do we interpret correlation?
You should always look at the scatterplot before attaching any interpretation to the data. A correlation might not mean what you think it means. The classic illustration of this is “Anscombe’s Quartet” (Anscombe 1973), a collection of four data sets, which all have the same r value but show very different patterns in their relationships!
What is the general form for the null hypothesis for a Pearson correlation?
H0: There is no significant [linear] association between the two variables
What are the outputs of the Pearson’s Correlation test?
Outputs from the Pearson Correlation test will include a p value that you can use to determine whether or not you can reject the null hypothesis. The r value is then used to determine the actual strength of the correlation.
What is the Spearman’s rank-order correlation?
the non-parametric version of the Pearson’s correlation. Spearman’s correlation coefficient (usually denoted by rs), measures the strength and direction of association between two ranked variables.
When would you usually use Spearman’s rank-order correlation?
if your relationship was non-linear and/or your data were of the type “ordinal”.
What does spearman’s correlation determine?
the strength and direction of the monotonic relationship between your two variables rather than the strength and direction of the linear relationship between your two variables, which is what Pearson’s correlation determines.
What is a monatomic relationship?
A monotonic relationship is a relationship that does one of the following:
As the value of one variable increases, so does the value of the other variable; or
As the value of one variable increases, the other variable value decreases.
How do you interpret Spearman’s rank correlation coefficient?
Similar to the Pearson’s correlation coefficient, the Spearman correlation coefficient, rs, can take values from +1 to -1.
An rs of +1 indicates a perfect association of ranks, an rs of zero indicates no association between ranks and an rs of -1 indicates a perfect negative association of ranks.
The closer rs is to zero, the weaker the association between the ranks.
As for the Pearson Correlation test the outputs from the Spearman rank test will include a p value that you can use to determine whether or not you can reject the null hypothesis. The rs value is then used to determine the actual strength of the association.
What is the general form of a null hypothesis for a Spearman correlation?
H0: There is no significant [monotonic] association between the two variables
What are linear regression models?
basically a slightly fancier version of the Pearson correlation, but they are really useful tools.
What is a regression line?
a line that is drawn through a scatter graph
the equation of a straight line is y = a + bx
The two variables are x and y, and the two coefficients are a and b. Coefficient a is the y-intercept (i.e. the value of y that you get when x= 0) and the coefficient b is the slope of the line (i.e. if you increase the x-value by 1 unit, the the y-value goes up by b units)
We use a similar formula for a regression line
Yi (subscript i) = b0 (subscript 0) + b1(subscript 1) x Xi (subscript i )
The subscript i just refers to a specific data point i.e. the ith observation
Y is the outcome (predicted) observation. It has a ^ over it so that we know that this is an estimate and not an actual measurement.
The whole point of regression is to “predict” something that we haven’t measured from something that we have.
The letters have all changed to b (because that’s just how statisticians like it!). Now b0 always refers to the intercept term and b1 refers to the slope
What is a residual?
Regardless of whether a regression model is good or bad, it’s never “perfect” - the data points never fall perfectly on the regression line. In other words, the predicted values of Y are never actually identical to the actual values of Y.
The difference between an actual and a predicted value of Y is called a residual.
When the regression line is good, our residuals should be small, but when the model is bad, our residuals are large. Consequently, the best-fitting regression line is one that has the smallest residuals
the technical name for this type of linear regression is ordinary least squares (OLS) regression.
How do we interpret OLS regression?
R2 and F statistics
When you run a linear regression what you get is a table which outputs a series of results.
Overall model quality: R2 and F statistics
The Model Fit Measures gives us the correlation coefficient (R) and the coefficient of determination (R2).
The coefficient of determination is the regression statistic. It represents the proportion of variance in the outcome variable (Y) that is explained by the model. So in this case my amount of sleep (predictor) can explain 81.6% of the variance in my grumpiness
This particular table also presents the F test statistic and its associated level of significance (p-value). As the table suggests, the F test is a hypothesis test which tests for overall model significance. Essentially it’s used as an initial indicator of whether the model is poor.
If the p value is > 0.05 - we cannot reject the null hypothesis and we should conclude that this is a poor model or that the data are poor. We should not proceed with the modelling.
If the p- value is < 0.05 - we can’t necessarily conclude that the model is good (yet!) we just know that it’s passed this overall model test. We may still want to test the significance of the individual coefficients (or predictor variables).
What is the null hypothesis for the F test?
H0:The regression model is not significantly different to a model which has no predictor variables
How do we interpret OLS regression?
Importance of the predictors: Model coefficients and the T-statistic
The model coefficients are found in the Estimate column - i.e. an estimate of the model coefficients
Slope: The slope coefficient is -8.94. This means that if I increase my sleep by 1 hour (X value) then I will reduce my grumpiness (Y value) by 8.94 grumpiness points! The slope coefficient is related to your predictor variable i.e. amount of sleep.
Intercept: The intercept coefficient is 128.96. If I get zero hours of sleep (x = 0) then my grumpiness will be a whopping 125.96 (!), which is very grumpy!!
The t-statistic is testing the significance of the individual model coefficients.
The null hypothesis for the t-statistic is:
H0: The coefficient of the model is not significantly different from zero
If the coefficient is important to the model then the p value should be < 0.05 (i.e. reject the null hypothesis). This statistic is most useful when looking at the importance of individual predictor variables (i.e. slope coefficients) in a multiple regression model.
When do we use Multiple linear regression models?
Simple linear regression models assume that we only have a single predictor variable. However, in reality you often have multiple predictor variables that you want to examine. If that’s the case then you want to use a multiple regression model.
What is our multiple linear regression equation?
We need to add more terms to our regression equation
Yi = b0 + (b1 x Xi1) + (b2 x Xi2)
How do we interpret multiple linear regression models?
Let’s take our previous example, but now we want to predict my level of grumpiness using the amount of sleep I’ve had AND the amount of sleep the baby has had. Now we have two X variables i.e. amount of my sleep I got and the amount sleep my son got.
As always b0 is the intercept; b1 is the coefficient associated with my sleep (-8.95) and b2 is the coefficient associated with my son’s sleep (0.01).
We can see that R2 values are in fact similar to those of our more simple regression model. Even though the overall model is not bad (F(2, 97) = 215.24, p < 0.001), the results of the t-tests suggest that the amount of sleep that my son gets is not a significant predictor of my level of grumpiness (t(2, 97) = -16.17, p > 0.05). Consequently, for this example I would conclude that I should just use a single regression model for modelling this relationship or at least that adding information on the amount of sleep my son gets does not add anything useful to our existing model!
linear regression modelling relies on what several assumptions?
Normality:
Like many of the models in statistics, basic simple or multiple linear regression relies on an assumption of normality. Specifically, it assumes that the model “residuals” are normally distributed. It’s actually okay if the predictors X and the outcome Y are non-normal, so long as the residuals are normal!
Linearity:
A pretty fundamental assumption of all linear regression models is that the relationship between X and Y actually is linear!
Homogeneity of variance:
Strictly speaking, the regression model assumes that each residual is generated from a normal distribution with mean 0, and (more importantly for the current purposes) with a standard deviation that is the same for every single residual. In practice, it’s impossible to test the assumption that every residual is identically distributed. Instead, what we care about is that the standard deviation of the residual is the same for all values of Y, and (if we’re being especially paranoid) all values of every predictor X in the model.
Uncorrelated predictors:
The idea here is that, in a multiple regression model, you don’t want your predictors to be too strongly correlated with each other. This isn’t “technically” an assumption of the regression model, but in practice it is required. Predictors that are too strongly correlated with each other (referred to as “collinearity”) can cause problems when evaluating the model as it is hard to detangle which variables are meaningfully important from an interpretation perspective, rather than just identified as important statistically, because they happen to be correlated with a meaningful variable!
Residuals are independent of each other:
This is really just a “catch all” assumption, to the effect that “there’s nothing else funny going on in the residuals”. If there is something weird (e.g., the residuals all depend heavily on some other unmeasured variable) going on, it might mess things up
No “bad” outliers:
Again, not actually a technical assumption of the model (or rather, it’s sort of implied by all the others), but there is an implicit assumption that your regression model isn’t being too strongly influenced by one or two anomalous data points because this raises questions about the adequacy of the model and the trustworthiness of the data in some cases.