Quantitative Flashcards
Define an observational study.
A study that does not include an intervention or experiment, only observation of natural relationships between factors and outcomes.
What are some types of observational studies?
cross-sectional, longitudinal, case-control, cohort and survey studies.
Describe a cross-sectional study.
A study that looks at a cohort at a single point in time.
Describe a longitudinal study.
A study that uses repeated measures over a long period of time.
Describe a case-control study.
A study that looks at the relationship of an outcome (case) versus no outcome (controls) and compares this to previous exposures. Also known as a ‘retrospective study’.
Describe a cohort study.
Almost the opposite to a case-control study. A study that follows a population with exposure to identify whether an outcome is developed or not.
Describe a survey study.
A study that uses surveys to collect data from participants. Particularly useful for collecting data from a geographically widespread population.
Define an interventional study.
A study that employs manipulation of a variable to define the outcome of this intervention on a specific population. Also known as experimental studies.
What are some types of interventional studies?
randomised control trials, pre-post studies, and non-randomised control trials.
Describe a randomised control trial.
A trial where subjects are randomly assigned to one of two (or more) groups- either the experimental or the control group. The outcomes of both groups are then compared.
What are some features of a well-designed RCT?
a large enough sample to allow generalisation of results
concealed randomisation of the subjects to each group
both groups are treated identically by researchers
analysis is focused on the research question
Describe a pre-post study.
A study that measures the occurrence of an outcome before and again after a particular intervention is implemented.
Why is a pre-post study not as strong as an RCT?
They suffer poor internal validity because they cannot accurately control for every variable that may be responsible for the outcome of an intervention like an RCT can.
Describe a non-randomised trial.
Similar to an RCT where there is an intervention and control group however there is no randomisation of participants into these groups.
Why is a non-randomised trial not considered a strong study design?
They can suffer from bias.
What is a variable?
an attribute that varies or changes between individuals, objects, qualities, and properties.
What are some different types of variables?
numeric (discrete or continuous), categorical (nominal or ordinal).
What is a numeric variable?
a variable that has a measurable value described by a number
What is the difference between a discrete and continuous variable?
a discrete variable uses only whole numbers (i.e. 1 child) whereas continuous can use values between units (i.e. 55.4kg).
What is a categorical variable?
A variable that may be divided into groups (i.e. race, sex, age group).
What is the difference between a nominal and an ordinal variable?
nominal variables have no natural order (i.e. gender), whereas ordinal variables are able to be ordered (i.e. satisfaction of treatment is 1=not satisfied, 2= slightly satisfied 3= moderately satisfied & 4= very satisified).
What is the difference between an interval scale and a ratio scale?
A ratio scale uses a true-zero point (i.e. weight, height) whereas an interval scale uses an arbitory zero point (i.e. temperature).
When might we see a bimodal distribution of data on a histogram?
When there are two distributions mixed together, i.e. heights of males and females on the same histogram.
When the distribution of data on a histogram is skewed, which direction do we name this for?
towards the tail, so a distribution with a tail to the right will be skewed to the right or positively skewed
When would we use a denisty plot over a histogram to visualise data?
When we need a better understanding of the data density. Histograms can vary in their picture depending on how many ‘bins’ are chosen.
It is also possible to overlay density plots making comparing of two data groups possible.
When might histograms be chosen over density plots?
If visualisation of data must be done by hand, density plots are difficult to draw and need software to be produced.
What is the difference between mean and median?
The mean is the average of a data set whereas the median is the middle figure in the data set.
What is the median also known as?
the 50th percentile or the 0.5 quantile
How do we calculate a 5-number summary for a data set?
By finding the minimum, first quartile, median, third quartile and maximum.
minimum= smallest number
1st quartile= median of the values below the median
median= middle number
3rd quartile= median of the values above the median
maximum= largest number
What is the interquartile range?
The distance between the first and third quartiles.
Also known as the middle 50% of the data.
What is the p quantile?
the 100th percentile or the maximum.
What kind of distribution is more likely to have observations being flagged as unsual or outliers in a box & whisker plot?
a skewed distribution.
What data is used in a box & whiskers plot?
a 5-number summary
What is an outlier?
a data value that does not seem to match the overall distribution observed
it could be either a genuine observation or a data entry error which is why they are marked in SPSS for review
When might we see a number flagged as a possible outlier in a box & whisker plot?
if that number/data point is more than 1.5 times the interquartile range above the third quartile.
True or false, a skewed data set is more likely to flag possible outliers on a box & whiskers box plot.
True
What type of variables are most likely going to be visualised using bar charts?
categorical (nominal or ordinal)
Why should pie charts be avoided for visualising categorical data?
they are less reliable for interpretation
If data distribution is skewed, the mean will be pulled in which direction?
towards the tail
What does it mean to say that the mean/average is susceptible to the presence of outliers?
It means that outliers can influence the average of a data set by skewing it so that it is no longer accurate.
What are some advantages of using a mean rather than median to discuss a data set?
the mean tends to be more powerful than the median because it takes into account every piece of data
mean has a rich theory, through the central limit theorem which makes it very useful in practice
What are some disadvantages of using a mean rather than median to discuss a data set?
the mean does not carry meaningful quantitative information for data gathered from nominal or ordinal scales
the mean is sensitive to extreme values
What is variance?
the extent to which each observation deviates from the mean
What is the 68 - 95 - 99.7 rule?
For any normal distribution, the area within 1 standard deviation of the mean is 68%, the area within 2 standard deviations of the mean is 95% and the area within 3 standard deviations of the mean is 99.7%.
This rule is used to make statements about data that has a normal distribution i.e. what range of values would include 95% of subjects.
How can we tell if data has a normal distribution?
data that looks symmetrical on a histogram
data that matches up with a normal quantile plot
Why is Normal distribution our friend?
We can use it to:
Describe the distribution of observations, such as height.
Describe the distribution of statistics, such as the sample mean.
Why is Normal distribution our friend?
We can use it to:
Describe the distribution of observations, such as height.
Describe the distribution of statistics, such as the sample mean.
What is the student’s T test and when would we use it?
a test to determine if there is a significant difference between the means of two groups or populations. It is typically used when the sample sizes are small and the variances of the two groups may be different.
What is a t-value?
a ratio of the difference between the means of two groups to the variation within each group.
a larger t value suggests a larger difference between the means and a smaller probability that the difference is due to chance.
What kind of distribution and data is a t test suitable for?
a normal distribution, continuous
What are the two types of t-tests? What are they used for?
independent samples t-test: used to compare the means of two independent groups
paired (dependent) samples t-test: used to compare the means of related groups (typically based on before-and-after measurements or matched subjects)
How do we calculate the degrees of freedom (df) in a t-test?
sample 1 size + sample 2 size -2
What are pooled t-tests appropriate for? What about a Welch t-test?
types of independent two-sample t-tests
pooled t-test is used if the two populations being compared have equal variances (as confirmed by a Levene’s test which has an outcome that is not significant)
welch t-test is used if the two populations being compared do not have equal variances (as confirmed by a Levene’s test which has an outcome that is significant)
What are type 1 and type 2 errors?
type 1 error (also known as a false positive) is the error of rejecting the null hypothesis when it is actually true.
type 2 error (also known as a false negative) is the error of not rejecting the null hypothesis even though it is false
What is central limit theorem and how is it useful?
if a data set is sufficiently large (sample size >20) and independent, the distribution will be approximately normal
it is useful to allow us to use tests that assume a normal distribution
When are non-parametric tests used?
also known as distribution-free tests, used when data is not normally distributed and central limit theorem does not apply (small sample size)
‘free from parameters’, a t-test is a parametric test because it estimate parameters i.e. population means using statistics
What are some non-parametric tests? When are each appropriate?
The Mann-Whitney U test (similar to the Wilcoxon Rank Sum test) used to compare two independent groups
The Kruskal-Wallis test used to compare more than two independent groups (better for outliers or ordinal data than ANOVA)
The chi-square test used to compare the association between two categorical variables
How are non-parametric tests protected from outliers?
by ranking the values (i.e. each data set is given a rank rather than a nominal value)
When is an ANOVA test used? (Analysis of Variance) What assumptions are made to use this kind of test?
to compare means between three or more groups
data is normally distributed, independent and variances between groups are equal
What kind of data is a chi-square test suitable for?
categorical
What is a chi-square test for?
it is used to determine if there is a significant association/dependence or independence between two categorical variables
i.e. ‘is there a significant relationship between gender and voting preference?’
What value does a chi-square test give us?
a chi-square statistic which can then be compared with the p-value (or critical value) to determine if the data occurred by chance or has significance
What are the degrees of freedom (df) in a chi-square test?
the number of rows -1 x number of columns -1
What is the difference between the critical value and p-value in a chi-square test?
the critical value is a specific value derived from a data set whereas a p-value is a probability value
How do we know if a chi-square test has told us the data is significant?
the chi-square statistic will exceed the critical value or the p-value will be below 0.05.
What is a Pearson Correlation coefficient and what does it tell us?
denoted by r
a unit-less measure that ranges from -1 to +1
tells us if there is a strong, moderate or weak, positive or negative, linear or non-linear corrrelation between two sets of data
i.e. a scatterplot that is linear and moving upwards has a strong, positive correlation coefficient whereas a scatterplot that is linear and moving downwards has a strong, negative correlation coefficient.
What is an R-squared value?
pearson correlation coefficient squared.
the proportion of all the variability that is explained by the differences between groups
so how much of the variability can be explained by the data in question (i.e. how much variability in SPPB score can be attributed to levels of physical activity)
calculated as the sum of squares between groups divided by the total sum of squares presented as 0.4 or 40% (as an example).
What is a Spearman correlation coefficient and what does it tell us?
denoted as Spearman’s rho
tells us about the correlation between data just like a Pearson correlation coefficient however first ranks the observations in each variable seperately (much like non-parametric methods this protects from outliers)
useful for when data is ordinal or there are outliers
What is linear regression and what does it tell us?
like correlation, it is a method used to describe the relationship between a dependent variable and one or more independent variable
aims to establish a linear line that best fits the data points and predicts the value of the dependent variable based on the values of the independent variable
assumptions: independent observations, linear association, normal variability, constant variability
come back to residuals???