Data Interpretation Flashcards

1
Q

What should be included in a good figure caption

A

Should be able to interpret the results just by looking at the figure, caption, legends without having to read the text associated with it

Brief description for the treatment conditions
Brief description of results
Statistical tests
Number of data points used to create graphs
What error bars and points represent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between descriptive and inferential statistics

A

Descriptive

  • describes and summarises a data set
  • calculations are made without uncertainty

Inferential

  • inference about a parameter of interest in the population, based on what is observed in a sample
  • calculations are estimated with a degree of uncertainty
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 2 forms of inferential statistics

-what is the difference between them

A

Estimation
-estimation of a population parameter of interest from the value observed in the sample

Hypothesis testing
-way to test differences in the parameter of interest between groups and produces a p value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 2 forms of estimation

A

Point estimate
-single value (best guess) of the parameter in the population

Interval

  • defined by 2 numbers between which the population parameter is said to lie
  • examples include confidence intervals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the types of variable

-what are examples of each type

A
Categorical
Qualitative 
-classified in categories without intrinsic ordering 
Binary (2 categories) - sex
Nominal (2+) - ethnicity

Ordinal (order values from low to high) - age groups

  • spacing between values does not have to be consistent
  • age groups

Numeric
Quantitative
Discrete - cell counts
Continuous - height

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What operators would be used for these variables

  • qualitative
  • ordinal
  • quantitative
A

Qualitative - = or ≠
Ordinal - < or >
Quantitative - +, =, x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How would you work out the average and spread for these variables

  • qualitative
  • ordinal
  • quantitative
A

Qualitative

  • mode
  • frequency distribution - frequency tables, graphs (histograms, pie chart)

Ordinal

  • median - 50% percentile
  • absolute ranges - range between max and min value
  • percentiles - the value below which a certain percent of observations fall
  • IQR - the range between the 25th and 75th percentile

Quantitative

  • mean
  • variance - average squared deviation of each observation from the mean
  • SDs - square root of the variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does are the mean and median influenced by extreme values

A

Means is affected by extremes
Medians are not

In the situation where you have extremes, medians would be a better measure of averages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe what the mode, median and mean would be in a normal distribution

A

Bell curve

All 3 would be the same

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
Describe what % of observations would fall within
-1SD
-2SD
-3SD
in a normal distribution
A

1SD - 68% of observations would be within 1SD from the mean

2SD - 95% of observations would be within 2SD from the mean

3SD - basically all observations would be within 3SD from the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 2 possible skewed distribution curves

-how will this affect the mean and median

A

Negatively skewed - the longer tail of distribution points in a negative direction
-mean is less than the median

Positively skewed - the longer tail of distribution points in a positive direction
-mean is more than the median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the dangers associated with categorising continuous variables
But why is this done

A

Done to improve clinical interpretation of results

However this may lead to

  • a loss of information, leading to a loss of statistical power (loss of ability to detect a difference)
  • the impact of the choice of cut-offs on results is problematic when the choice is not based on a strong a priori rationale
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between a population vs sample

-when is the use of a sample ok

A

Population
-represents whole group we are interested in

Sample

  • too time consuming and expensive to contact the whole population
  • representative group taken from population
  • use this sample to infer information about the whole population

Sample results are appropriate when they are

  • valid - sample is representative of population
  • accurate - sample size is large enough
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do standard error, standard deviation and confidence interval differ from each other

A

Standard error

  • describes the accuracy of the point estimate
  • used to calculate CI
  • 95% confidence interval indicates the range of values likely to include the true value in the population

Standard deviation

  • measure of spread (variability)
  • used for descriptive statistics to calculate intervals showing variability in the data
  • for data sets with a normal distribution, 95% of data points will fall within 2SDs of the mean

Confidence interval
-estimated range of values likely to include the true unknown value in the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why do we use 95% confidence intervals

-what does this mean

A

Good compromise between 90% and 99%

95% CI
-if we repeated the same sampling from the same population 100 times, 95 of the 100 CIs would contain the true population parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How would we calculate the standard error of the mean

A

SD of sample/square root of sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How would you calculate the 95% CI of the mean

-what are we assuming with this calculation

A

sample mean +- (1.96 x SE)

We expect 95% of sample means to lie within (1.96 x SE)

We assume that

  • the mean of all samples that could be drawn from our population follow a normal distribution
  • SE corresponds to the SD of all sample means (this is different from the SD of the original data)
  • distribution follows the 3sigma rule
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How does sample size impact on the accuracy of your estimate

A

SE is inversely proportional to the square root of the sample size

So larger the sample size, the greater the accuracy, lower SE and narrower CI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How to calculate SE for binary variables

-when would you use this

A

square root [p(1-p)/n]

p = sample proportion
n = sample size

SE of proportion is an approximation and only of use if the sample size is large
-a large sample size allows us to assume that the estimate is from a normal distribution and the SE is well estimated

np and n(1-p) should exceed 5 for the SE to be a good approximation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How would you calculate the 95% CI of the proportion

A

p +- (1.96 x SE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a

  • hypothesis
  • hypothesis testing
A

Hypothesis - a statement about the true value of parameters and the relationship in a defined population

Hypothesis testing - procedure, based on the observed values of the parameters in a sample of the population, to determine whether the hypothesis is a reasonable statement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the steps involved in hypothesis

A

Define hypothesis
Perform test
Calculate test statistics
-the measure that summarises the difference or relationship that you want to test

Estimate p value
-tells you if you should accept or reject your H0

Interpret test results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the difference between the

  • null hypothesis H0
  • alternative hypothesis H1
A

H0

  • assumed to be true
  • there is no true difference or relationship between the observed values in the sampled population

H1
-there is a true difference between the observed values in the sampled population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the difference between a 2 sided and 1 sided alternative hypothesis

A

2 sided

  • the difference can be in either direction
  • default

1 sided

  • the difference can be in 1 direction only
  • recommended if there is strong supporting evidence that the effect can be in one direction only
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

For qualitative tests, which statistical test would you use

  • unpaired
  • paired
A

Unpaired - Pearson X2

Paired - McNemar X2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

For ordinal tests, which statistical test would you use

  • unpaired
  • paired
A

Unpaired - Mann Whitney U

Paired - Sign

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

For quantitative unpaired tests, which statistical test would you use

  • parametric
  • non-parametric
A

Parametric - Student’s T test

Non-parametric - Mann Whitney U

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

For quantitative paired tests, which statistical test would you use

  • parametric
  • non parametric
A

Parametric - Student’s T test

Non-parametric - Wilcoxon signed-rank test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What does the
-p value
-a
tell you

A

p value
-The probability of getting the result in our sample by pure chance when H0 is true

a

  • arbitrary cut off to help us evaluate the size of p
  • the probability of committing a type 1 error
  • the 5% chance of rejecting the H0 incorrectly

Important to note that statistically significant results are not always clinically significant. Clinically significant results are not always statistically significant

30
Q

How would you interpret the p value

  • U0.05
  • U0.001
  • 0.049, 0.051
A

U0.05
-unlikely that results are due to chance => reject H0, significant result

U0.001
-strong evidence for a significant result

  1. 049 or 0.051
    - ‘borderline’ significant values could suggest to reject H0
31
Q

What is the difference between a

  • type 1 error (a error)
  • type 2 error (b error)
A

T1 a

  • occurs when a statistical test rejects a true H0
  • false positive
  • significance level is set at 5%, so every time we carry out a test, we have a 5% chance of a false positive

T2 b

  • test fails to reject a false H0
  • false negative
  • 1-power
  • power set at 80% => b=0.2

Can decrease the T2 error by increasing the sample size of the study => increases power to detect an effect, if a true effect exists

32
Q

What is the problem with multiple tests within the same dataset
-what are some examples of multiple testing

A

Each time you perform a test at the conventional significance threshold
a = 0.05, you accept a 5% chance of 1 false positive

More tests performed in same dataset => higher the number of false positives you have to accept

This applies when you have

  • multiple outcomes you are measuring
  • a single outcome in multiple subgroups
  • multiple exposures
33
Q

How would you avoid multiple testing

A

Limit the number of primary outcomes when designing the investigation

If this isn’t possible
Present findings as exploratory findings
-this requires further confirmation because the sample size was calculated on the primary outcome only => high risk of false positives due to multiple testing

Present multiple outcomes into 1 composite outcome

At the statistical analysis stage -Bonferroni correction
-this method corrects the significance level a for n (the no of tests performed according to the formula a/n

34
Q

How well does the Bonferroni correction perform with

  • independent hypotheses
  • correlated hypotheses
A

Independent (same hypothesis tested in several different groups of subjects from the same sample)
-performance is good, i.e. we have a good correction

Correlated (same hypothesis tested on related outcomes or exposures

  • performance is poor, i.e. we have an overconservative correction
  • apply a modified Bonferroni that accounts for the correlation between tests
35
Q

How would you avoid getting false negatives

A

Results from low power (small sample sizes)
Wrongly interpreted as no effect

Can be avoided if a priori sample size calculations are performed

Must also account for

  • possible drop outs
  • the need to correct for multiple testing if necessary
  • the minimal detectable difference has to reflect clinically meaningful effects
36
Q

What is the problem with studies with a sample size that is

  • too large
  • too small
A

Too large => likely to produce false positives

Too small => likely to produce false negatives
-aim to design the investigation to avoid this where possible

37
Q

How would you calculate the sample size of each group

A

N in each group =
f(a, b) x [2(SD^2)/(m2-m1)^2]

a = possibility of wrongly rejecting H0
b = power
m2, m1 = means of different groups

38
Q

Describe the relationship between the

  • minimal detectable difference and the sample size
  • minimal detectable difference and power
A

If we half the size of the minimal detectable difference => quadruple the sample size

If we increase the minimal detectable difference, we increase the power

39
Q

How would you do a power calculation

A

Zpower =

[m2-m1/SD x √(n/2) - Za/2

40
Q

What is the difference between

  • unpaired
  • paired variables
A

Unpaired (independent)

  • variable of interest has been measured once in each subject
  • the groups we want to compare are independent as they include separate subjects

Paired (dependent)

  • variable of interest has been measured more than once in the same subjects but at 2 different time points
  • we aim to compare the values of the variable within the same subject
41
Q

What is the

  • H0 for x2 test
  • formula
  • interpretation
A

The observed frequencies is the same as the expected frequencies, if there is no association between the 2 variables

X2 = sum of all cells [(O-E)^2/E]

X2 and degrees of freedom needed to find the p value

42
Q

When would you use Fisher’s exact test

-what is the formula for Fishers exact test

A

If you have a
-small sample size or
-20%+ of the cells have expected frequencies under 5
then X2 is not used

If you have a 2x2 table, only 1 cell needs to have an expected frequency under 5, don’t use X2

Interpretation of Fisher’s test is the same

(a+b)!(c+d)!(a+c)!(b+d)!/n!a!b!c!d!

43
Q

What is McNemar’s test formula

-what is it testing

A

X2 = (|b-c| -1)^2/b+c

Looks at those who move from 1 category to another

44
Q

What is the Mann-Whitney U test for

  • how would you do this
  • what is the formula
A

Tests the difference in the values of an ordinal variable between 2 independent samples

  1. Create a single list with all values ranking from lowest to highest, distinguishing between both groups
  2. Add ranks for the 2 groups separately
    - if 2+ observations have the same value, they are given a rank equal to the midpoint of the unadjusted rankings
  3. Sum up the ranks of each group separately
U1 = n1n2 + [ n1(n1 + 1)/2 ] - R1
U2= n1n2 + [ n1(n1 + 1)/2 ] - R2

U = test statistic
-is the smaller of U1 and U2

If U is lower than the significance level => statistically significant

45
Q

What is the sign test used for

-what would the H0 be

A

Test the difference in the values of pairs of observations for an ordinal variable
-can only say whether the difference is positive or negative

Overall no change between time point 1 and 2 (equal numbers of positive and negative differences)

Compare the number or positive or negative differences to the values in the significance tables

46
Q

What assumptions are made in the unpaired Student’s t test

  • how would you check the assumptions
  • formula
  • H0 and H1
  • what would you do if variances are not similar
A

Data is

  • independent
  • normally distributed
  • variance of data on compared 2 groups is the same (calculate SD^2 to see if they are the same/similar)

t = mean1 - mean2 / SD √(1/n1) + (1/n2)

t value compared to value in stats tables to assess the probability of getting that result

H0
-there is no difference between the means of both groups
H1
-there is a difference between the means of both groups

Unequal variance unpaired t test uses a different formula
H0
-there is no difference between the mean of 1 group to the individuals in the other group
H1
-there is a difference between the means of 1 group to the individuals in the other group

47
Q

What are the assumptions that must be met to use the paired Student’s paired t test
-formula

A

Data is not independent
Difference between paired groups are normally distributed (this is not the same as the distribution of each group is normal)
Does not assume that variance within each paired group is the same

t = mean difference/SE of mean difference

Value compared to value from stats tables to asses for significance

Mean difference is different from the difference between 2 means (x1-x2)

48
Q

When would you use Wilcoxon signed-rank test

-how would you do this

A

Test the difference in the values of pairs of observations (dependent samples) for quantitative variables, whenever the assumptions of the parametric test, Student’s paired t test are not met

  1. Calculate the difference between each paired value
  2. Rank by size of difference ignoring the sign
    - if differences are identical, each is given the mean of the ranks they would have if distinct
    - if there are 0 values, they are omitted
  3. Add up the positive differences and negative differences separately
  4. Test statistic is the smaller of the 2 values, compare this to the value from the stats table
49
Q

What is Pearson’s Correlation Coefficient

  • what does each value mean
  • formula
  • what assumptions have been made
A

Rho quantifies the direction and strength of the linear association between 2 continuous variables
-the correlation coefficient

-1 = perfect negative correlation
0 = no association
Between 0 and 0.4 = weak
Between 0.4 and 0.7 = moderate
Between 0.7 and 1 = strong
\+1 = perfect positive correlation

∑(x1-mean of x)(y1-mean of y)/√(´∑(x1-mean of x)^2 (´∑(y1-mean of y)^2)

Can calculate a p value

Assumes a
-normal distribution of both variables
-linearity between both variables
If these assumptions are not met => use Spearman Rank or Kendall’s Tau correlation

50
Q

When would you use Spearman’s Rank

-formula

A

If the assumptions of the Pearson’s correlation coefficient are not met but you still want to investigate the relationship between 2 variables

Rho is interpreted in the same way as in Pearson

Formula calculated by statistical software

51
Q

Assumptions and errors with correlation

A

Strong correlations cannot tell you timeline of which factor affects which

  • could be another factor that moderates the effect
  • will need simple or multiple regression to further investigate the relationship between X and Y

Cannot use correlation to assess the agreement between 2 different methods and decide they agree because the correlation is high

52
Q

What does regression tell you

-how does this differ from correlation

A

Assess the relationship between an outcome and one or more risk factors
-how does the value of an outcome change with a change in the risk factor

Correlation
-what is the relationship between X and Y

Regression

  • what is the increase in X per unit increase in Y
  • gradient of the regression line
53
Q

When would you use

  • linear regression
  • logistic regression
A

Linear
-used for continuous outcomes
Most common method - least squares regression
-chooses the straight line that minimises the sum of squares of the distance between the line and data points

Y = a + bX + e
Y = outcome
a = Y intercept
b = gradient
X = risk factor
e = difference between the observed and predicted value of Y

Logistic
-used for binary outcomes

54
Q

What does linear regression assume

A

Residuals are normally distributed

The observed values of Y have the same spread around the regression line as X changes

Residuals are independent of each other

55
Q

What are the 3 links between correlation and regression

A
  1. Gradient = Pearson’s correlation coeff x (SDy/SDx)
  2. Pearson’s correlation coeff^2 explains the amount of variation in Y that is explained by X
  3. p value of correlation coeff = p value of gradient
56
Q

How would you calculate odds

-what is the difference between probability and odds

A

p/1-p
p=probability

Probability
-number of successes/total number of trials
Equal probability = 1 success every 2 trials
Can range from 0 to 1

Odds
-number of sueccesses/number of failures
Equal odds = 1 success and 1 fail
Odds can range from - to + infinity
-greater than 1 = success more likely than failure
57
Q

How would you use odds to calculate the logit

-why do we need logit

A

log of odds
logit = loge(p/1-p)

logit = a + bX + e
-can substitute Y in the linear regression equation

58
Q

How would you calculate the odds ratio

A

OR = ad/bc

or can be calculated from the odds

Odds of disease with exposure/odds of disease without exposure

or can be calculated from logistic regression

e^gradient

59
Q

What are the advantages of using logistic regression

A

Gives us a p value for the odds ratio

Gives us a confidence interval for the odds ratio

Can be used to investigate or control for several different factors simultaneously

60
Q

What are the assumptions of logistic regression

A

Dependent variable must be binary
-must be coded 0 and 1
Each observation is independent
Needs a large sample (minimum of 20 is good)

61
Q

How do we interpret the odds ratio with the confidence interval

A

Does not include 1 = evidence of a relationship between the risk factor and outcome
-p value will be less than 0.05

62
Q

What is a confounder

A

A factor that affects a relationship between X and Y

Can be

  • a risk factor for the effect
  • associated with the study base
  • not an intermediate factor between the exposure and effect
63
Q

How can the confounder affect the results

A

Systematic error
Positive - overestimate true association, bias away from H0
Negative - underestimate true association, bias toward H0

64
Q

How do we identify a confounder

A

If after we add a variable into a statistical model, the crude estimated measure of association between an exposure and outcome varies by 10%+, confounding is found

Slightly different methods of assessing the magnitude of confounding
Crude starting value = RRcrude-RRadjusted / RRcrude

Adjusted starting value = RRcrude-RRadjusted / RRcrude

65
Q

How to design out confounding

A

Randomisation

  • random allocation of subjects to exposure
  • requires large sample size
  • only applicable in experimental studies

Restriction

  • inclusion of subjects belonging to 1 stratum only of the confounders
  • may result in lower statistical power, limits external validity

Matching

  • selection of controls so distribution of potential confounders is similar to cases
  • expensive, time consuming
66
Q

How to control for confounding at data analysis

A

Stratification (Mantel Haenszel method)

  • strength of association between exposure and outcome measured in each stratum of the confounder
  • weighed average calculated to account for difference in distribution of confounder
  • but this reduces power as there will be fewer participants per strata

Multivariable modelling
-statistical modelling that controls multiple confounders at once

67
Q

What are the possible characteristics we can conclude about a confounder

A

Strength of effects between confounder and exposure

Direction of effect between confounder and exposure
-positive or negative

May appear to be a risk factor or a protective factor

68
Q

Why is it not possible to completely eliminate confounding

A

Limitation of ability to control for all potential confounders
-some are unknown or unmeasurable

Random error in measuring confounders reduces our power to control them

69
Q

What are the assumptions for multiple regressions model

-formula

A

Used to control for confounders during data analysis

Same as simple regression
Addition of no strong correlation between predictors

Y = a + b1(factor1) + b2(factor2) + b3(factor3) + e

b’s are adjusted for the effect of each factor respectively

70
Q

What is the formula for multiple logistical regression

A

Used to control for confounders during data analysis

logit = a + b1(factor1) + b2(factor2) + b3(factor3) + e