1B Statistics Flashcards

1
Q

combining probabilities: OR

A
  • If event A and B are mutually exclusive (ie dice roll)
    p(A or B)= p(A) + p(B)
  • if event A and B are not mutually exclusive

p(AorB)= p(A) + p(B) - p(AandB)

(unless you subtract the probability of pAandB then the probability of both occurring is included twice, wrongly inflating the probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Combining probabilities: AND

A

p(AandB)= p(A) x p(BlA)

p(BlA)= probability of B given A has occurred. IF A has no impact on B then p(BlA)= p(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

If a smoking cessation results in 0.4 chance of quitting. Adam and Ben never meet, what is the probability of at least one of them quitting

A

Events are not mutually exclusive so p(AorB) = P(A)+p(B) - p(AandB)

= 0.4 + 0.4 - 0.16
= 0.64

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sampling error

A

Sampling error is chance variation (as long as the study is unbiased) between the values obtained for the study sample and the values which would be obtained if measuring the whole population.

The most common method for measuring the likely sampling error is to calculate the standard error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Standard error

A

-estimates how precisely a population parameter (ie mean, proportion, difference between means) is estimated by the equivalent statistic in the sample

The standard error is the standard deviation of the sampling distribution of the statistic

The method of calculating the standard error therefore depends on the data type and the statistic being used (ie is the data continuous/binary are you calculating mean or proportion. All require different formulas for standard error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

sampling distribution

A

Could be created by drawing many random samples of the same size from the same population and calculating the same sample statistic. The frequency distribution of all these sample statistics is a sampling distribution.

These distributions (ie a normal distribution) helps you understand how a sample statistic differs from sample to sample and are the basis for making inferences from sample to population.

The shape of the sampling distribution depends on the type of statistic ( ie cont data and mean= normal distribution, binary data and proportion = binomial distribution)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are confidence intervals calculated and how do they relate to standard error and the sampling distribution

A

-sampling distribution of a mean is a normal distribution and other stats (ie proportion or rate) can be approximated by a normal distribution
- in a sampling distribution the mean value is equivalent to the true population parameter
- standard deviation is equivalent to the standard error of the sampling statistic
- therefore 95% of sample statistics would lie within 1.96 standard errors of the true population parameter
- from this we can infer that there is a 95% chance that the true population parameter lies within 1.96 standard errors above or below a sample statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

value used for 99% confidence intervals

A

+/- 2.58x standard error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

interpret 3 possible scenarios when comparing 2 confidence intervals

  1. CIs do no overlap
  2. CIs overlap neither point estimate is within the others confidence interval
  3. Either point estimate is within the confidence interval of the other
A
  1. significant at the 5% level
  2. Unclear- need a significance test
  3. Not significant at 5% level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

formula for conditional probability ( the p(BIA) )

A

p(BIA)= probability of B given A has occurred

we know:
p(A and B)= p(A) x p(BIA)

rearranged
p(BIA)= p(AandB)/p(A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is a statistical distribution?

A

A function that shows all the possible values of a variable and the frequency that they occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Statistical distributions: the normal distribution

A
  • symmetrical bell shaped curve
  • described by 2 parameters:
    variance (SD squared)
    mean
  • The standard normal distribution has a mean of 0 and a variance of 1
    -Any normally distributed variable can be converted to a standard normal distribution
  • the normal distribution is very useful as many variables in biology follow a normal distribution
  • the sampling distribution of a mean follows a normal distribution
  • with large enough samples other distributions approximate to the normal distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Standard statistical distribution: binomial distribution

A
  • PROPORTIONS
    -the binomial distribution shows the frequency of events that have 2 possible outcomes
    ie success and fail
    -it is constructed using 2 parameters:
    n (sample size)
    pi (true probability)
  • when sample size is large it approximates to the normal distribution
  • used for:
    discrete data with 2 possible outcomes
    sampling distribution for proportions
  • since proportions or probability cannot be negative it has no negative values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In the normal distribution what percentage of the area under the curve is within:
1 standard deviation
1.96 standard deviations
2.58 standard deviations

A

1= 68%
2= 95%
3=99%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

standard statistical distributions: Poisson distributions

A
  • RATES/COUNTS
    -deals with the frequency with which an event occurs over a given time ie deaths from MI over a month
  • used in the analysis of rates
  • assumes that the data are discrete, events occur at random and are independent
  • described by a single parameter: variance (FOR THE POISSON DISTRIBUTION THE MEAN AND THE VARIANCE ARE THE SAME)
  • small samples give an asymmetric distribution and large samples approximate to the normal distribution
  • no negative values as a rate cannot be negative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Standard statistical distributions: Students T distribution

A
  • SMALL SAMPLE SIZE
    -Bell shaped like a normal distribution but tails are more spread out
  • Single parameter: degrees of freedom
  • as the degrees of freedom increase it approaches the normal distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Standard statistical distributions: Chi squared distribution

A
  • right skewed shape
  • parameter: degrees of freedom
  • as degrees of freedom increase it becomes more like normal distribution
    -used in chi squared tests which are used for analysing categorical variables (comparing expected and observed event frequencies)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Standard statistical distributions; F distribution

A
  • right skewed
    -values are positive
  • parameter: a ratio of degree of freedom of the numerator and denominator of the ratio
  • uses: ANOVA tests
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Degrees of freedom

A

number of independent pieces of information used to calculate a statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what is the difference between standard deviation and variance?

A
  • SD is a measure of how far apart values are in a data set
  • variance gives an actual value as to how far numbers in a data set are away from the mean
  • SD is the square root of the variance
  • SD is in the same units as the data where as the variance is not
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Sampling distribution shape:
Outcome variable= continuous
statistic type = mean

A

Normal shaped sampling distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Sampling distribution shape:
Outcome variable= binary
statistic type = proportion/risk

A

Binomial distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Sampling distribution shape:
Outcome variable= binary over time
statistic type = rate

A

Poisson distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

what is inference?

A

The process of drawing conclusions for a population based on observations collected from a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

what are the 2 main methods of inference?

A
  • Estimation
    point estimation (mean, proportion)
    Interval estimation- expresses the uncertainty associated with a point estimate eg confidence intervals

-hypothesis testing
assess the likelihood that a given observation in a sample would have occured due to chance

both estimation and hypothesis testing are derived from standard error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

measures of data location (5)

A
  1. arithmetic mean
  2. geometric mean
  3. mode
  4. median
    5 percentiles
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

measures of data dispersion (5)

A
  1. range
  2. interquartile range
  3. variance
  4. standard deviation
  5. coefficient of variation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

measures of data location: arithmetic mean ( how to calculate, advantages and disadvantages)

A
  • all values summed and divided by n
  • if a sample arithmetic mean is denoted by xbar
  • if a population arithmetic mean is denoted by mu

-advantages: amenable to statistical analysis
- disadvantages: not good for asymmetric distribution, affected by outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

measures of data location: geometric mean (how to calculate, advantages and disadvantages)

A
  • nth square root of the product of all the values
  • advantages: more appropriate for positively skewed distributions
  • disadvantages: cannot include any values of 0 or negative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

measures of data location: median ( how to calculate, advantages and disadvantages)

A
  • middle values
  • advantages: unaffected y extreme outliers, good for skewed distributions
  • disadvantages: value determined solely by rank so gives no information on any other values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

measures of data location: mode ( how to calculate, advantages and disadvantages)

A
  • most commonly occurring value
  • advantages: not generally affected by extreme outliers
    -disadvantages: there may not always be a mode, not amenable to statistical analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

measures of data location: percentiles (how to calculate, advantages and disadvantages)

A
  • data is ranked and divided into 100 groups where 100th percentile is the biggest
  • advantages: useful for comparing measurements (BMI, child height etc)
  • disadvantages: comparisons at the extreme ends of the spectrum less useful than those in the midde
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

measure of data dispersion: range ( how to calculate, advantages and disadvantages)

A
  • highest value minus lowest
  • advantages: simple, intuitive
  • disadvantages: sensitive to size of sample and outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

measure of data dispersion: interquartile range (how to calculate, advantages and disadvantages)

A
  • the middle 50% of the sample
  • calculated as the upper quartile- lower quartile
    -advantages: more stable than the range as sample size increases
  • disadvantages: unstable for small samples, does not allow for further mathematical manipulation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

measure of data dispersion: variance ( how to calculate, advantages and disadvantages)

A
  • average squared deviation of each value from its mean
  • the formula differs slightly depending on whether calculated for a sample (divided by n-1) or a population (divided by n)
  • advantages: takes all values into account, useful for making inferences about population
  • disadvantages: units differ from that of the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

measures of data dispersion: standard deviation (how to calculate, advantages and disadvantages)

A
  • square root of variance
  • advantages: most commonly used, units are the same as data, useful for making inferences about the population
  • disadvantages: sensitive to some extent to extreme values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

measures of data dispersion: coefficient of variance (how to calculate, advantages and disadvantages)

A
  • ratio of standard deviation to the mean
  • gives an idea of the size of the variance relative to the size of the observation
  • advantages: allows comparison of the variation of populations that have significantly different values
  • disadvantages: where the mean value is near 0 the coefficient of variance is highly sensitive to changes in standard deviation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

6 key elements to mention when describing a graph in the exam

A
  1. type of graph
  2. the axes
  3. the data displayed (ie mortality)
  4. the units
  5. any obvious findings
  6. what interpretation, if any, can be made from the findings (remember very unlikely to be able to conclude causality from a graph)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Displaying categorical data: 2 types of graph

A
  • bar graph
  • pie chart
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

categorical data: bar graph

A
  • bars can show frequency (total count) or relative frequency (percentage)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

categorical data: Pie chart

A
  • start at 1200 position and wedges should descend clockwise in order of size (ie biggest –> smallest clockwise)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

continuous data: 6 types of chart

A
  • stem and leaf display
  • box plot
  • histogram
  • frequency polygon
  • frequency distribution
  • cumulative frequency distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

continuous data: stem and leaf display (what is it and advantages disadvantages

A
  • a quick technique for displaying numerical data graphically
  • a vertical stem is drawn consisting of the first few significant figures of values in a dataset
  • any subsequent figures are the leaf
  • back to back stem and leaf displays can be used to display multiple data sets

advantages:
1. simple quick and easy
2. actual values are retained

disadvantages:
3. hard to display large data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

continuous data: box plot (what is it and advantages disadvantages

A
  • gives a measure of central location (MEDIAN)
  • shows 25th can 75th percentiles so gives range and interquartile range

-Advantages:
1. box element contains a lot of information
2. good for comparing 2 datasets

Disadvantage:
1. actual values are not retained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

continuous data: histogram (what is it and advantages disadvantages

A
  • divides the sample values into many intervals which are called bins
  • bars then display the number if values in that bin
  • most histograms use bins that a roughly equal in width but can aim to size bins so they contain an approximately equal number of sample (this can result in bins that are 2 narrow to see!)

advatages:
1. gives idea of data central tendancy
2. demonstrates skewness and the shape of the frequency distribution

disadvantages:
1. cannot read exact values as in intervals
2. more difficult to compare 2 data sets
3. can only be sued with continuous data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

continuous data: frequency polygon (what is in and advantages disadvantages

A

constructed by joining the midpoint of the top of each bar in the histogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

continuous data: frequency distribution (what is in and advantages disadvantages

A

-essentially the frequency polygon that would be drawn for a histogram with a very large number of bins
-leads to a smooth line

remember you describe the skewness of a graph according to WHERE THE TAIL IS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

continuous data: cumulative frequency (what is it and advantages disadvantages)

A
  • a running count starting with the lowest value and showing how the number of observations accumulate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

continuous data: Showing association between 2 variables: which graph type?

A
  • bivariate data is almost always best shown using a scatter plot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

continuous data: scatter plot for showing association between 2 variables

A
  • data from 2 variables are plotted against each other to explore the relationship between them
  • trend line is drawn to explore whether any correlation is:
    1. positive negative or non existent
    2. linear or non linear
    3. strong, moderate or weak

advantages:
1. data values and data set are retained
2. shows a trend in data relationship
3. shows minimum maximum and outliers

disadvantages:
- data from both variables must be continuous
- hard to visualise large data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Z test: what is it and how is it used

A

Used to compare proportions/means between 2 groups.

Different formulas for testing different things but all include the standard error

The z value is looked up in a z-distribution table which gives a P value.

The test can be used for paired data. To do this the difference in the observation for each pair is calculated and then the pair is treated as a single observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

z test value of significance

A

z score > 1.96 is significant at the 5% level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

T test: what is it, when is it used

A

Used to compare means/ proportions between 2 groups when a sample size is small (normally less than 60)

Based on a T distribution rather than a normal distribution.

T values are looked up in a T distribution table in order to discern the P value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Chi Squared test

A

used to compare the counts of categorical response between 2 or more independent groups.

Needs to compare counts, it cannot compare proportions/percentages.

The formula for the chi squared values is on the formula sheet.

Firstly you need to construct a rows (r) x columns (c) contingency table. You then need to calculate the expected value for each box. This is calculated by multiplying the total value for the row by the total value for the column and dividing by the total number in the table.

Once you have used the formula to calculate a value for each box you sum all the values to give the chi squared value

You look the chi squared value up in a table to get a P value

You need to know the degrees of freedom to do this. This is calculated by

(r-1)x (c-1)= degrees of freedom

56
Q

Chi squared degrees of freedom formula

A

(r-1)x(c-1)
where r= rows and c= columns

57
Q

Chi squared value of significance

A

1.96 squared= 3,84
therefore a chi squared score of 3.84 = P=0.05

therefore a chi squared score > 3,84= significant at the 5% level

58
Q

Two way ANOVA (what is it)

A

Used for when you have 2 or more independent groups but want to consider 2 factors

ie a study comparing drug A , drug B and drug C that also wants to analyse any difference in effects between male and females

58
Q

McNemars: what is it and how is it calculated

A

-Used for paired data with a binary outcome (when you could not use chi-squared as would not take into account pairing)

  • for example: individual matched case control studies or multiple measures of the same variable on individuals
  • when calculating McNemars we are interested in areas of discordance between pairs. Ie if a case was exposed but the control was not.

You therefore need to construct the table differently with controls in columns with one column for exposed and unexposed and cases in rows with one row for exposed and one for unexposed.

The formula is on the formula sheet, n12 and n21 correspond to the 2 cells where there is discordance.

58
Q

McNemars value of significance

A

value > 3.84 is significant at the 5% level

59
Q

One way ANOVA

A
  • used when there are 2 or more independent groups needing comparison
  • parametric test

ie a study comparing the effects of drug A, Drug b and drug C

59
Q

MANOVA (what is it)

A

A MANOVA is used when you have multiple dependent variables and 2 or more independant variables.

For example if you wanted to consider the effect of being male/female on IQ, reading and numeracy results.

60
Q

Repeated measures ANOVA

A

parametric test

-used when you have paired data ie multiple reading of the same variable on each subject over time

61
Q

The difference between parametric and non- parametric tests

A
  • If a sample is normally distributed it can be described by mean and standard deviation.
  • In this situation a difference between samples can be ascertained by examining their means
  • you cannot do this if a sample is not normally distributed.
  • in this instance you need to use a non-parametric test that focuses on rank-ordering the data rather than the individual values themselves
  • non-parametric tests are usually only used for small samples as with larger samples the lack of normality tends to not be problematic and parametric tests are used
62
Q

disadvantages of non-parametric tests (3)

A
  • lower power (greater risk of type II error)
  • harder to calculate confidence intervals
  • can generally only be used for bivariate analysis (ie unable to test for interaction or adjust for confounding)
63
Q

Wilcoxon signed rank test

A

-Used for paired, non parametric data

  1. Find the difference between each individual pair
  2. omit any 0 values
  3. ignoring any +/- signs rank the differences (if there are 2 values the same place them both halfway between the ranks they would have occupied)
  4. reapply the +/- signs
  5. Find the sum of the positive ranks and the sum of the negative ranks (these are ‘rank totals’)
  6. Ignoring +/- signs select the smaller rank total from 5
  7. Look rank total up in a wilcoxon table. If the rank total is larger than the value in the wilcoxon table it is not significant at that level.
64
Q

Mann whitney U test

A
  • Non parametric test used for unpaired data
  1. rank the values from both groups in a single table (colour code each group differently)
  2. Add up the ranks for each group
  3. select the smaller of the 2 rank totals
  4. Look up in the mann whitney U table using n1 as the number in one group and n2 as the number in the other group
  5. if the smaller rank total is larger than the number on the mann-whitney U table then the result is insignificant at this level
65
Q

What is a P value

A

A p value provides an estimate of the probability of recording an association at least as large as the association found in the sample if the null hypothesis is true.

66
Q

Type I errors (other name, false ???, generally accepted limits)

A

Alpha errors
False Positive
0.05 or 0.01

67
Q

Type II errors (other name, false ???, generally accepted limits)

A

Beta errors
False negative
0.8 (80% power)

68
Q

trade off between type I and II errors

A

If you try to reduce the rate of type I errors but requiring a small P-value to suggest significance the risk of type II error increases as you are more likely to wrongly reject the null hypothesis.

69
Q

What is data mining/fishing and what is the risk

A
  • when hypothesis and variables to be tested have not been prespecified
  • there can be a tendency to go looking for associations between variables
  • doing lots of tests increases the risk of a type 1 error given there is a 1 in 20 chance of getting a significant result by chance alone at a p value of 0.05 significance
  • should correct for multiple comparisons ie using bonferroni correction
70
Q

what is a bonferroni correction?

A

Used when multiple comparisons are being made. Adjusts the alpha value of significance so that there remains a 1 in 20 chance of type I errors despite multiple comparisons.

adjusted alpha= original alpha (ie 0.05) / n

where n = the number of tests conducted.

71
Q

why are sample size calculations important

A

Sample size calculations ensure there are sufficient participants in order to answer the study question but not so many as to be wasteful

72
Q

5 key factors in sample size calculations

A
  1. Expected Effect size (small effect size needs larger sample)
  2. Significance level (ie accepted type I error rate, a smaller P value requires a larger sample)
  3. Power (usually power above or equal to 80% is used, higher power requires larger sample)
  4. Event rate in population (in case control studies this is the prevalence of exposure in the controls, in cohort and experimental studies it is the prevalence of the outcome in the unexposed). Smaller prevalence requires a larger sample.
    5 sd in population
73
Q

Define power

A

Power is the probability a study will be able to detect a difference if it truly exists.

It is 1- type II error rate.

74
Q

Situations where the sample size may need to be increased (5)

A
  • high loss to follow up
  • confounding
  • interaction
  • cluster sampling
  • low response rate
75
Q

How to increase power in case-control studies?

A

Increase the ratio of controls to cases.

76
Q

what is correlation

A

Correlation tests assess the strength of any linear relationship between 2 variables

77
Q

parametric correlation test, what value does it give and how do you interpret this

A

Pearson’s correlation
Gives pearson’s correlation coefficient (r)

r= -1 perfect negative correlation
r=0 no correlation
r=+1 perfect positive correlation

78
Q

non-parametric correlation test

A

Spearmans rank correlation

79
Q

what is r squared and how is it interpreted

A
  • calculated using a form of T-test
    -r2 gives an assessment of how much variation in the y variable is accounted for by the linear relationship with the x variable
    ie if r2 = 64% then the regression model (which derives the line of best fit) explains 64% of the variation in the y variable (other variables may explain the other 26%).
80
Q

What is linear regression, what is it used for?

A

Linear regression is used to determine a line of best fit for associations between to variables which have a linear association when plotted on a scatter plot. Linear regression also gives the equation of the line of best fit

81
Q

What is the equation for a line derived from linear regression

A

y=a +bx

where a = intercept with y axis
b= gradient of the line

82
Q

what is the regression coefficient and how is it interpreted?

A

the regression coefficient is another name the for the gradient of the line so = b
If there was no association the gradient would be 0.
So linear regression tests the null hypothesis that the regression coefficient = 0.
Accordingly if the confidence interval does not cross 0 this is evidence of an association

83
Q

Multiple regression: what is it, what can it be used for

A

Multiple regression allows you to study the impact of multiple exposures on an outcome variable.

Allows adjustment for confounding factors

Interaction terms (ie when the effect of 2 or more variables is NOT additive) can also be included to assess for effect modification

84
Q

Logistic regression

A

Used for binary or ordered categorical outcome variables.

85
Q

Poisson regression

A

Generally used for rates or count data

86
Q

Cox regression

A

Generally used for survival times

87
Q

Define: The survival function S(t)

A

The probability of not experiencing the event of interest, at least until time point t.

88
Q

Define: the hazard function h(t)

A

The conditional probability of experiencing the event of interest at time t, having survived to that time (ie the specific rate at that time point)

89
Q

2 main methods for generating survival functions and survival curves

A
  1. Life tables
  2. Kaplan-Meier method
90
Q

Life tables: what are they

A

Generally used to display patterns of survival in a cohort when we do not know the exact survival time of each individual but we do know number of survivors at regular time intervals

91
Q

2 main types of life tables (what are they)

A
  1. cohort life tables
  2. Period life tables
92
Q

Life tables: what are cohort life tables

A

Show survival of an actual group of individuals through time
This is the main method used in life table survival analysis

93
Q

Life tables: What are period life tables

A

Uses age specific mortality rates applied to a hypothetical population to calculate expected survival times.

Often used in demographics

94
Q

Life tables: what information is collected at specified time intervals (3 pieces of information)

A
  1. The number of individuals alive at the beginning of the time period (at)
  2. The number of deaths (or other event of interest) during the time period (dt)
  3. the number of individuals censored during the time period (ie lost to follow up. died of another cause etc) (ct)
95
Q

Life tables: how to calculate number of persons at risk for the time period

A

nt= at- ct/2

where Ct is divided by 2 based on the assumption that average censorship happens halfway through the time period

96
Q

Life tables: how to calculate the risk of dying for the time period (rt)

A

rt= dt/ nt

97
Q

Life tables: how to calculate the risk of surviving for the time period (St)

A

St= 1-rt

98
Q

Life tables: how to calculate the survival function S(t)

A

S(t)= S(t-1) x St

99
Q

How does the Kaplan-Meier method differ from life tables

A
  • life tables collect information and calculate the survival function at set time intervals
  • however for many studies such as cohort studies the exact day an event occurs is known
  • if it is known it is better to use this information than to use that the event occur at some point in the time period
  • the kaplan -meier method calculates the survival function every time the event of interest occurs or someone is censored.
100
Q

What does a Kaplan-Meier survival curve look like and how is it constructed

A

The kaplan-meier survival curves are constructed using the calculated survival functions.

As these are calculated every time the event occurs or someone is censored the curve has a stepwise appearance with steps of varying widths

101
Q

2 main tests for looking for a statistical difference between survival curves of 2 groups

A
  1. log rank test
  2. Cox regression
102
Q

What is the log rank test and how is it done. What is it not likely to detect

A
  • a special application of the mantel-haenzsal chi squared test
  • For each time interval (life tables) or step (kaplan-meier) a 2x2 table is constructed in order to compare the proportion in each group who died
  • observed and expected deaths are compared to test the null hypothesis- there is no difference between the 2 groups
  • best if the difference in survival between the 2 groups remains constant
  • unlikely to detect a difference if the survival curves cross (always plot survival curves first to check for this)
  • purely a test of significance, cannot provide an estimate of the size of the difference between the groups.
103
Q

what is cox regression also known as

A

The proportional hazards regression

104
Q

How does cox regression differ from log rank test

A
  • the log rank test can only compare 2 groups ie were they given drug Ab or drug B
  • Cox regression can consider the impact of multiple variables on survival
  • it can therefore also adjust for confounding etc
105
Q

What assumption is cox regression based on

A

Based on the assumption that the ratio of hazard (ie the immediate risk of dying at time point t) between groups is constant. Ie if group A are twice as likely to die at the beginning this is also true at the end

106
Q

What is the output of Cox regression and how is it interpreted

A

Cox regression gives a log hazard ratio which can be converted into a hazard ratio.

Hazard ratio is interpreted similarly to relative risk:
1= no difference
>1= increased hazard
<1 = decreased hazard

107
Q

what is heterogeneity?

A

Differences in population, observations or studies

It is a particular problem for meta-analysis when you want to combine the results of many studies into a single result

108
Q

What are the three types of heterogeneity

A
  1. Statistical heterogeneity (can be caused by clinical or methodological heterogeneity. Can be tested for, assesses whether study results differ by more than would be expected by chance)
  2. methodological heterogeneity (studies conducted/ designed differently)
  3. Clinical heterogeneity (difference in population, intervention or outcome measure
109
Q

2 main statistical tests for heterogeneity

A

1.Cochran’s Q statistic
2. I squared statisitc

110
Q

Cochran’s Q statistic (what is it, what does it measure, how is it interpreted)

A
  • tests whether the differences between studies are greater than would be expected by chance
  • Has low power
  • tests the null hypothesis that the true effect size in all the studies is the same
  • A low P value therefore indicates heterogeneity
  • As power is low a 10% significance level is normally used
111
Q

Cochran’s Q statistic AKA

A

chi squared test for heterogeneity

112
Q

I squared statistic (what does it measure)

A

-Developed due to Cochran’s q statistic having low power
- describes the percentage variation across studies that is due to heterogeneity not chance

113
Q

Interpreting I squared statistic

A

25%= low heterogeneity
50%= moderate heterogeneity
75%= high heterogeniety

114
Q

What is a funnel plot and what is plotted on x and y?

A
  • particular type of scatter plot
  • x axis plots treatment effect whilst y axis plots a measure of study precision (ie standard error, study size)
115
Q

In what scenarios are funnel plots used

A
  • meta-analysis (look for small study effect)
  • performance analysis (look for units with outlying performance values)
116
Q

What is the small study effect and give 2 reasons it occurs

A
  • generic term for the phenomenon that small studies often report larger effect sizes than larger studies
  • most commonly this occurs due to publication bias (if study is small it is more likely to be published if it shows a very large effect size)
  • can also occur ie if study is done on high risk patients, of which there is only a small number, however the intervention has a particularly large impact in these patients
117
Q

Formal method of assessing for funnel plot assymetry

A

Egger’s regression

118
Q

Funnel plots in meta-analysis- what is the impact if small study effect is present

A

If a small study effect exists in a meta analysis the treatment effect size will be overestimated

119
Q

Funnel plots in performance assessment: how are they used

A
  • used to compare units/ clinical teams
  • measure of success is plotted along with confidence intervals, those units with values outside of the confidence intervals are identified as outliers which may warrant further investigation
120
Q

What does bayems theorem enable?

A
  • method to incorporate prior beliefs into probability calculations
  • ie existing knowledge about a patient with effect how much credence a clinician will give to a test result
  • Bayems theorem is important when considering how to interpret test results
  • it allows the positive predictive value to be related to the sensitivity of the test and the negative predictive value to be related to the specificity
121
Q

Bayems theorem: what is the overarching equation

A

P(A and B)= p(A) x p(BgivenA)= p(B) x p(Agiven B)

therefore

p(AgivenB)= p(A)x p(BgivenA)/ p(B)

122
Q

Sensitivity (expressed in terms of bayems theorem)

A

Sensitivity is the probability of a positive result GIVEN the person has the disease.

sensitivity is a test characteristic and it is not effected by background disease prevalence

123
Q

Specificity (expressed in terms of bayem’s theorem(

A

The probability of a negative result GIVEN the person does not have the disease

Specificity is a test characteristic and it is not effected by background disease prevalence

124
Q

positive predictive value (expressed in terms of bayems theorem)

A

The probability of someone having the disease GIVEN they have tested positive

This is affected by the population/the background disease prevelance

125
Q

negative predictive value (expressed in terms of bayems theorem)

A

The probability of someone not having the disease GIVEN they have tested negative.

This is affected by the population/the background disease prevelance

126
Q

Bayems theorem: converting probability to odds

A

Bayems theorem works better when things are expressed in terms of odds rather than probability

Prior odds= Prior probability/ (1- prior probability)

127
Q

Bayems theorem: what is the likelihood ratio

A

Likelihood ratios compare the probability that someone with the disease has a particular test result as compared to someone without the disease.

128
Q

Bayems theorem: calculating the likelihood ratio (LR+)

A

likelihood ratio= sensitivity/ (1- specificity)

129
Q

Bayems theorem: how to calculate posterior odds of diease

A

posterior odds= prior odds x likelihood ratio

130
Q

Bayems theorem: calculating the posterior probability

A

posterior probability= posterior odds/ (1+ posterior odds)

131
Q

Stages in calculating bayems theorem

A
  1. calculate prior probability
  2. calculate prior odds
  3. Calculate likelihood ratio
  4. calculate posterior ODDS of disease
  5. calculate posterior probability of disease
132
Q

Bayems theorem: advantages of use (3)

A
  1. more flexible
  2. incorporates ALL the available knowledge
  3. mathematics is not contreversial
133
Q

Bayems theorem: disadvantages of use (2)

A
  1. different people may quantify a priori probability differently
  2. if different people calculate prior probability differently then they will get different posterior probability results
134
Q
A