Statistics Flashcards

Question

What is standard error of distribution of sample means?

Answer 1

SE of a distribution of sample means is a measure of the spread of those means. It is the standard deviation of a sampling distribution MEASURES PRECISION OF THE SAMPLE MEAN

Answer 2

Small - all means close to the true mean - precise estimate

Answer 3

Gets smaller

Answer 4

``` SE= σ/ √N σ= SD of the population observations N= sample size ``` However we don't have data from the whole population so have to make do with SD (s) of a single sample to estimate the σ. As long as sample is large this should be a good measure. SE estimated = s/ √N

Answer 5

``` Sample proportion Difference between 2 means Difference between 2 proportions Relative risk/odds ratio Regression coefficients ``` They all have different standard error formulae.

Answer 6

SE (p) = √(px (1-p)/n) SE (0.2) = √(0.2x0.8/100)= 0.04 68% CI for asthma 0.2 +/- 0.04

Answer 7

An interval around a sample estimate within which there is a 95% probability that the true population value lies Sample mean +/- 1.96 SEs

Answer 8

If looking at difference between means and proportions - does 0 lie in the CI?? If looking at relative risk or odds ratio- does 1 lie in CI?? Then not statistically significant

Answer 9

RCT Non- randomised clinical intervention studies Experimental lab studies

Answer 10

``` Cohort studies Case- control studies Cross-sectional study Ecological study Case study ```

Answer 11

Usually disease free cohort followed over time and subsequent disease status recorded. Usually prospective Accurate

Answer 12

Accurate Selection bias avoided BUT..long and expensive, loss to follow up and inappropriate for rare diseases

Answer 13

Cases who already have the disease are compared to disease free controls Retrospective

Answer 14

Quick and cheap Suitable for rare diseases BUT... - subject to recall bias, selection bias, assessment bias - relative timings can be difficult to ascertain - Not suitable for rare exposures - relative risks cannot be directly calculated

Answer 15

Relative risk | Odds ratio

Answer 16

RR>1 = increased risk | RR <1 = decreased risk

Answer 17

Case control study - RR would not work in case control as you have picked the number of people with the disease. Use odds ratio instead.

Answer 18

Outcome RF Present Absent Present a b a+b Absent c d c+d a+c b+d RR = (a / a +b) / (c / c +d) Number with risk factor + disease/ total number with risk factor divided by number without risk factor and with disease/ total number without risk factor

Answer 19

Outcome RF Present Absent Present a b a+b Absent c d c+d a+c b+d Odds of having the risk factor among the cases vs odds of having a risk factor in controls Odds ratio = (a/c) / (b/d)

Answer 20

Statement that there is no difference between groups in the population from which the sample has come. ALWAYS about the population - would not make sense to hypothesis about the sample as we already known about that

Answer 21

Probability of obtaining sample data showing a difference as large or larger as that observed, if there is really no difference in the population from which the samples came i.e. the null hypothesis is true

Answer 22

Unlikely that the sample could have come from a population where the null hypothesis is true <5% chance.

Answer 23

Is is possible that the sample could have come from a population where the null hypothesis is true -> insufficient evidence to reject the null hypothesis (NEVER say we accept the null hypothesis)

Answer 24

Variable is numerical - you will be comparing means Variable is categorical you will be comparing percentages Variable is ordinal- you may use a specific test for ordinal data or you may treat the variable as categorical

Answer 25

Paired T test - Paired difference are normally distributed or large sample size (>100 pairs) Wilcoxon's signed ranks test - does not need normal distribution - NOT appropriate for ordinal data (as compares distributions not means)

Answer 26

Two types: - when the same person provides 2 values (eg crossover trial) - when each person is one group has a matched control in another group (eg case control studies)

Answer 27

``` Are you comparing means or percentages? How many groups are you comparing? Are the groups paired on independent? Are the test assumptions met? - sample size - distributions - equal variances ```

Answer 28

Comparing means of 2 independent groups Data normally distributed (or if >50 in each group) Normal variance

Answer 29

Skewed data often summarised using medians instead of means If mean - 2SDs takes you below minimum possible value (often zero), or mean +2SDs takes you above the max possible value then the data cannot be normally distributed.

Answer 30

Equal distribution around the mean. | Can have normal distribution but different variance - bell is flatter or thinner but still symmetrical.

Answer 31

Do a statistical test eg Levene's test - if p <0.05 conclude variances not equal, if >0.05 no evidence against variances. - BUT if sample size small unlikely to have sufficient power and if large likely to pick up unimportant differences. Could check for equal standard deviations. (less than a factor of 1.5 is ok) If variances not equal then some packages perform separate variances version of t-test Or could try transforming data (if positively skewed taking logs)

Answer 32

If assumptions for independent samples T test are not met. I.e. non-parametric data Can be used for numerical of ordinal data Less powerful than the T test

Answer 33

Paired differences are normally distributed (raw data can be skewed but the paired differences should be normally distributed) If >100 pairs can drop this.

Answer 34

Non parametric paired data Generally less powerful than the paired t test NOT ordinal data

Answer 35

Normally distributed with equal variances Used for >2 groups P >0.05 no evidence of real difference between any pair of groups p<0.05 there is evidence of a real difference between either some or all of the groups Does NOT tell you which group Needs follow up with post hoc test which tell you which groups have difference. - compare each pair of groups - automatically make an adjustment for multiple testing Many tests available including Scheffe, Bonferri

Answer 36

Non- parametric test For > 2 groups less powerful than ANOVA Can be used for ordinal data

Answer 37

Comparing percentages- categorical data Between 2 independent groups ``` Calculate observed (O) and expected (E) frequencies (O-E) ^2 / E ```

Answer 38

- any cells have expected freq <1 - > 20% cells have an expected freq < 5 Then use Fishers exact test (no min sample size)

Answer 39

Paired groups comparing the percentages | Only valid if number of discordant partners at least 10

Answer 40

Ordinal variable- ordered groups Large sample >30 Percentages increase/decrease linearly across groups.

Answer 41

2 sided test - difference can be in either direction Null hypothesis: no difference between groups Alternative hypothesis: there is a difference between groups, could be in either direction 1 sided test Null hypothesis- no difference between groups or a difference in 1 direction Alternative hypothesis - difference in other direction. More likely to get a statistically significant test in a 1 sided test as have 5% at top.

Answer 42

Non-inferiority trial | Should not be used because a true difference in one directions is thought to be very unlikely

Answer 43

α = significance level of test Usually set at 0.05 p <0.05 is significance level

Answer 44

Wrongly rejecting the null hypothesis when it is true. So α (significance level of test) is the probability of making a type 1 error -usually 5% Type 1 errors also occur in multiple testing

Answer 45

Accepting the null hypothesis when it is in fact false (missing a real difference) β = probability of making a type II error.

Answer 46

1 - β = power Probability of avoiding a type II error - correctly rejecting the null hypothesis. 1 - β is usually set at 0.8-0.9 (80-90%) - for phase 3 trials would be 0.9

Answer 47

When large differences observed but the sample size is small so results not statistically significant.

Answer 48

Subgroup analysis Many outcomes or many predictors Repeated measures data Pairwise comparisons (>2 groups) Repeated testing as more subjects recruited Data- driven hypothesis Trying different definitions of your variables until you find one that is significant

Answer 49

Probability of getting a non-significant result when the null hypothesis is true (i.e. getting it right) is usually 95% (1-α ) If we do 2 independent tests the probability of getting 2 non-significant tests correct is 0.95 x 0.95 = 0.90 So the probability of getting a significant test incorrectly (making a type I error) = 10% If you perform 20 tests for which null hypothesis are all true you would expect to get 1 significant result

Answer 50

Adjust for it - use appropriate signficant test - a single overall measurement like repeated measurement ANOVA or post hoc tests which have inbuild adjustment Simple manual Bonferroni correction Report number of tests you perform- honest

Answer 51

Used to try to adjust for multiple testing. Multiples the p value for each test by the number of tests performed. By increasing the p value it makes it more difficult to find signficant tests. If p value was 0.001 and you had done 10 tests it would be corrected to 0.01. Considered rather severe an adjustment

Answer 52

``` Significance level, α Power , 1- B Standard deviation of data Size of difference of clinical interest- min clinically important difference. Expected response ``` Need to allow for compliance/loss to follow up

Answer 53

``` True diagnosis +ve -ve Test +ve a b a+b -ve c d c+d a+c b+d ``` Sensitivity = a / a+c True positives / all truly positive

Answer 54

``` True diagnosis +ve -ve Test +ve a b a+b -ve c d c+d a+c b+d ``` Specificity= d/ b+d True negatives / all truly negative

Answer 55

``` True diagnosis +ve -ve Test +ve a b a+b -ve c d c+d a+c b+d ``` PPV= a/ a+b True positives/ all that tested positive

Answer 56

``` True diagnosis +ve -ve Test +ve a b a+b -ve c d c+d a+c b+d ``` NPV= d/ c+d true negatives/ all that tested negative

Answer 57

PPV low as small numbers | Others will be high

Answer 58

ROC curve- look at area under curve, bigger area = better test

Answer 59

Ratio of a chance of a positive result if the patient has the disease to the chance of a positive result if they do not have the disease. Sensitivity / (1- specificity) The higher the positive LR the better

Answer 60

Ratio of a chance of a negative result if the patient has the disease to the chance of a negative result if they do not have the disease. (1- sensitvity) / specificity Lower the negative LR the better

Answer 61

Standardised mortality ratio = observed deaths/expected deaths x 100 SMRs adjust for difference in age distributions of the groups being compared SMR <100 indicates a lower death rate than expected having adjusted for age SMR >100 indicates a higher death rate than expected, having adjusted for age

Answer 62

Number of new cases over a given time period

Answer 63

Number of existing cases at a certain point in time

Answer 64

Existing + new cases which develop over a given time period.

Answer 65

Used for meta analysis often Boxes= effect size for each study - larger study= bigger box Horizontal lines = 95% CI Diamond= pooled effect eg relative risk Width of diamond= 95% CI for pooled effect Log scale often used for relative risks as can increase infinitely

Answer 66

Combination of results of several different studies investigating the same effect. Single overall pooled estimate is obtained- often relative risk or odds ratio Increases the power

Answer 67

Selected as part of a systematic review with pre-defined inclusion criteria. Assess study quality eg via PRISMA recording guidelines.

Answer 68

Publication bias- small studies which do not show an effect are unlikely to be published. Use a funnel plot to overcome this. Statistical heterogeneity: we can test for heterogeneity in the treatment effects beyond that expected by chance. If statistically significant then unlikely studies actually reflect a single underlying treatment effect Clinical heterogeneity- causes statistical heterogeneity - when studies have important differences eg population, contexts, eligibility, control, follow up

Answer 69

Number of person-years at risk

Answer 70

The correlation coefficient is a measure of the strength and the direction of the linear relationship between 2 numerical variables Affected by outliers

Answer 71

R= -1 to +1 R +ve as x increases, y increases R -ve as x increases, y decreases R= 1 or -1 - perfect correlation, all points lie in a line (don't confuse this with slope of the line- can have any slope) R >0.8 strong correlation R <0.2 weak correlation R= 0 no correlation

Answer 72

R squared x 100 Tells you how much the variation in one variable can be explained by the other. Eg r = 0.94 … indicates a very strong positive correlation between a country’s average alcohol consumption and deaths rates from cirrhosis 0.94^2 x 100 = 88% so 88% of the variation in deaths from cirrhosis is accounted for by the variation in alcohol consumption

Answer 73

NO | Shows an association

Answer 74

Pearson's correlation coefficient - at least one of the variables is normally distributed Spearman's rank correlation coefficient - data at least ordinal

Answer 75

``` Bradford Hill's criteria • Strength of association • The cause must precede the effect • Dose-response relationship • Biologically plausible • Consistent results from several studies • Removing the risk factor should reduce the risk of disease (reversibility) ```

Answer 76

If 2 variables appear to be related then linear regression fits a straight line to the data. Can predict one variable from another.

Answer 77

y = a + b x ``` x = explanatory variable (also predictor; independent) y = outcome variable (also dependent; response) a = the intercept (value of y when x=0) b = the slope ( increase in y when x increases by 1 unit) ``` YOU MUST NOT REVERSE X and Y as would get a different line. (correlation coefficient you can swap them and it does not matter)

Answer 78

finds the line which minimises the sum of the squares of vertical deviations of points (called residuals) from the line

Answer 79

Slope = 0 | No association in the population

Answer 80

Residuals are normally distributed around the line Residuals have constant variance around the line If assumptions not met try a transformation eg log

Answer 81

Simple: one explanatory variable | Multiple regression: several explanatory variables. (3 dimensional line). Same assumptions apply

Answer 82

y = a + b1x1 + b2x2 + … + bkxk

Answer 83

Logistic : binary cateogorical eg hypertension of not Use Odds ratios Cox: time to event Use Hazard ratios

Answer 84

Summarise survival/mortality according to age. Only use when interested in age rather than a disease Based on current age specific death rates Cross sectional

Answer 85

qx: probability of dying between x & (x+1) years px: probability of surviving from age x to age (x+1years) qx +px = 1

Answer 86

nx: no of survivors at age x nx+1 = nx * px Px: cumulative survival probability

Answer 87

Survival of a special group eg breast cancer Measure survival from a particular stage (age per se is not important) At analysis some have not experienced an outcome -> censored

Answer 88

Lost to follow up Still alive at end of study The data contributes for as long as they have been observed Will cause number at risk to be reduced but will not affect probability of survival or cumulative survival

Answer 89

Prob death = no of deaths/ no at risk Prob survival = 1 - probability of death Cumulative survival = previous cumulative survival x new probability of survival

Answer 90

Assuming censoring is not self selected | If lots of people dropped out then may not be reliable

Answer 91

No - can't compare survival in 2 groups using survival at a fixed point. Will be different times when they are nearer or further.

Answer 92

Logrank test - non parametric test - uses all survival data - no assumptions about shape of survival curve - assumes lines don't cross over

Answer 93

Assumes survival same in 2 groups = null hypothesis Calculate expected nos of deaths & compare with observed nos. Test this using a X^2 statistic

Answer 94

Σ (d1-e1) ^2 / e1

Answer 95

Uses a mathematical function of time to model how probability of death varies with time Probability of death is known as the hazard & function of time t often denoted by H9t)- the hazard function

Answer 96

The log of the hazard ratio

Answer 97

< 1 is better =1 chances are the same HR= 2 - 2 x higher chance Probability of progression in one group/ probability of progressing in other group.

Answer 98

Hazards are proportional - risk doesnt change | Lines do not cross over

Answer 99

Only systematic difference between trial groups should be randomised treatment Repeated error.

Answer 100

``` Efficient and appropriate trial design Randomisation Blinding - pts and doctors Using an intention to treat population Minimise treatment and protocol deviations ```

Answer 101

Caused by unknown unpredictable changes. | Results are estimates of a population

Answer 102

confidence intervals and p values | Minimise by having a sufficient sample size

Answer 103

Aim: dose finding: MTD Conduct: 3+3, rolling 6, continual reassessment method (CRM) (need some previous human data for this usually) Endpoints: tolerability, PK, PD, bioavailability

Answer 104

Aim - determine if a drug has a theurapeutic effect Conduct - historically a single arm study of 20-80 pts - single stage design - two stage - Simon, Gehan design, allows trial to be terminated at end of 1st stage if clearely inactive Endpoints - tumour response- quick, pCR, ORR - PFS - biomarker

Answer 105

prone to selection bias No real allowance for inprecision in historical estimate of response Modest treatment effects may be lost

Answer 106

Aim - to determine if new treatment is better than an existing treatment Conduct - unbiased, reliable, clinical useful, randomised comparison Endpoints - DFS, PFS< OS - adverse risk vs benefit profile - translational research- identify patients who have most/least to gain

Answer 107

Parallel groups- between patient comparisons Factorial groups: >2 comparisons in same trial without necessarily increasing the size. Patient change to a different drug or have 2 interventions at same time. Cross over: within patient comparison- each patient recieves all treatments

Answer 108

Use accumulating data to decide how to modify aspects of the study without undermining the validity or integrity of the trial. eg platform trials/ umbrella protocols/ basket trials

Answer 109

Change dose of treatment Change allocation ratio control: research Early stopping for benefit/lack of benefit Adding in new treatments - via randomisation or as additional cohorts

Answer 110

Reduce bias | Prevent confounders

Answer 111

Treatment allocated at random, easy and quick | But... can be an imbalance in the allocation due to chance

Answer 112

Blocks allocate to treatment, ensures each treatment occurs a given number of times in a given series of patients. It avoids predictable allocation but still some imbalance in prognostic factors.

Answer 113

Divide the patients into groups depending on important characteristics, then allocated equally within each strata either using simple or preferable random permuted randomisation.

Answer 114

Dynamic allocation method- patient is allocated dependent on the characteristics of patients who have already been allocated. Also might incorperate a random element to avoid prediction of the next treatment (80% chance imbalance reduced and 20% chance it is increased)

Answer 115

Allocation of new patients dependent on characteristics of those that went before. Allocation lists cannot be drawn up Treatment allocation uses balancing factors NOT stratification

Answer 116

If standard therapy is no therapy Helps double blinding Ensures benefit due to treatment not just fact they are being treated

Answer 117

Larger sample size

Answer 118

Larger sample size | If significance level 1%

Answer 119

Avoid bias: people who receive non- allocated treatment likely to be a selected subset, ignoring them excludes this type of person from treatment arm. More pragmatic - gives an idea of the real world

Answer 120

Usually excludes patients who have any major protocol violations and analysis is by treatment actually received. Often used for - safety analyses - non-inferiority trials because data from patients who did not receive the protocol treatment tends to bias results towards equivalence and could make a truly inferior treatment appear non-inferior BUT bias

Answer 121

Should be defined in advance, but no standard definition Analysis by treatment received, but can include all patients who received some treatment (even if they were ineligible) Sensitivity analysis conducted on ITT population and patients with complete follow-up

Answer 122

Should be prespecified in protocol to avoid data dredging. If not pre-specified the interpret with caution Only for hypothesis generating not for real data

Answer 123

1) Background and rationale 2) Specific objectives and purpose 3) Description of trial design (randomised, placebo etc) 4) Registration and randomisation methods 4) Trial endpoints 5) Inclusion and exclusion criteria 6) Description of trial treatment - treatment schedule - dose modification procedures 7) Methods of patient evaluation - baseline and follow up 8) Assessment of safety - adverse event reporting 9) Required size of study, - rationale for statistical assumptions 10) Trial progress – ‘stopping rules’ 11) Data handling & record keeping 12) Ethics considerations 13) Plans for statistical analysis - interim analyses - monitoring of quality of data 14) Administrative responsibilities 15) Finance and insurance 16) Publication polic

Answer 124

Start up phase - identify hypothesis - design trial - Write protocol - apply for funding - identify sponsor - ethics approval - CTA - centre approvals Conduct trial: - recruit patients - manage data - monitor patietn safety - GCP Analyse data - test hypothesis - analyse safety and efficacy data, publish results

Answer 125

Research ethics commitee for approval | IRAS- integrated research application service - for combined ethics and central R&D approval

Answer 126

Research ethics committee | also changes in protocol

Answer 127

Central statistical monitoring - monitor recruitment rates, compliance, adverse events - freq depends on trial but should be done fairly regularly Interim analysis - - freq depends on trial - to look for treatment differences that are convincing and important enough to warrant stopping the trial early or changing the design

Answer 128

A multidisciplinary committee responsible for overseeing scientific and operational aspects of the trial. - includes CI, co-investigators, key clinical and scientific collaborators, clinical trials unit representatives and patient representatives

Answer 129

- input into trial protocol and case report forms - oversee ongoing conduct of trial - provide clinical or other expert guidance - develop strategies to optimise recruitment - promote and maintain profiel of trial during its follow up phase Actively contribute to interpretation and write up of results

Answer 130

Provide expert independent oversight of trial on behalf of sponsors and funders Includes an independent chair and at least two further independent members with clinical or statistical expertise ( one must be a statistician)

Answer 131

• consider protocol amendments that will significantly alter trial design, conduct or analysis • consider TMG strategies to improve trial conduct, e.g. recruitment • consider recommendations of the IDMC • consider decisions on future continuation (or otherwise) of trial • oversee the timely reporting of trial results • consider requests for analyses (from TMG and external groups) not identified in the protocol or SAP

Answer 132

Small group eg 2 clinicians and a statistician Independent of trial organisers Assess pre-specified interim analysis of data (results confidential). Look at recruitment and completeness of data, side effects and interim results Can recommend a trial is stopped -> give recommendation to TSC who makes final decision

Answer 133

Each time you calculate the p value the more chance you have of finding a significant result. Several statistical stopping rules or guidelines have been developed for multiple testing eg Pocock, Haybittle-Peto, O'Bien and Fleming

Answer 134

Refers to how well the outcome of the study can be generalised to the real world.

Answer 135

Extent to which study establishes a trustworthy cause and effect. Depends largely on procedures of the study eg randomisation, blinding, protocol

Answer 136

examines the relationship between disease (or other health related state) and other variables of interest as they exist in a defined population at a single point in time or over a short period of time (e.g. calendar year) Main outcome obtained is prevalence

Answer 137

Is at population level. Measures an outcome or risk in a population Looks at a group, not individuals.

Answer 138

Histogram groups the numbers into a range

Answer 139

Overall responsibility for the conduct of the trial Responsible for safety assessments Must evaluate all SAEs and decide if they are SARs or SUSARs. Must report all SUSARs to MHRA

Answer 140

``` • Must record all AEs during a study – records can be inspected by the Sponsor • decide if an event is serious • decide if an event is a reaction • decide if a reaction caused by IMP • Must immediately notify the Sponsor of SAE/Rs (usually within 24 hrs). ```

Answer 141

Row Total x column total / total number in both groups.

Answer 142

Log rank -> time to event, single predictor, categorical data Cox regression-> more than one variable, continuous data

Answer 143

Trial where the sample size is not defined in advance Data evaluated as it is collected and stopped at a predefined outcome. Good when time between treatment and outcome is short.

Answer 144

he National Cancer Registration and Analysis Service (NCRAS), part of Public Health England (PHE), is the population-based cancer registry for England. It collects, quality assures and analyses data on all people living in England who are diagnosed with malignant and pre-malignant neoplasms, with national coverage since 1971. It produces the national cancer registration dataset for England. The primary role of NCRAS is to provide near real-time, cost-effective, comprehensive data collection and quality assurance over the entire cancer care pathway. To achieve this, it receives data from across the National Health Service (NHS). NHS Act 2006 protects HSE rights to collect cancer related data.