Stats Flashcards
Explain the two categories of data
Categoric:
Nominal, binary and ordinal
Numeric:
Continuous and discrete
Give 3 ways of describing categoric data
Example scenario:
Total participants 676, those using drug and have MI = 31, those using the drug and no MI = 310
Those using placebo and have MI =61 , those using placebo and no MI = 305
Risk for MI with drugs = 31/341 =0.091 (9.1% -just x 100 to get percentage)
Risk for MI with placebo = 61/366 = 0.167 (16.7% to get percentage)
Odds for MI with drugs = 31/310 = 0.1
Odds for MI with placebo = 61/305 = 0.2
(Absolute) risk difference = 0.167 - 0.091 = 0.076 (7.6%) this means the risk with a placebo is 7.6% higher than the drug
Usually just called risk difference but absolute is added when we do not worry about the minus sign
(Relative) Risk ratio = 0.167/0.091 = 1.835 (185%) (placebo on top so becomes focus group), this means the risk is increased by 85% with the placebo compared to the drug, 0.091/0.167 = 0.545 (54.5%) (drug on top so becomes focus group), this means the risk is decreased by 45.5% with the drug than with the placebo
Usually referred to as risk ratio RRR, but actually called relative risk ratio
Odd ratio = 0.2/0.1 = 2 (with the placebo as the focus group) this means that it has increased by 1, there is an 100% increase in the odds of having an MI on the placebo compared to drug A
Odd ration = 0.1/0.2 = 0.5 (with drug A as the focus group) this means that it has decreased by 0.5, so there is a 50% decrease in odds of an MI on drug A compared to the placebo
How to find the relative risk of the other focus group e.g think of drug A and placebo scenario
If the risk ratio with the placebo as the focus group is = 1.835
To find the risk ratio with drug A as the focus group, we do 1/1.835 which gives 0.545.
Basically finding the reciprocal
Give two ways to measure and present categorical data
Pie charts
Bar charts
Give three ways to measure and present numerical (quantitative) data
Dot plots
Histograms
Box and whisker plots (box plots)
Give a way to present an association between two continuous variables
Scatter plots
Give some characteristics of a histogram
It can be used to show normal distribution (also called Gaussian distribution)
Can show skewed data which is when data is not symmetrical
Negative skewed data = has long low left tail and peaks at high values on the right
Positively skewed data = has long low right tail and peaks at low values on the left
Give some characteristics of a box plot
The box contains the middle 50% of the data
The line in the box plot shows the median value
Outliers (which are values 1.5 box length from the upper and lower edge of the box) are plotted as dots outside of the whiskers
Give some characteristics of scatter plots
The independent variable is on the X axis and is usually what the experimenter changes
The dependent variable is on the Y axis and is usually the response to what the experimenter changes
Giv three ways that measure the spread of data
Range
Inter-quartile range
Standard variation
What is the variance
Variance = standard deviation ^2 (squared)
Standard variation = square root of variance
What does a large or small standard deviation show
A small standard deviation shows that any random value picked is likely to be close to the mean so small spread of data
A large standard deviation shows that any random value picked is likely to further from the mean so large spread of data
What is the best method to use if there is a symmetric distribution of data
Mean and the standard deviation
What is the best method to use if the distribution of data is non-symmetric
Median and the interquartile range
Methods to summarise categoric variables
Proportion, percentage, risk and odds
Methods to summarise numerical (quantitative) data
Mean, median, range, interquartile range and standard deviation
Methods to quantify differences between two categorical variables
Absolute risk difference
Relative risk ratio
Odds ratio
Methods to quantify the differences between two numeric variables
Persons correlation coefficient (r) r must be between 1 and -1 \+1 shows a positive linear correlation 0 shows no linear correlation -1 shows a negative linear correlation
Methods to calculate the difference between one categoric variable and one numeric variable
If both variables give symmetrical graphs (distribution of data), use mean - mean =
If one of the variables give a non-symmetrical graph (distribution of data), use median - median =
Give the percentage of 1 standard devation, 2 standard deviations and 3 standard deviations.
1 standard deviation = 68%
2 standard deviation = 95.4%
3 standard deviation = 99.8%
Which standard deviation give 95% of the distribution in a graph
1.96 standard deviation give 95% of the distribution of the graph
If you cannot use a mean for a certain set of data, would you still be able to use standard deviation for that data
No, the standard deviation would be affected by the same issue of being skewed by outliers
All if possible it is always best to use the mean and standard deviation because they include all values in the data so more powerful
Who is Sir Galton Francis and what are his contributions to statistics
1822-1911
Standard deviation, correlation, concepts of regression, medians and ranking
First weather map
How to cut a cake
Attractiveness of cities
What is standard error
It is an estimate of the precision of the representation of the sample to the population
How is the standard error calculated
Standard error = standard deviation/ the square root of the sample size
When can the standard error not be used
When the standard deviation and the mean cannot be used due to how skewed the data is
What are the rules to use the standard error
The data has to be normally distributed
The sample size has to be large enough (more than 20 individuals)
Show how to calculate the confidence interval from the standard error
If the Mean = 18,477 sample size (n) = 12 standard deviation (SD) = 3,732
Standard error = standard deviation/ the square root of the sample size
Standard error = 1077.3
To get a 95% confidence interval, use the 1.96 from the standard deviation
To get a 99% confidence interval, use the 2.58 from the standard deviation
Mean - (1.96 x standard error) = 18,477 - (1.96 x 1077.3) = 16,365
Mean + (1.96 x standard error) = 18,477 + (1.96 x 1077.3) = 20,589
95% confident that the true value of the mean lies between 16,365 and 20,589
If we wanted to get the 99% confidence, we would do mean - (2.58 x standard error) and mean + (2.58 x standard error)
What does a small standard error mean for a sample
Greater precision that the results from the sample are representative of the populations
What does a small standard deviation mean
The values are less spread so there is less variability in the sample
What is correlation used to explore
- how two numeric continuous variable are related
- the strength of an association
Give the regression equation and what part of it means
Y= a + bx
Y is the dependent variable (or called outcome or response) the one we measure e.g blood pressure reading, pain score, hours of sleep
X is the independent variable (or called predictor or explanatory) e.g age, deprivation level and family history of illness)
A is the y intercept (or called the constant)
B is the coefficient - the change in y when we increase x by 1 unit
Give the name of the type of regression where the outcome is a single continuous variable e.g sleep time
Linear regression
Give the name of a regression which has a binary outcome e.g pass or fail
Logistic regression
What is the benefit of regressions compared to just correlations
We can incorporate additional values (additional predictors) which allows us to account for confounding variables
Give some characteristics of regression models
- they can be used to create productive models
- remove the effects of confounding variables
- explore how a particular drug influences the outcome
Does linear regression require a binary or numeric outcome variable
It requires a numeric outcome variable
A regression model with more than one predictor is called
A multivariable model
What is a great benefit of a multivariable regression model (more than one predictor)
It adjusts for confounding variables
A difference between coefficients in a single and multiple regression model is
Coefficients in a multiple model have taken account of background factors
In linear regression with 1 predictor variable, a coefficient of 1.5 means that increasing the predictor by 1 unit causes
An increase in the outcome by 1.5 units
What is prevalence
Proportion of population with a disease at one point in time
Number of cases at a point in time/ total population = prevalence
What is incidence
A rate
Rate at which new cases appear in a population at a certain period of time
Number of new cases/ at risk population = incidence
Advantages of ecological studies
-uses routinely collected data - Quick, cheap
-units of analysis are populations - groups of people
-can examine patterns of ill-health by age, sex, ethnicity, country
and/or by time
-few ethical issues
-useful for generating hypotheses
Disadvantages of ecological studies
- no link between individual exposure and effect
- bias - variation in diagnostic criteria
-absence of records of individual attributes
-unsuitable format of records
-inconsistency in data presentation
Advantages of cross-sectional studies
- results used to generate hypotheses
-rapid feedback of current events in the community - quick and cheap
-few ethical problems
Disadvantages of cross-sectional studies
- could just be reporting a medical oddity
-prone to bias, e.g. sampling, subject and observer variation - no time reference
Advantages of case control studies
-by concentrating effort on the identification of affected individuals and recruiting controls from the unaffected population, the number of subjects required to obtain significant results is kept to a minimum (so good for rare diseases)
-results can be obtained relatively quickly because the investigation does not have to wait for the disease to develop (compare this with cohort studies – see later) and can look for multiple causes
-it is a relatively inexpensive type of study
Disadvantages of case control studies
-generally rely on retrospective data, which has its own dangers. The ability of individuals to recall past events tends to be unreliable due to a tendency for memory to be selective. Records of past events may be incomplete.
-because data are collected retrospectively, it is difficult to say if an association is causal or not. This is less of a problem when the exposure is highly specific or where the time between exposure and disease is short
-prone to selection and information biases
-there can be difficulties choosing controls
-the incidence of disease within a population cannot be calculated from this type of study
Advantages of cohort study
-the main advantage is that it is possible to distinguish antecedent causes from concurrent associated factors (cause comes before effect)
-since incidence can be determined for both exposed and non- exposed groups, we can determine absolute, relative and attributable risks
-we can study more than one outcome to the same exposure
-there is less chance of bias since exposure is measured before development of disease
Disadvantages of cohort study
-cannot be certain that exposures are causal- this requires controlled studies
-long periods of study, and large populations mean that cohort studies are expensive
-follow-up can be a problem- especially if the period of study is long- this needs to be considered in the design of the study
-diagnosis of cases may change over the years as medical science becomes more advanced- better at detecting the disease or with different criteria for a diagnosis
Advantages of randomised control trials
-randomization should mean that confounding factors (age, sex etc.) are equally distributed. This helps to concentrate the study on the effect of the intervention
-by randomly allocating patients to interventions, it is likely that staff and patients will not break the blinding
-statistical tests for significance are easier to interpret when the study design removes confounders
-confounders and many biases minimised
Disadvantages of randomised control trials
-to allow sufficient numbers to balance confounders these tend to be large and expensive trials. They are often multicentre and may even be multinational
-there is always a chance that volunteer bias will be a problem: what about people that refuse to be included in the trial or those that are never asked.
-there may be ethical difficulties in withholding treatment from the control group or offering what is believed to be an inferior treatment to one group
-may lose statistical power if poor compliance
What is critical appraisal
-critical appraisal is the assessment of evidence by systematically reviewing its relevance, validity and results to specific situations -by R Chambers 1998
Difference between parametric and non parametric analysis tests
Parametric tests have rules that need to be followed or assumptions that need to be met
Non-parametric tests are used as an alternative - they dont need rules to be followed or assumptions to be met
Assumptions include - sample size, normal distribution and linearity in regression
Examples of parametric test
- one sample t-tests
- two sample t-tests (also called students t-test)
- chi square test
- ANOVA test
- Pearson correlation coefficient
Examples of non-parametric test
- one sample Wilcoxon test
- Mann-Whitney U test
- Fishers Exact test
- Kruskal-Wallis ANOVA
- Spearman rank correlation coefficient
Give two examples of critical appraisal tools
- CASP
- AXIS has 20 questions and no scoring system
Give the frequency of these single gene defects
Cystic fibrosis, alpha-1-antitrypsin deficiency, Hereditary Haemorrhagic Telengretasia (HHT), Immotile cilia syndrome
Cystic fibrosis = 1 in 2500
alpha-1-antitrypsin deficiency 1 in 2000
Hereditary Haemorrhagic Telengretasia (HHT) 1 in 4000
Immotile cilia syndrome 1 in 20000
How can the CFTR gene (chromosome 7, 27 exons and 1480 residue proteins) be identified
- linkage
- positional cloning
- sequencing
How can Cystic Fibrosis be diagnosed
- sweat test
- gene mutation analysis
Give the symptoms that Cystic Fibrosis could cause
- abnormal ion transport across epithelium
- salt loss
- impaired mucociliary clearance
- chronic infections
- sterility (infertility)
- impaired digestion (meconium ileus)
- failure to thrive
- liver disease
- diabetes
Treatment of Cystic Fibrosis
- pancreatic enzyme supplementation
- control of infection
- suppression of chronic infection - antibiotic nebulisers
- bronchodilation - salbutamol nebulisers
- anti-inflammatory - azithromycin
- diabetes - insulin
- vaccinations - flu, pneumococcal
Give chromosomal cause of alpha-1-antitrypsin
- autosomal recessive
- chromosome 14
- 14q32.1
What is the normal phenotype for alpha-1-antitrypsin deficiency and the disease phenotype
M is the normal phenotype
S and Z are associated with major disease presentation
What are the clinical presentation of alpha-1-antitrypsin deficiency
Due to build up of deformed alpha-1-antitrypsin in the liver
- childhood jaundice
- early onset cirrhosis
Due to the unopposed action of neutrophil elastase in the lungs
-early onset emphysema and bronchietasis
Highly sensitive to cigarette smoke
What is the inheritance pattern of hereditary haemorrhagic talengiectasia (HHT)
Hereditary haemorrhagic talengiectasis (HHT) is also known as Osler-Weber-Rendu diseases (or syndrome)
-causes abnormal blood vessel formation in the skin, mucous membranes and in the organs such as the lungs, liver and brain
Give the loci affected by the 3 forms of Osler-Weber-Rendu Syndrome and symptoms experienced in these conditions
HHT1
-endoglin gene (ENG) on chromosome 9
HHT2
-ALK-1
HHT3
-chromosome 5
- talengectasia
- epitaxis
- PAVMs
- GI blood loss
Give another name for immotile cilia syndrome and its inheritance patterns
- Kartagner’s syndrome or primary ciliary dyskinesia
- autosomal recessive
Give some symptoms of Kartagner’s syndrome
10 variations in dynein arm
- infertility
- sinusitis
- bronchiectasis
- situs invertus
Give some examples of disease from polygenic influences
- asthma
- chronic obstructive pulmonary disease (COPD)
- venous thrombosis and pulmonary embolism
- Tuberculosis
- sarcoidosis (NRAMP)
- Obstructive sleep apnoea
- infant respiratory distress
Give 4 examples of autosomal recessive respiratory disease and their genes
- cystic fibrosis - CFTR
- alpha-1-antitrypsin - SERPINEA1
- kartagener’s syndrome (immotile cilia) - DNA1
- pulmonary veno-occlusive disease - E1FZAK4
Give an x linked example of a respiratory condition and their genes
-chronic granulomatosis disease CYBB
Give 2 examples of autosomal dominant conditions and their genes
- hereditary haemorrhagic telangectasis (HHT) - ALK, ENG
- hereditary pulmonary arterial hypertension (HPAH) - BMPR2
What is secondary prevention
-aims to detect early disease in order to alter the course of the disease e.g screening by mammography for breast cancer in order to treat it early
What is sensitivity and give the formula to calculate it
-the proportion of people with the disease who are correctly identified by the screening test
True positive/ true positive + false negative = sensitivity
What is specificity
-the proportion of people without the disease are correctly excluded by the screening test
True negative/ true negative + false positive = specificity
What is positive predictive value and give the formula to calculate it
-the proportion of people with a positive test result who actually have the disease
True positive/ true positive + false positive = positive predictive value
What is a negative predictive value and give the formula to calculate it
-the proportion of people with a negative test result who do not have the disease
True negative/false negative + true negative = negative predictive value
Give the formula to calculate prevalence
True positive + false negative = true positive + false negative + false positive + true negative
Which of these have an effect the predictive values
- prevalence
- specificity
- sensitivity
- predictive values are dependent on prevalence
- sensitivity and specificity do not affect predictive values
How would screening programs be evaluated
-by randomised controlled trial (individual or clusters)
Give 3 forms of bias that can affect evaluation of screening programs
- selection bias
- lead time bias
- length time bias (or length bias)
What is selection bias
-people who chose to participate in screening programmes may be different from those who do not
- may be at more risk
- may be at less risk
What is lead time bias
When screening appears to increase survival time because disease was discovered and diagnosed earlier
What is length time bias
An overestimation of survival because long duration cases are more likely to be detected and treated than short duration cases e.g PSA screening more likely to be detected as the tumour is slow growing
What are the 5 types of screenings
- population-based screening programs (national diabetes and hypertension screening like in thailand)
- opportunistic screenings (prevention and control of substance abuse)
- screening for communicable diseases (heaf test)
- pre-employment and occupational medicals (vision test for commercial drivers)
- commercially provided screening (screening is a programme not a test)