Definitions from scratch Flashcards
Two types of variable
Metric variables
Categorical variables
Categorical variables can be
Nominal: relates to named things i.e. it is NOT numeric
It is categorical because we allocate each bit of data to a category e.g. male or female
Ordinal
Nominal data
= categorical nominal variable
Nominal: relates to named things
Category: each data point is placed in a category
Properties of nominal data:
- They do not have units of measurement
- The ordering of categories is arbitrary i.e. does not matter
Example:
Males: 45
Females: 72
Properties of nominal data
Categorical nominal variable
- They do not have units
- The ordering of categories is arbitrary
Example:
Males 43
Females: 52
OR
Females: 52
Males: 43
Ordinal data
=categorical ordinal data
Ordinal data is categorical but it can be ordered in a meaningful way i.e. smallest to largest
e.g. Glasgow coma scale
If person A has a GCS of 5, and person B has a GCS of 10 we can conclude person A’s consciousness is lower BUT we can conclude by how much i.e. we CANNOT say half as much
The difference between adjacent scores is not constant
The seemingly numeric values are NOT number, but labels
Properties of ordinal data
Categorical ordinal data
- Does not have units
- CAN be ordered in a meaningful way
- Nearly always integers
- Assessed rather than measured
NOTE: they do not have a numeric value, they seemingly have numeric values but these are actually labels i.e. GCS of is saying that they fit into a category called GCS 3
What you shouldn’t do with ordinal data
YOU SHOULD NOT TREAT THEM AS NUMBERS
i.e. for ordinal data you should not add, divide, or average it
Ordinal data = number labels
Metric variables can be
Discrete: values occur in discrete intervals i.e. 1, 2, 3, 4, 5,
- comes from counting i.e. number of operations
- difference between each count is constant (in comparison to ordinal data)
- 4 operations is twice as many as 2 operations
Continuous:
Properties of discrete data
Discrete metric data
- Has units
- Discrete variables can be counted, meaning they are real numbers - produce Integers
Continuous data
Continuous metric data
- Values form a continuum
- Real numbers
- Has units
Frequency table
Used to illustrate descriptive statistics
Frequency distribution
Illustrates the number of events in each category
Relative frequency
= percentages
Contingency table
Cross tabulations
Illustrate association between two variables in a single population
Has two columns for the given variable in the row
Ranking data
Allows assessment of non-parametric data
Order data into size
Starting with larges variable, rank this with value of 1
Next value rank as 2
Equal values are tied with the value of the average some of ranks used in tied series e,g, 7 8 5 5 5 3 1
8: 1
7: 2
5: =4
5: =4
5: =4 (3 , 4 , 5 avergae = 4)
3: 6
1: 7
Ogive
Pronounced ojive
Cumulative frequency curve with continuous metric data
Curved (no step) chart
Measures of shape (skew)
Skewness:
-skewness coefficient defined from -1 to +1
Kurtosis:
Left skew
= negative skew
Lots of large values
Negative –> peak is further away from y-axis
Right skew
=positive skew
Lots of small values
“Right skew, close to you”
Distributions
Symmetric: classic one humped distribution
Bimodal: two peaks
Multimodal: multiple peaks
Kurtosis
Measure of distribution
Distributions with large kurtosis exhibit tail data exceeding the tails of the normal distribution (e.g., five or more standard deviations from the mean).
Skewness differentiates extreme values in one versus the other tail, kurtosis measures extreme values in either tail.
If you hold the area the same, if you increase the kurtosis, the peak would get flatter and broad and hence larger spread
Kurtosis value of normal distribution
=3
(excess kurtosis value = 0 i.e. the excess subtracts 3 form calculation)
(uniform distribution =1)
Mode
Useful in categorical data
Useless in continuous data when no two values likely to be the same
Median
CAN BE USED FOR ORDINAL AND CONTINUOUS DATA
Discards a lot of information
Not as affected by skew vs mean
Not as affect by outliers vs mean
Therefore median is a stable measure
Mean
Uses all the data - each value is included
Therefore subjected to effect from outliers and skew
Cannot be performed on ordinal data
Percentiles
Values that divide a data set into 100 equal-sized group
To find percentile, multiply percentage in decimals by (n+1)
Where n is equal to number of data points
Properties of using range
Lowest to highest value
Not affected by skew
Sensitive to outliers which may misguide range
Interquartile range
Removes 25% from each end
Reduces effect of outliers
Affected by skewed distributions
Limitations of interquartile range
Discards 50% of the data!
Definition of standard deviation
Average distance of the data values from their collective mean
Uses each data point, i.e. uses all the data (unlike IQR)
Measuring spread with ordinal data
Range and IQR
Not standard deviation, can only be used on continuous data
Why not use median and SD
If you have continuous data, SD is the measure of choice.
However, if you use a median value, that has suggested you have skewed data. Therefore, you shouldn’t use SD.
Mean +/- 1 SD
68% of values in range
*for normally distributed data
Mean +/- 2 SD
95% of values in range
*for normally distributed data
Mean +/- 3 SD
99% of values in range
*for normally distributed data
Testing for normal distribution
Shapiro-wilk test
- if less than 2000 values
- provides p-value with null hypothesis set for normal disrubtion
Kolmogorov-Smirnov test
- > 2000 values
- provides p-value with null hypothesis set for normal disrubtion
Transforming data
Make it normally distributed
log (to the base 10) = most common
square-root
1 over value
Incidence rate
Is actually the crude incidence rate
Number of new cases of a disease or event over for a defined population given time period
= number of new cases / number at risk
(same time period)
Incidence rate ratio
ratio of two incidence rates
Prevalence
Number of cases in a given population at a given point in time
Crude mortality rate
= number if deaths over a period time (usually 1 year) divided by population at mid-point of that time duration
MULTIPLE by 1000
gives crude mortality per 1000 per year
Case fatality rate
Number of deaths from a disease in a given time period divide by total number with disease over that time period
Standardised mortality rate
Crude mortality rate divide by overall standardised mortality rate
Properties of a confounding variable:
A confounding variable:
- is associated (casually or not) with the exposure
- causally related to the outcome
- must not be part of the exposure-outcome pathway
Positive confounding
Leads to effect of exposure being inflated
Negative confounding
Leads to effect of exposure being reduced
Confounding by indication
Occurs when the clinical indication for selecting a particular treatment (eg, severity of the illness) also affects the outcome.
Indication for exposure leads to disease outcome
Not exposure itself
Residual confounding
Residual confounding is the distortion that remains after controlling for confounding in the design and/or analysis of a study. There are three causes of residual confounding:
There were additional confounding factors that were not considered, or there was no attempt to adjust for them, because data on these factors was not collected.
Control of confounding was not tight enough. For example, a study of the association between physical activity and age might control for confounding by age by a) restricting the study population to subject between the ages of 30-80 or b) matching subjects by age within 20 year categories. In either event there might be persistent differences in age among the groups being compared. Residual differences in confounding might also occur in a randomized clinical trial if the sample size was small. In a stratified analysis or in a regression analysis there could be residual confounding because data on confounding variable was not precise enough, e.g., age was simply classified as “young” or “old”.
There were many errors in the classification of subjects with respect to confounding variables.
Controlling for confounding at design stage
- Restriction
- exclude all those with exposure hairball
- limits generalisation of evidence e.g. if you exclude all smokers then results unlikely to be generalise to any population - Matching
- choice of method in case-control studies
e. g. frequency matching (same proportions)
e. g. propensity score matching - Randomisation
- choice of method in RCTs
- controls for known and unknown confounding
Controlling for confounding at analysis
- Stratification
- divides into strata, with and without exposure
- essentially restriction but after the event - Adjustment
- regression
Reverse causality
The exposure-disease process is reversed; In other words, the exposure causes the risk factor.
Lower employment status is linked to causing depression.
It may well be depression is linked to causing employment status.
Descriptive cross-sectional studies
Do no infer any causality and only measure one variable (i.e. incidence) but can measure multiple
- Generally not subject to confounding if only measure prevalence
- If measures multiple things, will need to adjust for potential confounding
Analytical cross-sectional studies
Attempts to asses potential links between two or more variables at a given time point
- Does not infer causality
- Need to be adjusted for confounding variables
Cross-sectional studies
- Take one set of measurements from each participant at a SINGLE point in time
- Used to investigate associations between variables but NOT causality or direction
- Not useful if condition is rare
- If used to asses opinions or attitudes, referred to as surveys
Cohort studies
Pros
- Main purpose is to identify if exposures or risk factors cause a certain disease
- Several outcomes can be studied for single exposure
- Temporal relationship can be established
- ->therefore adds to causality
- Suited for rare exposures
- Less subject to bias and confounding than case-control
Cons
- Sampling bias
- Not suited for rare diseases
- Long follow-up: leads to attrition and bias
- Recall bias in retrospective studies
- Data quality in retrospective studies
Problems with case-control studies
- Recall bias
- Selection of cases difficult to find
- Difficult to match patients for each variable
- Sampling bias of cases +/- controls
- Definition of a case e.g. GOLD 1 COPD is unlikely to clarify much
Ecological studies
Make large-scale comparisons between two groups of people
Statistical inference
Data from the sample will inform conclusions about the target population
Using sample statistics, we are inferring about the population
Sample statistics
Variable measured in a sample
=sample statistics
This is used to inform inferences regarding population parameters
Sample error
Deviation from true value of a parameter in sampled population
-usually unknown
Rules of probability
Chance of an event occurring lies between 0 - 1
1 is absolute certainty of an event e.g. everyone will die some day
0 is an impossible outcome e.g. rolling an 8 on a dice labelled 1- 6
If an event is equally likely to happen as to not happen, the probability would be 0.5
If p is the probability of an event happening, the probability of the vents not happening is 1 - p
Proportional frequency
Used to calculate probability in clinical settings when outcomes do not all have an equal chance of occurring
i.e. any clinical setting
Proportional frequency states that the probability of an event occurring is equal to the proportion of times that outcome would have occurred if we repeated the experiment a large number of times
Techniques for randomisation
Simple randomisation
Block randomisation
-ensures at any given point there are roughly equal numbers in each group
Stratification
-ensures balanced strata of variables across each group
Reducing placebo or response bias
Blinding of participant
Problems with cross-over trials
- Participants may undergone change between treatment 1 and treatment 2
- Does not work for treatments that require a long time to take effect
- Does not work in self-resolving or acute illness that responds to therapy immediately
- “Carry over” effect despite washout periods
Hawthorne effect
Change in behaviour after knowledge of being observed
Some trials do not recruit controls if data collection will not differ as Hawthorne effect changes outcomes
Intention to treat
The process of analysing the data as if participants are still in the original group allocation despite loss or changeover of participants
- Maintains baseline characteristics
- Prevents attrition bias
- Reflects real-world practice
- Keeps sample size and power the same
Cons
- requires imputation
- can sometimes underestimate effect size
Per protocol analysis
Analysis performed as per treatments received by participant
- protocol deviation has taken place
- balanced baseline characteristics lost
- attrition bias now in action
- subject to confounding
- loss of power
- likely to overestimate effect size e.g. those most unwell are least likely to tolerate the side effects of new drug, hence only moderate disease is analysed in treatment group vs full-spectrum of severity in control group
Cluster sampling
Overcomes need for sampling frame
Common sampling technique in randomised-controlled trials
Units represent GP surgeries, hospitals, schools, clinics etc.
Sampling units are a likely place to find spectrum of participants
However, not a sampling frame - do not include everyone eligible and hence sampling and then selection bias will be introduced
Example: 75 GP surgeries identified as eligible sampling units
Randomly select 25 as your cluster sample
-People who aren’t registered with a GP have no chance of being included, hence not equal sampling probability
Probability density function
Used when calculating probability function in continuous variables
pdf gives probability that a continupous random vairbale will lie between two values
THIS IS BECAUSE: continuous variable have infinite possible number of outcomes, hence probability of a given outcome = 0
Absolute risk
Probability of an outcome occurring in a population with exposure
Relative risk
Risk exposed divided by risk unexposed
Same as risk ratio (decimal)
Risk ratios
Risk exposed / risk unexposed
Decimal
Can over-inflate risk
Relative risk reduction
risk reduction / risk in unexposed