BIO Statistics Flashcards
Central limit theorem
The sampling distribution of the mean of any independent, random variable will be normal, or nearly so, if the size of the sample is large enough.
Gaussian curve: area between u and 1SD, 1 SD and 2 SD, 2 SD and 3 SD, 3SD–> infinity
U and 1SD: 34.1%
1SD - 2SD: 13.6%
2SD-3SD: 2.1%
Past 3 SD: 0.1%
Parametric statistics (definition)
A class of statistical procedures relying on the assumptions about the shape of the distribution(assume normal), in the population and about the form or parameters (u, SD) of the assumed distribution.
Non parametric statistics (definition)
A class of statistical procedures NOT relying on assumptions about the shape or form of the probability distribution from which the data is drawn. `
Descriptive statistics include
Mean, median, mode, range, variance, SD, SE
Range
Difference between largest and smallest sample values
Not indicative of the data set’s dispersion
Variance
Average of the square distance of each value from the mean.
Includes negative values
Standard deviation
Tells you how tightly each sample is clustered around the mean.
Tight cluster=low SD.
Only under normal distribution.
Shows precision of the calculated mean
Standard error
Measure of how far the sample mean is from the population mean.
Gets smaller as sample size increases, since the mean of a larger sample is likely to be closer to the population mean.
Confidence interval (definition)
The estimate of the range that is likely to contain the true population mean. Takes into account the size of the population and the scatter of the measurements.
What constitutes reliable data?
Precise, accurate, repeatable, reproduce able.
Random error
Caused by inherently unpredictable fluctuations on the readings of the measurement apparatus or in the experimenter’s interpretation of instrumental reading.
Can occur in any direction
Systematic error
Result of bad science. Predictable, one direction. Caused by imperfect calibration of instruments, imperfect methods.
Alpha
Significance level. Probability threshold below which the H0 will be rejected.
0.05 or 0.01 are appropriate.
Type 1 error
Incorrect rejection of a true Ho. (False positive)
Say the experiment worked when it didn’t
Type II error
Incorrectly retaining a false Ho. (False negative)
If the true state of the Ho is false and you fail to reject it. Usually an issue with power.
Z Test definition
Any statistical test for which the distribution of the test can be approximated by a normal distribution, with n>30.
Assumes pop and sample are normally distributed.
What does the value of Z mean in a z test?
Z is the chance that the experimental mean would occur by chance, given that the Ho is true. Large Z means that there’s less of a chance this is true.
Z score of 2.5 means that the sample mean is 2.5 SD away from the population mean.
T test is used when (general)
You have a normal distribution in the population and the sample, and have n
P value– what do large and small p mean
Large p indicates weak evidence against the Ho. Need to accept.
Small p indicates strong evidence against the Ho, reject.
One tailed t test
To test if the experimental mean is significantly greater than the population mean, or significantly less than, but not both.
Making the assumption about the data makes this less robust
Two tailed t test.
Testing if the exp. mean is significantly greater than and significantly less than pop mean.
More robust because using a smaller area on each side of the distribution (2.5% on each)
Paired t test
The observed data are from the same subject, twins, or otherwise matched subject and are drawn from a population with a normal distribution
Unpaired t test
Observed data are from two independent, random samples from a. Population with a normal distribution.
ANOVA
Compares 3 or more means. Measures the sum of squares to understand the variance.
ANOVA tells you whether any of the earns have a difference between each other, taking scatter and variability into consideration.
One way ANOVA
One measurement variable and one nominal variable is explored.
All the groups are independent, and only one thing is being measured in each group. There is theoretically a normal distribution within each group.
Two way ANOVA
1 measurement variable and 2 nominal variables.
There are two factors being measured within each group that effect the outcome. Ex: how 3 different drugs affect subjects - both men and women. Drug response and gender are the two factors.
Post hoc tests
In follow up to the ANOVA. Used when ANOVA rejects Ho. Tests whether the group means differ significantly, correcting for multiple comparisons.
Mann Whitney U test
For independent measures with 2 groups. It’s a non-parametric two sample t test.
Ranks measurements from highest to lowest values, separating the groups– U from each sample set. Lowest U is compared to the table. If Uexp
Correlation
The extent to which two variables have a linear relationship with each other.
Pearson correlation Coefficient
The certainty of when you know X will predict y. How well do the variables correlate.
Linear regression
Used to adjust the values of the slope and intercept to find the lie that best predicts y from X based on the data. Assumes that data are linear. They may not be.
Categorical data
No mean, median, mode, or normal distribution. Dead or alive, diabetes or no diabetes.
May be inherent in the data or made from continuous data.
May be more meaningful clinically
Chi square- what it is used for
It is the appropriate statistic for measuring relationships between categorical data in a contingency table. Compares experimental outcomes to expected outcomes to see if there is a significant difference.
Assumptions made by a chi square test
Data are frequency data
Adequate sample size
Measures are independent of each other (a patient only goes in one box).
When to use a Chi Square (check list)
Categorical data
Not normally distributed
No assumption that data will be normal.
Experimental research design includes these 3 things
Independent variables manipulated, extraneous factor are controlled, random assignments into groups.
Run-in experiment
Precedes the randomized control trial. A period of time where subjects are put on the control regimen to see if they will continue with the study and comply. If not then they will be removed before the real study starts.
Healthy user bias
Sample is more healthy, or medically fluent than the average population.
Berkson’s bias
Sample selected from an impaired or diseased group, like hospital patients. Clearly doesn’t reflect the regular population
Exclusion bias
Excluding subjects based on potential extraneous factors.
Excluding reduces generalizability
Selection bias
Bias in placing sample subjects into treatment or control arms. (Hand picking). Leads to non-equivalent groups, which builds inherent biases.
Investigator bias
Where the investigators are aware of which subjects are in each group and this influences how they work with the subject or record results
Hawthorne effect
Subjects will change their behavior in a study, effecting internal and external validity.
Usually done to gain approval of/please investigators.
Incidence (def)
The number of new cases of disease arising during a given period of time.
Also “absolute risk”. (Number of people with disease)/(total number of people)
Relative risk
Incidence in exposed population/incidence in unexposed population.
Cohort study
A cohort of people who have something in common when they are first assembled are observed to see what happens to them
Not random, the cohort subjects have a relationship.
Goal: to study predictor variables and associated outcomes
Case-control studies
Looking backward to compare people with and without a condition– trying to determine risk factors for disease or outcome. Good for long latency, or rare disease.
Recall bias
People may not remember the exposure or details about it, and is not in medical record.
Equation for Variance
SUM [(Mean of data - Mean sample) ^2] / (N-1)
Equation for Standard Deviation
Square root of the variance.
SQRT:
SUM [(Mean sample - Mean pop)^2]/(N-1)
Grubb’s test
For outliers
Z=(Mean - outlier)/SD
Effect on required N: increased variability
Increased N
Effect on required N: greater differences between groups
Lower N required
Effect on required N: smaller alpha
Increase N
Effect on required N: decrease Power
Decrease N
R^2 correlation
-1+1
0 means no correlation
Odds ratio- values?
OR1 increased odds that the exposure is associated with the case
Risk factor definition
Characteristic or factor that increases a person’s risk of disease. Can be inherited, environmental, socioeconomic, behavioral.
Chemical agents
Workplace exposure to chemicals, etc
Physical agents
Radioactivity in your state, noise, vibration
Biologic agents
Infectious agents (like bacteria, virus), allergens
Psychosocial agents
Stress, trauma/ptsd, depression
Mechanical agents
Repetitive motion jobs/hobbies (typing), heavy lifting,
Lifestyle risk factors
Drugs, alcohol, unsafe sex, sun exposure
Framingham calculator
Risk assessment tool for 10 year risk of having a heart attack based on risk factos.
Absolute risk
The probability of an event in a population under study. Same as incidence.
Attributable Rsik
Absolute risk, or incidence, of a disease in exposed persons, minus the absolute risk from non-exposed persons.
Risk attributed to an exposure
Relative risk
Compares the probability of an event occurring in the exposed group vs the non-exposed group.
Relative Risk Reduction RRR
By how much the treatment reduced the risk of disease outcomes, relative to the control group who did not receive treatment.
Absolute Risk Reduction ARR
The most useful. Shows difference in risk comparing treated vs non treated. Expressed as NNT
NNT
Number needed to treat. The number of patients you need to treat before seeing a benefit of the intervention
1/(ARR%)
High sensitivity of a diagnostic test
Probability of testing positive given the patient has a disease. Small false negative, high false positive (in a normal test)
High specificity of a diagnostic test
Probability of testing negative given patient does not have disease. Low false positive rate. High false negative.
Prevalence of a diagnostic test
The proportion of people possessing a clinical condition or outcome at a given point in time. The probability of disease before test result is known.
Positive predictive value of a diagnostic test
Probability of having disease, given a positive test result.
Negative predictive value of a diagnostic test
Probability of not having a disease given a negative test result.