Biostats - Week 1 Flashcards
Which kind of graph is negatively skewed?
Where bulk of data (curve is on the right) and the skewed data tails to the Left
layman’s terms for precision v. accuracy. Give ex
Precision related to # of participants in your study. More participants = more precise.
Accuracy related to where you draw your sample from. Drawing from registered voters is considered to be an accurate measure of the population.
simplified way to think about what a chi test measures
how many people fall into one group or not (e.g. who got a cold after taking Vitamin A and who did not)
If a confidence interval range does NOT include 0 (e.g. 0.61-1.19cm), what does that tell you about the (two-sided) p-value for testing the null hypothesis?
so p value is the likelihood that your results were obtained by chance (as opposed to meaning something). So if 0 is outside the confidence interval, it is unlikely to be obtained by chance (outside that range) and thus, p
Using a lower - or more stringent - value of alpha does what
Makes it LESS likely to make a Type I error (helps prevent Type I errors). Idea is it’s harder to get a statistically significant result. Thus, you can be more confident of your findings IF they are statistically significant (p
What can you never conclude from a p value
Can never conclude that there is a CLINICAL significance just because there is a statistical significance.
(4) data types. Which are categorical and which are numerical?
think “data NOIR”
Categorical = nominal and ordinal
-Nominal: UNordered categorical data
-Ordinal: ordered categorical data
Numerical = interval and ratio
- Interval: similar intervals for numeric groups, but NO absolute zero
- Ratio: similar intervals WITH an absolute zero, so can compute ratios
Nominal data, def and medical ex
unordered categories of data, i.e. no particular order or way of measuring these things; just different buckets to put stuff in
ex. smoking status, ethnicity, or specialty
What data type is dichotomous data?
Nominal data that only has 2 groups (buckets)
ex. diabetic v. non-diabetic
Ordinal data, def and medical ex
ordered (grouped) categorical data; so there is an order, but intervals between groups may be different. Means that computations on ordinal data are mathematically flawed. ex. class rank and 5-point rating scale for faculty evals (b/c a rating of 4 isn't twice as better as 2)
Interval data, def and ex
data is ordered with meaningful intervals between the groups, but NO absolute zero exists
ex. graduation years (has no absolute zero)
Ratio data, def and ex. How can ratio data be further broken down?
interval scale with an absolute zero, so you can compute ratios. Can be discrete (only has certain integer values) or continuous (can taken on any value)
ex. BP, weight, or age can taken on any value (continuous) but we generally reduce it to discrete data b/c we round it off
ex. of discrete would be # of patients seen in a day
Addition rule, def and ex
the probability that A OR B will happen is the sum of individual probabilities of A and B. So two independent events that can NOT both happen.
ex. probability of surgery clerkship first = 16% and prob of IM first = 16%. Probability of getting IM OR surgery first = 32%
multiplication rule, def and ex
probability of A AND B both occurring (must know the individual probabilities of both).
ex. prob of getting IM clerkship first = 16%. The probability of passing it is 95%.
Probability of getting IM first AND passing it = 0.16 x 0.95 = 15.2%
precision v. accuracy (immunity from…)
precision = immunity from random variation. It’s related to the width of the confidence interval (sqrt of n)
accuracy = immunity from systematic error or bias (bias is something wrong with the way samples are chosen)
for a gaussian distribution, what is between +/-1 SD?
68% of your data lies in the range between +/- 1 SD
what % of the data lies below the +1 SD mark?
84% of the data. (50% below the mean + 34% between mean and +1)
Where does 99% of the data lie on a gaussian curve?
between +/-3 standard deviations
z score, def and eqn
EACH data point on a “standard” Gaussian distribution has a z score, meaning that data point (x) is “z” standard deviations above or below the mean
z = (x - mean)/SD
If looking at a Z table and see that z score of 1.10 = 0.8707. What does that mean?
means 87.07% of the data lies BELOW the point where z = 1.1
Why are z scores symmetric?
because the gaussian curve is symmetric
(2) typical reasons for using z (t) scores
- To figure out how many SDs is your sample mean above or below the population mean
- Figure out how many SD away from the mean will contain a certain proportion of the data
What is the z score that divides the top 5% of a normal population from the remaining 95% not = +2?
picture gaussian curve. z = +2 has ~2.2% beyond it. So a z score LOWER than +2 will encompass all of the top 5%
Why are t scores used more in practice than z scores?
Z scores are based on the ACTUAL standard error of the true population, which we don’t know.
But T scores use an ESTIMATED standard error of the mean
Why does increasing the n# make t and z scores get closer to the same value? Around what n value are t and z about the same?
T scores are calculated by the degrees of freedom (n-1), which means that t scores change based on the population size (n). As n gets higher and higher, the d.f. goes up.
n > 100, t and z scores are about the same
mode
the measure (of central tendency) with the greatest frequency. Is the high point on the graph and is NOT influenced by extreme values (unlike mean)
When are mean, median, and mode (measures of central tendency) all the same?
normal (gaussian) distribution
On a negatively skewed distribution, where do the mean, median, and mode measurements fall?
First, negatively skewed means the skewed data (tail) is to the left (heading towards negative x axis) and bulk is on R.
Mode = peak, Mean = closest to skewed tail, and Median is in between the two
endemic v. epidemic
A disease in ENDEMIC when it is constantly present in a population or area. An endemic has a usual incidence/prevalence. Ex. Rhinovirus (common cold)
EPIDEMIC means more cases of that disease than expected in a population/location within a time frame. Diseases that start as epidemics may drift into endemicity.
epidemiology
study of the distribution and determinants of disease frequency. Disease does NOT occur randomly; there are causes and/or preventative factors for disease. Epidemiology is the study of those things
Preclinical v. Clinical phase of a disease
Preclinical begins with the onset of the disease and ends once signs/sx of the disease manifest.
Clinical phase begins with signs/sx and ends (ideally) with treatment/resolution
incubation period. What phase is this in?
time from colonization to the point where have sx. In the preclinical phase
(2) types of epidemiological studies and example
Experimental and Observational:
- Experimental important in testing drugs
- Observational are really important for learning causality. ex. figured out that Reye’s syndrome was caused by kids with viral infections taking ASA for fever
Rate v. Proportion
Rate IS proportion per a specific time period.
Proportion = (# of cases)/(population at risk)
Rate = (# of cases)/(population at risk) IN A TIME
Incidence
[# of people who ACQUIRE the disease] divided by [# of people at risk] IN A TIME
(“associate in your mind the word ‘acquire’ with incidence”)
Synonym for “attack rate”
Incidence
prevalence
(# of people that HAVE the disease)/(# of people at risk) …at a given point in time
What does prevalence not account for?
latent/undiagnosed diseases
Incidence rate v. Prevalence rate
Incidence rate = probability that healthy people will develop a particular disease DURING a specific period of time
Prevalence rate = proportion of people in a population who HAVE the disease AT a given time (point prevalent or period prevalence)
visual depiction of incidence, prevalence, mortality, and cure (slide 45)
prevalence is existing cup of liquid. Incidence is new cup pouring into prevalence.
Coming out at bottom of prevalence cup are mortalities and cures
mortality rate
(# deaths)/(population)
Population is standardized to 10^n for a specific time interval. e.g. 10^3 = 1,000 or 10^5 = 10,000
neonatal v. infant mortality rate
Neonatal: (# deaths
crude mortality rate v. cause-specific death rate
is simply the # of deaths/population (10^n) in specific time period
v. cause-specific death rate, which is (# of deaths due to certain cause)/population (10^n) in specific time period
death to case ratio
(# of deaths attributed to a disease) / (# new cases identified)
ex. total of 300 cases of disease with 50 new cases, 20 of whom have died. Death to Case ratio = 20:50
case fatality rate
(# cause specific deaths among the incident cases) / (# of incident cases). Can ONLY calculate the proportion of fatal cases once the epidemic ends.
ex. epidemic of a disease ends with 500 total cases, 250 of whom died
Case fatality rate = 250/500 = 50%
crude birth rate v. crude fertility rate
crude birth = (# live births)/(population, 10^n)
crude fertility = (# live births)/(women aged 15-44 yrs)
relationship of variance and standard deviation
variance = standard deviation squared
standard deviation = square root of the variance
how is variance (and standard deviation) related to the accuracy or reliability of the data? For the population?
LESS variance (which means lower SD) means MORE accurate/reliable data because less variation means your data is more clustered and more accurate around the mean. Overall idea = results for sample more closely represent the true result in the population
concept of standard deviation
in a normal distribution, the proportion of data elements is CONSTANT for a given number of standard deviations above or below the mean
percentile (e.g. what does the 90th percentile represent?)
x percentile is the value below which x% of the data lie. e.g. 90% of the data lies below the 90th percentile
what percentile is at the +1SD and +2SD marks of a normal distribution? How is +2SD percentile different from +/-2 SDs?
+1 SD = 84th percentile
+2 SD = 98th percentile
Even though 95% of the data LIES in between +/-2 SDs, this does not mean that +2SDs is the 95th percentile! It’s the 98th b/c of the little tail of the rest of the data after -2SD
(4) examples of non-gaussian data distribution
skewed (positive or negative)
J-shaped (high frequency at R-most)
bimodal: two peaks of highest frequency
U-shaped: high frequency at both extremes
Most important and practical way to increase the power of a statistical test
increase the sample size
Probability v. Odds - getting heads on a coin toss
Probability of getting heads = 50%
The ODDS of getting heads = 50%/50% = 1
This is because Odds of an event happening = the probability it does happen / probability it doesn’t
T/F - a cross-sectional study can be retrospective or prospective
False! Cross-sectional study collects data ONLY at one point in time; it is not retrospective or prospective
The only type of study that can determine the absolute risk of contracting a disease
cohort study
another name for case-control studies
retrospective studies. Case-control studies, by definition, look backward in time
which type of study is the most powerful way to establish cause-and-effect relationships?
Controlled clinical trials are the only way to establish causation between exposure and illness. Cohort and case-control studies are only able to establish a statistical association, not actual causation
Which type of study is the best method for evaluating rare illnesses?
case-control studies (b/c they identify the cases at the start of the trial)
Which type of study is often very large, expensive, and spanning many years?
cohort studies
selection bias
occur when systematic difference between either:
- those participating in the study and those who do not, or
- those in the tx arm of the study and those in control group
ex. if study conducted at a hospital where pts with that disease are more likely to be referred, then that sample of pts probably doesn’t accurately represent the population
(4) different ways to describe a Type I error
False positive
alpha error
incorrectly reject a TRUE null hypothesis
we think there’s an effect, when really there is NOT
(4) different ways to describe a Type II error
Fasle negative
beta error
incorrect accept a FALSE null hypothesis
we think there is not an effect, when really there IS
relationship of correlation, causation, and association
Correlation is a measure of the variables’ statistical ASSOCIATION, not of their causal relationship. Correlation does not equal causation.
(2) different ways to define the null hypothesis
- no difference between the two groups
2. any observed differences are due to chance
State alternative hypothesis for two-tailed v. one-tailed
Two-tailed: There is a difference between the two groups
One-tailed: the mean of the trial group is greater than the mean of the control group
level of significance
the probability level at which it’s decided that the null hypothesis is INCORRECT is the significance level (alpha)
(2) rules of Central Limit Theorem
If you plot the frequency distribution of the MEANS of infinite # of random samples, then:
- it will be a normal distribution, and
- the distribution mean - i.e. sample mean (mu x-bar) - will be the same as the population mean (mu)
critical values, def and how to calculate
the +/- limits of the area of acceptance range (accept the null). Outside the critical value range = area of rejection (reject the null).
- Must find critical values by looking at T score table. Based on degree of freedom (n - 1) and then look for value under (.05 for two-tailed).
ex. +/-2.262 for df = 9
estimated standard error of the mean, def and eqn
measures how much the sample mean deviates from the population mean
standard error = SD (x-bar) = SD/(sqrt of n)
what does a t-score represent in hypothesis testing? T-score eqn?
the number of Estimated Standard Errors that the sample mean lies above or below the hypothesized population mean
Talc = [sample mean - hypothesized population mean] / est standard error of the sample
t-test v. ANOVA
t-test compares the means between 2 groups
ANOVA: compares means of 3 or more different populations
What data type does chi square test use?
Nominal data, since chi square is a test of proportions between groups (categorical)
(3) ex of nonparametric tests
Spearman’s rank CORRELATION test
Wicoxon rank SUM test
Mann-Whitney test
power analysis, customary power, and level of significance
aims to prevent type 2 error by ensuring adequate study size, involves:
- fixing customary power to 80% and
- fixing level of significance to 5%
What does a non-inferiority design study? 1- or 2-tailed?
only wants to study (ensure) that the intervention is not worse than current standard of care.
Is a 1-tailed analysis that doesn’t need as many patients
What kind of time-frame do case-control studies look at?
case-control studies are ALWAYS retrospective
(3) advantages of case-control studies
- easy and inexpensive
- can study multiple risk factors
- since you identify patient cases at the beginning of the study, it’s the best way to study rare diseases
(2) disadvantages of case-control studies
- highly prone to bias & confounding (especially recall and selection biases)
- hard to identify a truly matched population (e.g. similar in severity of illness, age-matched, etc)
Name (2) examples that can NOT be studied via randomized trials? What must we rely on instead?
Surgery and pregnancy. ex. Can’t randomize people to get surgery or not.
For these, rely on observational studies
definition of bias
systematic error in the study design that produces results “systematically” different from the truth
(3) types of bias
- Selection (sampling) Bias: selection of pts doesn’t represent the population its supposed to represent (e.g. too old or too well-educated)
- Recall Bias: exists ANY time historical self-report info is collected from the respondents
- Measurement Bias: just means something wrong w/ way it’s being measured (instrument or observer)
RRR v. ARR
RRR = Relative Risk Reduction - the RATIO of the risk rate in disease group / risk rate in control group. ex. 12%/20% = RRR of 0.60
ARR = Absolute Risk Reduction - % risk in control group - % risk in disease group. Since you just subtract, the ARR is always LESS than the RRR. Ex. 20% - 12% = ARR of 8%
Calculate NNT
NNT = 100/ARR (if ARR is %) or 1/ARR (if ARR is in decimals)
ARR = % risk in control group - % risk in disease group
observational v. experimental study and subcategories of each
these are the 2 major types of clinical studies.
Experimental mostly refers to randomized controlled trials
Observational (just watching) includes cohorts, case-control, cross-sectional, and case reports
How to interpret the OR or RR
RR = 1: no difference
RR > 1: Increased risk
RR
calculating odds
Odds = (probability of the event) / (probability of the NOT event)
or
(probability of the event) / (1 - probability of the event)
difference in odds v. risk ratio
the denominator
Odds denominator = probability of the NOT event (or 1 - probability of the event), whereas
Risk denominator = sum total of risk factor + non-risk factor present
Structure of the 7-character ICD-10 code
_ _ _ . _ _ _ _
Begins and ends with letters. First letter is the disease group (e.g. M = musculo sys, N = GU system). First 3 characters total represent category.
Next 3 = etiology, anatomic site, and severity respectively and the last one is an extension.
What should patient-physician email not be used for, according to the AMA?
To establish a patient-physician relationship
Which EHR received the highest ranking from users for its disease management features (in the survey of family physicians using EHRs)?
Praxis
According to the O’Donnell article, among both physicians who use and do not use the copy and paste function (everyone), what was considered most problematic with copy and paste EHR function?
notes contain more inconsistent and more outdated information
According to the O’Donnell article, what do the most physicians believe is best solution for problems w/ copy and paste function in EHR?
provide education for physicians regarding the copy/paste function use
According to the Bryant article, what was observed about “alert fatigue”?
the number of alerts received did not correlate with physicians’ override rate
According to the Bryant article, what was the rate of drug-drug alert overrides?
greater than 95% !!
meta-analysis
summary study of previous trials to give us an overall result
In screening for disease that has LOW prevalence, what will most positive tests be?
False positives
What is the ‘implied promise’ of screening tests?
That the screening test IS beneficial and will do more good than harm
When is the screening period?
Period between possible detection and occurrence of symptoms
USPTF rating for screening mammography before age 50
C, should be an individual decision
Age group that has a B (v. C) rating for PSA screening in men
B: men 55-69 yo
Under 40 men = C
40-54 YO at average risk = C
USPSTF recommendation for testicular cancer screening in ASYMPTOMATIC patients
D (recommend against)
USPSTF guidelines for mammography screening in 40-49, 50-74, and >75 YO women
40-49 = C 50-74 = B >75 = I (Insufficient evidence)
USPSTF guidelines for mammography screening in 40-49, 50-74, and >75 YO women
40-49 = C 50-74 = B >75 = I (Insufficient evidence)
USPSTF rating for lung cancer screening
B
Name (4) D rating cancer screenings
Ovarian cancer
Pancreatic cancer
Prostate cancer
Testicular cancer
Name (3) I-rated cancer screens
Bladder cancer
Oral cancer
Skin cancer prevention