Biostats Flashcards
Tchebysheffs Theorem
For any value of k that is ≥ 1, at least 100(1 – 1/k2)% of the data will lie within k standard deviations of the mean. 100(1 – 1/12 )% = 0% of the data will lie within one standard deviation of the mean.
When performing a nonparametric Wilcoxon rank-sum test, the first step is to combine the data values in the two samples and assign a rank of ‘1’ to
the smallest observation
Contingency table test
(r-1)(c-1) (r means row, c means column)
Dichotomous variables
Only two possible responses
Used to classify participants (e.g. has/does not have attribute of interest)
Ordinal variables
Categorical, ordered variable
Nominal variables
Categorical, unordered variable
Continuous variables
Quantitative/measurement variables; unlimited responses
Standard deviation
Measures how far individual observations deviate from the average
Small = the observed values are close to the mean
Large = if the observed values vary widely around the sample mean
Sample variance
Average of squared deviations; not interpretable, therefore use sample std. deviation.
Sample standard deviation
Square root of sample variance
Interquartile range
Difference between 1st and 3rd quartiles
IQR = Q3 – Q1
Sensitivity
true positive fraction; the probability of a diseased person testing positive
Specificity
true negative fraction; the probability of a disease-free person testing negative
Z scores
Used when we cannot use the properties of a normal distribution
Converting to a z score means we are standardizing
Z score formula converts x values to a standard normal distribution: Z=x-μ/σ
Central Limit Theorem
Theorem states, as long as the distribution is sufficiently large (n ≥ 30), then the distribution of sample means is normal in spite of a normal or skewed population distribution
Two exceptions:
1. Results are normal for population, then results will be normal for sample means if sample is less than 30
2. If the outcome for the population is dichotomous and the results meet the following criteria: min [np, n (1 – p)] > 5
Standard error
Standard deviation of the sample means
Decreases as sample size increases
Variability in sample means is smaller for larger sample sizes (extreme values less likely to impact larger samples)
confidence interval
Range of values for a population parameter with a level of confidence attached. (e.g. 95% confidence =we are 95% confident that the interval contains the unknown parameter)
Confidence Interval estimates
General form: point estimate ± margin of error
Confidence level starts with point estimate then adds in a margin of error
Margin of error = Z*SE
Z value = Z score value from standard normal distribution based on confidence level (e.g. 90%, 95%, etc.)
SE = standard error of the point estimate (sampling variability)
Reflects the likelihood that the confidence interval contains the true, unknown parameter
Commonly used values are 90%, 95%, and 99% (Table 1 B in textbook)
Higher confidence levels = larger z values, therefore wider confidence intervals; (99% CI = wider range to account for greater variability to include unknown parameter)
0 = null value; if included in range, then results are not statistically significant
T distribution
Used for small samples (generally n
Hypothesis testing
An explicit statement or hypothesis generated about a population parameter. We analyze sample statistics to determine if the hypothesis about the population parameter is supported or rejected.
Based on probability theory and the Central Limit Theorem
Type 1 Error
Reject the null hypothesis and it is true
Type 2 Error
Fail to reject the null hypothesis and it is false
Commonly occurs when the sample sizes are small
Chi-Square test for goodness of fit
Tests with one sample, categorical and ordinal variables
Independent Samples T-Test
Tests with two independent samples, continuous outcome
Paired-Samples T-Test
Two matched samples, continuous outcome
Chi-Square Test for Independence
Two or more independent samples, categorical and ordinal outcomes
ANOVA
More than two independent samples, continuous outcome
The goal of an ANOVA statistical analysis is to determine whether or not
One of the simplest experimental design is the completely randomized design in which random samples are selected independently from each of g populations. An analysis of variance is used to test if the g population means are the same, or is at least one mean different from the others.
The probability distribution for all possible values of a given sample statistic is called
The sampling distribution of a statistic is the distribution of values of the statistic over all possible samples of size n that could have been selected from the reference population.
The sum of the deviations of the individual observations from their mean is?
The sum of the deviation of the individual data elements from their mean is always equal to zero. This is why we use the sum of squared deviations.
How to find cumulative relative frequency
add all the previous relative frequencies to the relative frequency for the current row. Thus the last entry of the cumulative relative frequency column is one, indicating that one hundred percent of the data has been accumulated.
Which is larger sample size or population size
The sample size is a subset of the population size, thus it is always smaller than or equal to the population size.
Protective factor
confidence interval falls below 0
Risk factor
confidence interval falls above 0
If the range includes _ or _ there is no significant difference
0 or 1
The assumption of a t-test for the difference between the means of two independent populations is that the respective
One of the assumptions for the t-test for two independent populations is normality.
One can describe the F-distribution as a sampling distribution of the ratio of which of the following
F statistic which is typically used for comparing two population variances. If the parent populations are independently and normally distributed, then the F statistic is calculated by (F=var1/var2 or F=var2/var1) where the numerator is the larger of the two variances. This ratio has F-distribution with the degrees of freedom n 1 -1, n 2 -1 where n 1 and n 2 are the sample sizes.
A clinical experiment with four treatment groups was analyzed using an ANOVA and a significant difference in the population means is found. Which of the following is a natural next step?
Once a significant difference among the population means is found after performing an ANOVA, we next examine pairwise comparisons to further identify the nature of the differences while adjusting for the multiple comparisons via Tukey’s method or a similar method
Poisson Distribution
The Poisson distribution is used to model data that represent the number of occurrences of a specified event in a given unit of time or space
Parts of a box plot
The lower fence is defined as: Q1 – 1.5(IQR). The upper fence is defined as: Q3 + 1.5(IQR) where Q1 and Q3 are the lower and upper quartiles and IQR is the interquartile range. The upper and lower fences are boundaries to detect any measurements beyond those fences which are called outliers
What components are needed to compute a z-score?
mean and standard deviation
Kurtosis
a measure of the “peakedness” of the probability distribution of a real-valued random variable. Higher kurtosis means more of the variance is due to infrequent extreme deviations versus frequent modestly sized deviations
standard deviation of a mean
given by standard deviation divided by the square root of n: , here n = 100.
scatterplot
used to investigate the relationship between two continuous variables
If all of the numbers in a list increase by 2, then the standard deviation is
Adding a constant number to a list of data does not change the standard deviation, but it will change the list of numbers.
In simple linear regression, what is a method of determining the slope and intercept of the best-fitting line
least squares
point prevalence
of current cases/ # of people in the population
What is beta?
slope of a regression