Biostats Flashcards
In biostatistics name the steps in study design. There 5 steps.
1) Design of studies–> sample size/selection of study participants/role of randomization
2) Data collection variability –> important patterns in data are obscured by variability.
3) Inference -> draw conclusions from limited data
4) Summarize –> what summary measures will best convey the results
5) Interpretation –> what do the results mean in terms of practice, the program and the population
What are the 4 types of data in biostatistics?
1) Binary (Dichotomous) data: yes/no answers
2) Categorical Data: either nominal (no ordering) or ordinal (ordering)
3) Continous Data: blood pressure, weight, etc
4) Time to event data: time in remission
There are different statistical methods for different types of data. What two methods are used for binary data?
Fishers Exact Test
Chi-Square Test
what method is used for continous data?
2 sample t test
wilcoxon rank sum (nonparametric) test
How would you calculate the mean of a sample (sample average)?
Add up data and then divide by the sample size
What is the difference between population and sample in regards to data?
Population –> the entire group about which you want information (all women ages 30 and 40)
Sample –> a part of the population from which we actually collect information; used to draw conclusions about the whole population.
How is population vs sample mean differentiated when it comes to statistical symbols?
Population Mean –> Mu
Sample Mean –> X
The median number is the middle number. What happens when the sample size is an even number?
Average the two middle numbers
what are ways in which spread of the distribution can be explained?
Min and Max
Range –> min - max
sample standard deviation (SD)
Why would a researcher feel it appropriate to make a histogram?
Way of displaying the distribution of a set of data by charting the number of observations whose values fall within pre defined numerical ranges
How would one go about making a histogram?
Divide the data into equal intervals
Count the number of observations in each class
Draw the histogram
Label scales
Generally, now many intervals should you have in a histogram?
depends on the same size , n
usually the guideline is the square root of n
What are other types of histograms?
frequency histogram
relative frequency histogram
relative frequency polygon
(note see lecture page 9 for images)
There are several shapes of distribution when plotting data, explain what right skewed and left skewed and symmetrical means
Symmetrical --> right and left sides are mirror images (mean = median = mode) Left Skewed (negatively skewed) --> long left tail; mean long right tail; mean> median (ex: hospital stays)
Describe in general terms what probability density refers to?
smooth idealized curve that shows the shape of the distribution in the population
What are some features of a normal (gaussian) distribution
symmetric
bell shaped
mean = median = mode
(mean is the center) (SD is the spread)
what does the 68–95-99.7 Rule mean?
In any normal distribution, approximately;
68% of the observations fall within one standard deviation of the mean
95% of the observations fall within two standard deviations of the mean
99.7% of the observations fall within three standard deviations of the mean
What is a Z score?
Tells how many standard deviations from the population mean you are
Z = observation - population mean / SD
What are the standard Z scores?
Z= 1 –> observation lies one SD above the mean
Z=2 –> observation lies two SD above the mean
Z = -1 –> observation lies one SD below the mean
Z= -2 –> observation lies two SD below the mean
If female heights, mean = 65 , s =2.5 inches
what is the Z score for 72.5 inches and 60 inches?
Z= 72.5 Z = 72.5 - 65/2.5 = +3.0 SD above the average Z= 60 Z = 60-65/ 2.5 = -2.0 SD below the mean
Example:
Suppose the population is normally distributed: if you have a standard score of Z=2, what percent of the population would have scores greater than you?
2.5% (95% so total would be 5% but it asks for greater then)
(refer to the 68-95-99.7 rule)
Example:
If you have a standard score of Z=2, what % of the population would have scores less then you?
97.5% (this person is 2 SD away which would be 5% so therefore it would be 100-2.5 = 97.5 )
again refer to the 68-95-99.7 rule
Example:
If you have a standard score of Z=3, what % of the population would have scores greater than you?
.15% (this person is 3 SD away which is 0.3% total however this asks for greater then so therefore the answer would be 0.15% )
again refer to the 68–95-99.7 rule
Example:
If you have a standard score of Z=-1.5, what % of the population would have scores less than you?
this requires a table 6.68%
however knowing that 2 SD would be 2.5% therefore the answer has to be higher then 2.5% but less then 16%
Example:
Suppose we call “unusual” observations those that are either at least 2 SD above the mean or about 2 SD below the mean. What % are unusual? (in order words, what % of the observations will have a standard score either Z> +2.0 or Z 2?
5% of outside of 2 (again this is known from the rule of 95%)
what % of the observations would have Z > 1.0 (aka more than 1 SD away from the mean?
32%
again 100-68 = 32
What % of the observations would have Z > 3.0?
0.3%
again 100-99.7 = 0.3
What % of the observations would have Z > 1.15
well Z > 1.0 would be 32% and Z >2.0 would be 5%
so therefore the answer would be between 32% and 5%
what is the difference between a parameter and a statistic?
Parameter –> number that describes the population; this is a fixed number (population mean; population proportion)
Statistic –> number that describes a sample of data; can be calculated (sample mean; sample proportion)
What are errors from biased sampling?
study systemically favors certain outcomes voluntary response non response convenience sampling solution? randomly sampling
what are errors from random sampling?
caused by change occurrence
get a bad sample because of bad luck
can be controlled by taking larger sample
When a selection procedure is biased does taking a larger sample help?
no
this just repeats the mistake on a larger scale
When a sample is randomly selected from the population, it is called what?
random sample
What is an advantage to random sample?
helps control systematic bias
however there is still some sampling variability or error
If we repeatedly choose samples from the same population, a statistic will take different values in different samples, what is this called?
Sampling Variability
The spread of a sampling distribution depends on the sample size. Is it better to have a bigger or smaller sample size?
larger unbiased samples are better
larger samples also give us more tightly clustered histograms therefore more values are closer to the mean
If the researcher was to increase the sample size by a factor of 4 what would happen to the spread?
The spread each time will be cut in half
Describe the sampling distribution
what the distribution of the statistic would look like if we chose a large number of samples from the same population
it describes the distribution of all sample means, from all possible random samples of the same size taken from a population.
What is the central limit theorem?
Provided this mathematical result: sampling distribution of a statistic is often normally distributed
For the theorem to work, it requires the sample size (n) to be large (n >60)
What is a standard errors (SE)?
Measures the precision of your sample statistic such as the sample mean or proportion that is calculated from a number (n) of different observations.
As the sample size gets bigger what happens to the standard error?
gets smaller and therefore the more precise the sample mean is.
Standard Error of the Mean (SEM) is again a measure of the precision of the sample mean. What is the formula to calculate SEM?
s/square root n example: blood pressure on random sample of 100 students Sample Size: n=100 Sample Mean: X=123.4 Sample SD: s= 14.0 SEM: 14/sq.root 100 = 1.4mmHg
How close to population mean (mu) is sample mean (X)?
the standard error of the sample mean tells us 95% of the time the population mean will lie within about 2 standard errors of the sample mean.
X+- 2SEM
123.4 +- 2 x 1.4
123.4 +- 2.8
we are 95% confident that the sample mean is within 2.8mmHg of the population mean. The 95% error bound is 2.8
From the blood pressure example, what would be the 95% Confidence Interval (CI)?
123.4 +- 2.8
We are highly confident that the population mean falls in the range 120.6 to 126.2
Is a 99% or 90% CI wider?
99% CI is wider
90% is narrower
The length of CI decreases (narrower) when n and s do what?
n increases
s decreases
(level of confidence decreases)
what are the two underlying assumptions for a 95% CI for the population mean?
Random Sample of Population
Sample Size n is at least 60 to use +- 2SEM
How would one calculate 95% CI for mean if sample size is smaller or larger then 60?
based on a t- table
df is degrees of freedom: n-1
according to the df you find the t value
For example if:
n=5
X= 99mmHg
s= 15.97 then what is the 95% CI?
99 +- 2.776 (from t table) x SEM 15.97/sq. root 5 99 +- 2.776 x 7.142 99 +- 19.93 The 95% CI for mean blood pressure is: (79.17, 118.83)
Does standard error or standard deviation depend on sample size?
standard error
remember the formula (s/sq.root n)