Exam 1: Biostatistics Flashcards
Descriptive Statistics
- Involves
- Purpose
Involves: Collecting, Presenting, and Characterizign Data
Purpose: Describe Data
Inferential Statistics
- Involves
- Purpose
Involves: Estimation and Hypothesis Testing
Purpose: Make decisions about population characteristics
***Allows us to describe a population based on a sample***
Variable
A symbol of an event, act, characteristic, trait, or attribute that can be measured and to which we can assign some values
Categorical Variable
Consists of some numeric or character codes that represent either:
- The presence or absence of something that is of research interest
- The relative weight or rank of the thing that is of research interest
Quantitative Variable
Variable that holds the numerical result of some measurement
Process
Series of actions or operations that transforms inputs to outputs; generates output over time
Characteristics of Variables: Nominal Scale
Simplest level of measurement - categories without order
Characteristics of Variables: Ordinal Scale
Nominal variables with an inherent order among the categories
Characteristics of Variables: Interval Scale
Measruable difference or interval or distance between observations
Characteristics of Variables: Ratio
Same as interval but with an absolute reference point (such as “0”)
Data Presentation: Qualitative Data
Summary Table –> Either a Bar Graph or Pie Chart
Data Presentation: Quantitative Data
Dot Plot, Stem and Leaf Display,
or
Frequency Distribution –> Histogram
Class
One of the categories into which qualitative data can be classified
Class Frequency
Number of observations in the data set falling into a particular class
Class Relative Frequency
Class frequency divided by the total numbers of observations in the data set
Class Percentage
The class relative frequency multipled by 100
Bar Graph
Classes (Bars) have heights equivalent to class frequency, class relative frequency, or class percentage
(Unlike Histogram –> just class frequency and class relative frequency, bars are touching)
Pie Chart
Classes are in slices proportional to the class relative frequency
Central Tendency
Tendency to cluster/center about certain numerical values
Variability
Spread of the data
What symbols represents Sample/Population Mean and Size?
X bar should be lower case x
Which is used for both quantiative and qualitative data, mean, median, or mode?
Which is not effected by extreme values?
Mode
Median and Mode
Summary of Mean, Median, and Mode
Variance and Standard Deviation
- Measures of dispersion ***More reliable than Range***
- Most common measures
- Consider how data are distributed (unlike Range)
- Show variation about mean
What does Normal Distribution mean?
Mean = Median = Mode
What does the mean equal in the standard normal curve and what is the first standard deviation?
Mean = 0
First SD is +/- 1
Standard Notation (Sample vs. Population)
- Mean
- Standard Deviation
- Variance
- Size
When do you use n-1 vs. n in the denominator of the Variance Formula?
n-1 = Sample Variance
n = Population Variance
Shape of Curve: Mean vs. Median
1. Left-Skewed
Left Skewed
Mean < Median
Shape of Curve: Mean vs. Median
- Right-Skewed
Right-Skewed
Mean > Median
The Empirical Rule
- Applies to
- What percentage of the measurements lie within 1, 2, and 3 SDs of the mean? What are their Z-scores?
Applies to: Data sets that are mound-shaped and symmetric (i.e. Normal Distributions)
68% of measurements lie within one SD of the mean (x-s to x+s) z-score = b/w -1 and 1
95% of measurements lie within two SDs of the mean (x-2s to x+2s) z-score = b/w -2 and 2
99.7% of measurements lie within three SDs of the mean (x-3s to x+3s) z-score = b/w -3 and 3
If you scored in the 58th percentile, what percentage of test takers scored lower/higher than you?
Lower: 58%
Higher: 42%
Numerical Measures of Relative Standing: Z-Scores
- Describes…
- Measures…
Describes the relative location of a measurement compared to the rest of the data
Measures the number of standard deviations away from the mean a data value is located
What is the Frequentist definition of Probability?
If an experiment is repeated n times under identical conditions and if the event A occurs m times, then as n grows, the ratio of m/n approaches a fixed limit called the probability of A
P(A) = m/n
“Law of Large Numbers”
Probability Equation
Frequency of times an outcome occurs divided by the total number of possible outcomes (symbolized as p)
Random Event
Any event where the outcomes observed in that event involves uncertainty or the outcome can vary
(predicted by Probability)
When is probability unnecessary to calculate?
For a fixed event
An Event (Two Definitions)
- An occurrence due to nature
- A collection of one or more outcomes of an experiment
Simple vs Compound Probabilities
Simple = Single occurrence
Compound = Result of operations
-Define relationships between or combination of event occurrences
What are the three operations that can be used to create compound events?
- Intersection
- Union
- Complement
Intersection
The intersection is defined as “both A and B”
Represented by A Π B
Union
Union is defined as “either A or B or both A and B”
A Ü B
Complement
Defined as “Not A”
Denoted by AC or -A
The Additive Law: Special Rule of Addition
Two events A and B that cannot occur simultaneously are said to be mutually exclusive or disjoint
e.g. The probability of a newborn weightin under 2000 grams is 0.025 and over is 0.043
***simply would add the probabilities of the individual events***
The Additive Law: General Rule of Addition
This is used when there is a common region; must subtract out common region
Two-Way Table Example
Probabilities Example
Independent Events
Two unrelated events
***When expressing the joint probabilit of independent events, the general rule of multiplication does not hold
The Special Rule of Multiplication
e.g. tossing a coin
Second toss has nothing to do with the first
Questions about Mutual Exclusiveness…
- If events are mutually exclusive…
- If events are not mutually exclusive…
Use “or” and the additive rule
- ME: add them all up
- Not ME: Subtract out common region
Questions about indpendence…
- Independent events…
- Not independent…
Use “and” and the multiplication rule
- Multiply them all together
- P (A and B) = P(A|B) x P(B)
P (B and A) = P(B|A) x P(A)
Bayes’ Theorem
- When is it used
- P(A) vs P(B|A)
- Importance
- When multiplicative events are not independent
- P(A) = prior probability (known before calculation)
P(B|A) = posterior probability (only known after calculation)
- Helps investigators determine the other pertinent probability when only one is known
How do you figure out the population mean?
You can use a sample and will be very close
Unbiased vs. Biased Estimates
Unbiased: if the sampling distributino of a sample statistic has a mean equal to the population paramater that the statistic is intended to estimate
Biased: if the mean of the sampling distribution is not equal to the parameter
Central Limit Theorem
As sample size gets large enough, the sampling distribution becomes almost normal
***Justifies Inferential Statistics***
Confidence Interval for a Population Mean: Normal (z) Statistic
- Finds what?
Finds the range over which the population parameter MIGHT be found
***A range of plausible values for the population parameter***
What does a 95% Confidence Level indicate?
In the long run, 95% of our confidence intervals will contain u (the population mean) and 5% will not
What are 2 conditions required for a Valid Confidence Interval for u?
- A Random Sample is selected from the target population
- The sample size n is LARGE
- Due to the Central Limit Theorem this condition guarantees that the sampling distribution of x(bar) is approximately normal
Also, for large n, s will be a good estimator of o- (population standard deviation)
Student’s t-statistic
Has a sampling distribution very much like that of the z-statistic (mound shaped, symmetric, with mean 0)
***Primary difference is that t-statistic is more variable than z-statistic***
Degrees of Freedom (df)
Actual amount of variability in the sampling distribution of t depends on the sample size, n
T-statistic has (n-1) degrees of freedom
What happens as Degrees of Freedom (df) go down?
The t-distribution flattens out
Sampling Error
A way of expressing the reliability associated with a confidence interval for the population mean, u
Sampling Error (SE) is equal to half-width of the confidence interval
What is a Hypothesis?
A statment about the numerical value of a population parameter
Null Hypothesis (H0)
The hypothesis that will be accepted unless the data provide convincing evidence that it is false. This usually represents the “status quo” or some claim about the population parameter that the researcher wants to test
Alternative Hypothesis (Ha)
The hypothesis that will be accepted only if the data provide convincing evidence of its truth. This usually represents the values of a population parameter for which the researcher wants to gather evidence to support
***Opposite of the null hypothesis***
When do we use Hypothesis Testing?
-
Observational Studies
- Find the “true” population parameter
(e. g. what is the prevalence of AIDs in some community)
***1 sample***
-
Clinical Trials
- Compare Group 1 to Group 2 or
- Compare Baseline state to post-intervention state
***2 sample tests - Independent Samples***
Test Statistic
A sample statistic, computed from information provided in the sample, that the researcher uses to decide between the null and alternative hypotheses
Type I Error
Occurs if the researcher reject the null hypotehsis in favor of the alternative when, in fact, the null hypothesis is true. The probabilit of committing a Type I error is denoted by a (alpha)
***The level of a is usually small and is referred to as the level of significance of the test***
Rejection Region
The set of possible values of the test statistic for which the researcher will reject H0 in favor of Ha
Type II Error
Occurs if the researcher accepts the null hypothesis when, in fact, it is false. Probabiility of committing a Type II error is denoted by B (beta)
How do you identify the null hypothesis?
It will always have an equality sign
What is a p-value?
The observed significance level for a specific statistical test is the probability (assuming H0 is true) of observing a value of the test statistic that is at least as contradictory to the null hypothesis, and supportive of the alternative hypothesis, as the actual one computed from the sample data
***Used to make rejection decision***
What does a p-value > or = to a mean?
DO NOT reject H0
What does a p-value < a mean?
REJECT H0
Where is the Confidence in hypothesis testing?
Confidence is in the testing process, NOT in the particular result of a single test
Strength of Correlation
Reflects how consistently scores for each factor change
Regression Line
The best fitting straight line to a set of data points. A best fitting line is the line that minimizes the distance of all data points that fall from it
Numerical Measure of Correlation: Pearson Correlation Coefficient
The Pearson (product moment) correlation coefficient (r)- used to measure the direction and strength of the linear relationship of two factors in which the data for both factors are measured on an interval or ratio scale of measurement
Numerator –> Covariance (extent to which X and Y axis vary together)
Denominator –> “” independently or separately
Regression Analysis
Statistical procedure used to determine the equation of a regression line to a set of data points and to determine the extent to which the regression equation can be used to predict values of one variable, given known values of a second factor in a population
- One quantitative dependent variable
- One or more quantitative or qualitative (binary) variables
Regression Analysis –>
Logistic Regression –>
Regression Analysis (Quantitative DV)
Logistic Regression (Qualitative DV)
-Yes or no, Male or Female
What do rows and columns represent in a data table?
Rows = Cases
Columns = Variables
What type of data do proportions summarize?
Nominal and Ordinal
(i.e. Qualitative data)
How are rates different from proportions?
They are similar to proportions EXCEPT a multiplier (e.g. 100, etc.) is used
***They have a time reference - are computed over a known/given period of time***
Vital Statistics Rates
Also known as demographic measures
***Describe the health status of a population***
e.g. Mortablity Rates (Crude, Specific) and Morbidity Rates
Crude Mortality Rate
Number of all deaths in a given geography over a given year divided by the total population of the geography durnig the same year
Specific Mortality
Relates to specific populations within the geographic region
What is Morbidity Rate also known as?
Prevalence or Prevalence Rate
Incidence
The number of new cases that have occurred during a given interval of time divided by the total population at risk
What are Adjusting Rates used for?
To make a fair comparison between different populations and to avoid Confounding
Examples of Confounding Factors
Age composition, Gender composition, Race/ethnic composition of a population
Absolute Risk Reduction (ARR)
The reduction in risk (by the experiment) compared with the baseline risk
Number Needed to Treat (NNT)
The number needed to treat in order to prevent one event
What is the reciprocoal of the absolute value of NNT?
Absolute Risk Increase or the Number Needed to Harm
Relative Risk Reduction (RRR)
The amount of risk reductuion relative to the baseline risk
Relative Risk
What types of studies is it mainly used in?
The ratio of the incidence of a disease in people who are exposed to a risk to the incidence of people without exposure to risk
Mainly used in cohort studies
(Prospective)
Odds Ratio
What type of study is this used in?
The odds that a person with the disease is exposed to a potential cause for the disease relative to the odds of a person without the disease is expose to the potential cause
Mainly used in a case/control study
(Retrospective)
What does a RR or OR <1, >1, or =1 mean?
< 1 = Protective exposure
> 1 = Risky exposure
= 1 = No effect
Inference (on RR and OR)
Inference is possible using the normal distribution
RR and OR distributions do not follow the theoretical probability distribution
The distribution of the natural log of RR and OR do follow normal distribution
***Need to transform to generate inferential statistics***
When can you reject the null hypothesis?
When the p value involves less error than you were willing to commit (the significance level, a)
p-value of 0.03
significance level of 0.05
****Can reject the null hypothesis in this case