Week 9 Descriptive and Comparative Statistics Flashcards
Objectives of Today’s Class
-understanding the role of statistical analysis in health research
-getting familiar with terminology
you must know all terms in ch29 and 30
-essential for correct interpretation of health research results
note: today class is only for introducing the concepts you will learn (much more) in a biostatistics class
Research process
Descriptive Statistics
Definions need to know in ch
Biostatistics
is the science of analyzing data and interpreting the results so that they can be applied to solving problems related to biology, health or related fields
Univariate analysis
describes one variable in a data set using simple statistics like counts (frequencies), proportions, and averages
Bivariable analysis
uses rate ratios, odds ratios, and other comparative statistical tests to examine the associations between two variables (mostly exposure and outcome)
Multivariable analysis
encompasses statistical tests such as multiple regression models that examine the relationships among three or more variables
Advanced statistics should be used only when
they are appropriate for the study question and the analyst knows how to use and interpret them correctly
Link to Study Design
What is a Variable?
- any quantity that varies from one entity to another (sometime within an entity over time)
*any attribute, phenomenon, or event that can have different values
-to describes characteristic of a person, place, thing, or idea
We measure variables when an experiment is carried out or an observation is made (week5-2)
THE BIG PICTURE
VMDSI
WHAT are the types of variables
Qn-DC
Ql-NO
Types of Variables (1)Nominal
-No intrinsic or logical order or value
* university programs, countries, types of fruits
-You can assign numbers to different categories (like assigning a number to a pear) but they do not have any other numeric properties
-doing arithmetic (e.g., 1+2= 3) IS NONSENSICAL
Types of Variables (1)ORDINAL
Intrinsic value but with no clear or equal differences between levels (a set of ordered categories)
Primary vs. secondary vs. university education
Mild vs. moderate vs. severe pain
Rating scales (assigning numbers)
Legitimate to say: 1 ≠ 2; 5 > 4 > 3 > 2 > 1
But in terms of the attribute being measured, we cannot say
(4-3) = (3-2) = (2-1)
4 is not two times larger than 2
Displaying Qualitative (nominal, ordinal) Data
-Pie chart
-Bar chart
Florence Nightingale
Displaying Qualitative (nominal, ordinal) Data
Types of Variables_Quantitative (Numeric)
Meaningful numeric scales
Age, blood pressure, # of friends, temperature
Assigned numbers have total mathematical meaning
1 ≠ 2; 2 ≠ 3
5 > 4 > 3 > 2 > 1
4 is indeed two times larger than 2
(4-3) = (3-2) = (2-1)
Classification of QUANTITATIVE VARIABLES
Continuous vs Discrete
Classification of QUANTITATIVE VARIABLES
Interval vs Ratio
Numeric Variables; Measures of Central Tendency: 1- Mean
A sample MEAN is calculated by adding up all the values for a particular variable and dividing that sum by the total number of individuals with a value for the variable=arithmetic average
Find the mean for the set of measurements:
2, 9, 11, 6, 6, 26
Solution: x̅=(2+9+11+6+6+26)/6=10
Numeric Variables; Measures of Central Tendency: 2-Median
The median is the value in the middle when you rank the data in ascending or descending order
Divides the data into 2 equal parts
Find the median for the set of measurements: 9, 5, 11, 6, 6, 26
Solution:
Rank the measurements from smallest to largest: 5, 6, 6, 9, 11, 26
Find the middle observation(s)
Choose a value between the two middle observations
Median = (6+9) = 7.5
2
Numeric Variables; Measures of Central Tendency: 3-Mode
The most frequently occurring value for a particular variable in a data set
Find the mode for the set of measurements:
2 9 11 6 6 26
Displaying Distributions: Histogram
important to manage the intervals
-remember histograms is one bar that includes a group such as 70-80
Shape of the Histogram Normal Distribution
Positively skewed looks like a P on his back.
Numeric Variables: Measures of Variability (Spread, dispersion)
-The range for a variable is the difference between the minimum (lowest) and the maximum (highest) values in the data set
-Quartiles mark the three values that divide a data set into four equal parts
The interquartile range (IQR) captures the middle 50% of values for a numeric variable
Boxplot: Display of Distribution
A simple visual depiction of and intuitive way to explore the data
Variance (σ2)
-The extent of deviation from the average value of that variable in the data set
-Calculated by adding together the squares of the differences between each observation and the sample mean (µ) and then dividing by the total number of observations
-The standard deviation (σ) is the square root of the variance
-The standard error of the mean adjusts for the number of observations in the data set by dividing the variance by the total number of observations and then taking the square root of that number
Mean (µ) and SD (σ) in a Normal distribution
-About 68% of area (population) within μ±1σ;
-95% of area within μ±2σ;
-99.7% of area within μ±3σ
If μ=20 and σ=5, then 68% of subjects are measured between 15 (20-5) and 25 (20+5)
The probability of observing a value between 15 and 25=0.68
between 10 and 30=0.95
between 5 and 35 =0.997
Confidence Intervals (CI)
-Provide information about the expected value of a measure in a source population based on the value of that measure in a study population
–A larger sample size will yield a narrower confidence interval
-A 95% confidence interval is usually reported for statistical estimates, which means that 5% of the time the confidence interval is expected to miss capturing the true value of a measure in the source population
–Example: mean systolic blood pressure of a sample is 120 mmHg; 95%CI: 110-130
-We are 95% confident that the real average is between 110-130; 5% chance that the true value of mean is either larger than 130 or smaller than 110
Comparative Statistics
Comparing main factors between exposed and unexposed in cohort studies
Average age of exposed=Average age of unexposed
% male in exposed=% male in unexposed
Testing if randomization was effective in experimental studies
Comparing the outcome status
We can NOT just look at the calculated values (these are estimates from samples, subject to random sampling error)
Inferential Statistics
Techniques that use statistics from a random sample of a population to make evidence-based assumptions (inference) about the values of parameters in the population as a whole
Decision about parameters via information obtained from a sample is via hypothesis testing
Hypothesis Testing
Steps in Hypothesis Testing
- Take a random sample from the population of interest
- Set up two competing hypotheses (based on research questions)
Null Hypthesis (H0); no effect, no difference between sample and the original population
Alternative Hypothesis (H1 or Ha), there is an effect (a difference) - Use sample statistics (mean, frequency) to decide whether to support or reject the null
By calculation of a test statistics
Note: Tests are developed (specific formula) for different types of data and research questions (Figures 30-12 to 30-15 of the textbook) - Determine if the null hypothesis is really true, what the observed sample statistics will be
How?
Idea of (Probability) p. Value
Introduced by Fisher to determine whether the observed sample supports the null
Between 0.1 and 0.9: no reason to suspect null is false
<0.02 sufficiently strong evidence to conclude null does not reflect the state of nature, unlikely to be true
“The value for which P=0.05, or 1 in 20; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not.”
0.05 the convention commonly used in health research
P.value measures how strongly the sample data agrees with the null
Idea of (Probability) p. Value
Is calculated from observed data based on a pertinent test statistic
The probability that the observed sample will produce a value of the test statistics as or more extreme than the observed test statistic in a universe in which we know that null in true
If 0.01 it means if in the real-world null is true (no difference) there is only 1% chance that the data produce results on a difference
Small chance, we can safely reject the null
The significance level (α) is the p value at which the null hypothesis is rejected, usually 0.05 in health research
A parametric test
assumes the variables being examined have particular distributions
Inferential methods are based on types of distributions (mostly normal)
A nonparametric test
does not make assumptions about the distributions of responses
Nonparametric tests are used for ranked variables and when the distribution of a ratio or interval variable is non-normal