Module 1-8 Flashcards
Biostatistics
-Statistics is not merely a compilation of computational techniques
-Statistics
~Is a way of learning from data
~Is concerned with all elements of study design, data collection, and analysis of numerical data
~Does require judgment
-Biostatistics is statistics applied to biological and health problems
Biostatisticians are
-Data Detectives
~Who uncovers patterns and clues
~This involves exploratory data analysis (EDA) and descriptive statistics
-Data Judges
~Who judge and confirm clues
~This involves statistical inference
Measurement
-Measurement (defined)
~The assigning of numbers and codes according to prior-set rules
-There are three broad types of measurements
~Nominal
~Ordinal
~Quantitive
Nominal Measurements
-Classify observations into named categories
~No order
~Typically two categories (binary (yes/no)), but can have more categories but can not be ordered
-Ex:
~HIV status (positive or negative)
~Sex (Male or Female)
~Hair color (red, brown, black, blonde, gray, etc.)
Ordinal Measurement
-Categories that can be put in rank order
~Opinion
*More than two categories that have to be in order
-Ex:
~ Stage of cancer classified as Stage I, Stage II, Stage III, Stage IV
~Opinion classified as strongly agree (5), agree (4), neutral (3), disagree (2), strongly disagree (1)
~Age groups 0-4, 5-9, 10-14, etc.
Quantitative Measurements
-True numerical values that can be put on a number line
-Numerical values with equal spacing between numerical values
-Ex:
~Age (years)
~Serum cholesterol (mg/dL)
~T4 cell count (per dL)
Illustrative Example:
-Weight Change and Heart Disease
-This study sought to determine the effect of weight change on coronary heart disease risk
-It studied 115,818 women, 30-55 years of age, free of CHD over 14 years
-Measurements included the following variables
~Nominal (including Binary)
*CHD onset (yes or no)
*Family history of CHD (yes or no)
~Ordinal
*Non-smoker, light-smoker, moderate smoker, heavy smoker
~Quantitative
*BMI (kgs/m^3)
*Age (years)
*Weight presently
*Weight at age 18
Observation, Variable, and Value
-Observation
~The unit upon which measurements are made and can be an individual or aggregate
-Variable
~The generic thing we measure
*Age of person
*HIV status of a person
-Value
~A realized measurement
*“27”
*“positive”
Data Table
-Each row corresponds to an observation
-Each column contains information on a variable
-Each cell in the table contains a value
-Units of observation in these data are individual regions, not individual people
~Table 1.2 in the textbook
Measurement Inaccuracies
-Imprecision
~The inability to get the same result upon repetition
-Bias
~A tendency to overestimate or underestimate the true value of an object
Biostatisticians are
-Data Detectives
~Who uncovers patterns and clues
~This involves exploratory data analysis (EDA) and descriptive statistics
-Data Judges
~Who judges and confirms clues
~This involves statistical inference
Types of Studies
-Surveys
~Describe population characteristics
*A study of the prevalence of hypertension in a population
-Comparative Studies
~Determine relationships between variables
*A study to address whether weight gain causes hypertension
2.1 Surveys
-Goal
~To describe population characteristics
-Studies a subset (sample) of the population
~Census vs. Sample
-Uses sample to make inferences about population
-Sampling
~Saves time
~Saves money
~Allows resources to be devoted to greater scope and accuracy
Illustrative Example
-Youth Risk Behavior Surveillance (YRBS)
-YRBS monitors health behaviors in youth and young adults in the US. Six categories of health-risk behaviors are monitored. These include:
~Behavior that contributes to unintentional injuries and violence
~Tobacco use
~Alcohol and drug use
~Sexual behaviors
~Unhealthy dietary behavior
~Physical activity levels and body weight
-Ex:
~Several million public and private school students in the US in 2003
~Sampling
*15,240 questionnaires completed at 158 schools
Types of Samples
-Probability Sample
-Simple random sample (SRS)
-Stratified random sample
-Cluster sample
Types of Samples
-Non-probability sample
-Convenience sample
Sampling
-Probability samples
~Use chance mechanisms to select individuals
-Most basic type of probability sample is the simple random sample (SRS)
-SRS
~Each population member has the same probability of being selected into the sample
~Selection of any individual into the sample does not influence the likelihood of selecting any other individuals
Simple Random Sampling Method
- Identify each population member with a number 1,2,…N
-Use a random number generator to generate n random numbers between 1 and N
~Ex:
*http://www.random.org/integer-sets/
-Keep in mind
~The objective of an SRS is that every possible subset is equally likely!
Let’s Select 5 random IDs from our class ID (1-79)
-Step one
-Generate 1 set with 5 unique random integers in each
-Each integer should have a value between 1 and 79 (both inclusive; limits +/- 1,000,000,000)
-The total number of integers must be no greater that 10,000
Class ID Selection
-15, 31, 40, 48, and 76
-Random Integer Set Generator
~One requested 1 set with 5 unique random integers, taken from the [1,79] range. The integers were sorted in ascending order
*Here is the set
**Set 1: 15, 31, 40, 48, 76
Sampling
-Sampling Fraction = (n)/N
~n is the sample size
~N is the size of the population
-Sampling with replacement
~Tossing selected members back into the mix after they’ve been selected
~Any given unit can appear more than once in the sample
-Sampling without replacement
~Selected units are removed from possible future reselection
Other Types of Probability Samples
-(More Advanced Methods)
-Stratified random sample
~Random sample strata (subset) with the population
~Ex:
*The population can be divided into 5-year age groups (0-4, 5-9,…) with simple random samples of varying sizes drawn from each age-strata
-Cluster sample
~Randomly sample clusters comprising varying numbers of observations
~Ex:
*Households (cluster) are selected at random, and ALL individuals are studied within the clusters
Cautions when Sampling
-Undercoverage
~Groups in the source population are left out or underrepresented in the population list used to select the sample
-Volunteer bias
~Occurs when self-selected participants are atypical of the source population
-Nonresponse bias
~Occurs when a large percentage of selected individuals refuse to participate or cannot be contacted
2.2 Comparative Studies
-Comparative designs study the relationship between an explanatory variable and a response variable
-Explanatory variable
~Synonyms
*Independent variable, factor, predictor, exposure
~Treatment or exposure that explains or predicts changes in the response variable
-Response variable
~Synonyms
*dependent variable, outcome
~Outcome or response being investigated
Experimental vs. non-experimental
-Comparative studies may be experimental or non-experimental
-In EXPERIMENTAL DESIGNS, the investigator assign the subjects to groups according to the explanatory variable
~Exposed and unexposed groups
-In NONEXPERIMENTAL DESIGNS, the investigator does not assign subjects into groups; individuals are merely classified as “exposed” or “non-exposed”
Figure 2.1
-Experimental and non-expermental study design
Example of an Experimental Design
-The Women’s Health Initiative study randomly assigned about half its subjects to a group that received hormone replacement therapy (HRT)
-Subject were followed for ~5 years to ascertain various health outcomes, including heart attacks, stroke, the occurrence of breast cancer and so no
Example of a Nonexperimental design
-The Nurse’s Health study classified individuals according to whether they received HRT
-Subjects were followed for ~5 years to ascertain the occurrence of various health outcomes
Comparison of Experimental and Nonexperimental Designs
-In both the experimental (WHI) study and nonexperimental (Nurse’s Health) study, the relationship between HRT (explanatory variable) and various health outcomes (response variables) was studied
-In the experimental design, the investigators controlled who was and who was not exposed
-In the nonexperimental design, the study subjects (or their physicians) decided on wether or not subjects were wxposed
Jargon
-A subject = an individual participating in the experiment
-A factor = an explanatory variable being studied; experiments may address the effect of multiple factors
-A treatment = a specific set of factors
What can you say about data?
-Ages of people in group A
~21, 42, 5, 11, 30, 50, 28, 27, 24, 52
Stemplots
-You can observe a lot by looking - Yogi Berra
-Starting by exploring the data with Exploratory Data Analysis (EDA)
-A popular univariate EDA technique is the stem-and-leaf plot
-The stem of the stempolt is a number-line (axis)
-Each leaf represents a data point
-Ex:
0 l 5
1 l 1
2 l 1 4 7 8
3 l 0
4 l 2
5 l 0 2
Stemplot
-Illustration
-10 ages (data sequenced as an ordered array)
~ 5, 11, 21, 24, 27, 28, 30, 42, 50, 52
-Draw the stem to cover the range 5 to 52
0 l
1 l
2 l 1
3 l
4 l
5 l
x 10 <- axis multiplier
-Divide each data point into a stem-value (in this example, the tens place) and leaf-value (the ones-place, in this example)
-Place leaves next to the stem value
-Example of a leaf: 21 (plotted)
-Plot all the data points in rank order
0 l 5
1 l 1
2 l 1478
3 l 0
4 l 2
5 l 02
Interpreting Stem plots
-Shape
-Symmetry (mirror image of itself around its center)
-Modality (number of peaks)
- Kurtosis (width of tails or steepness of the mound)
-Departures (outliers)
Interpreting Stem Plots
-Location
-Gravitational center -> mean
-Middle value -> median
Interpreting Stemplots
-Spread
-Range and inter-quartile range
-Standard deviation and variance (chapter 4)
Shape
-“Shape” refers to the pattern when plotted
-Here’s the “skyline silhouette” of our data
x
x
x x
x x x x x x
0 1 2 3 4 5
- Consider: symmetry, modality, kurtosis
-Do NOT ‘over-interpret” plots when n is small
Location
-Mean
-“Eye-ball method” -> visualize where the plot would balance
~Around 25 to 35
-Arithmetic method = sum values and divide by n
~mean = 290/10 = 29
Central location
-Median
-Ordered array:
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
-The median has depth (n+1) / 2
- n = 10, median’s depth = (10+1) / 2 = 5.5
-Falls between 27 and 28
-When n is even, average the adjacent values
~Meadian = 27.5
Spread
-Range
-For now, report the range (minimum and maximum values)
-Current data range is “5 to 52”
-The range is the easiest but not the best way to describe spread (better methods described later)
Outlier
-An outlier is a striking deviation from the overall pattern or shape of the distribution
0 l 679
1 l 124557
2 l
3 l
4 l
5 l 0
x10
Stemplot
-Second Example
-Data:
~1.47, 2.06, 2.36, 3.43, 3.74, 3.78, 3.94, 4.42
-Stem = ones-place
-Leaves = tenths-place
-Truncate extra digit (ex., 1.47 -> 1.4)
~DO NOT plot decimal
-Center
~Between 3.4 and 3.7
-Spread
~ 1.4 to 4.4
-Shape
~Mound, no outliers
Third Illustrative Example (n = 25)
-Data
14, 17, 18, 19, 22, 22, 23, 24, 24, 26, 26, 27, 28, 29, 30, 30, 30, 31, 32, 33, 34, 34, 35, 36, 37, 38,
-Regular stemplot
1 l 4789
2 l 2234466789
3 l 000123445678
x10
Too squished to see the shape
Third Illustration; Split Stem
-Split stem values into two ranges
~First “1” holds leaves between 0 to 4, and second “1” will hold leaves between 5 to 9
-Split-stem
1 l 4
1 l 789
2 l 2234
2 l 66789
3 l 00012344
3 l 5678
x10
-negative skew now evident
How many stem-values?
-Start with between 4 and 12 stem-values
-Trial and error
~Try different stem multiplier
~Try splitting stem
~Look for most informative plot
3.3 Body weight (pounds) of students in a class, n = 53
-Data range from 100 to 260 lbs
-x100 axis multiplier -> only two stem-values (1x100 and 2x100)
-x100 axis-multiplier w/ split stem -> only 4 stem values -> might be okay
-x10 axis-multiplier -> see next slide
Fourth Stemplot Example (n = 53)
10 l 0166
11 l 009
12 l 0034578
13 l 00359
14 l 08
15 l 00257
16 l 555
17 l 000255
18 l 000055567
19 l 245
20 l 3
21 l 025
22 l 0
23 l
24 l
25 l
26 l 0
x10
-Shape
~Positive skew, high outlier (260)
-Location
~Median about 165
-Spread
~From 100 to 260
Frequency Table
-Frequency = count
-Relative frequency = proportion or %
-Cumulative frequency = % less than or equal to level
Frequency Table with Class intervals
-When data are sparse, group data into class intervals
-Create 4 to 12 class intervals
-Classes can be uniform or non-uniform
-End-point convention
~First class interval of 0 to 10 will include o but exclude 10 (0 to 9.99)
-Tally frequencies
-Calculate relative frequency
-Calculate cumulative frequency
Class Intervals
-Uniform class intervals table (width 10) for data
~5, 11, 21 ,24, 27, 28, 30, 42, 50, 52
Class Freq Relative Freq (%) Cumulative Freq (%)
0-9 1 10 10
10-19 1 10 20
20-29 4 40 60
30-39 1 10 70
40-44 1 10 80
50-59 2 20 100
Total 10 100
Histogram
-A histogram is a frequency chart for a quantitative measurement
~The bars will touch
Bar Cart
-A bar chart with non-touching bars is reserved for categorical measurements and non-uniform class
Summary Statistics
-Central location
~Mean
~Median
~Mode
-Spread
~Range and interquartile range (IQR)
~Variance and standard deviation
-Shape Summaries
~Seldom used in practice
Notation
-n = sample size
-x = the variable (ex. ages of subjects)
-xi = the value of individual i for variable X
-E = sum all values (capital sigma)
-Illustrative data (ages of participants)
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
n = 10
x = Age variable
x1 = 5, x2 = 11, …… x10 = 52
Exi = x1 + x2 + ….. + x10 = 5+11+ …. + 52 =290
4.1 Central Location
-Sample Mean
-“Arithmetic average”
-Traditional measure of central location
-Sum the values and divide by n
-“xbar” refers to the sample mean
pages 77-79 in the textbook has the equation to use
-
Example
-Sample Mean
-Ten individuals selected at random have the following ages
21, 42, 5, 11, 30, 50, 28, 27, 24, 52
*Note that n = 10, Exi = 21 +41, + …. + 52 = 290, 1/10(290) = 29.0
Uses of the Sample Mean
-The sample mean:
~The value of an observation drawn at random from the sample can be used to predict the population mean
Population Mean
-
-Same operation as the sample mean except based on the entire population (N = population size)
-Conceptually important
-Usually not available in practice
-Sometimes referred to as the expected value
4.2 Central Location
-Median
-The median is the value with a depth on (n + 1) / 2
-When n is even, average the two values that straddle a depth of (n + 1) / 2
-For the 10 values listed below, the median has depth (10 + 1) / 2 = 5.5, placing it between 27 and 28
~Average these two values to get the median = 27.5
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
M = 27.5
More Examples of Medians
-Ex A:
~2, 4, 6
*M = 4
-Ex B:
~2, 4, 6, 8
*M = 5
-Ex C:
~6, 2, 4
*M does not = 2
**(Values MUST be ORDERED first)