Module 1-8 Flashcards
Biostatistics
-Statistics is not merely a compilation of computational techniques
-Statistics
~Is a way of learning from data
~Is concerned with all elements of study design, data collection, and analysis of numerical data
~Does require judgment
-Biostatistics is statistics applied to biological and health problems
Biostatisticians are
-Data Detectives
~Who uncovers patterns and clues
~This involves exploratory data analysis (EDA) and descriptive statistics
-Data Judges
~Who judge and confirm clues
~This involves statistical inference
Measurement
-Measurement (defined)
~The assigning of numbers and codes according to prior-set rules
-There are three broad types of measurements
~Nominal
~Ordinal
~Quantitive
Nominal Measurements
-Classify observations into named categories
~No order
~Typically two categories (binary (yes/no)), but can have more categories but can not be ordered
-Ex:
~HIV status (positive or negative)
~Sex (Male or Female)
~Hair color (red, brown, black, blonde, gray, etc.)
Ordinal Measurement
-Categories that can be put in rank order
~Opinion
*More than two categories that have to be in order
-Ex:
~ Stage of cancer classified as Stage I, Stage II, Stage III, Stage IV
~Opinion classified as strongly agree (5), agree (4), neutral (3), disagree (2), strongly disagree (1)
~Age groups 0-4, 5-9, 10-14, etc.
Quantitative Measurements
-True numerical values that can be put on a number line
-Numerical values with equal spacing between numerical values
-Ex:
~Age (years)
~Serum cholesterol (mg/dL)
~T4 cell count (per dL)
Illustrative Example:
-Weight Change and Heart Disease
-This study sought to determine the effect of weight change on coronary heart disease risk
-It studied 115,818 women, 30-55 years of age, free of CHD over 14 years
-Measurements included the following variables
~Nominal (including Binary)
*CHD onset (yes or no)
*Family history of CHD (yes or no)
~Ordinal
*Non-smoker, light-smoker, moderate smoker, heavy smoker
~Quantitative
*BMI (kgs/m^3)
*Age (years)
*Weight presently
*Weight at age 18
Observation, Variable, and Value
-Observation
~The unit upon which measurements are made and can be an individual or aggregate
-Variable
~The generic thing we measure
*Age of person
*HIV status of a person
-Value
~A realized measurement
*“27”
*“positive”
Data Table
-Each row corresponds to an observation
-Each column contains information on a variable
-Each cell in the table contains a value
-Units of observation in these data are individual regions, not individual people
~Table 1.2 in the textbook
Measurement Inaccuracies
-Imprecision
~The inability to get the same result upon repetition
-Bias
~A tendency to overestimate or underestimate the true value of an object
Biostatisticians are
-Data Detectives
~Who uncovers patterns and clues
~This involves exploratory data analysis (EDA) and descriptive statistics
-Data Judges
~Who judges and confirms clues
~This involves statistical inference
Types of Studies
-Surveys
~Describe population characteristics
*A study of the prevalence of hypertension in a population
-Comparative Studies
~Determine relationships between variables
*A study to address whether weight gain causes hypertension
2.1 Surveys
-Goal
~To describe population characteristics
-Studies a subset (sample) of the population
~Census vs. Sample
-Uses sample to make inferences about population
-Sampling
~Saves time
~Saves money
~Allows resources to be devoted to greater scope and accuracy
Illustrative Example
-Youth Risk Behavior Surveillance (YRBS)
-YRBS monitors health behaviors in youth and young adults in the US. Six categories of health-risk behaviors are monitored. These include:
~Behavior that contributes to unintentional injuries and violence
~Tobacco use
~Alcohol and drug use
~Sexual behaviors
~Unhealthy dietary behavior
~Physical activity levels and body weight
-Ex:
~Several million public and private school students in the US in 2003
~Sampling
*15,240 questionnaires completed at 158 schools
Types of Samples
-Probability Sample
-Simple random sample (SRS)
-Stratified random sample
-Cluster sample
Types of Samples
-Non-probability sample
-Convenience sample
Sampling
-Probability samples
~Use chance mechanisms to select individuals
-Most basic type of probability sample is the simple random sample (SRS)
-SRS
~Each population member has the same probability of being selected into the sample
~Selection of any individual into the sample does not influence the likelihood of selecting any other individuals
Simple Random Sampling Method
- Identify each population member with a number 1,2,…N
-Use a random number generator to generate n random numbers between 1 and N
~Ex:
*http://www.random.org/integer-sets/
-Keep in mind
~The objective of an SRS is that every possible subset is equally likely!
Let’s Select 5 random IDs from our class ID (1-79)
-Step one
-Generate 1 set with 5 unique random integers in each
-Each integer should have a value between 1 and 79 (both inclusive; limits +/- 1,000,000,000)
-The total number of integers must be no greater that 10,000
Class ID Selection
-15, 31, 40, 48, and 76
-Random Integer Set Generator
~One requested 1 set with 5 unique random integers, taken from the [1,79] range. The integers were sorted in ascending order
*Here is the set
**Set 1: 15, 31, 40, 48, 76
Sampling
-Sampling Fraction = (n)/N
~n is the sample size
~N is the size of the population
-Sampling with replacement
~Tossing selected members back into the mix after they’ve been selected
~Any given unit can appear more than once in the sample
-Sampling without replacement
~Selected units are removed from possible future reselection
Other Types of Probability Samples
-(More Advanced Methods)
-Stratified random sample
~Random sample strata (subset) with the population
~Ex:
*The population can be divided into 5-year age groups (0-4, 5-9,…) with simple random samples of varying sizes drawn from each age-strata
-Cluster sample
~Randomly sample clusters comprising varying numbers of observations
~Ex:
*Households (cluster) are selected at random, and ALL individuals are studied within the clusters
Cautions when Sampling
-Undercoverage
~Groups in the source population are left out or underrepresented in the population list used to select the sample
-Volunteer bias
~Occurs when self-selected participants are atypical of the source population
-Nonresponse bias
~Occurs when a large percentage of selected individuals refuse to participate or cannot be contacted
2.2 Comparative Studies
-Comparative designs study the relationship between an explanatory variable and a response variable
-Explanatory variable
~Synonyms
*Independent variable, factor, predictor, exposure
~Treatment or exposure that explains or predicts changes in the response variable
-Response variable
~Synonyms
*dependent variable, outcome
~Outcome or response being investigated
Experimental vs. non-experimental
-Comparative studies may be experimental or non-experimental
-In EXPERIMENTAL DESIGNS, the investigator assign the subjects to groups according to the explanatory variable
~Exposed and unexposed groups
-In NONEXPERIMENTAL DESIGNS, the investigator does not assign subjects into groups; individuals are merely classified as “exposed” or “non-exposed”
Figure 2.1
-Experimental and non-expermental study design
Example of an Experimental Design
-The Women’s Health Initiative study randomly assigned about half its subjects to a group that received hormone replacement therapy (HRT)
-Subject were followed for ~5 years to ascertain various health outcomes, including heart attacks, stroke, the occurrence of breast cancer and so no
Example of a Nonexperimental design
-The Nurse’s Health study classified individuals according to whether they received HRT
-Subjects were followed for ~5 years to ascertain the occurrence of various health outcomes
Comparison of Experimental and Nonexperimental Designs
-In both the experimental (WHI) study and nonexperimental (Nurse’s Health) study, the relationship between HRT (explanatory variable) and various health outcomes (response variables) was studied
-In the experimental design, the investigators controlled who was and who was not exposed
-In the nonexperimental design, the study subjects (or their physicians) decided on wether or not subjects were wxposed
Jargon
-A subject = an individual participating in the experiment
-A factor = an explanatory variable being studied; experiments may address the effect of multiple factors
-A treatment = a specific set of factors
What can you say about data?
-Ages of people in group A
~21, 42, 5, 11, 30, 50, 28, 27, 24, 52
Stemplots
-You can observe a lot by looking - Yogi Berra
-Starting by exploring the data with Exploratory Data Analysis (EDA)
-A popular univariate EDA technique is the stem-and-leaf plot
-The stem of the stempolt is a number-line (axis)
-Each leaf represents a data point
-Ex:
0 l 5
1 l 1
2 l 1 4 7 8
3 l 0
4 l 2
5 l 0 2
Stemplot
-Illustration
-10 ages (data sequenced as an ordered array)
~ 5, 11, 21, 24, 27, 28, 30, 42, 50, 52
-Draw the stem to cover the range 5 to 52
0 l
1 l
2 l 1
3 l
4 l
5 l
x 10 <- axis multiplier
-Divide each data point into a stem-value (in this example, the tens place) and leaf-value (the ones-place, in this example)
-Place leaves next to the stem value
-Example of a leaf: 21 (plotted)
-Plot all the data points in rank order
0 l 5
1 l 1
2 l 1478
3 l 0
4 l 2
5 l 02
Interpreting Stem plots
-Shape
-Symmetry (mirror image of itself around its center)
-Modality (number of peaks)
- Kurtosis (width of tails or steepness of the mound)
-Departures (outliers)
Interpreting Stem Plots
-Location
-Gravitational center -> mean
-Middle value -> median
Interpreting Stemplots
-Spread
-Range and inter-quartile range
-Standard deviation and variance (chapter 4)
Shape
-“Shape” refers to the pattern when plotted
-Here’s the “skyline silhouette” of our data
x
x
x x
x x x x x x
0 1 2 3 4 5
- Consider: symmetry, modality, kurtosis
-Do NOT ‘over-interpret” plots when n is small
Location
-Mean
-“Eye-ball method” -> visualize where the plot would balance
~Around 25 to 35
-Arithmetic method = sum values and divide by n
~mean = 290/10 = 29
Central location
-Median
-Ordered array:
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
-The median has depth (n+1) / 2
- n = 10, median’s depth = (10+1) / 2 = 5.5
-Falls between 27 and 28
-When n is even, average the adjacent values
~Meadian = 27.5
Spread
-Range
-For now, report the range (minimum and maximum values)
-Current data range is “5 to 52”
-The range is the easiest but not the best way to describe spread (better methods described later)
Outlier
-An outlier is a striking deviation from the overall pattern or shape of the distribution
0 l 679
1 l 124557
2 l
3 l
4 l
5 l 0
x10
Stemplot
-Second Example
-Data:
~1.47, 2.06, 2.36, 3.43, 3.74, 3.78, 3.94, 4.42
-Stem = ones-place
-Leaves = tenths-place
-Truncate extra digit (ex., 1.47 -> 1.4)
~DO NOT plot decimal
-Center
~Between 3.4 and 3.7
-Spread
~ 1.4 to 4.4
-Shape
~Mound, no outliers
Third Illustrative Example (n = 25)
-Data
14, 17, 18, 19, 22, 22, 23, 24, 24, 26, 26, 27, 28, 29, 30, 30, 30, 31, 32, 33, 34, 34, 35, 36, 37, 38,
-Regular stemplot
1 l 4789
2 l 2234466789
3 l 000123445678
x10
Too squished to see the shape
Third Illustration; Split Stem
-Split stem values into two ranges
~First “1” holds leaves between 0 to 4, and second “1” will hold leaves between 5 to 9
-Split-stem
1 l 4
1 l 789
2 l 2234
2 l 66789
3 l 00012344
3 l 5678
x10
-negative skew now evident
How many stem-values?
-Start with between 4 and 12 stem-values
-Trial and error
~Try different stem multiplier
~Try splitting stem
~Look for most informative plot
3.3 Body weight (pounds) of students in a class, n = 53
-Data range from 100 to 260 lbs
-x100 axis multiplier -> only two stem-values (1x100 and 2x100)
-x100 axis-multiplier w/ split stem -> only 4 stem values -> might be okay
-x10 axis-multiplier -> see next slide
Fourth Stemplot Example (n = 53)
10 l 0166
11 l 009
12 l 0034578
13 l 00359
14 l 08
15 l 00257
16 l 555
17 l 000255
18 l 000055567
19 l 245
20 l 3
21 l 025
22 l 0
23 l
24 l
25 l
26 l 0
x10
-Shape
~Positive skew, high outlier (260)
-Location
~Median about 165
-Spread
~From 100 to 260
Frequency Table
-Frequency = count
-Relative frequency = proportion or %
-Cumulative frequency = % less than or equal to level
Frequency Table with Class intervals
-When data are sparse, group data into class intervals
-Create 4 to 12 class intervals
-Classes can be uniform or non-uniform
-End-point convention
~First class interval of 0 to 10 will include o but exclude 10 (0 to 9.99)
-Tally frequencies
-Calculate relative frequency
-Calculate cumulative frequency
Class Intervals
-Uniform class intervals table (width 10) for data
~5, 11, 21 ,24, 27, 28, 30, 42, 50, 52
Class Freq Relative Freq (%) Cumulative Freq (%)
0-9 1 10 10
10-19 1 10 20
20-29 4 40 60
30-39 1 10 70
40-44 1 10 80
50-59 2 20 100
Total 10 100
Histogram
-A histogram is a frequency chart for a quantitative measurement
~The bars will touch
Bar Cart
-A bar chart with non-touching bars is reserved for categorical measurements and non-uniform class
Summary Statistics
-Central location
~Mean
~Median
~Mode
-Spread
~Range and interquartile range (IQR)
~Variance and standard deviation
-Shape Summaries
~Seldom used in practice
Notation
-n = sample size
-x = the variable (ex. ages of subjects)
-xi = the value of individual i for variable X
-E = sum all values (capital sigma)
-Illustrative data (ages of participants)
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
n = 10
x = Age variable
x1 = 5, x2 = 11, …… x10 = 52
Exi = x1 + x2 + ….. + x10 = 5+11+ …. + 52 =290
4.1 Central Location
-Sample Mean
-“Arithmetic average”
-Traditional measure of central location
-Sum the values and divide by n
-“xbar” refers to the sample mean
pages 77-79 in the textbook has the equation to use
-
Example
-Sample Mean
-Ten individuals selected at random have the following ages
21, 42, 5, 11, 30, 50, 28, 27, 24, 52
*Note that n = 10, Exi = 21 +41, + …. + 52 = 290, 1/10(290) = 29.0
Uses of the Sample Mean
-The sample mean:
~The value of an observation drawn at random from the sample can be used to predict the population mean
Population Mean
-
-Same operation as the sample mean except based on the entire population (N = population size)
-Conceptually important
-Usually not available in practice
-Sometimes referred to as the expected value
4.2 Central Location
-Median
-The median is the value with a depth on (n + 1) / 2
-When n is even, average the two values that straddle a depth of (n + 1) / 2
-For the 10 values listed below, the median has depth (10 + 1) / 2 = 5.5, placing it between 27 and 28
~Average these two values to get the median = 27.5
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
M = 27.5
More Examples of Medians
-Ex A:
~2, 4, 6
*M = 4
-Ex B:
~2, 4, 6, 8
*M = 5
-Ex C:
~6, 2, 4
*M does not = 2
**(Values MUST be ORDERED first)
The Median is Robust
-The median is more resistant to skews and outliers than the mean; it is more robust
-This data set has a mean of 1636
1362, 1439, 1460, 1614, 1666, 1792, 1867
-Here’s the same data set with a data entry error “outlier”
~This data set has a mean of 2743
1362, 1439, 1460, 1614, 1666, 1792, 9867
-The median is 1614 in both instances, demonstrating its robustness in the face of outliers
4.3 Mode
-The mode is the most commonly encountered value in the dataset
-This data set had a mode of 7
{4, 7, 7, 7, 8, 8, 9}
-This data set has no mode
{4, 6, 7, 8} (each point appears only once)
-The mode is useful only in large data sets with repeating values
4.7 Spread
-Standard Deviation
-Most common descriptive measures of spread
-Based on deviations around the mean
-This figure demonstrates the deviations of two of its values
Variance and Standard Deviation
-Deviation = xi-
~Sum of squared deviations = SS = E(xi-
~Sample variance = s^2 = (SS/(n-1))
~Sample standard deviation = s=
*Go back to slides to write down the rest of the equations
Standard deviation (formula)
s =
-Sample standard deviation s is the estimator of population standard deviation
~See “Facts About the Standard Deviation” page 93
*Go back to slides to write down the equation
Illustrative Example
-Standard Deviation (p. 92)
Observation Deviations Squared deviations
36 36-36 = 0 0^2 = 0
38 38-36 = 2 2^2 = 4
39 39-36 = 3 3^2 = 9
40 40-36 = 4 4^2 = 16
36 36-36 = 0 0^2 = 0
34 34-36 = -2 -2^2 = 4
33 33-36 = -3 -3^2 = 9
32 32-36 = -4 -4^2 = 16
SUMS -> 0* SS = 58
*SUM of deviations always equal zero
Illustrative Example
-Standard Deviation (p. 92)
Observation Deviations Squared deviations
36 36-36 = 0 0^2 = 0
38 38-36 = 2 2^2 = 4
39 39-36 = 3 3^2 = 9
40 40-36 = 4 4^2 = 16
36 36-36 = 0 0^2 = 0
34 34-36 = -2 -2^2 = 4
33 33-36 = -3 -3^2 = 9
32 32-36 = -4 -4^2 = 16
SUMS -> 0* SS = 58
*SUM of deviations always equals zero
-Sample variances (s^2)
~
-Standard deviation(s)
~
Interpretation of Standard Deviation
-Measure spread (ex. if group was s1 = 15 and group 2 s2 = 10, group 1 has more spread, i.e., variability)
4.5 Spread
-Quartiles
-Two distributions can be quite different yet can have the same mean
-This data compares particulate matter in air samples (up/m^3) at two sites
~Both sites have a mean of 36, but Site 1 exhibits much greater variability
*We would miss the high pollution days if we relied solely on the mean
Site 1 l l Site 2
42 l 2 l
8 l 2 l
2 l 3 l 234
86 l 3 l 6689
2 l 4 l 0
l 5 l
l 5 l
l 6 l
8 l 6 l
x10
Spread
-Range
-Range = maximum - minimum
-Illustrative example
~Site 1 range 68 - 22 = 46
-Site 2 range = 40 - 32 = 8
-Beware:
~The sample range will tend to underestimate the population range
-Always supplement the range with at least one addition measure of spread
Site 1 l l Site 2
42 l 2 l
8 l 2 l
2 l 3 l 234
86 l 3 l 6689
2 l 4 l 0
l 5 l
l 5 l
l 6 l
8 l 6 l
x10
Spread
-Quartiles
-Quartile 1 (Q1)
~Cuts off bottom quarter of data = median of the lower half of the data set
-Quartile 2 (Q2)
~Cuts off top quarter of data = median of the upper half of the data set
-Interquartile Range (IQR) = Q3-Q1 covers the middle 50% of the distribution
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
Q1 = 21, Q3 = 42, and IQR = 42-21 = 21
Spread
-Quartiles
-Quartile 1 (Q1)
~Cuts off bottom quarter of data = median of the lower half of the data set
-Quartile 2 (Q2)
~Cuts off top quarter of data = median of the upper half of the data set
-Interquartile Range (IQR) = Q3-Q1 covers the middle 50% of the distribution
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
Q1 = 21, Q3 = 42, and IQR = 42-21 = 21
Quartiles (Tukey’s Hinges)
-Example 2 Data are metabolic rates (cal/day), n = 7
1362, 1439, 1460, 1614, 1666, 1792, 1867
Median = 1614
-When n is odd, include the median in both halves of the data set
-Bottom half:
~ 1362, 1439, 1460, 1614 which has a median of 1449.5 (Q1)
-Top half
~1614, 1666, 1792, 1867 which has a median of 1729 (Q3)
Five-Point Summary
-Q0 (the minimum)
-Q1 (25th percentile)
-Q2 (median)
-Q3 (75th percentile)
-Q4 (the maximum)
4.6 Boxplots
-Calculate 5-point summary
~Draw box from Q1 to Q3 with line at median
-Calculate IQR and fences as follows
~Fence lower = Q1-1.5(IQR)
~Fence upper = Q3 + 1.5(IQR)
*DO NOT DRAW FENCES
-Determine if any values lie outside the fences (outside values)
~If so, plot these separately
-Determine values inside the fences (inside values)
~Draw whisker from Q3 to upper inside value
~Draw whisker from Q1 to lower inside value
Illustrative Example
-Boxplot
Data: 5, 11, 21, 24, 27, 28, 30, 42, 50, 52
-5 pt summary: {5, 21, 27.5, 42, 52}; box from 21 to 42 with line @ 27.5
-IQR = 42 - 21 = 21
~Fu = Q3 + 1.5 (21) = 73.5
~Fl = Q1 - 1.5 (21) = -10.5
-None values above the upper fence and below the lower fence
-Upper inside value = 52
-Lower inside value = 5
-Draw whiskers
Illustrative Example
-Boxplot 2
-5 pt summary
~3, 22, 25.5, 29, 51: draw a box
-IQR = 29 - 22 = 7
~Fu = Q3 + 1.5 (7) = 39.5
~Fl = Q1 - 1.5 (7) = 11.6
-One above the top fence (51) and one below the bottom fence (3)
-Upper inside value is 31
-Lower inside value is 21
-Draw whiskers
Illustrative Example
-Boxplot 3
-Seven metabolic rates
1362, 1439, 1460, 1614, 1666, 1792, 1867
-5 pt summary
~1362, 1449.5, 1614, 1729, 1867
-IQR = 1729 - 1449.5 = 279.5
~Fu = Q3 + 1.5 (279.5) = 2148.25
~Fl = Q1 - 1.5 (279.5) = 1030.25
-None outside
-Whiskers end @ 1867 and 1362
Boxplots
-Interpretation
-Location
~Position of median
~Position of box
-Spread
~Hinge-spread (IQR)
~Whisker-to-whisker spread
-Shape
~Symmetry or direction of skew
~Long whiskers (tails) indicate leptokurtosis
Side-by-side boxplots
-Boxplots are especially useful when comparing groups
Choosing Summary Statistics
-Always report a measure of a central location, a measure of spread, and the sample size
-Symmetrical mound-shaped distributions -> report the mean and standard deviation
-Odd-shaped distributions -> report 5-point summaries (or median and IQR)
Definitions
-Random variable = a numerical quantity that takes on different values depending on chance
-Ex:
~Number of smokers in a simple random sample of size n, the ages of subjects selected at random at UNR
-Sample Space = the set of all possible values from a random variable
-Ex:
~If the subject’s age is a random variable of interest, the set of all possible values for this random variable is???
-Event = an outcome or set of outcomes from random variables
-Probability = the proportion of times an event is expected to occur in the population
-Ex:
~Roll a fair die: the probability that the die lands on “one”
*Ideas about probability are founded on relative frequencies (proportions) in populations
Die Example
-Random Variable
~The number on the face
-Population (Sample Space): (not a population of people)
{1, 2, 3, 4, 5, 6}
-Event: 1
-Probability: 1/6
EX:
~Event: 5 or 6
~Probability: 2/6 or 1/3
Probability Illustrated
-In a given year, there were 42,636 traffic fatalities in a population of N= 293,655,000
-If randomly selected a person from this population, what is the probability that they will experience a traffic fatality by the end of that year
-ANS
~The relative frequency of that event in the population = 42,636 / 293,655, 000 = 0.0001452
*Thus, Pr(traf. fatality) = 0.0001452 (about 1 in 6887)
5.2 random Variables
-Random variable = a numerical quantity that takes on different values depending on chance
-Two types of random variables
-Discrete random variables
~A countable set of possible outcomes
*X = nu ber of smokers (cannot have half of a person)
-Continuous random variable
~An unbroken continuum of possible outcomes
*Weight in pounds (cannot have 0 due to it not existing)
Discrete Random Variables
-Discrete Random Variables
~Acountable set of possible outcomes
*The variable number of leukemia cases in a geographic region in a given period
*The variable number of success in n independent treatments
*The variable number of smokers in a simple random sample of size n
-Continuous random variable
~An unbroken continuum of possible outcomes
*The variable Amount of time it takes to complete a task
*The variable Height of an individual selected at random
5.3 Discrete Random Variables
-Probability mass function (pmf) = a mathematical relation that assigns probabilities to all possible outcomes for discrete random variables
-Illustrative example:
~One rolls a die 2 times
*Let X = the variable number of times one gets six
*This is the pmf for the random variable
X 0 1 2
Pr(X=x) 0.6944 0.2778 0.0278
Illustrative example 2
-“Four Patients”
~Suppose one treat four patients with an intervention that is successful 75% of the time
*Let X = the variable number of successes in this experiment
*This is the pmf for this random variable
X 0 1 2 3 4
Pr(X=x) 0.0039 0. 0469 0.2109 0.4219 0.3164
Operations on Events
-Intersection
~For two events A and B, the intersection A B represents the events that both A and B occur
-Union
~For two events A and B, the union A U B represents the events that A or B occurs
*A occurs without B, B occurs without A, or A and B both occur
-Complement
~For an event A, the complement of A represents the event that occurs if A does not occur. It is typically denoted by A-bar
Properties of Probabilities
-Property 1
~Probabilities are always between 0 and 1
-Property 2
~A sample space is all possible outcomes
*The probabilities in the sample space to 1 (exactly)
-Property 3
~The complement of an event is “the event not happening”
*The probability of a complement is 1 minus the probability of the event
**Pr(rain tomorrow) = 0.6
**Pr(not rain tomorrow) = 0.4
-Property 4
~Probabilities of disjoint events can be added
*Pr(X = 1) + Pr(X = 2)
**X = number in die
Properties of Probabilities in Symbols
-Property 1.
0 < Pr(A) < 1
-Property 2.
Pr(S) = 1, where S represents the sample space (all possible outcomes)
-Property 3.
Pr (A-bar) = 1- Pr(A), A-bar represents the complement of A (NOT A)
-Property 4.
If A and B are disjoint, the Pr(A or B) = Pr (A) + Pr(B)
Properties 1 and 2 Illustrated
-Figure 5.2
-Property 1.
0 < Pr(A) < 1
~Note that all individual probabilities are between 0 and 1
-Property 2
Pr(S) = 1
~Note that the summ of all probabilities = .0039 + .0469 + .2109 + .4219 + .3164 = 1
Property 3 Illustrated
-Property 3
Pr (A-bar) = 1- Pr(A)
~As an example, let A represent 4 successes
Pr (A) = .3164
-Let A-bar represent the complement of A (“NOT A”), which is “3 or fewer”
Pr(A-bar) = 1 - Pr(A) = 1 - 0.3164 = 0.6836
Property 4 Illustrated
-Property 4
Pr(A or B) = Pr (A) + Pr(B) for disjoint events
~Let A represent 4 successes
~Let B represent 3 successes
-Sine A and B are disjoin, Pr (A or B) = Pr(A) + Pr(B) = 0.3164 + 0.4219 = 0.7383
-The probability of observation 3 or 4 successes is 0.7383 or about 74%
Area Under the Curve (AUC)
-The area under curves (AUC) on a pmf corresponds to the probability
-Pr (X = 2)
~area of shaded region = height x base
*.2109(1.0) = .2109
Cumulative Probability
-“Cumulative probability” refers to the probability of the value or less
-Notation
Pr(X < x)
-Corresponds to AUC to the left of the point (“Left tail”)
-Ex:
Pr (X < 2)
~Shaded “tail”
0.0039 + 0.0469 + 0.2109 = 0.2617
Mean and Variance of a Discrete Random Variance
-Definitional formula for mean or expectation (p.111)
-Definitional formula for variance (p.111)
.
Exprected Mean
.
X 0 1 2 3 4
Pr(X=x) 0.0039 0.0469 0.2109 0.4219 0.3164
How to calculate the expected mean?
u = 00.0039 + 10.0469 + 20.2109 +30.4219 + 4*0.3164 = 3
Variance
.
X 0 1 2 3 4
Pr(X=x) 0.0039 0.0469 0.2109 0.4219 0.3164
How to calculate the variance?
(0-3)^2 * 0.0039 + (1-3) ^2 * 0.0469 + (2-3) ^2 * 0.2109 + (3-3)^2 * 0.4219 + (4-3)^2 * 0.3164 = 0.75
5.4 Continuous Random Variables
-Continuous random variables form a continuum of possible values
-As an illustration, consider the spinner
-The spinner will generate a continuum of random numbers between 0 to 1
-A probability density function (pdf) is a mathematical relation that assigns probabilities to all possible outcomes for a continuous random variable
-The pdf for our random spinner is shown here
-The shaded area under the curve represents probability, in this instance
Pr(0 < X < 0.5) = 0.5
0.5 - 0 = 0.5 * 1 = 0.5
Pr( 0.25 < x < 0.5 ) = 0.25
0.5 - 0.25 = 0.25 * 1 = 0.25
Pr( X > 0.7) = 0.3
1 - 0.7 = 0.3 * 1 = 0.3
Examples of pdfs
-pdfs obey all the rules of probabilities
-pdfs come in many forms (shapes)
~Uniform pdf
~Normal pdf
~Chi-square pdf
~Exercise 5.13 pdf
*The most common pdf is the normal (We study the Normal pdf in detain in the next chapter)
Area Under the Curve
-As was the case with pmfs, pdfs display probability with the area under the curve (AUC)
-This histogram shades bars corresponding to ages < 9 (~40% of histograms)
-This shaded AUC on the Normal pdf curve also corresponds to ~40% of total
X = age
X Normal
Pr (X < 9) = 0.4
6.1 Binomial Random Variables
-Binomial = a family of discrete random variables
-Binomial Random Variable = the random number of successes in n independent Bernoulli trials (a Bernoulli trial has two possible outcomes: “success” or “failure”
-Binomials random variables have toe parameters
~n = number of trials
~P = probability of success of each trial
Binomial Example
-Consider the random number of successful treatments when treating four patients
-Suppose the probability of success in each instance is 75%
-The random number of successes can vary from 0 to 4
-The random number of successes is a binomial with parameters n = 4 and p = 0.75
-Notation
~Let X ~b(n,p) represent a binomial random variable with parameters n and p
*The illustration variable is X ~ b(4, 0.75)
6.2 Calculating Binomial Probabilities
-Formula for binomial probabilities
Pr(X = x) = nCx p^x q^(n-x)
-Where
~nCx = the binomial coefficient (next slide
~p = probability of success for each trial
~q = probability of failure = 1-p
Binomial Coefficient
-Formula for the binomial coefficient
nCx = (n!) / (x! (n-x)!)
-Where ! represent the factorial function, calculated
-X! = x * (x-1) * (x-2) …. 1
-Ex:
~ 4! = 4321 = 24
-By definition 1! = 1 and 0! = 1
-Ex:
4C2 = (4!) / (2!) (4-2) = (4!) / (2!) (2) ! = (4321) / (21) (21) = 6
Binomial Coefficient Cont.
nCx = (n!) / (x!(n-x))!
-The binominal coefficient is called the “choose function” because it tells you the number of ways you could choose x items out of n
nCx = the number of ways to choose x items out of n
-Ex:
4C2 = 6 means there are six ways to choose two items out of four
Binomial Calculation
-Example
-Recall the “Four patients example”
-Four patients; probability of success of each treatment = 0.75
-The number of success is the binomial random variable X ~b(4, 0.75)
-Note q = 1 - 0.75 = 0.25
-What is the probability of observing 0 successes under these circumstances?
Pr (X = 0) = nCx p^x q^(n-x)
4C0 * 0.75^0 * 0.25^(4-0)
(4!) / (0! * 4!) * 0.75 ^0 * 0.25 ^4
1 * 1 * 0.0039
0.0039
Pr(X= 1) = 4C1 * 0.75^1 * 0.25^4-1
4 * 0.75 * 0.0156
0.0469
Pr(X= 2) = 4C2 * 0.75^2 * 0.25^4-2
6 * 0.5625 * 0.0625
0.2109
Pr(X = 3) = 4C3 * 0.75^3 * 0.25^4-3
4 * 0.4219 * 0.25
0.4219
Pr(X = 4) = 4C4 * 0.75^4 * 0.25^4-4
1 * 0.3164 * 1
0.3164
Area Under the Curve
-Recall the area under the curve (ACU) concept
ACU = probability
6.3 Cumulative Probability
-Recall the cumulative probability concept
-Cumulative probability = the probability of that value or less
-Pr(X < x)
-Correspond to left tail of pmf
Cumulative Probability Function
-Cumulative probability function lists cumulative probabilities for all possible outcome
-Ex:
~The cumulative probability function for X ~b(4, 0.75)
Pr(X < 0) = 0.0039
Pr(X < 1) = 0.0508
Pr(X < 2) = 0.2617
Pr(X < 3) = 0.6836
Pr(X < 4) = 1.000
6.5 Expected Value and Variance for Biomilas
-The expected value (mean) u of a binomial pmf is its “balance point”
-The variance ^2 is its spread
-Shortcut formula
u =np
^2 = npq
Expected Value and Variance, Binomials, illustration
-For the “Four patients” pmf of X~b(4, 0.75)
u = n*p
4(0.75) = 3
^2 = n(p)(q)
4(0.75)(.25) = 0.75
6.6 Using the Binomial
-Suppose we observe 2 successes in the “Four patients” example
-Note u = 3, suggesting we should see 3 success on average
-Does the observation of 2 successes cast doubt on p = 0.75
-No, because Pr(X < 2) = 0.2617 is not too unusual
Normal Distributions
-Normal random variables are the most common type of continuous random variable
-More importantly, describe the behavior of means
Normal Probability Density Function
-Recall the continuous random variables are described with smooth probability density functions (pdfs) - see chapter 5
-Normal pdfs are recognized by their familiar bell-shape
Figure 7.1
-Histogram with overlying Normal Curve
~The overlying curve represents its Normal pdf model
Area Under the Curve
-The darker bars of the histogram in Figure 7.2 correspond to ages less than or equal to 9 (~40% of observations)
-This darker area under the curve (see Figure 7.3) also correspond to ages less than 9 (~40% of the total area)
Figure 7.2
-Proportion less than 9 shaded darker color
Figure 7.3
-Proportion less than 9 (area under the curve)
~This shaded area is the probability associated with the range 0-9 years old
f(x) = 1 / (sq root 2 pi sigma) e^((-1/2)((x-u) / sigma)^2
Parameters mu and sigma
-Normal pdfs are a family of distributions
-family members identified by parameters
mu (mean) and
sigma (standard deviation
-mu control location (see Figure 7.4)
-sigma control spread (see Figure 7.5)
Standard Deviation sigma
-Point of inflections (where the slopes of the curve begins to level) occur one sigma below and about mu
Normal Distribution
-Normal distribution is often written as N(mu, sigma^2) to indicate that the density curve depends upon the parameters mu and sigma^2, which are the mean and variance of the random variable
~mu corresponds to the middle of the curve
~sigma^2 determines the spread of the curve
-The standard Normal Distribution is a normal distribution with mu = 0, sigma = 1
Standard Normal Distribution
-A standard normal random variable is generally denoted as Z
~The area between a and b under the standard normal density curve provides the probability that Z will assume a value over the interval (a,b): P(a<Z<b)
Example
-Let Z be a standard normal random variable
~Find the following probabilities using Table B
a) P(Z < 1.96) = 0.9750
b) P(-2.00 < Z < 2.00) = 0.0228 < Z < 0.9772 = 0.9772 - 0.0228 = 0.9544
c) P (Z > -1.28) = Z > 0.1003 = 1 - 0.1003 = 0.8997
d) P (-5.13 < Z < 2.00) = 0 < Z < 0.9772 = 0.9772 - 0 = 0.9772
e) P (Z = 1.71) = 0
7.2 Determining Normal Probabilities
-To determine a Normal Probability
~State the problem
~Standardize the value (z score)
~Sketch and shade the curve
~Use Table B to determine the probability
Standard Normal (Z) Variable
-Standard Normal Variable = a Normal random variable with mu = 0 and sigma = 1
-Called “z variables”
-Notation
Z ~N(0,1)
-Use Table B to look up cumulative probabilities
Figure 7.11
-Portion of Table B highlighting P (Z < 1.96) = 0.9750
Example: Normal Probability
-Step 1: Statement of Problem
-We want to determine the percentage of human gestations that are less than 40 weeks in length
-We know that uncomplicated human pregnancy from conception to birth is approximately Normally distributed with mu = 39 wees and sigma = 2 weeks
*Note: clinicians measure gestation from the last menstrual period to birth, which adds 2 weeks to the sigma
X = human gestation in weeks
-Let X represent human gestation
X ~N (39,2)
-Statement of the problem
Pr(X < 40) =
Normal Probability
-Step 2: Standardize
-To standardize, subtract mu and divide by sigma
Z = (x-mu) / sigma
-The z-score tells one how the number of sigma-units the value falls above or below mu
-Ex:
~The value 40 from X~N(39,2) has
Z= (40-39) / 2 = 0.5
Pr(Z < 0.5) = 0.6915
Normal Probability
-Steps 3 and 4: Sketch and use Table B
-Sketch and label axes
-Use Table b to lookup
Pr(Z < 0.5) = 0.6915
Probabilities Between Two Points
-Let a represent the lower boundary and b represent the upper boundary of a range
Pr(a < Z < b) = Pr (Z < b) - Pr (Z < a)
7.3 Looking up the z percentile value
-Use Table B to look up the z-percentile value
~Ex:
*The score for the probability in questions
-Look inside the table for the entry closest to the associated cumulative probability
-Then trace the z score to the row and column labels
Looking up the (Z) percentile value
-Suppose one wanted the 97.5th percentile z score
~Look inside the table for 0.975
*Then trace the z-score to the margins
-Notation
~Let Zp represents the z-score with cumulative probability p
~EX:
Z.975 = 1.96
8.1 Concepts
-Statistical inference is the act of generalizing from a sample to a population with a calculated degree of certainty
~We are curious about parameters in the population
~We calculate statistics in the sample
Parameters and Statistics
-It is essential to draw the distinction between parameters and statistics
Parameters Statistics
Source Population Sample
Calculated? No Yes
Constant? Yes No
Notation (examples) Mu, sigma, p x-bar, s, p̂
8.2 Sampling Behavior of a mean
-How precisely does a given sample mean reflect the underlying population mean?
Sampling
-Age
~Population 65 students in our CHS 280 class
-Which sample mean reflects the underlying population mean more precisely?
-If the sample size is 3
~Sampling distribution of the sample mean
N= 65
mu = age
X-bar1 = (18 + 18 + 19) / 3 =
X-bar2 = (19 + 20 + 21) / 3 =
-If the sample size is 50
~Sampling distribution of the sample mean
N = 65
X-bar1 = (…+…+…+…) / 50 =
Deviation of population and sampling distribution
-Population (Individual observation)
Sigma
-Sampling Distribution of x-bar
Sigma / (sq root n)
Standard deviation of sampling distribution of x-bar
-Standard error of the mean
Sigma lower x-bar
SE lower x-bar
Sigma / (sq root n)
*The square root law says the SE of the mean is inversely proportional to the square root of the sample size
Example
-The Weschler
~Adult Intelligence Scale has sigma = 15
-For n = 1 -> SE lower x-bar = Sigma / (sq root n) = 15 / sq root 1 = 15
-For n = 4 -> SE lower x-bar = Sigma / (sq root n) = 15 / sq root 4 = 7.5
-For n = 16 -> SE lower x-bar = Sigma / (sq root n) = 15 / sq root 16 = 3.75
*Quadrupling the sample size cut the SE in half Square root law
Figure 8.2
-Sampling distribution of the mean based on n = 10 compared of population values, Wechsler Adult Intelligence
Scale scores
Central limit Theorem
-Sampling distribution of x-bar tends toward Normality even when the population distribution is not Normal
~This effect is strong in large samples
Law of Large Numbers
-As a sample size gets larger and larger, the sample mean tends to get closer and closer to the mu
Adult Intelligence Scale Example
-Wechsler Adult Intelligence Scale (WAIS) scores vary according to a Normal distribution with mu = 100 and sigma = 15
a) what can we say about the sampling distribution of a mean based on an SRS of 10 such scores?
mu = 100 sigma = 15
SE lower x-bar = sigma / (sq root n) = 15 / (sq root 10) = 4.7434
X-bar~N (100, 4.74)
b) What is the probability of getting an x-bar less than 90?
Pr(X-bar < 90) = ?
X-bar to Z = (X-bar - mu) / SE lower X-bar
Pr (Z < 90) = (X-bar - 100) / 4.74 = (90 -100) / 4.74 = -2.109 = 0.0174
8.3 Sampling Behavior of Counts and Proportions
-Binomial Random Variable
~Random number of successes (X) in n independent “success/ failure” trials
~Probability of success for each trial is p
-Notation X~b(n,p)
~When n is large (npq > = 5), we can do normal approximation to the Binomial
Normal Approximation for a Binomial Count
Mu = np and sigma = sqroot npq
-When Normal approximation applies
X~N (np, sq root npq)
Normal Approximation for a Binomial Proportion
-mu = p, and sigma = sq root (pq) / n
p̂~N(p, sq root ((pq) / n))
Example
Recent statistics claim the prevalence of maternal smoking is quite low, at only 5%
~Suppose another research group sampled 107 pregnant mothers in their third trimester
n = 107
p = 0.05
q = 0.95
Example
n = 107
p = 0.05
q = 0.95
A. Can we assume an approximation to the normal distribution for this case?
npq = 5.0825
Example
n = 107
p = 0.05
q = 0.95
B. Calculate the probability of observing at least 12 mothers among 107 are smokers during their pregnancy using a Normal Approximation
mu = np = (107 * 0.05) = 5.35
sigma = (sq root npq) = (Sq root 107 * 0.05 * 0.95) = 2.2544
X = 12
X~N (5.35, 2.25444)
Pr(X> 12) = Pr ( Z > 12) = (12-5.35)/ 2.25444 = 2.949 = 2.95 = 0.9984
1- 0.9984 = 0.0016