Module 1-8 Flashcards

1
Q

Biostatistics

A

-Statistics is not merely a compilation of computational techniques
-Statistics
~Is a way of learning from data
~Is concerned with all elements of study design, data collection, and analysis of numerical data
~Does require judgment
-Biostatistics is statistics applied to biological and health problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Biostatisticians are

A

-Data Detectives
~Who uncovers patterns and clues
~This involves exploratory data analysis (EDA) and descriptive statistics
-Data Judges
~Who judge and confirm clues
~This involves statistical inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Measurement

A

-Measurement (defined)
~The assigning of numbers and codes according to prior-set rules
-There are three broad types of measurements
~Nominal
~Ordinal
~Quantitive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Nominal Measurements

A

-Classify observations into named categories
~No order
~Typically two categories (binary (yes/no)), but can have more categories but can not be ordered
-Ex:
~HIV status (positive or negative)
~Sex (Male or Female)
~Hair color (red, brown, black, blonde, gray, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ordinal Measurement

A

-Categories that can be put in rank order
~Opinion
*More than two categories that have to be in order
-Ex:
~ Stage of cancer classified as Stage I, Stage II, Stage III, Stage IV
~Opinion classified as strongly agree (5), agree (4), neutral (3), disagree (2), strongly disagree (1)
~Age groups 0-4, 5-9, 10-14, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Quantitative Measurements

A

-True numerical values that can be put on a number line
-Numerical values with equal spacing between numerical values
-Ex:
~Age (years)
~Serum cholesterol (mg/dL)
~T4 cell count (per dL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Illustrative Example:
-Weight Change and Heart Disease

A

-This study sought to determine the effect of weight change on coronary heart disease risk
-It studied 115,818 women, 30-55 years of age, free of CHD over 14 years
-Measurements included the following variables
~Nominal (including Binary)
*CHD onset (yes or no)
*Family history of CHD (yes or no)
~Ordinal
*Non-smoker, light-smoker, moderate smoker, heavy smoker
~Quantitative
*BMI (kgs/m^3)
*Age (years)
*Weight presently
*Weight at age 18

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Observation, Variable, and Value

A

-Observation
~The unit upon which measurements are made and can be an individual or aggregate
-Variable
~The generic thing we measure
*Age of person
*HIV status of a person
-Value
~A realized measurement
*“27”
*“positive”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Table

A

-Each row corresponds to an observation
-Each column contains information on a variable
-Each cell in the table contains a value
-Units of observation in these data are individual regions, not individual people
~Table 1.2 in the textbook

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Measurement Inaccuracies

A

-Imprecision
~The inability to get the same result upon repetition
-Bias
~A tendency to overestimate or underestimate the true value of an object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Biostatisticians are

A

-Data Detectives
~Who uncovers patterns and clues
~This involves exploratory data analysis (EDA) and descriptive statistics
-Data Judges
~Who judges and confirms clues
~This involves statistical inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Types of Studies

A

-Surveys
~Describe population characteristics
*A study of the prevalence of hypertension in a population
-Comparative Studies
~Determine relationships between variables
*A study to address whether weight gain causes hypertension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

2.1 Surveys

A

-Goal
~To describe population characteristics
-Studies a subset (sample) of the population
~Census vs. Sample
-Uses sample to make inferences about population
-Sampling
~Saves time
~Saves money
~Allows resources to be devoted to greater scope and accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Illustrative Example
-Youth Risk Behavior Surveillance (YRBS)

A

-YRBS monitors health behaviors in youth and young adults in the US. Six categories of health-risk behaviors are monitored. These include:
~Behavior that contributes to unintentional injuries and violence
~Tobacco use
~Alcohol and drug use
~Sexual behaviors
~Unhealthy dietary behavior
~Physical activity levels and body weight
-Ex:
~Several million public and private school students in the US in 2003
~Sampling
*15,240 questionnaires completed at 158 schools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Types of Samples
-Probability Sample

A

-Simple random sample (SRS)
-Stratified random sample
-Cluster sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Types of Samples
-Non-probability sample

A

-Convenience sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Sampling

A

-Probability samples
~Use chance mechanisms to select individuals
-Most basic type of probability sample is the simple random sample (SRS)
-SRS
~Each population member has the same probability of being selected into the sample
~Selection of any individual into the sample does not influence the likelihood of selecting any other individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Simple Random Sampling Method

A
  • Identify each population member with a number 1,2,…N
    -Use a random number generator to generate n random numbers between 1 and N
    ~Ex:
    *http://www.random.org/integer-sets/
    -Keep in mind
    ~The objective of an SRS is that every possible subset is equally likely!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Let’s Select 5 random IDs from our class ID (1-79)
-Step one

A

-Generate 1 set with 5 unique random integers in each
-Each integer should have a value between 1 and 79 (both inclusive; limits +/- 1,000,000,000)
-The total number of integers must be no greater that 10,000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Class ID Selection

A

-15, 31, 40, 48, and 76
-Random Integer Set Generator
~One requested 1 set with 5 unique random integers, taken from the [1,79] range. The integers were sorted in ascending order
*Here is the set
**Set 1: 15, 31, 40, 48, 76

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Sampling

A

-Sampling Fraction = (n)/N
~n is the sample size
~N is the size of the population
-Sampling with replacement
~Tossing selected members back into the mix after they’ve been selected
~Any given unit can appear more than once in the sample
-Sampling without replacement
~Selected units are removed from possible future reselection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Other Types of Probability Samples
-(More Advanced Methods)

A

-Stratified random sample
~Random sample strata (subset) with the population
~Ex:
*The population can be divided into 5-year age groups (0-4, 5-9,…) with simple random samples of varying sizes drawn from each age-strata
-Cluster sample
~Randomly sample clusters comprising varying numbers of observations
~Ex:
*Households (cluster) are selected at random, and ALL individuals are studied within the clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Cautions when Sampling

A

-Undercoverage
~Groups in the source population are left out or underrepresented in the population list used to select the sample
-Volunteer bias
~Occurs when self-selected participants are atypical of the source population
-Nonresponse bias
~Occurs when a large percentage of selected individuals refuse to participate or cannot be contacted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

2.2 Comparative Studies

A

-Comparative designs study the relationship between an explanatory variable and a response variable
-Explanatory variable
~Synonyms
*Independent variable, factor, predictor, exposure
~Treatment or exposure that explains or predicts changes in the response variable
-Response variable
~Synonyms
*dependent variable, outcome
~Outcome or response being investigated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Experimental vs. non-experimental
-Comparative studies may be experimental or non-experimental -In EXPERIMENTAL DESIGNS, the investigator assign the subjects to groups according to the explanatory variable ~Exposed and unexposed groups -In NONEXPERIMENTAL DESIGNS, the investigator does not assign subjects into groups; individuals are merely classified as "exposed" or "non-exposed"
26
Figure 2.1
-Experimental and non-expermental study design
27
Example of an Experimental Design
-The Women's Health Initiative study randomly assigned about half its subjects to a group that received hormone replacement therapy (HRT) -Subject were followed for ~5 years to ascertain various health outcomes, including heart attacks, stroke, the occurrence of breast cancer and so no
28
Example of a Nonexperimental design
-The Nurse's Health study classified individuals according to whether they received HRT -Subjects were followed for ~5 years to ascertain the occurrence of various health outcomes
29
Comparison of Experimental and Nonexperimental Designs
-In both the experimental (WHI) study and nonexperimental (Nurse's Health) study, the relationship between HRT (explanatory variable) and various health outcomes (response variables) was studied -In the experimental design, the investigators controlled who was and who was not exposed -In the nonexperimental design, the study subjects (or their physicians) decided on wether or not subjects were wxposed
30
Jargon
-A subject = an individual participating in the experiment -A factor = an explanatory variable being studied; experiments may address the effect of multiple factors -A treatment = a specific set of factors
31
What can you say about data?
-Ages of people in group A ~21, 42, 5, 11, 30, 50, 28, 27, 24, 52
32
Stemplots
-You can observe a lot by looking - Yogi Berra -Starting by exploring the data with Exploratory Data Analysis (EDA) -A popular univariate EDA technique is the stem-and-leaf plot -The stem of the stempolt is a number-line (axis) -Each leaf represents a data point -Ex: 0 l 5 1 l 1 2 l 1 4 7 8 3 l 0 4 l 2 5 l 0 2
33
Stemplot -Illustration
-10 ages (data sequenced as an ordered array) ~ 5, 11, 21, 24, 27, 28, 30, 42, 50, 52 -Draw the stem to cover the range 5 to 52 0 l 1 l 2 l 1 3 l 4 l 5 l x 10 <- axis multiplier -Divide each data point into a stem-value (in this example, the tens place) and leaf-value (the ones-place, in this example) -Place leaves next to the stem value -Example of a leaf: 21 (plotted) -Plot all the data points in rank order 0 l 5 1 l 1 2 l 1478 3 l 0 4 l 2 5 l 02
34
Interpreting Stem plots -Shape
-Symmetry (mirror image of itself around its center) -Modality (number of peaks) - Kurtosis (width of tails or steepness of the mound) -Departures (outliers)
35
Interpreting Stem Plots -Location
-Gravitational center -> mean -Middle value -> median
36
Interpreting Stemplots -Spread
-Range and inter-quartile range -Standard deviation and variance (chapter 4)
37
Shape
-"Shape" refers to the pattern when plotted -Here's the "skyline silhouette" of our data x x x x x x x x x x 0 1 2 3 4 5 - Consider: symmetry, modality, kurtosis -Do NOT 'over-interpret" plots when n is small
38
Location -Mean
-"Eye-ball method" -> visualize where the plot would balance ~Around 25 to 35 -Arithmetic method = sum values and divide by n ~mean = 290/10 = 29
39
Central location -Median
-Ordered array: 5, 11, 21, 24, 27, 28, 30, 42, 50, 52 -The median has depth (n+1) / 2 - n = 10, median's depth = (10+1) / 2 = 5.5 -Falls between 27 and 28 -When n is even, average the adjacent values ~Meadian = 27.5
40
Spread -Range
-For now, report the range (minimum and maximum values) -Current data range is "5 to 52" -The range is the easiest but not the best way to describe spread (better methods described later)
41
Outlier
-An outlier is a striking deviation from the overall pattern or shape of the distribution 0 l 679 1 l 124557 2 l 3 l 4 l 5 l 0 x10
42
Stemplot -Second Example
-Data: ~1.47, 2.06, 2.36, 3.43, 3.74, 3.78, 3.94, 4.42 -Stem = ones-place -Leaves = tenths-place -Truncate extra digit (ex., 1.47 -> 1.4) ~DO NOT plot decimal -Center ~Between 3.4 and 3.7 -Spread ~ 1.4 to 4.4 -Shape ~Mound, no outliers
43
Third Illustrative Example (n = 25)
-Data 14, 17, 18, 19, 22, 22, 23, 24, 24, 26, 26, 27, 28, 29, 30, 30, 30, 31, 32, 33, 34, 34, 35, 36, 37, 38, -Regular stemplot 1 l 4789 2 l 2234466789 3 l 000123445678 x10 Too squished to see the shape
44
Third Illustration; Split Stem
-Split stem values into two ranges ~First "1" holds leaves between 0 to 4, and second "1" will hold leaves between 5 to 9 -Split-stem 1 l 4 1 l 789 2 l 2234 2 l 66789 3 l 00012344 3 l 5678 x10 -negative skew now evident
45
How many stem-values?
-Start with between 4 and 12 stem-values -Trial and error ~Try different stem multiplier ~Try splitting stem ~Look for most informative plot
46
3.3 Body weight (pounds) of students in a class, n = 53 -Data range from 100 to 260 lbs
-x100 axis multiplier -> only two stem-values (1x100 and 2x100) -x100 axis-multiplier w/ split stem -> only 4 stem values -> might be okay -x10 axis-multiplier -> see next slide
47
Fourth Stemplot Example (n = 53)
10 l 0166 11 l 009 12 l 0034578 13 l 00359 14 l 08 15 l 00257 16 l 555 17 l 000255 18 l 000055567 19 l 245 20 l 3 21 l 025 22 l 0 23 l 24 l 25 l 26 l 0 x10 -Shape ~Positive skew, high outlier (260) -Location ~Median about 165 -Spread ~From 100 to 260
48
Frequency Table
-Frequency = count -Relative frequency = proportion or % -Cumulative frequency = % less than or equal to level
49
Frequency Table with Class intervals
-When data are sparse, group data into class intervals -Create 4 to 12 class intervals -Classes can be uniform or non-uniform -End-point convention ~First class interval of 0 to 10 will include o but exclude 10 (0 to 9.99) -Tally frequencies -Calculate relative frequency -Calculate cumulative frequency
50
Class Intervals
-Uniform class intervals table (width 10) for data ~5, 11, 21 ,24, 27, 28, 30, 42, 50, 52 Class Freq Relative Freq (%) Cumulative Freq (%) 0-9 1 10 10 10-19 1 10 20 20-29 4 40 60 30-39 1 10 70 40-44 1 10 80 50-59 2 20 100 Total 10 100
51
Histogram
-A histogram is a frequency chart for a quantitative measurement ~The bars will touch
52
Bar Cart
-A bar chart with non-touching bars is reserved for categorical measurements and non-uniform class
53
Summary Statistics
-Central location ~Mean ~Median ~Mode -Spread ~Range and interquartile range (IQR) ~Variance and standard deviation -Shape Summaries ~Seldom used in practice
54
Notation
-n = sample size -x = the variable (ex. ages of subjects) -xi = the value of individual i for variable X -E = sum all values (capital sigma) -Illustrative data (ages of participants) 5, 11, 21, 24, 27, 28, 30, 42, 50, 52 n = 10 x = Age variable x1 = 5, x2 = 11, ...... x10 = 52 Exi = x1 + x2 + ..... + x10 = 5+11+ .... + 52 =290
55
4.1 Central Location -Sample Mean
-"Arithmetic average" -Traditional measure of central location -Sum the values and divide by n -"xbar" refers to the sample mean pages 77-79 in the textbook has the equation to use -
56
Example -Sample Mean
-Ten individuals selected at random have the following ages 21, 42, 5, 11, 30, 50, 28, 27, 24, 52 *Note that n = 10, Exi = 21 +41, + .... + 52 = 290, 1/10(290) = 29.0
57
Uses of the Sample Mean
-The sample mean: ~The value of an observation drawn at random from the sample can be used to predict the population mean
58
Population Mean
- -Same operation as the sample mean except based on the entire population (N = population size) -Conceptually important -Usually not available in practice -Sometimes referred to as the expected value
59
4.2 Central Location -Median
-The median is the value with a depth on (n + 1) / 2 -When n is even, average the two values that straddle a depth of (n + 1) / 2 -For the 10 values listed below, the median has depth (10 + 1) / 2 = 5.5, placing it between 27 and 28 ~Average these two values to get the median = 27.5 5, 11, 21, 24, 27, 28, 30, 42, 50, 52 M = 27.5
60
More Examples of Medians
-Ex A: ~2, 4, 6 *M = 4 -Ex B: ~2, 4, 6, 8 *M = 5 -Ex C: ~6, 2, 4 *M does not = 2 **(Values MUST be ORDERED first)
61
The Median is Robust
-The median is more resistant to skews and outliers than the mean; it is more robust -This data set has a mean of 1636 1362, 1439, 1460, 1614, 1666, 1792, 1867 -Here's the same data set with a data entry error "outlier" ~This data set has a mean of 2743 1362, 1439, 1460, 1614, 1666, 1792, 9867 -The median is 1614 in both instances, demonstrating its robustness in the face of outliers
62
4.3 Mode
-The mode is the most commonly encountered value in the dataset -This data set had a mode of 7 {4, 7, 7, 7, 8, 8, 9} -This data set has no mode {4, 6, 7, 8} (each point appears only once) -The mode is useful only in large data sets with repeating values
63
4.7 Spread -Standard Deviation
-Most common descriptive measures of spread -Based on deviations around the mean -This figure demonstrates the deviations of two of its values
64
Variance and Standard Deviation
-Deviation = xi- ~Sum of squared deviations = SS = E(xi- ~Sample variance = s^2 = (SS/(n-1)) ~Sample standard deviation = s= *Go back to slides to write down the rest of the equations
65
Standard deviation (formula)
s = -Sample standard deviation s is the estimator of population standard deviation ~See "Facts About the Standard Deviation" page 93 *Go back to slides to write down the equation
66
Illustrative Example -Standard Deviation (p. 92)
Observation Deviations Squared deviations 36 36-36 = 0 0^2 = 0 38 38-36 = 2 2^2 = 4 39 39-36 = 3 3^2 = 9 40 40-36 = 4 4^2 = 16 36 36-36 = 0 0^2 = 0 34 34-36 = -2 -2^2 = 4 33 33-36 = -3 -3^2 = 9 32 32-36 = -4 -4^2 = 16 SUMS -> 0* SS = 58 *SUM of deviations always equal zero
67
Illustrative Example -Standard Deviation (p. 92)
Observation Deviations Squared deviations 36 36-36 = 0 0^2 = 0 38 38-36 = 2 2^2 = 4 39 39-36 = 3 3^2 = 9 40 40-36 = 4 4^2 = 16 36 36-36 = 0 0^2 = 0 34 34-36 = -2 -2^2 = 4 33 33-36 = -3 -3^2 = 9 32 32-36 = -4 -4^2 = 16 SUMS -> 0* SS = 58 *SUM of deviations always equals zero -Sample variances (s^2) ~ -Standard deviation(s) ~
68
Interpretation of Standard Deviation
-Measure spread (ex. if group was s1 = 15 and group 2 s2 = 10, group 1 has more spread, i.e., variability)
69
4.5 Spread -Quartiles
-Two distributions can be quite different yet can have the same mean -This data compares particulate matter in air samples (up/m^3) at two sites ~Both sites have a mean of 36, but Site 1 exhibits much greater variability *We would miss the high pollution days if we relied solely on the mean Site 1 l l Site 2 42 l 2 l 8 l 2 l 2 l 3 l 234 86 l 3 l 6689 2 l 4 l 0 l 5 l l 5 l l 6 l 8 l 6 l x10
70
Spread -Range
-Range = maximum - minimum -Illustrative example ~Site 1 range 68 - 22 = 46 -Site 2 range = 40 - 32 = 8 -Beware: ~The sample range will tend to underestimate the population range -Always supplement the range with at least one addition measure of spread Site 1 l l Site 2 42 l 2 l 8 l 2 l 2 l 3 l 234 86 l 3 l 6689 2 l 4 l 0 l 5 l l 5 l l 6 l 8 l 6 l x10
71
Spread -Quartiles
-Quartile 1 (Q1) ~Cuts off bottom quarter of data = median of the lower half of the data set -Quartile 2 (Q2) ~Cuts off top quarter of data = median of the upper half of the data set -Interquartile Range (IQR) = Q3-Q1 covers the middle 50% of the distribution 5, 11, 21, 24, 27, 28, 30, 42, 50, 52 Q1 = 21, Q3 = 42, and IQR = 42-21 = 21
72
Spread -Quartiles
-Quartile 1 (Q1) ~Cuts off bottom quarter of data = median of the lower half of the data set -Quartile 2 (Q2) ~Cuts off top quarter of data = median of the upper half of the data set -Interquartile Range (IQR) = Q3-Q1 covers the middle 50% of the distribution 5, 11, 21, 24, 27, 28, 30, 42, 50, 52 Q1 = 21, Q3 = 42, and IQR = 42-21 = 21
73
Quartiles (Tukey's Hinges) -Example 2 Data are metabolic rates (cal/day), n = 7
1362, 1439, 1460, 1614, 1666, 1792, 1867 Median = 1614 -When n is odd, include the median in both halves of the data set -Bottom half: ~ 1362, 1439, 1460, 1614 which has a median of 1449.5 (Q1) -Top half ~1614, 1666, 1792, 1867 which has a median of 1729 (Q3)
74
Five-Point Summary
-Q0 (the minimum) -Q1 (25th percentile) -Q2 (median) -Q3 (75th percentile) -Q4 (the maximum)
75
4.6 Boxplots
-Calculate 5-point summary ~Draw box from Q1 to Q3 with line at median -Calculate IQR and fences as follows ~Fence lower = Q1-1.5(IQR) ~Fence upper = Q3 + 1.5(IQR) *DO NOT DRAW FENCES -Determine if any values lie outside the fences (outside values) ~If so, plot these separately -Determine values inside the fences (inside values) ~Draw whisker from Q3 to upper inside value ~Draw whisker from Q1 to lower inside value
76
Illustrative Example -Boxplot
Data: 5, 11, 21, 24, 27, 28, 30, 42, 50, 52 -5 pt summary: {5, 21, 27.5, 42, 52}; box from 21 to 42 with line @ 27.5 -IQR = 42 - 21 = 21 ~Fu = Q3 + 1.5 (21) = 73.5 ~Fl = Q1 - 1.5 (21) = -10.5 -None values above the upper fence and below the lower fence -Upper inside value = 52 -Lower inside value = 5 -Draw whiskers
77
Illustrative Example -Boxplot 2
-5 pt summary ~3, 22, 25.5, 29, 51: draw a box -IQR = 29 - 22 = 7 ~Fu = Q3 + 1.5 (7) = 39.5 ~Fl = Q1 - 1.5 (7) = 11.6 -One above the top fence (51) and one below the bottom fence (3) -Upper inside value is 31 -Lower inside value is 21 -Draw whiskers
78
Illustrative Example -Boxplot 3
-Seven metabolic rates 1362, 1439, 1460, 1614, 1666, 1792, 1867 -5 pt summary ~1362, 1449.5, 1614, 1729, 1867 -IQR = 1729 - 1449.5 = 279.5 ~Fu = Q3 + 1.5 (279.5) = 2148.25 ~Fl = Q1 - 1.5 (279.5) = 1030.25 -None outside -Whiskers end @ 1867 and 1362
79
Boxplots -Interpretation
-Location ~Position of median ~Position of box -Spread ~Hinge-spread (IQR) ~Whisker-to-whisker spread -Shape ~Symmetry or direction of skew ~Long whiskers (tails) indicate leptokurtosis
80
Side-by-side boxplots
-Boxplots are especially useful when comparing groups
81
Choosing Summary Statistics
-Always report a measure of a central location, a measure of spread, and the sample size -Symmetrical mound-shaped distributions -> report the mean and standard deviation -Odd-shaped distributions -> report 5-point summaries (or median and IQR)
82
Definitions
-Random variable = a numerical quantity that takes on different values depending on chance -Ex: ~Number of smokers in a simple random sample of size n, the ages of subjects selected at random at UNR -Sample Space = the set of all possible values from a random variable -Ex: ~If the subject's age is a random variable of interest, the set of all possible values for this random variable is??? -Event = an outcome or set of outcomes from random variables -Probability = the proportion of times an event is expected to occur in the population -Ex: ~Roll a fair die: the probability that the die lands on "one" *Ideas about probability are founded on relative frequencies (proportions) in populations
83
Die Example
-Random Variable ~The number on the face -Population (Sample Space): (not a population of people) {1, 2, 3, 4, 5, 6} -Event: 1 -Probability: 1/6 EX: ~Event: 5 or 6 ~Probability: 2/6 or 1/3
84
Probability Illustrated
-In a given year, there were 42,636 traffic fatalities in a population of N= 293,655,000 -If randomly selected a person from this population, what is the probability that they will experience a traffic fatality by the end of that year -ANS ~The relative frequency of that event in the population = 42,636 / 293,655, 000 = 0.0001452 *Thus, Pr(traf. fatality) = 0.0001452 (about 1 in 6887)
85
5.2 random Variables
-Random variable = a numerical quantity that takes on different values depending on chance -Two types of random variables -Discrete random variables ~A countable set of possible outcomes *X = nu ber of smokers (cannot have half of a person) -Continuous random variable ~An unbroken continuum of possible outcomes *Weight in pounds (cannot have 0 due to it not existing)
86
Discrete Random Variables
-Discrete Random Variables ~Acountable set of possible outcomes *The variable number of leukemia cases in a geographic region in a given period *The variable number of success in n independent treatments *The variable number of smokers in a simple random sample of size n -Continuous random variable ~An unbroken continuum of possible outcomes *The variable Amount of time it takes to complete a task *The variable Height of an individual selected at random
87
5.3 Discrete Random Variables
-Probability mass function (pmf) = a mathematical relation that assigns probabilities to all possible outcomes for discrete random variables -Illustrative example: ~One rolls a die 2 times *Let X = the variable number of times one gets six *This is the pmf for the random variable X 0 1 2 Pr(X=x) 0.6944 0.2778 0.0278 Illustrative example 2 -"Four Patients" ~Suppose one treat four patients with an intervention that is successful 75% of the time *Let X = the variable number of successes in this experiment *This is the pmf for this random variable X 0 1 2 3 4 Pr(X=x) 0.0039 0. 0469 0.2109 0.4219 0.3164
88
Operations on Events
-Intersection ~For two events A and B, the intersection A B represents the events that both A and B occur -Union ~For two events A and B, the union A U B represents the events that A or B occurs *A occurs without B, B occurs without A, or A and B both occur -Complement ~For an event A, the complement of A represents the event that occurs if A does not occur. It is typically denoted by A-bar
89
Properties of Probabilities
-Property 1 ~Probabilities are always between 0 and 1 -Property 2 ~A sample space is all possible outcomes *The probabilities in the sample space to 1 (exactly) -Property 3 ~The complement of an event is "the event not happening" *The probability of a complement is 1 minus the probability of the event **Pr(rain tomorrow) = 0.6 **Pr(not rain tomorrow) = 0.4 -Property 4 ~Probabilities of disjoint events can be added *Pr(X = 1) + Pr(X = 2) **X = number in die
90
Properties of Probabilities in Symbols
-Property 1. 0 < Pr(A) < 1 -Property 2. Pr(S) = 1, where S represents the sample space (all possible outcomes) -Property 3. Pr (A-bar) = 1- Pr(A), A-bar represents the complement of A (NOT A) -Property 4. If A and B are disjoint, the Pr(A or B) = Pr (A) + Pr(B)
91
Properties 1 and 2 Illustrated
-Figure 5.2 -Property 1. 0 < Pr(A) < 1 ~Note that all individual probabilities are between 0 and 1 -Property 2 Pr(S) = 1 ~Note that the summ of all probabilities = .0039 + .0469 + .2109 + .4219 + .3164 = 1
92
Property 3 Illustrated
-Property 3 Pr (A-bar) = 1- Pr(A) ~As an example, let A represent 4 successes Pr (A) = .3164 -Let A-bar represent the complement of A ("NOT A"), which is "3 or fewer" Pr(A-bar) = 1 - Pr(A) = 1 - 0.3164 = 0.6836
93
Property 4 Illustrated
-Property 4 Pr(A or B) = Pr (A) + Pr(B) for disjoint events ~Let A represent 4 successes ~Let B represent 3 successes -Sine A and B are disjoin, Pr (A or B) = Pr(A) + Pr(B) = 0.3164 + 0.4219 = 0.7383 -The probability of observation 3 or 4 successes is 0.7383 or about 74%
94
Area Under the Curve (AUC)
-The area under curves (AUC) on a pmf corresponds to the probability -Pr (X = 2) ~area of shaded region = height x base *.2109(1.0) = .2109
95
Cumulative Probability
-"Cumulative probability" refers to the probability of the value or less -Notation Pr(X < x) -Corresponds to AUC to the left of the point ("Left tail") -Ex: Pr (X < 2) ~Shaded "tail" 0.0039 + 0.0469 + 0.2109 = 0.2617
96
Mean and Variance of a Discrete Random Variance
-Definitional formula for mean or expectation (p.111) -Definitional formula for variance (p.111) .
97
Exprected Mean
. X 0 1 2 3 4 Pr(X=x) 0.0039 0.0469 0.2109 0.4219 0.3164 How to calculate the expected mean? u = 0*0.0039 + 1*0.0469 + 2*0.2109 +3*0.4219 + 4*0.3164 = 3
98
Variance
. X 0 1 2 3 4 Pr(X=x) 0.0039 0.0469 0.2109 0.4219 0.3164 How to calculate the variance? (0-3)^2 * 0.0039 + (1-3) ^2 * 0.0469 + (2-3) ^2 * 0.2109 + (3-3)^2 * 0.4219 + (4-3)^2 * 0.3164 = 0.75
99
5.4 Continuous Random Variables
-Continuous random variables form a continuum of possible values -As an illustration, consider the spinner -The spinner will generate a continuum of random numbers between 0 to 1 -A probability density function (pdf) is a mathematical relation that assigns probabilities to all possible outcomes for a continuous random variable -The pdf for our random spinner is shown here -The shaded area under the curve represents probability, in this instance Pr(0 < X < 0.5) = 0.5 0.5 - 0 = 0.5 * 1 = 0.5 Pr( 0.25 < x < 0.5 ) = 0.25 0.5 - 0.25 = 0.25 * 1 = 0.25 Pr( X > 0.7) = 0.3 1 - 0.7 = 0.3 * 1 = 0.3
100
Examples of pdfs
-pdfs obey all the rules of probabilities -pdfs come in many forms (shapes) ~Uniform pdf ~Normal pdf ~Chi-square pdf ~Exercise 5.13 pdf *The most common pdf is the normal (We study the Normal pdf in detain in the next chapter)
101
Area Under the Curve
-As was the case with pmfs, pdfs display probability with the area under the curve (AUC) -This histogram shades bars corresponding to ages < 9 (~40% of histograms) -This shaded AUC on the Normal pdf curve also corresponds to ~40% of total X = age X Normal Pr (X < 9) = 0.4
102
6.1 Binomial Random Variables
-Binomial = a family of discrete random variables -Binomial Random Variable = the random number of successes in n independent Bernoulli trials (a Bernoulli trial has two possible outcomes: "success" or "failure" -Binomials random variables have toe parameters ~n = number of trials ~P = probability of success of each trial
103
Binomial Example
-Consider the random number of successful treatments when treating four patients -Suppose the probability of success in each instance is 75% -The random number of successes can vary from 0 to 4 -The random number of successes is a binomial with parameters n = 4 and p = 0.75 -Notation ~Let X ~b(n,p) represent a binomial random variable with parameters n and p *The illustration variable is X ~ b(4, 0.75)
104
6.2 Calculating Binomial Probabilities
-Formula for binomial probabilities Pr(X = x) = nCx p^x q^(n-x) -Where ~nCx = the binomial coefficient (next slide ~p = probability of success for each trial ~q = probability of failure = 1-p
105
Binomial Coefficient
-Formula for the binomial coefficient nCx = (n!) / (x! (n-x)!) -Where ! represent the factorial function, calculated -X! = x * (x-1) * (x-2) *....* 1 -Ex: ~ 4! = 4*3*2*1 = 24 -By definition 1! = 1 and 0! = 1 -Ex: 4C2 = (4!) / (2!) (4-2) = (4!) / (2!) (2) ! = (4*3*2*1) / (2*1) (2*1) = 6
106
Binomial Coefficient Cont.
nCx = (n!) / (x!(n-x))! -The binominal coefficient is called the "choose function" because it tells you the number of ways you could choose x items out of n nCx = the number of ways to choose x items out of n -Ex: 4C2 = 6 means there are six ways to choose two items out of four
107
Binomial Calculation -Example
-Recall the "Four patients example" -Four patients; probability of success of each treatment = 0.75 -The number of success is the binomial random variable X ~b(4, 0.75) -Note q = 1 - 0.75 = 0.25 -What is the probability of observing 0 successes under these circumstances? Pr (X = 0) = nCx p^x q^(n-x) 4C0 * 0.75^0 * 0.25^(4-0) (4!) / (0! * 4!) * 0.75 ^0 * 0.25 ^4 1 * 1 * 0.0039 0.0039 Pr(X= 1) = 4C1 * 0.75^1 * 0.25^4-1 4 * 0.75 * 0.0156 0.0469 Pr(X= 2) = 4C2 * 0.75^2 * 0.25^4-2 6 * 0.5625 * 0.0625 0.2109 Pr(X = 3) = 4C3 * 0.75^3 * 0.25^4-3 4 * 0.4219 * 0.25 0.4219 Pr(X = 4) = 4C4 * 0.75^4 * 0.25^4-4 1 * 0.3164 * 1 0.3164
108
Area Under the Curve
-Recall the area under the curve (ACU) concept ACU = probability
109
6.3 Cumulative Probability
-Recall the cumulative probability concept -Cumulative probability = the probability of that value or less -Pr(X < x) -Correspond to left tail of pmf
110
Cumulative Probability Function
-Cumulative probability function lists cumulative probabilities for all possible outcome -Ex: ~The cumulative probability function for X ~b(4, 0.75) Pr(X < 0) = 0.0039 Pr(X < 1) = 0.0508 Pr(X < 2) = 0.2617 Pr(X < 3) = 0.6836 Pr(X < 4) = 1.000
111
6.5 Expected Value and Variance for Biomilas
-The expected value (mean) u of a binomial pmf is its "balance point" -The variance ^2 is its spread -Shortcut formula u =np ^2 = npq
112
Expected Value and Variance, Binomials, illustration
-For the "Four patients" pmf of X~b(4, 0.75) u = n*p 4(0.75) = 3 ^2 = n(p)(q) 4(0.75)(.25) = 0.75
113
6.6 Using the Binomial
-Suppose we observe 2 successes in the "Four patients" example -Note u = 3, suggesting we should see 3 success on average -Does the observation of 2 successes cast doubt on p = 0.75 -No, because Pr(X < 2) = 0.2617 is not too unusual
114
Normal Distributions
-Normal random variables are the most common type of continuous random variable -More importantly, describe the behavior of means
115
Normal Probability Density Function
-Recall the continuous random variables are described with smooth probability density functions (pdfs) - see chapter 5 -Normal pdfs are recognized by their familiar bell-shape
116
Figure 7.1
-Histogram with overlying Normal Curve ~The overlying curve represents its Normal pdf model
117
Area Under the Curve
-The darker bars of the histogram in Figure 7.2 correspond to ages less than or equal to 9 (~40% of observations) -This darker area under the curve (see Figure 7.3) also correspond to ages less than 9 (~40% of the total area)
118
Figure 7.2
-Proportion less than 9 shaded darker color
119
Figure 7.3
-Proportion less than 9 (area under the curve) ~This shaded area is the probability associated with the range 0-9 years old f(x) = 1 / (sq root 2 pi sigma) e^((-1/2)((x-u) / sigma)^2
120
Parameters mu and sigma
-Normal pdfs are a family of distributions -family members identified by parameters mu (mean) and sigma (standard deviation -mu control location (see Figure 7.4) -sigma control spread (see Figure 7.5)
121
Standard Deviation sigma
-Point of inflections (where the slopes of the curve begins to level) occur one sigma below and about mu
122
Normal Distribution
-Normal distribution is often written as N(mu, sigma^2) to indicate that the density curve depends upon the parameters mu and sigma^2, which are the mean and variance of the random variable ~mu corresponds to the middle of the curve ~sigma^2 determines the spread of the curve -The standard Normal Distribution is a normal distribution with mu = 0, sigma = 1
123
Standard Normal Distribution
-A standard normal random variable is generally denoted as Z ~The area between a and b under the standard normal density curve provides the probability that Z will assume a value over the interval (a,b): P(a
124
Example
-Let Z be a standard normal random variable ~Find the following probabilities using Table B a) P(Z < 1.96) = 0.9750 b) P(-2.00 < Z < 2.00) = 0.0228 < Z < 0.9772 = 0.9772 - 0.0228 = 0.9544 c) P (Z > -1.28) = Z > 0.1003 = 1 - 0.1003 = 0.8997 d) P (-5.13 < Z < 2.00) = 0 < Z < 0.9772 = 0.9772 - 0 = 0.9772 e) P (Z = 1.71) = 0
125
7.2 Determining Normal Probabilities
-To determine a Normal Probability ~State the problem ~Standardize the value (z score) ~Sketch and shade the curve ~Use Table B to determine the probability
126
Standard Normal (Z) Variable
-Standard Normal Variable = a Normal random variable with mu = 0 and sigma = 1 -Called "z variables" -Notation Z ~N(0,1) -Use Table B to look up cumulative probabilities
127
Figure 7.11
-Portion of Table B highlighting P (Z < 1.96) = 0.9750
128
Example: Normal Probability -Step 1: Statement of Problem
-We want to determine the percentage of human gestations that are less than 40 weeks in length -We know that uncomplicated human pregnancy from conception to birth is approximately Normally distributed with mu = 39 wees and sigma = 2 weeks *Note: clinicians measure gestation from the last menstrual period to birth, which adds 2 weeks to the sigma X = human gestation in weeks -Let X represent human gestation X ~N (39,2) -Statement of the problem Pr(X < 40) =
129
Normal Probability -Step 2: Standardize
-To standardize, subtract mu and divide by sigma Z = (x-mu) / sigma -The z-score tells one how the number of sigma-units the value falls above or below mu -Ex: ~The value 40 from X~N(39,2) has Z= (40-39) / 2 = 0.5 Pr(Z < 0.5) = 0.6915
130
Normal Probability -Steps 3 and 4: Sketch and use Table B
-Sketch and label axes -Use Table b to lookup Pr(Z < 0.5) = 0.6915
131
Probabilities Between Two Points
-Let a represent the lower boundary and b represent the upper boundary of a range Pr(a < Z < b) = Pr (Z < b) - Pr (Z < a)
132
7.3 Looking up the z percentile value
-Use Table B to look up the z-percentile value ~Ex: *The score for the probability in questions -Look inside the table for the entry closest to the associated cumulative probability -Then trace the z score to the row and column labels
133
Looking up the (Z) percentile value
-Suppose one wanted the 97.5th percentile z score ~Look inside the table for 0.975 *Then trace the z-score to the margins -Notation ~Let Zp represents the z-score with cumulative probability p ~EX: Z.975 = 1.96
134
8.1 Concepts
-Statistical inference is the act of generalizing from a sample to a population with a calculated degree of certainty ~We are curious about parameters in the population ~We calculate statistics in the sample
135
Parameters and Statistics
-It is essential to draw the distinction between parameters and statistics Parameters Statistics Source Population Sample Calculated? No Yes Constant? Yes No Notation (examples) Mu, sigma, p x-bar, s, p̂
136
8.2 Sampling Behavior of a mean
-How precisely does a given sample mean reflect the underlying population mean?
137
Sampling
-Age ~Population 65 students in our CHS 280 class -Which sample mean reflects the underlying population mean more precisely? -If the sample size is 3 ~Sampling distribution of the sample mean N= 65 mu = age X-bar1 = (18 + 18 + 19) / 3 = X-bar2 = (19 + 20 + 21) / 3 = -If the sample size is 50 ~Sampling distribution of the sample mean N = 65 X-bar1 = (...+...+...+...) / 50 =
138
Deviation of population and sampling distribution
-Population (Individual observation) Sigma -Sampling Distribution of x-bar Sigma / (sq root n)
139
Standard deviation of sampling distribution of x-bar
-Standard error of the mean Sigma lower x-bar SE lower x-bar Sigma / (sq root n) *The square root law says the SE of the mean is inversely proportional to the square root of the sample size
140
Example -The Weschler ~Adult Intelligence Scale has sigma = 15
-For n = 1 -> SE lower x-bar = Sigma / (sq root n) = 15 / sq root 1 = 15 -For n = 4 -> SE lower x-bar = Sigma / (sq root n) = 15 / sq root 4 = 7.5 -For n = 16 -> SE lower x-bar = Sigma / (sq root n) = 15 / sq root 16 = 3.75 *Quadrupling the sample size cut the SE in half Square root law
141
Figure 8.2
-Sampling distribution of the mean based on n = 10 compared of population values, Wechsler Adult Intelligence Scale scores
142
Central limit Theorem
-Sampling distribution of x-bar tends toward Normality even when the population distribution is not Normal ~This effect is strong in large samples
143
Law of Large Numbers
-As a sample size gets larger and larger, the sample mean tends to get closer and closer to the mu
144
Adult Intelligence Scale Example
-Wechsler Adult Intelligence Scale (WAIS) scores vary according to a Normal distribution with mu = 100 and sigma = 15 a) what can we say about the sampling distribution of a mean based on an SRS of 10 such scores? mu = 100 sigma = 15 SE lower x-bar = sigma / (sq root n) = 15 / (sq root 10) = 4.7434 X-bar~N (100, 4.74) b) What is the probability of getting an x-bar less than 90? Pr(X-bar < 90) = ? X-bar to Z = (X-bar - mu) / SE lower X-bar Pr (Z < 90) = (X-bar - 100) / 4.74 = (90 -100) / 4.74 = -2.109 = 0.0174
145
8.3 Sampling Behavior of Counts and Proportions
-Binomial Random Variable ~Random number of successes (X) in n independent "success/ failure" trials ~Probability of success for each trial is p -Notation X~b(n,p) ~When n is large (npq > = 5), we can do normal approximation to the Binomial
146
Normal Approximation for a Binomial Count
Mu = np and sigma = sqroot npq -When Normal approximation applies X~N (np, sq root npq)
147
Normal Approximation for a Binomial Proportion
-mu = p, and sigma = sq root (pq) / n p̂~N(p, sq root ((pq) / n))
148
Example
Recent statistics claim the prevalence of maternal smoking is quite low, at only 5% ~Suppose another research group sampled 107 pregnant mothers in their third trimester n = 107 p = 0.05 q = 0.95
149
Example
n = 107 p = 0.05 q = 0.95 A. Can we assume an approximation to the normal distribution for this case? npq = 5.0825
150
Example
n = 107 p = 0.05 q = 0.95 B. Calculate the probability of observing at least 12 mothers among 107 are smokers during their pregnancy using a Normal Approximation mu = np = (107 * 0.05) = 5.35 sigma = (sq root npq) = (Sq root 107 * 0.05 * 0.95) = 2.2544 X = 12 X~N (5.35, 2.25444) Pr(X> 12) = Pr ( Z > 12) = (12-5.35)/ 2.25444 = 2.949 = 2.95 = 0.9984 1- 0.9984 = 0.0016