Module 1-8 Flashcards

1
Q

Biostatistics

A

-Statistics is not merely a compilation of computational techniques
-Statistics
~Is a way of learning from data
~Is concerned with all elements of study design, data collection, and analysis of numerical data
~Does require judgment
-Biostatistics is statistics applied to biological and health problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Biostatisticians are

A

-Data Detectives
~Who uncovers patterns and clues
~This involves exploratory data analysis (EDA) and descriptive statistics
-Data Judges
~Who judge and confirm clues
~This involves statistical inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Measurement

A

-Measurement (defined)
~The assigning of numbers and codes according to prior-set rules
-There are three broad types of measurements
~Nominal
~Ordinal
~Quantitive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Nominal Measurements

A

-Classify observations into named categories
~No order
~Typically two categories (binary (yes/no)), but can have more categories but can not be ordered
-Ex:
~HIV status (positive or negative)
~Sex (Male or Female)
~Hair color (red, brown, black, blonde, gray, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ordinal Measurement

A

-Categories that can be put in rank order
~Opinion
*More than two categories that have to be in order
-Ex:
~ Stage of cancer classified as Stage I, Stage II, Stage III, Stage IV
~Opinion classified as strongly agree (5), agree (4), neutral (3), disagree (2), strongly disagree (1)
~Age groups 0-4, 5-9, 10-14, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Quantitative Measurements

A

-True numerical values that can be put on a number line
-Numerical values with equal spacing between numerical values
-Ex:
~Age (years)
~Serum cholesterol (mg/dL)
~T4 cell count (per dL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Illustrative Example:
-Weight Change and Heart Disease

A

-This study sought to determine the effect of weight change on coronary heart disease risk
-It studied 115,818 women, 30-55 years of age, free of CHD over 14 years
-Measurements included the following variables
~Nominal (including Binary)
*CHD onset (yes or no)
*Family history of CHD (yes or no)
~Ordinal
*Non-smoker, light-smoker, moderate smoker, heavy smoker
~Quantitative
*BMI (kgs/m^3)
*Age (years)
*Weight presently
*Weight at age 18

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Observation, Variable, and Value

A

-Observation
~The unit upon which measurements are made and can be an individual or aggregate
-Variable
~The generic thing we measure
*Age of person
*HIV status of a person
-Value
~A realized measurement
*“27”
*“positive”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Table

A

-Each row corresponds to an observation
-Each column contains information on a variable
-Each cell in the table contains a value
-Units of observation in these data are individual regions, not individual people
~Table 1.2 in the textbook

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Measurement Inaccuracies

A

-Imprecision
~The inability to get the same result upon repetition
-Bias
~A tendency to overestimate or underestimate the true value of an object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Biostatisticians are

A

-Data Detectives
~Who uncovers patterns and clues
~This involves exploratory data analysis (EDA) and descriptive statistics
-Data Judges
~Who judges and confirms clues
~This involves statistical inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Types of Studies

A

-Surveys
~Describe population characteristics
*A study of the prevalence of hypertension in a population
-Comparative Studies
~Determine relationships between variables
*A study to address whether weight gain causes hypertension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

2.1 Surveys

A

-Goal
~To describe population characteristics
-Studies a subset (sample) of the population
~Census vs. Sample
-Uses sample to make inferences about population
-Sampling
~Saves time
~Saves money
~Allows resources to be devoted to greater scope and accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Illustrative Example
-Youth Risk Behavior Surveillance (YRBS)

A

-YRBS monitors health behaviors in youth and young adults in the US. Six categories of health-risk behaviors are monitored. These include:
~Behavior that contributes to unintentional injuries and violence
~Tobacco use
~Alcohol and drug use
~Sexual behaviors
~Unhealthy dietary behavior
~Physical activity levels and body weight
-Ex:
~Several million public and private school students in the US in 2003
~Sampling
*15,240 questionnaires completed at 158 schools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Types of Samples
-Probability Sample

A

-Simple random sample (SRS)
-Stratified random sample
-Cluster sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Types of Samples
-Non-probability sample

A

-Convenience sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Sampling

A

-Probability samples
~Use chance mechanisms to select individuals
-Most basic type of probability sample is the simple random sample (SRS)
-SRS
~Each population member has the same probability of being selected into the sample
~Selection of any individual into the sample does not influence the likelihood of selecting any other individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Simple Random Sampling Method

A
  • Identify each population member with a number 1,2,…N
    -Use a random number generator to generate n random numbers between 1 and N
    ~Ex:
    *http://www.random.org/integer-sets/
    -Keep in mind
    ~The objective of an SRS is that every possible subset is equally likely!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Let’s Select 5 random IDs from our class ID (1-79)
-Step one

A

-Generate 1 set with 5 unique random integers in each
-Each integer should have a value between 1 and 79 (both inclusive; limits +/- 1,000,000,000)
-The total number of integers must be no greater that 10,000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Class ID Selection

A

-15, 31, 40, 48, and 76
-Random Integer Set Generator
~One requested 1 set with 5 unique random integers, taken from the [1,79] range. The integers were sorted in ascending order
*Here is the set
**Set 1: 15, 31, 40, 48, 76

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Sampling

A

-Sampling Fraction = (n)/N
~n is the sample size
~N is the size of the population
-Sampling with replacement
~Tossing selected members back into the mix after they’ve been selected
~Any given unit can appear more than once in the sample
-Sampling without replacement
~Selected units are removed from possible future reselection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Other Types of Probability Samples
-(More Advanced Methods)

A

-Stratified random sample
~Random sample strata (subset) with the population
~Ex:
*The population can be divided into 5-year age groups (0-4, 5-9,…) with simple random samples of varying sizes drawn from each age-strata
-Cluster sample
~Randomly sample clusters comprising varying numbers of observations
~Ex:
*Households (cluster) are selected at random, and ALL individuals are studied within the clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Cautions when Sampling

A

-Undercoverage
~Groups in the source population are left out or underrepresented in the population list used to select the sample
-Volunteer bias
~Occurs when self-selected participants are atypical of the source population
-Nonresponse bias
~Occurs when a large percentage of selected individuals refuse to participate or cannot be contacted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

2.2 Comparative Studies

A

-Comparative designs study the relationship between an explanatory variable and a response variable
-Explanatory variable
~Synonyms
*Independent variable, factor, predictor, exposure
~Treatment or exposure that explains or predicts changes in the response variable
-Response variable
~Synonyms
*dependent variable, outcome
~Outcome or response being investigated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Experimental vs. non-experimental

A

-Comparative studies may be experimental or non-experimental
-In EXPERIMENTAL DESIGNS, the investigator assign the subjects to groups according to the explanatory variable
~Exposed and unexposed groups
-In NONEXPERIMENTAL DESIGNS, the investigator does not assign subjects into groups; individuals are merely classified as “exposed” or “non-exposed”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Figure 2.1

A

-Experimental and non-expermental study design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Example of an Experimental Design

A

-The Women’s Health Initiative study randomly assigned about half its subjects to a group that received hormone replacement therapy (HRT)
-Subject were followed for ~5 years to ascertain various health outcomes, including heart attacks, stroke, the occurrence of breast cancer and so no

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Example of a Nonexperimental design

A

-The Nurse’s Health study classified individuals according to whether they received HRT
-Subjects were followed for ~5 years to ascertain the occurrence of various health outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Comparison of Experimental and Nonexperimental Designs

A

-In both the experimental (WHI) study and nonexperimental (Nurse’s Health) study, the relationship between HRT (explanatory variable) and various health outcomes (response variables) was studied
-In the experimental design, the investigators controlled who was and who was not exposed
-In the nonexperimental design, the study subjects (or their physicians) decided on wether or not subjects were wxposed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Jargon

A

-A subject = an individual participating in the experiment
-A factor = an explanatory variable being studied; experiments may address the effect of multiple factors
-A treatment = a specific set of factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What can you say about data?

A

-Ages of people in group A
~21, 42, 5, 11, 30, 50, 28, 27, 24, 52

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Stemplots

A

-You can observe a lot by looking - Yogi Berra
-Starting by exploring the data with Exploratory Data Analysis (EDA)
-A popular univariate EDA technique is the stem-and-leaf plot
-The stem of the stempolt is a number-line (axis)
-Each leaf represents a data point
-Ex:
0 l 5
1 l 1
2 l 1 4 7 8
3 l 0
4 l 2
5 l 0 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Stemplot
-Illustration

A

-10 ages (data sequenced as an ordered array)
~ 5, 11, 21, 24, 27, 28, 30, 42, 50, 52
-Draw the stem to cover the range 5 to 52
0 l
1 l
2 l 1
3 l
4 l
5 l
x 10 <- axis multiplier
-Divide each data point into a stem-value (in this example, the tens place) and leaf-value (the ones-place, in this example)
-Place leaves next to the stem value
-Example of a leaf: 21 (plotted)
-Plot all the data points in rank order
0 l 5
1 l 1
2 l 1478
3 l 0
4 l 2
5 l 02

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Interpreting Stem plots
-Shape

A

-Symmetry (mirror image of itself around its center)
-Modality (number of peaks)
- Kurtosis (width of tails or steepness of the mound)
-Departures (outliers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Interpreting Stem Plots
-Location

A

-Gravitational center -> mean
-Middle value -> median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Interpreting Stemplots
-Spread

A

-Range and inter-quartile range
-Standard deviation and variance (chapter 4)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Shape

A

-“Shape” refers to the pattern when plotted
-Here’s the “skyline silhouette” of our data
x
x
x x
x x x x x x
0 1 2 3 4 5
- Consider: symmetry, modality, kurtosis
-Do NOT ‘over-interpret” plots when n is small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Location
-Mean

A

-“Eye-ball method” -> visualize where the plot would balance
~Around 25 to 35
-Arithmetic method = sum values and divide by n
~mean = 290/10 = 29

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Central location
-Median

A

-Ordered array:
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
-The median has depth (n+1) / 2
- n = 10, median’s depth = (10+1) / 2 = 5.5
-Falls between 27 and 28
-When n is even, average the adjacent values
~Meadian = 27.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Spread
-Range

A

-For now, report the range (minimum and maximum values)
-Current data range is “5 to 52”
-The range is the easiest but not the best way to describe spread (better methods described later)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Outlier

A

-An outlier is a striking deviation from the overall pattern or shape of the distribution
0 l 679
1 l 124557
2 l
3 l
4 l
5 l 0
x10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Stemplot
-Second Example

A

-Data:
~1.47, 2.06, 2.36, 3.43, 3.74, 3.78, 3.94, 4.42
-Stem = ones-place
-Leaves = tenths-place
-Truncate extra digit (ex., 1.47 -> 1.4)
~DO NOT plot decimal
-Center
~Between 3.4 and 3.7
-Spread
~ 1.4 to 4.4
-Shape
~Mound, no outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Third Illustrative Example (n = 25)

A

-Data
14, 17, 18, 19, 22, 22, 23, 24, 24, 26, 26, 27, 28, 29, 30, 30, 30, 31, 32, 33, 34, 34, 35, 36, 37, 38,
-Regular stemplot
1 l 4789
2 l 2234466789
3 l 000123445678
x10
Too squished to see the shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Third Illustration; Split Stem

A

-Split stem values into two ranges
~First “1” holds leaves between 0 to 4, and second “1” will hold leaves between 5 to 9
-Split-stem
1 l 4
1 l 789
2 l 2234
2 l 66789
3 l 00012344
3 l 5678
x10
-negative skew now evident

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

How many stem-values?

A

-Start with between 4 and 12 stem-values
-Trial and error
~Try different stem multiplier
~Try splitting stem
~Look for most informative plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

3.3 Body weight (pounds) of students in a class, n = 53
-Data range from 100 to 260 lbs

A

-x100 axis multiplier -> only two stem-values (1x100 and 2x100)
-x100 axis-multiplier w/ split stem -> only 4 stem values -> might be okay
-x10 axis-multiplier -> see next slide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Fourth Stemplot Example (n = 53)

A

10 l 0166
11 l 009
12 l 0034578
13 l 00359
14 l 08
15 l 00257
16 l 555
17 l 000255
18 l 000055567
19 l 245
20 l 3
21 l 025
22 l 0
23 l
24 l
25 l
26 l 0
x10
-Shape
~Positive skew, high outlier (260)
-Location
~Median about 165
-Spread
~From 100 to 260

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Frequency Table

A

-Frequency = count
-Relative frequency = proportion or %
-Cumulative frequency = % less than or equal to level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Frequency Table with Class intervals

A

-When data are sparse, group data into class intervals
-Create 4 to 12 class intervals
-Classes can be uniform or non-uniform
-End-point convention
~First class interval of 0 to 10 will include o but exclude 10 (0 to 9.99)
-Tally frequencies
-Calculate relative frequency
-Calculate cumulative frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Class Intervals

A

-Uniform class intervals table (width 10) for data
~5, 11, 21 ,24, 27, 28, 30, 42, 50, 52
Class Freq Relative Freq (%) Cumulative Freq (%)
0-9 1 10 10
10-19 1 10 20
20-29 4 40 60
30-39 1 10 70
40-44 1 10 80
50-59 2 20 100
Total 10 100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Histogram

A

-A histogram is a frequency chart for a quantitative measurement
~The bars will touch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Bar Cart

A

-A bar chart with non-touching bars is reserved for categorical measurements and non-uniform class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Summary Statistics

A

-Central location
~Mean
~Median
~Mode
-Spread
~Range and interquartile range (IQR)
~Variance and standard deviation
-Shape Summaries
~Seldom used in practice

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Notation

A

-n = sample size
-x = the variable (ex. ages of subjects)
-xi = the value of individual i for variable X
-E = sum all values (capital sigma)
-Illustrative data (ages of participants)
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
n = 10
x = Age variable
x1 = 5, x2 = 11, …… x10 = 52
Exi = x1 + x2 + ….. + x10 = 5+11+ …. + 52 =290

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

4.1 Central Location
-Sample Mean

A

-“Arithmetic average”
-Traditional measure of central location
-Sum the values and divide by n
-“xbar” refers to the sample mean
pages 77-79 in the textbook has the equation to use
-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Example
-Sample Mean

A

-Ten individuals selected at random have the following ages
21, 42, 5, 11, 30, 50, 28, 27, 24, 52
*Note that n = 10, Exi = 21 +41, + …. + 52 = 290, 1/10(290) = 29.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Uses of the Sample Mean

A

-The sample mean:
~The value of an observation drawn at random from the sample can be used to predict the population mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Population Mean

A

-
-Same operation as the sample mean except based on the entire population (N = population size)
-Conceptually important
-Usually not available in practice
-Sometimes referred to as the expected value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

4.2 Central Location
-Median

A

-The median is the value with a depth on (n + 1) / 2
-When n is even, average the two values that straddle a depth of (n + 1) / 2
-For the 10 values listed below, the median has depth (10 + 1) / 2 = 5.5, placing it between 27 and 28
~Average these two values to get the median = 27.5
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
M = 27.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

More Examples of Medians

A

-Ex A:
~2, 4, 6
*M = 4
-Ex B:
~2, 4, 6, 8
*M = 5
-Ex C:
~6, 2, 4
*M does not = 2
**(Values MUST be ORDERED first)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

The Median is Robust

A

-The median is more resistant to skews and outliers than the mean; it is more robust
-This data set has a mean of 1636
1362, 1439, 1460, 1614, 1666, 1792, 1867
-Here’s the same data set with a data entry error “outlier”
~This data set has a mean of 2743
1362, 1439, 1460, 1614, 1666, 1792, 9867
-The median is 1614 in both instances, demonstrating its robustness in the face of outliers

62
Q

4.3 Mode

A

-The mode is the most commonly encountered value in the dataset
-This data set had a mode of 7
{4, 7, 7, 7, 8, 8, 9}
-This data set has no mode
{4, 6, 7, 8} (each point appears only once)
-The mode is useful only in large data sets with repeating values

63
Q

4.7 Spread
-Standard Deviation

A

-Most common descriptive measures of spread
-Based on deviations around the mean
-This figure demonstrates the deviations of two of its values

64
Q

Variance and Standard Deviation

A

-Deviation = xi-
~Sum of squared deviations = SS = E(xi-
~Sample variance = s^2 = (SS/(n-1))
~Sample standard deviation = s=
*Go back to slides to write down the rest of the equations

65
Q

Standard deviation (formula)

A

s =

-Sample standard deviation s is the estimator of population standard deviation
~See “Facts About the Standard Deviation” page 93
*Go back to slides to write down the equation

66
Q

Illustrative Example
-Standard Deviation (p. 92)

A

Observation Deviations Squared deviations

36 36-36 = 0 0^2 = 0
38 38-36 = 2 2^2 = 4
39 39-36 = 3 3^2 = 9
40 40-36 = 4 4^2 = 16
36 36-36 = 0 0^2 = 0
34 34-36 = -2 -2^2 = 4
33 33-36 = -3 -3^2 = 9
32 32-36 = -4 -4^2 = 16
SUMS -> 0* SS = 58
*SUM of deviations always equal zero

67
Q

Illustrative Example
-Standard Deviation (p. 92)

A

Observation Deviations Squared deviations

36 36-36 = 0 0^2 = 0
38 38-36 = 2 2^2 = 4
39 39-36 = 3 3^2 = 9
40 40-36 = 4 4^2 = 16
36 36-36 = 0 0^2 = 0
34 34-36 = -2 -2^2 = 4
33 33-36 = -3 -3^2 = 9
32 32-36 = -4 -4^2 = 16
SUMS -> 0* SS = 58
*SUM of deviations always equals zero
-Sample variances (s^2)
~

-Standard deviation(s)
~

68
Q

Interpretation of Standard Deviation

A

-Measure spread (ex. if group was s1 = 15 and group 2 s2 = 10, group 1 has more spread, i.e., variability)

69
Q

4.5 Spread
-Quartiles

A

-Two distributions can be quite different yet can have the same mean
-This data compares particulate matter in air samples (up/m^3) at two sites
~Both sites have a mean of 36, but Site 1 exhibits much greater variability
*We would miss the high pollution days if we relied solely on the mean
Site 1 l l Site 2
42 l 2 l
8 l 2 l
2 l 3 l 234
86 l 3 l 6689
2 l 4 l 0
l 5 l
l 5 l
l 6 l
8 l 6 l
x10

70
Q

Spread
-Range

A

-Range = maximum - minimum
-Illustrative example
~Site 1 range 68 - 22 = 46
-Site 2 range = 40 - 32 = 8
-Beware:
~The sample range will tend to underestimate the population range
-Always supplement the range with at least one addition measure of spread
Site 1 l l Site 2
42 l 2 l
8 l 2 l
2 l 3 l 234
86 l 3 l 6689
2 l 4 l 0
l 5 l
l 5 l
l 6 l
8 l 6 l
x10

71
Q

Spread
-Quartiles

A

-Quartile 1 (Q1)
~Cuts off bottom quarter of data = median of the lower half of the data set
-Quartile 2 (Q2)
~Cuts off top quarter of data = median of the upper half of the data set
-Interquartile Range (IQR) = Q3-Q1 covers the middle 50% of the distribution
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
Q1 = 21, Q3 = 42, and IQR = 42-21 = 21

72
Q

Spread
-Quartiles

A

-Quartile 1 (Q1)
~Cuts off bottom quarter of data = median of the lower half of the data set
-Quartile 2 (Q2)
~Cuts off top quarter of data = median of the upper half of the data set
-Interquartile Range (IQR) = Q3-Q1 covers the middle 50% of the distribution
5, 11, 21, 24, 27, 28, 30, 42, 50, 52
Q1 = 21, Q3 = 42, and IQR = 42-21 = 21

73
Q

Quartiles (Tukey’s Hinges)
-Example 2 Data are metabolic rates (cal/day), n = 7

A

1362, 1439, 1460, 1614, 1666, 1792, 1867
Median = 1614
-When n is odd, include the median in both halves of the data set
-Bottom half:
~ 1362, 1439, 1460, 1614 which has a median of 1449.5 (Q1)
-Top half
~1614, 1666, 1792, 1867 which has a median of 1729 (Q3)

74
Q

Five-Point Summary

A

-Q0 (the minimum)
-Q1 (25th percentile)
-Q2 (median)
-Q3 (75th percentile)
-Q4 (the maximum)

75
Q

4.6 Boxplots

A

-Calculate 5-point summary
~Draw box from Q1 to Q3 with line at median
-Calculate IQR and fences as follows
~Fence lower = Q1-1.5(IQR)
~Fence upper = Q3 + 1.5(IQR)
*DO NOT DRAW FENCES
-Determine if any values lie outside the fences (outside values)
~If so, plot these separately
-Determine values inside the fences (inside values)
~Draw whisker from Q3 to upper inside value
~Draw whisker from Q1 to lower inside value

76
Q

Illustrative Example
-Boxplot

A

Data: 5, 11, 21, 24, 27, 28, 30, 42, 50, 52
-5 pt summary: {5, 21, 27.5, 42, 52}; box from 21 to 42 with line @ 27.5
-IQR = 42 - 21 = 21
~Fu = Q3 + 1.5 (21) = 73.5
~Fl = Q1 - 1.5 (21) = -10.5
-None values above the upper fence and below the lower fence
-Upper inside value = 52
-Lower inside value = 5
-Draw whiskers

77
Q

Illustrative Example
-Boxplot 2

A

-5 pt summary
~3, 22, 25.5, 29, 51: draw a box
-IQR = 29 - 22 = 7
~Fu = Q3 + 1.5 (7) = 39.5
~Fl = Q1 - 1.5 (7) = 11.6
-One above the top fence (51) and one below the bottom fence (3)
-Upper inside value is 31
-Lower inside value is 21
-Draw whiskers

78
Q

Illustrative Example
-Boxplot 3

A

-Seven metabolic rates
1362, 1439, 1460, 1614, 1666, 1792, 1867
-5 pt summary
~1362, 1449.5, 1614, 1729, 1867
-IQR = 1729 - 1449.5 = 279.5
~Fu = Q3 + 1.5 (279.5) = 2148.25
~Fl = Q1 - 1.5 (279.5) = 1030.25
-None outside
-Whiskers end @ 1867 and 1362

79
Q

Boxplots
-Interpretation

A

-Location
~Position of median
~Position of box
-Spread
~Hinge-spread (IQR)
~Whisker-to-whisker spread
-Shape
~Symmetry or direction of skew
~Long whiskers (tails) indicate leptokurtosis

80
Q

Side-by-side boxplots

A

-Boxplots are especially useful when comparing groups

81
Q

Choosing Summary Statistics

A

-Always report a measure of a central location, a measure of spread, and the sample size
-Symmetrical mound-shaped distributions -> report the mean and standard deviation
-Odd-shaped distributions -> report 5-point summaries (or median and IQR)

82
Q

Definitions

A

-Random variable = a numerical quantity that takes on different values depending on chance
-Ex:
~Number of smokers in a simple random sample of size n, the ages of subjects selected at random at UNR
-Sample Space = the set of all possible values from a random variable
-Ex:
~If the subject’s age is a random variable of interest, the set of all possible values for this random variable is???
-Event = an outcome or set of outcomes from random variables
-Probability = the proportion of times an event is expected to occur in the population
-Ex:
~Roll a fair die: the probability that the die lands on “one”
*Ideas about probability are founded on relative frequencies (proportions) in populations

83
Q

Die Example

A

-Random Variable
~The number on the face
-Population (Sample Space): (not a population of people)
{1, 2, 3, 4, 5, 6}
-Event: 1
-Probability: 1/6
EX:
~Event: 5 or 6
~Probability: 2/6 or 1/3

84
Q

Probability Illustrated

A

-In a given year, there were 42,636 traffic fatalities in a population of N= 293,655,000
-If randomly selected a person from this population, what is the probability that they will experience a traffic fatality by the end of that year
-ANS
~The relative frequency of that event in the population = 42,636 / 293,655, 000 = 0.0001452
*Thus, Pr(traf. fatality) = 0.0001452 (about 1 in 6887)

85
Q

5.2 random Variables

A

-Random variable = a numerical quantity that takes on different values depending on chance
-Two types of random variables
-Discrete random variables
~A countable set of possible outcomes
*X = nu ber of smokers (cannot have half of a person)
-Continuous random variable
~An unbroken continuum of possible outcomes
*Weight in pounds (cannot have 0 due to it not existing)

86
Q

Discrete Random Variables

A

-Discrete Random Variables
~Acountable set of possible outcomes
*The variable number of leukemia cases in a geographic region in a given period
*The variable number of success in n independent treatments
*The variable number of smokers in a simple random sample of size n
-Continuous random variable
~An unbroken continuum of possible outcomes
*The variable Amount of time it takes to complete a task
*The variable Height of an individual selected at random

87
Q

5.3 Discrete Random Variables

A

-Probability mass function (pmf) = a mathematical relation that assigns probabilities to all possible outcomes for discrete random variables
-Illustrative example:
~One rolls a die 2 times
*Let X = the variable number of times one gets six
*This is the pmf for the random variable
X 0 1 2
Pr(X=x) 0.6944 0.2778 0.0278
Illustrative example 2
-“Four Patients”
~Suppose one treat four patients with an intervention that is successful 75% of the time
*Let X = the variable number of successes in this experiment
*This is the pmf for this random variable
X 0 1 2 3 4
Pr(X=x) 0.0039 0. 0469 0.2109 0.4219 0.3164

88
Q

Operations on Events

A

-Intersection
~For two events A and B, the intersection A B represents the events that both A and B occur
-Union
~For two events A and B, the union A U B represents the events that A or B occurs
*A occurs without B, B occurs without A, or A and B both occur
-Complement
~For an event A, the complement of A represents the event that occurs if A does not occur. It is typically denoted by A-bar

89
Q

Properties of Probabilities

A

-Property 1
~Probabilities are always between 0 and 1
-Property 2
~A sample space is all possible outcomes
*The probabilities in the sample space to 1 (exactly)
-Property 3
~The complement of an event is “the event not happening”
*The probability of a complement is 1 minus the probability of the event
**Pr(rain tomorrow) = 0.6
**Pr(not rain tomorrow) = 0.4
-Property 4
~Probabilities of disjoint events can be added
*Pr(X = 1) + Pr(X = 2)
**X = number in die

90
Q

Properties of Probabilities in Symbols

A

-Property 1.
0 < Pr(A) < 1
-Property 2.
Pr(S) = 1, where S represents the sample space (all possible outcomes)
-Property 3.
Pr (A-bar) = 1- Pr(A), A-bar represents the complement of A (NOT A)
-Property 4.
If A and B are disjoint, the Pr(A or B) = Pr (A) + Pr(B)

91
Q

Properties 1 and 2 Illustrated

A

-Figure 5.2
-Property 1.
0 < Pr(A) < 1
~Note that all individual probabilities are between 0 and 1
-Property 2
Pr(S) = 1
~Note that the summ of all probabilities = .0039 + .0469 + .2109 + .4219 + .3164 = 1

92
Q

Property 3 Illustrated

A

-Property 3
Pr (A-bar) = 1- Pr(A)
~As an example, let A represent 4 successes
Pr (A) = .3164
-Let A-bar represent the complement of A (“NOT A”), which is “3 or fewer”
Pr(A-bar) = 1 - Pr(A) = 1 - 0.3164 = 0.6836

93
Q

Property 4 Illustrated

A

-Property 4
Pr(A or B) = Pr (A) + Pr(B) for disjoint events
~Let A represent 4 successes
~Let B represent 3 successes
-Sine A and B are disjoin, Pr (A or B) = Pr(A) + Pr(B) = 0.3164 + 0.4219 = 0.7383
-The probability of observation 3 or 4 successes is 0.7383 or about 74%

94
Q

Area Under the Curve (AUC)

A

-The area under curves (AUC) on a pmf corresponds to the probability
-Pr (X = 2)
~area of shaded region = height x base
*.2109(1.0) = .2109

95
Q

Cumulative Probability

A

-“Cumulative probability” refers to the probability of the value or less
-Notation
Pr(X < x)
-Corresponds to AUC to the left of the point (“Left tail”)
-Ex:
Pr (X < 2)
~Shaded “tail”
0.0039 + 0.0469 + 0.2109 = 0.2617

96
Q

Mean and Variance of a Discrete Random Variance

A

-Definitional formula for mean or expectation (p.111)

-Definitional formula for variance (p.111)

.

97
Q

Exprected Mean

A

.

X 0 1 2 3 4
Pr(X=x) 0.0039 0.0469 0.2109 0.4219 0.3164
How to calculate the expected mean?
u = 00.0039 + 10.0469 + 20.2109 +30.4219 + 4*0.3164 = 3

98
Q

Variance

A

.

X 0 1 2 3 4
Pr(X=x) 0.0039 0.0469 0.2109 0.4219 0.3164
How to calculate the variance?
(0-3)^2 * 0.0039 + (1-3) ^2 * 0.0469 + (2-3) ^2 * 0.2109 + (3-3)^2 * 0.4219 + (4-3)^2 * 0.3164 = 0.75

99
Q

5.4 Continuous Random Variables

A

-Continuous random variables form a continuum of possible values
-As an illustration, consider the spinner
-The spinner will generate a continuum of random numbers between 0 to 1
-A probability density function (pdf) is a mathematical relation that assigns probabilities to all possible outcomes for a continuous random variable
-The pdf for our random spinner is shown here
-The shaded area under the curve represents probability, in this instance
Pr(0 < X < 0.5) = 0.5
0.5 - 0 = 0.5 * 1 = 0.5
Pr( 0.25 < x < 0.5 ) = 0.25
0.5 - 0.25 = 0.25 * 1 = 0.25
Pr( X > 0.7) = 0.3
1 - 0.7 = 0.3 * 1 = 0.3

100
Q

Examples of pdfs

A

-pdfs obey all the rules of probabilities
-pdfs come in many forms (shapes)
~Uniform pdf
~Normal pdf
~Chi-square pdf
~Exercise 5.13 pdf
*The most common pdf is the normal (We study the Normal pdf in detain in the next chapter)

101
Q

Area Under the Curve

A

-As was the case with pmfs, pdfs display probability with the area under the curve (AUC)
-This histogram shades bars corresponding to ages < 9 (~40% of histograms)
-This shaded AUC on the Normal pdf curve also corresponds to ~40% of total
X = age
X Normal
Pr (X < 9) = 0.4

102
Q

6.1 Binomial Random Variables

A

-Binomial = a family of discrete random variables
-Binomial Random Variable = the random number of successes in n independent Bernoulli trials (a Bernoulli trial has two possible outcomes: “success” or “failure”
-Binomials random variables have toe parameters
~n = number of trials
~P = probability of success of each trial

103
Q

Binomial Example

A

-Consider the random number of successful treatments when treating four patients
-Suppose the probability of success in each instance is 75%
-The random number of successes can vary from 0 to 4
-The random number of successes is a binomial with parameters n = 4 and p = 0.75
-Notation
~Let X ~b(n,p) represent a binomial random variable with parameters n and p
*The illustration variable is X ~ b(4, 0.75)

104
Q

6.2 Calculating Binomial Probabilities

A

-Formula for binomial probabilities
Pr(X = x) = nCx p^x q^(n-x)
-Where
~nCx = the binomial coefficient (next slide
~p = probability of success for each trial
~q = probability of failure = 1-p

105
Q

Binomial Coefficient

A

-Formula for the binomial coefficient
nCx = (n!) / (x! (n-x)!)
-Where ! represent the factorial function, calculated
-X! = x * (x-1) * (x-2) …. 1
-Ex:
~ 4! = 4321 = 24
-By definition 1! = 1 and 0! = 1
-Ex:
4C2 = (4!) / (2!) (4-2) = (4!) / (2!) (2) ! = (4
321) / (21) (21) = 6

106
Q

Binomial Coefficient Cont.

A

nCx = (n!) / (x!(n-x))!
-The binominal coefficient is called the “choose function” because it tells you the number of ways you could choose x items out of n
nCx = the number of ways to choose x items out of n
-Ex:
4C2 = 6 means there are six ways to choose two items out of four

107
Q

Binomial Calculation
-Example

A

-Recall the “Four patients example”
-Four patients; probability of success of each treatment = 0.75
-The number of success is the binomial random variable X ~b(4, 0.75)
-Note q = 1 - 0.75 = 0.25
-What is the probability of observing 0 successes under these circumstances?
Pr (X = 0) = nCx p^x q^(n-x)
4C0 * 0.75^0 * 0.25^(4-0)
(4!) / (0! * 4!) * 0.75 ^0 * 0.25 ^4
1 * 1 * 0.0039
0.0039
Pr(X= 1) = 4C1 * 0.75^1 * 0.25^4-1
4 * 0.75 * 0.0156
0.0469
Pr(X= 2) = 4C2 * 0.75^2 * 0.25^4-2
6 * 0.5625 * 0.0625
0.2109
Pr(X = 3) = 4C3 * 0.75^3 * 0.25^4-3
4 * 0.4219 * 0.25
0.4219
Pr(X = 4) = 4C4 * 0.75^4 * 0.25^4-4
1 * 0.3164 * 1
0.3164

108
Q

Area Under the Curve

A

-Recall the area under the curve (ACU) concept
ACU = probability

109
Q

6.3 Cumulative Probability

A

-Recall the cumulative probability concept
-Cumulative probability = the probability of that value or less
-Pr(X < x)
-Correspond to left tail of pmf

110
Q

Cumulative Probability Function

A

-Cumulative probability function lists cumulative probabilities for all possible outcome
-Ex:
~The cumulative probability function for X ~b(4, 0.75)
Pr(X < 0) = 0.0039
Pr(X < 1) = 0.0508
Pr(X < 2) = 0.2617
Pr(X < 3) = 0.6836
Pr(X < 4) = 1.000

111
Q

6.5 Expected Value and Variance for Biomilas

A

-The expected value (mean) u of a binomial pmf is its “balance point”
-The variance ^2 is its spread
-Shortcut formula
u =np
^2 = npq

112
Q

Expected Value and Variance, Binomials, illustration

A

-For the “Four patients” pmf of X~b(4, 0.75)
u = n*p
4(0.75) = 3
^2 = n(p)(q)
4(0.75)(.25) = 0.75

113
Q

6.6 Using the Binomial

A

-Suppose we observe 2 successes in the “Four patients” example
-Note u = 3, suggesting we should see 3 success on average
-Does the observation of 2 successes cast doubt on p = 0.75
-No, because Pr(X < 2) = 0.2617 is not too unusual

114
Q

Normal Distributions

A

-Normal random variables are the most common type of continuous random variable
-More importantly, describe the behavior of means

115
Q

Normal Probability Density Function

A

-Recall the continuous random variables are described with smooth probability density functions (pdfs) - see chapter 5
-Normal pdfs are recognized by their familiar bell-shape

116
Q

Figure 7.1

A

-Histogram with overlying Normal Curve
~The overlying curve represents its Normal pdf model

117
Q

Area Under the Curve

A

-The darker bars of the histogram in Figure 7.2 correspond to ages less than or equal to 9 (~40% of observations)
-This darker area under the curve (see Figure 7.3) also correspond to ages less than 9 (~40% of the total area)

118
Q

Figure 7.2

A

-Proportion less than 9 shaded darker color

119
Q

Figure 7.3

A

-Proportion less than 9 (area under the curve)
~This shaded area is the probability associated with the range 0-9 years old
f(x) = 1 / (sq root 2 pi sigma) e^((-1/2)((x-u) / sigma)^2

120
Q

Parameters mu and sigma

A

-Normal pdfs are a family of distributions
-family members identified by parameters
mu (mean) and
sigma (standard deviation
-mu control location (see Figure 7.4)
-sigma control spread (see Figure 7.5)

121
Q

Standard Deviation sigma

A

-Point of inflections (where the slopes of the curve begins to level) occur one sigma below and about mu

122
Q

Normal Distribution

A

-Normal distribution is often written as N(mu, sigma^2) to indicate that the density curve depends upon the parameters mu and sigma^2, which are the mean and variance of the random variable
~mu corresponds to the middle of the curve
~sigma^2 determines the spread of the curve
-The standard Normal Distribution is a normal distribution with mu = 0, sigma = 1

123
Q

Standard Normal Distribution

A

-A standard normal random variable is generally denoted as Z
~The area between a and b under the standard normal density curve provides the probability that Z will assume a value over the interval (a,b): P(a<Z<b)

124
Q

Example

A

-Let Z be a standard normal random variable
~Find the following probabilities using Table B
a) P(Z < 1.96) = 0.9750
b) P(-2.00 < Z < 2.00) = 0.0228 < Z < 0.9772 = 0.9772 - 0.0228 = 0.9544
c) P (Z > -1.28) = Z > 0.1003 = 1 - 0.1003 = 0.8997
d) P (-5.13 < Z < 2.00) = 0 < Z < 0.9772 = 0.9772 - 0 = 0.9772
e) P (Z = 1.71) = 0

125
Q

7.2 Determining Normal Probabilities

A

-To determine a Normal Probability
~State the problem
~Standardize the value (z score)
~Sketch and shade the curve
~Use Table B to determine the probability

126
Q

Standard Normal (Z) Variable

A

-Standard Normal Variable = a Normal random variable with mu = 0 and sigma = 1
-Called “z variables”
-Notation
Z ~N(0,1)
-Use Table B to look up cumulative probabilities

127
Q

Figure 7.11

A

-Portion of Table B highlighting P (Z < 1.96) = 0.9750

128
Q

Example: Normal Probability
-Step 1: Statement of Problem

A

-We want to determine the percentage of human gestations that are less than 40 weeks in length
-We know that uncomplicated human pregnancy from conception to birth is approximately Normally distributed with mu = 39 wees and sigma = 2 weeks
*Note: clinicians measure gestation from the last menstrual period to birth, which adds 2 weeks to the sigma
X = human gestation in weeks
-Let X represent human gestation
X ~N (39,2)
-Statement of the problem
Pr(X < 40) =

129
Q

Normal Probability
-Step 2: Standardize

A

-To standardize, subtract mu and divide by sigma
Z = (x-mu) / sigma
-The z-score tells one how the number of sigma-units the value falls above or below mu
-Ex:
~The value 40 from X~N(39,2) has
Z= (40-39) / 2 = 0.5
Pr(Z < 0.5) = 0.6915

130
Q

Normal Probability
-Steps 3 and 4: Sketch and use Table B

A

-Sketch and label axes
-Use Table b to lookup
Pr(Z < 0.5) = 0.6915

131
Q

Probabilities Between Two Points

A

-Let a represent the lower boundary and b represent the upper boundary of a range
Pr(a < Z < b) = Pr (Z < b) - Pr (Z < a)

132
Q

7.3 Looking up the z percentile value

A

-Use Table B to look up the z-percentile value
~Ex:
*The score for the probability in questions
-Look inside the table for the entry closest to the associated cumulative probability
-Then trace the z score to the row and column labels

133
Q

Looking up the (Z) percentile value

A

-Suppose one wanted the 97.5th percentile z score
~Look inside the table for 0.975
*Then trace the z-score to the margins
-Notation
~Let Zp represents the z-score with cumulative probability p
~EX:
Z.975 = 1.96

134
Q

8.1 Concepts

A

-Statistical inference is the act of generalizing from a sample to a population with a calculated degree of certainty
~We are curious about parameters in the population
~We calculate statistics in the sample

135
Q

Parameters and Statistics

A

-It is essential to draw the distinction between parameters and statistics
Parameters Statistics
Source Population Sample
Calculated? No Yes
Constant? Yes No
Notation (examples) Mu, sigma, p x-bar, s, p̂

136
Q

8.2 Sampling Behavior of a mean

A

-How precisely does a given sample mean reflect the underlying population mean?

137
Q

Sampling

A

-Age
~Population 65 students in our CHS 280 class
-Which sample mean reflects the underlying population mean more precisely?
-If the sample size is 3
~Sampling distribution of the sample mean
N= 65
mu = age
X-bar1 = (18 + 18 + 19) / 3 =
X-bar2 = (19 + 20 + 21) / 3 =
-If the sample size is 50
~Sampling distribution of the sample mean
N = 65
X-bar1 = (…+…+…+…) / 50 =

138
Q

Deviation of population and sampling distribution

A

-Population (Individual observation)
Sigma
-Sampling Distribution of x-bar
Sigma / (sq root n)

139
Q

Standard deviation of sampling distribution of x-bar

A

-Standard error of the mean
Sigma lower x-bar
SE lower x-bar
Sigma / (sq root n)
*The square root law says the SE of the mean is inversely proportional to the square root of the sample size

140
Q

Example
-The Weschler
~Adult Intelligence Scale has sigma = 15

A

-For n = 1 -> SE lower x-bar = Sigma / (sq root n) = 15 / sq root 1 = 15
-For n = 4 -> SE lower x-bar = Sigma / (sq root n) = 15 / sq root 4 = 7.5
-For n = 16 -> SE lower x-bar = Sigma / (sq root n) = 15 / sq root 16 = 3.75
*Quadrupling the sample size cut the SE in half Square root law

141
Q

Figure 8.2

A

-Sampling distribution of the mean based on n = 10 compared of population values, Wechsler Adult Intelligence
Scale scores

142
Q

Central limit Theorem

A

-Sampling distribution of x-bar tends toward Normality even when the population distribution is not Normal
~This effect is strong in large samples

143
Q

Law of Large Numbers

A

-As a sample size gets larger and larger, the sample mean tends to get closer and closer to the mu

144
Q

Adult Intelligence Scale Example

A

-Wechsler Adult Intelligence Scale (WAIS) scores vary according to a Normal distribution with mu = 100 and sigma = 15
a) what can we say about the sampling distribution of a mean based on an SRS of 10 such scores?
mu = 100 sigma = 15
SE lower x-bar = sigma / (sq root n) = 15 / (sq root 10) = 4.7434
X-bar~N (100, 4.74)
b) What is the probability of getting an x-bar less than 90?
Pr(X-bar < 90) = ?
X-bar to Z = (X-bar - mu) / SE lower X-bar
Pr (Z < 90) = (X-bar - 100) / 4.74 = (90 -100) / 4.74 = -2.109 = 0.0174

145
Q

8.3 Sampling Behavior of Counts and Proportions

A

-Binomial Random Variable
~Random number of successes (X) in n independent “success/ failure” trials
~Probability of success for each trial is p
-Notation X~b(n,p)
~When n is large (npq > = 5), we can do normal approximation to the Binomial

146
Q

Normal Approximation for a Binomial Count

A

Mu = np and sigma = sqroot npq
-When Normal approximation applies
X~N (np, sq root npq)

147
Q

Normal Approximation for a Binomial Proportion

A

-mu = p, and sigma = sq root (pq) / n
p̂~N(p, sq root ((pq) / n))

148
Q

Example

A

Recent statistics claim the prevalence of maternal smoking is quite low, at only 5%
~Suppose another research group sampled 107 pregnant mothers in their third trimester
n = 107
p = 0.05
q = 0.95

149
Q

Example

A

n = 107
p = 0.05
q = 0.95
A. Can we assume an approximation to the normal distribution for this case?
npq = 5.0825

150
Q

Example

A

n = 107
p = 0.05
q = 0.95
B. Calculate the probability of observing at least 12 mothers among 107 are smokers during their pregnancy using a Normal Approximation
mu = np = (107 * 0.05) = 5.35
sigma = (sq root npq) = (Sq root 107 * 0.05 * 0.95) = 2.2544
X = 12
X~N (5.35, 2.25444)
Pr(X> 12) = Pr ( Z > 12) = (12-5.35)/ 2.25444 = 2.949 = 2.95 = 0.9984
1- 0.9984 = 0.0016