Lectures 1-4 Flashcards
Frequent Statistics
What is the probability of a wrong decision about the treatment effect?
►What should we conclude from the observed data given a specified null hypothesis?
Bayesian Statistics
What should we believe about the treatment effect given the data that are observed?
Likelihood Inference
What is the evidence about the treatment effect given the data that are observed?
Biostatistics
The science of learning from biomedical data involving appreciable variability or uncertainty
The application of statistical reasoning and methods to the solution of biological, medical, and public health problems
►The scientific use of quantitative information to describe or draw inferences about natural phenomena
►Scientific—accepted theory (ideas) and practice; ethical standards
►Quantitative information—data reflecting variation in populations
►Inference—to conclude or surmise from evidence
14
Generate hypotheses
Ask questions
►Falsifiable
Design and conduct studies to generate evidence
Collect data
Descriptive statistics
Describe the distributions of observations
Statistical inference
Assess strength of evidence in favor of competing hypotheses
►Use data to update beliefs and make decisions
Also known as confirmatory data analysis (CDA)
►Draw conclusions about a population (whole group; true mechanism) from a sample (representative part of a group; “trials”)
►Assess strength of evidence in support of competing hypotheses
►Make comparisons
►Make decisions
►Make predictions
Design of a Study
Ask a precise, testable, and appropriatequestion ►Choose a research approach and design ►Define outcome of interest ►Define comparison groups ►Choose a population to study ►Implementation—collect data
Descriptive Statistics / Exploratory Data Analysis (EDA)
Organization and summarization of data
►Graphical display to visualize important patterns and variation
►Hypothesis generating
Explanations
hypotheses about mechanisms
Variable
a characteristic taking on different values
Simple
scientists prefer simple, rather than complex, explanations
►Occam’s razor
►Principle of parsimony
Interrelationships
associations; causal connections
Variable
a characteristic taking on different values
Random variable
a variable for which the values obtained are usually thought of as arising partly as a result of chance factors
Response variable (𝒀)
the outcome measure; that which may be affected or caused; often a health measure
Explanatory variables (𝑿)—
those that affect or cause the response:
►Treatment (intervention)—explanatory variable that can be controlled by the scientist
►Risk factors—explanatory variables that influence the risk of the outcome; of scientific interest (e.g., smoking, salt intake, environment) and usually cannot be controlled
Quantitative
concept of amount; numerical
Discrete variables
gaps in values; e.g., number of births, number of drinks per week
Continuous variables
no gaps in values; e.g., blood pressure, age, height, time to seroconversion
Special case
time-to-event data in which we need to deal with “censoring”
4
Qualitative
concept of attribute; categorical
Nominal scale
Binary or dichotomous—e.g., disease status (diseased or not diseased), vital status (alive or dead)
●Polychotomous or polytomous—e.g., occupation, marital status
Ordinal or ordered scale
e.g., ratings, preferences
Variation
refers to the differences among a set of measurements
Natural variation
differences among persons (experimental units) in the “true” values of the variable of interest
Measurement variation (or error)
differences between the measured and true values
Bias
difference between the average (expected) value of a measurement (variable) and the true value that it targets
Variance
variation among measurements about their average or mean value, even if that mean differs from the true targeted value
Mean Squared Error
MSE= variance + bias^2
Cause
something that brings about an effect or result
Confounder
another variable (𝑋𝑋2) that needs to be taken into account when assessing the true association between the risk factor 𝑋𝑋1and the outcome 𝑌𝑌 BMI
Effect modifier—
another variable (𝑋𝑋2) that identifies subgroups of individuals (units) across which the association between the risk factor 𝑋𝑋1and the outcome 𝑌𝑌will differ
Inference
Estimate the association between the outcome of mortality and treatment, and characterize the estimate’s uncertainty
Prediction
Best predict the outcome of mortality on the basis of available data of treatment and other factors, and characterize the prediction’s accuracy
Experimental studies
control allocation of “treatment” to subjects (experimental units)
Laboratory studies:
control variation (e.g., effect of pesticide on rate of mutations in rat pups)
Clinical trials
randomize to produce groups with similar observed and unobserved characteristics; average over rather than control variation (e.g., compare two treatments to reduce blood pressure)
Observational studies
do not control allocation of “treatment” to subjects (experimental units)
Frequency
the count(frequency) of the number of individuals in a particular group
Empirical distribution function
a frequency distribution which describes an observed set of values of a variable
Cumulative frequency
the count (frequency) of the number of individuals in a particular age group or lower age group ►That is, the cumulative count
Relative frequency
the proportion of individuals in a particular age group = the count (frequency) of the number of individuals in a particular age group divided by the overall total
Cumulative relative frequency
the cumulative proportion of individuals in a particular age group or any lower age group
Range
difference between largest and smallest values
Variance
“average” of the squared differences of observations from the sample mean
𝑠𝑠2=Σi=1n(xi−𝑥𝑥)2𝑛𝑛−1
Standard deviation
𝑠𝑠=𝑠𝑠2. square root of variance
Stats terminology
Upper hinge =𝑄𝑄3
►Median=𝑄𝑄2
►Lower hinge=𝑄𝑄1
►Interquartile range (IQR)= 𝑄𝑄3−𝑄𝑄1
●Contains the middle 50% of the observations
►Whiskers: lines drawn to the smallest and largest actual observations within the calculated fences
fences
Fences are notobserved data points
►Fences are calculated to provide guidelines for identifying outliers
►𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑓𝑓𝑒 𝑒𝑒𝑒𝑒𝑒 =𝑢𝑢𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ℎ𝑖𝑖𝑖 𝑖𝑖𝑖 +1.5∗𝐼𝐼𝐼 𝐼𝐼=𝑄𝑄3+1.5∗𝐼𝐼𝐼 𝐼𝐼
►𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑓𝑓𝑒 𝑒𝑒𝑒𝑒𝑒 =𝑙𝑙𝑜 𝑜 ℎ𝑖𝑖𝑖 𝑖𝑖𝑖 −1.5∗𝐼𝐼𝐼 𝐼𝐼=𝑄𝑄1−1.5∗𝐼𝐼𝐼 𝐼𝐼
outliers
Outliers are actual observed data values falling beyond the calculated fences (higher or lower)
Positively skewed:
more lower values, sparse higher values
►Also: long “tail” of higher values
►Also: mean > median > mode
Negatively skewed
reverse of positively skewed
Symmetric
not skewed in either direction
Outlying values
Values that are “far” from most values
►Importance: a few outlying values can strongly influence certain statistical summary measures and analyses
arithmetic scale
each increment represents change by a constant amount
logarithmic scale
each increment represents change by a constant multiplier
Probability
provides a measure of the uncertainty associated with the occurrence of events
Outcome
exactly the experiment result
Event
specific way(s) the experiment can turn out
mutually exclusive
Two events, A and B, are mutually exclusive if the events cannot occur together
statistically independent
Two events, A and B, are statistically independent if the probability of A occurring is not influenced by the presence or absence of B
conditional probability
𝑃𝑃(𝐴𝐴|𝐵𝐵)=𝑃𝑃𝐴𝐴and𝐵𝐵/𝑃𝑃𝐵𝐵, where 𝑃𝑃𝐵𝐵≠0
(Vertical bar | = “given”)
12
statistically independent
𝑃𝑃(𝐴𝐴|𝐵𝐵)=𝑃𝑃(𝐴𝐴)
►That is, the probability of 𝐴𝐴occurring is not influenced by the presence or absence of 𝐵𝐵
Joint probability
“and”𝑃𝑃𝐴𝐴and𝐵𝐵
Multiplication rule
From conditional probability, we can write the joint probability as …
mutually exclusive
Two outcomes or events are mutually exclusive if and only if the probability of their joint outcome equals zero
statistically independent
Two outcomes or events are statistically independent if and only if the probability of their joint outcome equals the product of the probabilities of occurrence of each outcome
Probability distributions
a complete listing of the probabilities for every possible value of a random variable
Binomial
two possible outcomes
►Underlies much of statistical applications to epidemiology
►Basic model for logistic regression
Poisson
uses counts of events or rates
►Basis for log-linear and survival models
Gaussian (normal) bell-shaped curve
means are normally distributed or approximately normally distributed
Exponential
useful in describing times to events and population growth
Counting techniques
Factorial
►Permutations
►Combinations
Factorial
𝑛𝑛factorial” = number of possible arrangements (orderings) of n objects
►
Notation: “𝑛𝑛factorial” =𝑛𝑛!
Permutation
ordered arrangement of 𝑛𝑛objects taken 𝑟𝑟at a time
Combination
a selection of 𝑛𝑛objects taken 𝑟𝑟at a time without regard to order
Poisson
Describes the totally random (haphazard) occurrences of events in time or objects in space