Biostats Test 1 Flashcards
Statistics
the science of data
Data
numbers with a context
Biostatistics
the application of statistics to topics in biology, including, but not limited to the design and analysis of biological experiments and observational studies
Descriptive Statistics
Methods of organizing, summarizing and presenting data in an informative way
Inferential Statistics
Methods for drawing conclusions about a phenomenon (population) on the basis of data (sample
- draw conclusions about hypotheses
Population vs. sample
Population: all subjects or items of interest (whose size, the number of subjects in the population, is denoted by N)
Sample: a group (or subset) selected from a population whose size is denoted by n
- Many different samples can be selected from any given population
- The number of distinct samples depends on the size of both the population and the sample
Data
observations (such as measurements, genders or survey responses) that have been collected
Parameter
a number that describes a characteristic of a population
Statistic
a number that describes a characteristic of a sample (aka sample statistics)
- The observed value of a statistic is used to estimate the unobserved value of a parameter
Unbiased statistic
A statistic is unbiased if the mean of its sampling distribution is the same as the parameter it is intended to estimate
Individuals
Individuals are the objects described in a set of data
- Individuals may be people, animals, plants or things (ex: freshmen, newborns, fields of corn, cells)
Variable
A variable is any property that characterizes an individual.
- A variable can take different values for different individuals (ex: age, gender, blood pressure, blood types, flower color)
- two types: quantitative, categorical
Quantitative variable
Some quantity assessed or measured for each individual. We can then report the average of all individuals.
- Numeric (ex: age in years, blood pressure)
Categorical variable
Some characteristic describing each individual. We can then report the count or proportion of individuals with that characteristic.
- Gender (male, female), blood type (A, AB, O, B), flower color (white, yellow, red)
- finite number of categories
- don’t calculate averages for categorical variables - instead, often calculate proportions
pie charts, bar graphs often used to represent
Histograms
This is a summary graph for a single variable. Histograms are useful to understand the pattern of variability in the data, especially for large data sets
A histogram is a graph in which the horizontal scale represents classes of data values and the vertical scale represents frequencies. The heights of the bars correspond to the frequency values and the bars are drawn adjacent to each other.
- tells us shape and distribution
- break data into bins/ranges of equal length
Dotplots and stemplots
These are graphs for the raw data. They are useful to describe the pattern of variability in the data, especially for small data sets
- Also called stem and leaf plots
- usually when 20 or fewer observations (if 21+, use histogram)
A graph in which each data value is plotted as a point along a scale of values. Dots representing equal values are stacked.
- not recommended unless small sample size (few observations)
- dots: where the observations are located along the line
Measures of center
The center of a data set is a representative or average value that indicates where the middle of the data set is located.
- Mean
- Median
- Mode
Mean
The mean or arithmetic average of a data set is the measure of center found by adding the values and dividing the total by the number of values
- sample mean = summation of all observations / number of values in sample
Median
The median of a data set is the measure of center that is the middle value when the data values are arranged in increasing or decreasing order.
To find the median, first sort the values, then:
- If the number of values is odd, the median is the number located in the exact middle of the list
- If the number of values is even, the median is found by computing the mean of the two middle numbers
Mode
The mode of a data set is the value that occurs most frequently.
- When two values occur with the same (greatest) frequency, each one is a mode and the data set in bimodal.
- When more than two values occur with the same (greatest) frequency, each is a mode and the data set is multimodal.
- When no value is repeated, there is no mode.
- One mode: unimodal
Skewed data distribution
A distribution of data is skewed if it is not symmetric and extends more to one side than the other
- If tail is on left (skinny side), mean pulled towards left
- If tail is on left, mean pulled towards right (mean > median)
Left skew (negative skew): the mean and median are to the LEFT of the mode (mean < median)
Symmetric (zero skew): the mean, median, and mode are the same
Right-skew (positive skew): the mean and median are to the RIGHT of the mode (mean > median)
The Best Measure of Center
Each measure of center has advantages and disadvantages
- Mean: is unique in that it takes all data values into account. However, it is NOT resistant to skew and extreme values (outliers)
- Median: is resistant to skew and outliers
- For data that is approximately symmetric with only one mode, the mean, median, mode and midrange will be approximately the same
- For data that is obviously asymmetric, you should report both the mean and the median
Variation
a measure of the amount that values within a data set vary among themselves
Range
The range of a set of data is the difference between the maximum value and the minimum value
- Range = max - min
Standard deviation
The standard deviation of a set of sample values is a measure of variation of values about the mean
- The standard deviation “s” is used to describe the variation around the mean.
- Like the mean, it is NOT resistant to skew or outliers.
(used to estimate population)
Variance
The variance of a set of values is a measure of variation equal to the square of the standard deviation (s^2)
(used to estimate population)
Z-score
(also known as standardized score)
A z-score can be used to compare values from different data sets.
- The Z-score is the number of standard deviations that a given value x is above or below the mean.
- If z-score is 1…..1 standard deviation above mean. If z score is -1.5…..1.5 standard deviations below.
use when we want to compare different populations
- need to standardize data to make comparable
positive vs. negative z-score
Positive z-score: indicates that the value is above the mean
Negative z-score: indicates that the value is below the mean.
Quartiles
Quartiles divide the sorted data values into four equal parts.
- The median divides the data into two equal components.
- Q1: 25% of values are less than or equal to Q1, and 75% of values are greater than or equal to Q1
- Q2: equal to the median
- Q3: 75% of values are less than or equal to Q3, and 25% of values are greater than or equal to Q3
Exploratory data analysis
the process of using statistical tools to investigate data sets in order to understand their important characteristics, including: center, variation, distribution, outliers and time
Outlier
An outlier is a value that is located very far away from almost all of the other values. Relative to the other data, an outlier is an extreme value
- An outlier can have a dramatic effect on the mean, the standard deviation, and the scale of the histogram so that the true nature of the distribution is obscured
Five-number summary
min Q1 M (median) Q3 max
Inter-quartile range (IQR)
The IQR is the distance between the first and third quartiles (the length of the box in the box plot)
- IQR = Q3-Q1
- used to find suspected high and low outliers
Calculating outliers
An outlier is an individual value that falls outside the overall pattern. How far outside the overall pattern does a value have to fall to be considered a suspected outlier?
- Suspected low outlier: any value < Q1 - 1.5 IQR
- Suspected high outlier: any value > Q3 + 1.5 IQR
How to draw a boxplot
- Find the 5 number summary
- Construct a scale with values that includes the minimum and maximum data values
- Construct a box extending from Q1 to Q3, and draw a line in the box at the median values
- Draw lines extending outward from the box to the minimum and maximum data values
(won’t be asked to do this on exam)
Bivariate (or paired) data
can be analyzed to determine if there is an association between the two variables.
- We explore only linear associations within quantitative data
Correlation
A correlation exists between two variables when one of them is linearly related to the other in some way
- must be quantitative variables
How to investigate the association between two variables
- Make a scatterplot
- - What type of relationship is there? linear or nonlinear
- - Direction of relationship? positive (as x increases, y increases) or negative (as x increases, y decreases)
- - How strong is the relationship? strong (if you can connect dots), weak (if scattered)
- - Look for potential outliers
Linear correlation coefficient (r) - definition + requirements
The correlation measures the strength of the linear association between paired x and y quantitative values in a sample. r is a sample statistic representing the population correlation coefficient, p.
Requirements for making inferences about p, using r:
- Paired data (x, y) must be a ramble sample
- A scatterplot must confirm that the points approximate a straight-line pattern
- Outliers should be removed if they are known to be errors
Properties of the correlation coefficient
- The value of r is always between -1 and 1, inclusive (-1 less than or equal to r less than or equal to 1)
- The value of r does not change if all values of either variable are converted to a different scale
- The value of r is not affected by the choice of x and y (ex: doesn’t matter if BMI x or y, cholesterol, y or x)
- r measures the strength and direction of a linear association
Negative correlation: - slope
Positive correlation: + slope
Interpreting the correlation coefficient
If r is closer to zero, we can conclude that there is no significant linear correlation between x and y.
If r is close to -1 or 1, we conclude that there is significant linear correlation (values closer to -1 or 1 indicate stronger correlation)
- CANNOT conclude that there is no relationship at all (there could be another relationship like a parabola)
Interpreting r
If we conclude that there is a linear correlation between x and y, we can find a linear equation that expresses y in terms of x and that equation can be used to predict values of y for given values of x. (Simple Linear Regression)
The value of r^2 is the proportion of variation in y that is explained by x. In addition to x, there may be a variety of other factors affecting y, such as random variation or other factors not included in the study. We will explore this in more detail with linear regression.
Interpreting r - common errors
- Concluding that correlation implies causality (ex: shark attacks and ice cream consumption)
- Data based on averages: Averages suppress individual variation and may inflate the correlation coefficient (averages may make things look better than they are)
- Linearity: An association may exist between x and y even when there is no significant linear correlation.
r is not resistant to outliers:
- Correlations are calculated using means and standard deviations, and thus are NOT resistant to outliers
- Outliers will make a relationship look stronger/weaker than it actually is
Simple linear regression
The regression equation expresses an association between x and y.
Variable x: the independent, predictor, or explanatory variable
Variable y: the dependent or response* variable
Data comes in pairs (xi, yi) where xi is the ith observation for variable x and yi is the ith observation for variable y
A linear regression model with one predictor variable is a simple linear regression (SLR) model
x is what we are using to predict y
least-squares regression line: definition
the unique line such that the sum of the vertical distances between the data points and the line is zero, and the sum of the squared vertical distances is the smaller possible
- same as line of best fit
- line: smallest amount of vertical distances squared (minimizes error)
- sum of all vertical distances has to = 0
- always has to pass through the point (x bar, y bar)
- Only for linear associations
- Don’t compute the regression line until you have confirmed that there is a linear relationship between x and y - always plot the raw data first to confirm linear association (always do a scatterplot and correlation coefficient first)
least squares regression line: notation and interpretations
y hat: the predicted value of y for a given value of x
y hat = intercept + slope x
*always have to write y hat
slope of the regression line: describes how much we expect y to change, on average, for every unit change in z
intercept: a necessary mathematical descriptor of the regression line (it does not describe a specific property of the data)
Slope of the regression line:
b1 = r (sy/sx)
- r: the correlation coefficient between x and y
- sy: standard deviation of the response variable y
- sx: standard deviation of the explanatory variable x
Intercept:
b0 = y bar - b1 (x bar)
- x and y bar are the respective means fo the x and y variables
coefficient of determination
coefficient of determination = r^2
It is the square of the correlation coefficient. It represents the fraction of the variation (%) in y that is explained by the regression model.
- Always between 0 and 1
- The closer r^2 gets to 1, the better the model explains (fits) the data
Interpretation ex:
- If r=0.87 then r^2 = 0.76…… About 76% of the variation in children’s heights is explained by the regression model with FEV/ The regression model explains 76% of the variations in y.
Outliers vs. influential points
Outlier: an observation that lies outside the overall pattern
“Influential individual/point”: an observation that markedly changes the regression if removed. This is often an isolated point
Residuals
the vertical distances from each point to the least-squares regression line
- The sum of all the residuals is by definition 0
- Outliers have unusually large residuals (in absolute value) - residuals can be positive or negative
Association + Causation (and reason)
Association, however strong, does NOT imply causation.
- The observed association could have an external cause.
- A lurking variable is a variable that is not among the explanatory or response variables in a study, and yet may influence the relationship between the variables studied
- We say that two variables are confounded when their effects on a response variable cannot be distinguished from each other.
Establishing causation
Establishing causation from an observed association can be done if:
- The association is strong
- The association is consistent
- Higher doses are associated with stronger responses
- The alleged cause precedes the effect
- The alleged cause is plausible
Observational study
record data on individuals without attempting to influence the responses
- not imposing anything on anyone/manipulating antyhignabout individuals
Experimental study
deliberately imposing or assigning a treatment on individuals and record their responses
- influential factors can be controlled
- implement something that changes person’s lifestyle, etc.
confounded variables
Two variables are confounded when their effects on a response variable cannot be distinguished
- Observational studies often fail to yield clear causal conclusions because the explanatory variable is confounded with lurking variables
Lurking variables: didn’t collect info on it (Can’t account for it)
Confounding variables: connected the info while doing the study
Can control for confounding variables in experiments generally but not in observational studies
Population vs. sample
Population: the entire group of individuals in which we are interested but can’t usually assess directly
Sample: the part of the population we actually examine and for which we do have data
Parameter vs. Statistic
Parameter: a number summarizing a characteristic of the Population
Statistic: a number summarizing a characteristic of a Sample
Randomization in sampling
Probability sampling: individuals or units are randomly selective; the sampling process is unbiased
Methods that are NOT ideal but only done if probability sampling not possible:
- Voluntary random sampling: individuals choose to be involved
- Convenience sampling: ask whoever is around (mail, street) or take the next 10 units
Simple random sample (SRS)
Made of randomly selected individuals
- Each individual in the population has the same probability of being in the sample
- All possible samples of size n have the same chance of being drawn
How to choose an SRS:
- draw from a hat (lottery style)
- flip a coin
- use a table of published random numbers
- use software that generates random numbers
Sample Survey
an observational study that relies on a random sample drawn from the entire population
- Opinion polls are sample surveys that typically use voter registries or telephone numbers to select their samples
- In epidemiology, sample surveys are used to establish the incidence (rate of new cases per year) and the prevalence (rate of all cases at one point in time) of various medical conditions, diseases, and lifestyles.
Some survey challenges
- undercoverage or selection bias: parts of the population are systematically left out (based on the way you choose to distribute the survey)
- nonresponse: some people choose not to answer/participate
- wording effects: biased or leading questions, complicated/confusing statements can influence survey results
- response bias: fancy term for lying or forgetting (especially on sensitive/personal issues) - can be exacerbated by survey method (in person vs. by phone or online) - more likely to lie to someone’s face than online
case control studies
start with 2 random samples of individuals with different outcomes and look for exposure factors in the subjects’ past (“retrospective”)
- common for rare diseases
cohort studies
enlist individuals of common demographic, and keep track of them over a long period of time (“prospective”)
- individuals who later develop a condition are compared with those who don’t
cross-sectional studies
measure the exposure and the outcome at the same time (i.e. surveys)
the individuals in an experiment are called the….
experimental units
- if they are human, we call them subjects
the explanatory variables in an experiment are often called the
factors
treatments
any specific experimental condition applied to the subjects
- if an experiment has several factors, a treatment is a combination of specific levels of each factors
ex: the factor may be the administration of a drug
experiments
compare the response to a given treatment versus:
- another experiment
- the absence of treatment (often called a control)
- a placebo (a fake treatment)
Experiments randomize the assignment of subjects to treatments.
Experiments use replication: several or many individuals are studied
Negative vs. Positive control in cellular biology experiments
negative control: expect outcome to stay the same (expect that not going to help/hurt)
positive control: expect outcome to change
placebo effect
improvement in health or perceived condition due not to any active treatment but only to the patient’s belief that he or she is being cared for or helped
- therapeutic results on up to 35% of patients
- neural response to the placebo effect seen as early as the spinal cord
Hawthorne effect
term used to describe a type of bias that may occur due to behavior modification because of study enrollment
- also known as “observer effect”
- blinding can help against bias
if people know what group they are in or doctor knows who’s in placebo group may cause change in individual or doctor’s treatment
- people behave differently if they know they are being observed by a doctor
double blind experiment
one in which neither the subjects nor the experimenter(s) know which individuals received which treatment until the experiment is completed
completely randomized experimental design
individuals are randomly assigned to groups then the groups are assigned to treatments completely at random
matched pairs design
choose pairs of subjects that are closely matched (like twins but doesn’t have to be) - within each pair, randomly assign who will receive which treatment (each gets a different treatment)
repeated measures design
give the two (or more) treatments to each subject over time, in random order, so we have repeated measures for each subject
Belmont report 1979
established IRB (partly in response to Tuskegee Syphilis study - also Stanford Prison experiment, Milgram experiment)
3 main aims:
- respect for persons (consent)
- beneficence (maximize benefit while minimizing harm - study will be beneficial)
- justice (will the study be worthwhile?)
two-way tables
summarize data about two categorical variables (or factors) collected on the same set of individuals
- Each factor can have any number of levels. If the row factor has “r” levels and the column factor has “c” levels, we say that the two-way table is an “r by c” table
marginal distributions
- We can examine each factor in a two-way table separately by studying the row totals and the column totals. They represent the marginal distributions, expressed in counts or percents
- total all rows or total all columns
conditional distribution
is the distribution of one factor for each level of the other factor
- fix either a row or column and calculate percentages across that fixed/raw column
- A conditional percent is computed using the counts within a single row or a single column. The denominator is the corresponding row or column total (rather than the table grand total)
Random event
outcomes are uncertain, but there is nonetheless a regular distribution of outcomes in a large number of repetitions
probability
We define the probability of any outcome of a random phenomenon as the proportion of times the outcome would occur in a very long series of repetitions (number of times outcome will occur in long series of replications)
- description of the pattern in the LONG RUN
probability models
Probability models mathematically describe the outcome of random processes. They consist of two parts:
1) S= Sample Space: This is a list or description of ALL possible outcomes of a random process. An event is a subset of the sample space.
2) A probability assigned for each possible simple event in the sample space S
Discrete sample space
Discrete variables that can take on only certain values (a whole number or a descriptor) - sample with finite number of outcomes (ex: blood types - there are only 4)
Continuous sample space
Continuous variables that can take on any one of an infinite number of possible values over an interval (ex: height, weight, BMI generally continuous - can be any number within lower/upper bound)
- have a minimum/lower bound, upper bound but there are unlimited values within
Probability tules
- Probabilities range from 0 (no chance) to 1 (event has to happen) – For any event A, P(A) is between 0 and 1
- The probability of the complete sample space S must equal 1: P(sample space) = 1 (probabilities of all outcomes must add up to 1)
- Complement rule: The probability that an event A does not occur (not A) equals 1 minus the probability that it does not occur: P(not A) = 1 - P(A)
disjoint events
Two events are disjoint or mutually exclusive if they can never happen together (have no outcome in common)
Addition Rules
Addition rules for disjoint events: When two events A and B are disjoint, P(A or B) = P(A) + P(B)
General addition rule for ANY two events A and B:
P(A or B) = P(A) + P(B) - P(A and B)
Continuous sample spaces
contain an infinite number of events
- We use density curves to model continuous probability distributions
- They assign probabilities over the range of values making up the sample space
Continuous Probabilities + intervals (density curves)
Events are defined over intervals of values
- The total area under a density curve represents the whole population (sample space) and equals 1 (100%)
- Probabilities are computed as areas under the corresponding portion of the density curve for the chosen interval
- The probability of an event being equal to a single numerical value is zero when the sample space is continuous
Area under curve between specified points = P(A) - probability of event A happening
Independent events
two events are independent if knowing that one event is true or has happened does not change the probability of the other event
- If knowledge of the first event affects the second -> dependent
Conditional probabilities
reflect how the probability of an event can be different if we know that some other event has occurred or is true
The conditional probability of event B, given event A is: P(B|A) = P(A and B)/P(A) (this is the probability B will occur given that A has already occurred)
When two events A and B are independent, P(B|A) = P(B). No information is gained from he knowledge of event A.
To show independence, you have to show:
P(B|A) = P(B) = P(B|A complement)
Multiplication rule
General multiplication rule: The probability that ANY two events, A and B both occur is: P(A and B) = P(A)P(B|A)
Multiplication rule for independent events: If A and B are independent, then: P(A and B) = P(A)P(B)
Tree diagrams
are used to represent probabilities graphically and facilitate computations
Baye’s Theorem
P(A|B) not equal to P(B|A)
If we know the conditional probability P(B|A) and the individual probability P(A), we can use Baye’s theorem to find the conditional probability P(A|B)
equation is on equation sheet
Specificity
probability you get negative result when you don’t have the disease
- P(negative result given negative disease status)
Sensitivity
how likely it is to give positive result when you have the disease (want this to be close to 1) - how accurate the test is
P( positive result given positive disease status)
Normal (Gaussian) distributions
a family of symmetrical, bell-shaped density curves defined by a mean u (“mu”) and a standard deviation (sigma): N(u, sigma)
Normal curves are used to model many biological variables. They can describe a population distribution or a probability distribution.
Rule for any N(mu, sigma) - all normal curves:
68-95-99.7 rule
All normal curves share the same properties:
- About 68% of all observations are within 1 standard deviation (sigma) of the mean (mu) —– mu - sigma to sigma + mu range
- About 95% of all observations are within 2 sigma of the mean mu
- Almost all (99.7%) observations are within 3 sigma of the mean – probably outliers
To obtain any other area under a normal curve, use Table B
Standard Normal Distribution
We can standardize data by computing a z score: z = (x-mu)/sigma where x= an observation
- If a has the N(mu, sigma) distribution, then z has the N(0,1) distribution
- Mean of 0, sd of 1 is normal distribution
z score
measures the number of standard deviations that a data value x is from the mu
- When x is 1 standard deviation larger than the mean, then z = 1
- When x is 2 standard deviations larger than the mean, then z = 2
When x is larger than the mean, z is positive.
When x is smaller than the mean, z is negative.
The area under N(0,1) for a single value of z is zero
Z table: finding area to the right of a z-value
area to the right of z
= 1 - area left of z OR
= area left of -z
Area -> Z score
- find the desired area/proportion in the body of the table
- then read the corresponding z-value from the left column and top row
- percentile corresponds to area under curve
Normal Quantile Plots (QQ Plots)
One way to assess if a data set has an approximately Normal distribution is to plot the data on a QQ Plot (assess normality of data)
- The data points are ranked and the percentile ranks are converted to z-scores. The z-scores are then used for the horizontal axis and the actual data values are used for the vertical axis. Use technology to obtain normal quantile plots
- If the data have approximately a Normal distribution, the Normal quantile plot will have roughly a straight-line pattern (if straight line, then data probably normally distributed)
Sampling Distributions
- Different random samples taken from the same population will give different statistics, but there is a predictable pattern in the long run
- A statistic computed from a random sample is a random variable
The sampling distribution of a statistic is the probability distribution of that statistic for samples of a given size n taken from a given population
- Every time you do simple random sample, get slightly different average (due to sampling error and non-sampling error (any error involving human - data collection, etc.))
Sampling distribution of x bar (the sample mean)
The mean of the sampling distribution of x bar is mu.
- There is no tendency for a sample average to fall systematically above or below mu, even if the population distribution is skewed.
- x bar is an unbiased estimate of the population mean mu.
The standard deviation of the sampling distribution of means is sigma/square root of n.
- The standard deviation of the sampling distribution measures how much the sample statistic x bar varies from sample to sample
- Averages are less variable than individual observations
Sample mean for normally distributed populations
When a variable in a population is Normally distributed, the sampling distribution of the sample mean x bar is also Normally distributed
Population: N(mu, sigma)
Sampling distribution: N(mu, sigma/square root of n)
Standardizing a normal sample distribution
When the sampling distribution is Normal, we can standardize the value of a sample mean x bar to obtain a z-score. This z-score can then be used to find areas under the sampling distribution from the Normal probability table.
z= x bar - mu / sigma* square root of n
Here we work with the sampling distribution
sigma/square root of n is its standard deviation (indicative of spread)
Central Limit Theorem
When randomly sampling from ANY population with mean mu and standard deviation (sigma) when N is large enough, the sampling distribution of x bar is approximately Normal: N(mu, sigma/square root of n)
- The larger the sample size n, the better the approximation of Normality
- This is very useful in inference: Many statistical tests assume Normality for the sampling distribution. The central limit theorem tells us that, if the sample size is large enough, we can safely make this assumption even if the raw data appear non-Normal
How large a sample size
It depends on the population distribution. More observations are required if the population distribution is far from Normal.
- A sample size of 25 or more is generally enough to obtain a Normal sampling distribution from a skewed population, even with mild outliers in the sample
- A sample size of 40 or more will typically be good enough to overcome an extremely skewed population and mild (but not extreme) outliers in the sample
In many cases, n=25 isn’t a huge sample. Thus, even for strange population distributions, we can assume a Normal sampling distribution of the sample mean and work with it to solve problems
How do we know if the population is normal
- Sometimes we are told that a variable has an approximately Normal distribution (e.g. large studies on human height or bone density)
- Most of the time, we just don’t know. All we have is sample data.
- We can summarize the data with a histogram and describe its shape.
- If the sample is random, the shape of the histogram should be similar to the shape of the population distribution.
- The central limit theorem can help guess whether the sampling distribution should look roughly Normal or not
Law of large numbers
As the number of randomly drawn observations (n) in a sample increases:
- the mean of the sample (x bar) gets closer and closer to the population mean mu (quantitative variable)
- the sample proportion (p hat) gets closer and closer to the population proportion p (categorical variable)
x bar should be getting closer to population mean as sample size increases
Law of large numbers and sampling distribution (when sampling randomly from a given population)
When sampling randomly from a given population:
- The law of large numbers describes what would happen if we took samples of increasing size n
- A sampling distribution describes what would happen if we took all possible random samples of a fixed size n
Both are conceptual ideas with many important practical applications. We rely on their known mathematical properties but we don’t actually build them from data
sampling distribution of x bar
- distribution of x bar -> normally distributed
- mean of x bar -> close to the population mean mu (mean of sample mean close to mean of population)
- spread/standard deviation of sampling distribution of x bar ALWAYS smaller than the population spread/standard deviation
PPV (Positive predictive value)
P (A|+)
probability that you have the disease given + result on test (if positive result, how likely that you have the disease?)
negative predictive value (NPV)
P(Disease complement | negative)
notation for complement rule
P (not A)
P(Ac)
P(A bar)
Strong vs. weak vs. moderate correlation
For her:
strong correlation is r of (-1 to -.7) or (.7 to 1)
moderate correlation: (.4 to .69) or (-.4 to -.69)
weak correlation: (-3.9 to 3,9)
y intercept
y value when x=0
note, it may not make sense in the context
Interpretation of b0 and b1
b0: When the ___ is 0, the predicted ____ is ___.
b1: For a ___ unit increase in____, there is, on average, a ____ increase in_____.
variation vs. standard deviation vs. variance
variation: spread (how spread out the data is)
standard deviation: measures spread (large standard deviation = more spread out data)
variance: standard deviation squared