Exam 1 - Sept 20 Flashcards
What are the four steps of the experimental process?
Formulate Theory → Collect Data → Summarize Results → Interpret Results and Make Decisions
Variable
An observed category (label) or quantity (number) in an experiment that may “vary” for different individuals
Categorical variable
Individuals are classified into groups or categories
Quantitative variable
A numerical quantity
Explanatory variable
Variable that is thought to affect (“explain”) another variable
Response Variable
Variable that is thought to be affected by (“respond to”) the explanatory variable(s)
Inference
A conclusion that patterns from data can be extended to some broader context
Statistical Inference
Justified by a probability model linking the data to the broader context; Incorporates measure of uncertainty
Causal Inference
Enables us to establish a cause and effect relationship
Population Inference
About population characteristics, Expand results from study to larger population
Describe the probability model of randomization. What kind of inferences can be made when it is used?
Assigning experimental units (subjects) to treatment groups using a chance mechanism
Causal inference
Describe the probability model of random sampling. What kind of inferences can be made when it is used?
Selecting experimental units (subjects) to be in a sample using a chance mechanism
Population inference
Anecdotal Evidence
A short story or example of an interesting event that could lead to scientific investigation, but does not establish a scientific theory
Observational Study
A study in which the group status (e.g., gender) is beyond the control of the researcher; results may be due to confounding variables
Randomized Experiments
An experiment in which randomization is done to assign subjects to groups; accounts for confounding variables
Main Lesson for Causal Inferences
causal inferences can be made from randomized experiments, but not observational studies
Confounding Variables
variables that are related to both the group membership and the outcome
Main Lesson for Population Inferences
population inferences can only be made from samples which utilize random sampling
Population
A well-defined collection of objects that we are interested in drawing conclusions about
Sample
A subset of objects from the population
Describe the two types of random sampling
Simple Random Sample (SRS) → All individuals have an equal chance of being selected
Stratified Random Sample → Individuals selected within groups
Self-selection
sampling using volunteers
Convenience sampling
more common but allows for a higher probability of bias
Control Groups
Gives a baseline for comparison with test groups
Placebo Effect
Individuals may respond favorably even when given a treatment that is known to be ineffective, opposite is nocebo effect
Blinding
The treatment assignment is kept secret from the experimental subject
Double Blinding
The treatment assignment is kept secret from both the experimental subject and the individuals measuring the response
Sampling Error
Discrepancy between the sample and population
Nonresponse bias
Not everyone who is asked to participate agrees to do so, and nonresponders differ from responders
What are some ways to display categorical variables in graphic form?
Bar plots and pie charts
Give a general description of a histogram
The range of observations is divided into subintervals (usually of equal size)
The frequency of observations is plotted as a bar on the y-axis
What three aspects of the data are shown by histograms?
Center, Outliers, and General Shape
What would data look like that is symmetric or left/right skewed?
Symmetric or skewed - shape of the distribution
Both halves are a reflection of each other
Can be left or right skewed
One side has a tail (named side), one side has the bulk of the data
Unimodal/Multimodal
number of peaks in the distribution
What is a quartile?
The 25th and 75th percentiles are the first (Q1) and third quartiles (Q3)
How do you make a box plot?
The median of the observations is denoted by a thick line
A box is drawn from the Q1 to the Q3
Whiskers extend to the largest and smallest observation
Outliers are shown as stars
What is a five-star summary?
The set of numbers that make up the → minimum, Q1, median, Q3, maximum
Observations
The categorical or quantitative measurements made (data)
Frequency
A count of observations that fall into a certain category
Statistic (general)
A numerical measure calculated from the observations; sample characteristic
(2) measures of center
mean or median
(3) measures of spread
variance, standard deviation, IQR
What is the symbol for mean? What is its strength/weakness?
y with a horizontal line over it
efficient in using all data
What is the symbol for median? What is its strength/weakness?
M - population median
m (italics) - sample median
resistant to outliers
Percentile
The pth percentile of the observations is the observation value such that p% of the observations are smaller than it
IQR or Interquartile Range
Q3 - Q1
Measures dispersion
What is the symbol for variance?
σ^2 - population variance
s^2 (italics) - sample variance
Standard Deviation (formula, will not need to calculate) Why is SD better than Variance?
the square root of ………. 1/(n-1) times the sum of the squared differences between each value and the mean
(The average distance of each value from the mean)
same units as the data, variance is squared
What is the symbol for standard deviation?
σ - population standard deviation
s (italics) - sample standard deviation
How is an ‘outlier’ defined?
An observation is considered an outlier if it is smaller than Q1 - 1.5(IQR) or larger than Q3 + 1.5(IQR)
Parameter
population characteristic
(Box-plots) What is the meaning of long-tailed or short-tailed?
Long-Tailed → Spike in data
Short-Tailed → Data evenly spread
What are the proper graphs (2) to show the relationship between two categorical variables?
Frequency or Relative Frequency Table
Row percentages displayed, each cell is the count for that cell divided by the row total
Stacked Relative Frequency Bar Chart
Percent within levels of ____
What are the proper graphs (2) to show the relationship between a quantitative and a categorical variable?
Side by Side Box Plots
Side by Side Dotplots
What is the proper graph to show the relationship between two quantitative variables?
Explanatory variable on x-axis and response on the y-axis
What is the standard notation for a normal distribution?
Y ~ N(μ, σ)
μ is mean
σ is SD
How can the mean and the SD affect the appearance of a graph of normal distribution?
Mean (μ) → Determines the center
SD (σ) → Determines the spread or height/width
What does it mean to standardize a data point with respect to the normal curve?
Rescaling each normally distributed variable to make them equivalent with respect to the area under the curve
What is the equation to standardize a data point with respect to the normal curve?
Subtract the mean and divide by the standard deviation to yield # of SDs from the mean (Z)
Using a normal distribution table, how can you convert from a data point to the proportion of data above or below that point?
Convert to Z value
The exact Z is the value on the leftmost column plus the value on the topmost row
→ Area/Proportion below Z = table value
→ Area/Proportion above Z = 1 - (table value)
Using a normal distribution table, how can you convert two data points to the proportion of data between those points?
Convert to Z value
→ Area/Proportion between ZA and ZB = table value B - table value A
Using a normal distribution table, how can you convert a percentile to the corresponding cutoff point?
Convert to Z by finding proportion in table then the corresponding Z-value
Convert Z-value back to Y using the standardization equation
What are the four ways to assess the normality of data?
Histogram, Normal Curve, Probability Tables, Normality Tests
How do you assess normality using a histogram?
Plot the data into a histogram and superimpose a normal curve
How do you assess normality using a normal curve?
Compare data with 68-95-99.7 rules
How do you assess normality using probability tables?
Comparison of observed versus expected left tail percentages
How do you assess normality using the Shapiro-Wilk test?
Yields a p-value, above .1 is no evidence for non-normality
Sampling Variability
Variability among random samples from the same population
Sampling Distribution
A probability distribution that characterizes some aspect of sampling variability
Cutoff for CLT
A sample size over 30 allows for the use of the CLT (Central Limit Theorem)
Standard Error (defn and formula)
The uncertainty in the mean of the sample data due to sampling characteristics, equal to the SD of X-bar
σ (or s) over √n
Bias
Estimates are systematically away from center, reduced by random sampling
Variability
Spread of estimates, reduced by increasing sample size
Confidence Level
The percentage of samples that will produce confidence intervals containing μ
Margin of Error (MOE)
Half the width of the confidence interval, equal to t(alpha/2, n-1) * s/√n
Critical Value
The normal tail probability corresponding to Z𝞪/2
The z-value corresponding to the cutoffs for the confidence interval, can be converted to Y to find the values for the confidence interval
What is the notation for a normal curve created for a sample mean (SD known)?
X-bar ~ Normal(μ, σ/√n)
How do you find the confidence interval for a population mean calculated from sample means when SD is known?
100(1-𝞪)% → Zalpha/2 → Critical Value = upper bound on confidence interval (if +)
Mean +/- Critical Value* Standard Error (standard deviation/sample size) = Confidence Intervals
How do you find the confidence interval for a population mean calculated from sample means using only estimated components?
X-bar +/- t(alpha/2, n-1) * s/√n
X-bar is sample mean, s is the sample standard deviation, n is the sample size
t(alpha/2, n-1) is the critical value of Student’s t-distribution with n-1 degrees of freedom for tail probability 𝞪/2
How do you calculate required sample size for a 95% confidence interval using sample standard deviation and desired margin of error?
Margin of Error depends on 𝞪 and n, if 𝞪 is .05 then t(.025,n-1)=2 and the number of samples (n) is equal to (2s/MOE) squared
Plug in desired MOE and sample s to get recommended n, then round up
Or solve for t(alpha/2, n-1) * s/√n = MOE with an estimated t-value*
*same thing, different equation
What are the assumptions when creating a one-sample confidence interval ?
Data must be regarded as a random sample from a large population
Observations must be independent of each other
If n is small, the population distribution must be approximately normal
What measure of spread is resistant to outliers?
IQR