midterm 1 Flashcards
Three Criteria for Causality
- Plausibility
- Time Order
- Non-Spuriousness
What are the two type of variables?
Numeric and Categorical
mean, median, mode
- mean: every number added up divided by how many numbers
- median: middle
- mode: most common
Range
maximum – minimum value
IQR
Q3 – Q1 (50% of the data)
standard deviation is what square rooted?
variance
Cluster Sample vs Stratified Sample
Cluster: randomly samples existing clusters, then samples within those clusters
Stratified: creates subgroups based on variables and samples from within those subgroups
68 – 95 – 99.7 rule for normal distribution
68% of data will fall within 1 sd of mean
95% of data will fall within 2 sd of mean
99.7% of data will fall within 3 sd of mean
how to calculate ANOVA
Test Statistic:
𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝐺𝑟𝑜𝑢𝑝𝑠 (𝑀𝑆𝐺)/ 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑟𝑟𝑜𝑟 (𝑀𝑆𝐸)
What does MSG conceptually mean?
the amount of variation between groups. In other words, how much of the variation you see in the sample is because there are multiple groups.
If this number is high, it means that the groups are different from each other. If this number is low, it means that the group means are all very similar to the overall mean – the groups are NOT different.
What is MSE conceptually?
the amount of variation within groups. If this number is high, it means that there is a lot of variation within the groups. If this number is low, it means that all of the observations are pretty close to average for their group.
Factorial ANOVA
A technique for studying the effect of two or more categorical independent variables on a numeric dependent variable accounting for interaction effects among the independent variables
Main Effect
The overall relationship between an independent and a dependent variable
Interaction Effect
when the relationship between two variables is different depending on the value of a third variable
Tukey HSD (factorial anova)
This tests every pair to see if they are statistically significantly different
In R: TukeyHSD() function on the aov model object
multivariate
studies relationships of independent variables with multiple dependent variables
conditions for regression
- linearity
- nearly normal residuals
- constant variability
- independent observations
coefficient vs y-intercept
coefficient: Bx or mx
y-intercept: Bo
RMSE root mean square error (three steps)
- find squared error
- calculate mean of the se
- take the square root
what minimizes the RMSE?
the mean, and the line that minimizes RMSE is the best fit line - OLS (ordinary least squares)
R for regression
summary(lm(depedent variable ~ indepepdnet variable, data = dataset))
high leverage vs influencial point
high leverage: very high or low value on the independent variable (x)
influencial point: extreme in both independent (x) and dependent (y) variables
Simpson’s Paradox
where the observed relationship b/w two variables changes when the population is divided into different groups
Adjusted R squared in multiple regression
applies a penalty based on the number of parameters in the model
Bo + B1 = two parameters
Bo + B1 + B2 = three parameters
How to Build a Multiple Regression Model
- Simultaneous Inference or Full Entry: put all variables into model
- Hierarchical: adding variables in a pre-specified and theoretically justified order
- Forward Selection: keep on adding variables
- Backward Elimination: start with full and eliminate
How is ANOVA a special case of regression?
Regression: how much of the variation in Y is explained by variation in the numerical variable X
ANOVA: how much of the variation in Y is explained by a categorical variable
nominal vs ordinal (categorical variables)
nominal: Data is organized into categories that have no intrinsic order
ordinal: Data is organized into categories that have a clear order
variable vs case
Variable: A feature that varies across cases/observational units/units of analysis
Case: an individual person/object whose features you are measuring and evaluating.
4 types of sampling frames
- Simple random sample: Randomly select a sample from a list
- Systematic sample: Organize the sampling frame in a list and select every kth individual
- Cluster Sample: Randomly sample existing clusters (e.g. cities, schools, neighborhoods) first, then sample within those clusters
- Stratified sample: Create subgroups (called strata) of the sample based on important variables (often race, gender, or other demographic features) and then sample from within those subgroups
The Central Limit Theorem (for proportion):
When observations are independent, the sampling distribution of sample proportions for a sample of size n is a normal distribution with a mean of P and a standard deviation of √((𝑝(1−𝑝))/𝑛)
parameters vs statistics
parameter (population)
P – the population proportion
μ (mu) – the population mean
σ (sigma) – the population standard deviation
statistic (sample)
𝑝̂ (p-hat) – the sample proportion
𝑥̅ (x-bar) – the sample mean
s – the sample standard deviation
what’s the null hypothesis in linear regression
that the coefficient (B1x) is zero