Midterm Flashcards
Dependent Variable
Outcome we are interested in
Depends on the other variable
Goes on the y-axis
Independent Variable
Intervention or treatment
The “cause”
Goes on the x-axis
Confounder
Something related to both the IV and DV
Comparability
in absence of treatment
Treated Group
those who et some treatment of interest
Control Group
Those who do not get the treatment of interest
Observational Study
General term for research where you don’t get to randomize who get the treatment
Instead you just observe some relationship in the world
Experimental Study & Randomized Control Trial (RCT)
common terms for research designs in which you do randomize who gets the treatment
Typically you can make causal claims from experimental studies
Quasi-experimental research
research in which you have observational data, but you find ways to ensure that the treatment was effectively randomly distributed
Internal Validity
Is the experiment well designed? Is it free from confounders or bias?
External Validity
Is the finding generalizable to other populations, situations or cases? Does it apply outside of the context in which the finding was generated?
Problems with experiments
Not everything can be randomized (democracy, gender)
Not everything should be randomized (wars, right to vote and medication for birth defects)
Ethical dilemma: randomized treatment means denying treatment to some but not randomizing means we don’t really know if its effective or not
Running experiment is expensive
Yi
dependent variable, outcome variable, the thing we want to predict
Xi
independent variable, the thing that predicts the DV
Ei
error term
part of the DV or IV doesn’t explain
everything NOT in our model
β1
slope coefficient
relationship between X & Y
Indicated how much change in Y is expected if X increases by 1 unit
β0
constant
Value of Y when X is zero (intercept)
Indicates where the regression line crosses the Y-axis
Value of Y when X is 0
Endogeneity
the IV is correlated with the error term
confounder: this means that there is another unmeasured variable (a confounder) that affects the IV which also affects the DV
We haven’t included this other confounding variable in our model
To get our casual estimate
we need to create exogeneity
Exogeneity
IV is unrelated to or uncorrelated with the error term
Accomplish this through random assignment
Randomness
noise in the data
could go away with larger sample sizes
address some of these concerns with t-tests (p-values) and confidence intervals
Randomization
using a coin toss to create treatment and control groups which creates exogeneity
Population
the overall collection of individuals, beyond just the sample
Sample
the collection of individuals on which statistical analyses are performed, and from which general trends for the population are inferred
Individual
also called an “object” or “unit”
a single data point contributing to the sample
Mean
average of a variable
X bar
Minimum
lowest value of a variable (min)
Maximum
the highest value of a variable (max)
Sample size
the number of observations (N)
Standard deviation
how widely dispersed the values of the variable are (population, sample or standard deviation)
Probability Theory
Population distribution (Y)
—Estimand/Parameter
Then goes to sample (Y1, Y2,…Yn)
Then to Estimator/Statistic g(Y1, Y2,…Yn)
Finally to Estimate and back to population distribution
Expectation
the best guess about what number will be drawn from the distribution
Variance
how far the numbers you drew tend to be from the best guess
E[X]
something that exists for any random variable
if you could draw repeatedly from a distribution and take an average, it would get closer and closer to the expectation
Variance
measure of spread, or how fast you expect a random draw to be from the expectation
E[X-E[X]^2]
draw a bunch of numbers from the same distribution, squared the difference from each to the mean and averaged those
E[X] & Var[X]
are properties of the distribution of X, not your data
Normal Distribution
X~N (E [X], Var(X))
X ∼ N (µ, σ2), where µ is the expectation and σ 2 is the variance.
Sample Mean
The Sample mean of X1, X2…XN is
X bar = 1/N Xi = 1/N [X1 + X2 + … + XN]
This is an example of an estimator
Law of Large Numbers
For Rv’s X1, X2…Xn the mean (x bar) gets closer and closer to E[X]
As N grows the estimator gets better
Distribution of the mean
Properties
Property 1: the means distribution is centered around E[X}
You only get a mean once, but you know it takes a value from a distribution centered on E[X]
We don’t know E[X] (though we know X bar gets closer and closer as N grows)
Property 2: variance of the mean is V(X bar) which is the Var(X) divided by N
• So we can estimate Var(Xbar) =Varbar/N
How is the mean distributed
o There is one more piece to understanding the distribution of the mean: we know its expectation and variance, but what about the shape?
o The shape must depend on the distribution of the underlying X, you would think
Fortunately it does not
o Property 3: Central limit theorem (CLT): the distribution of the mean tends toward a normal distribution
This is magical: regardless of how the original X is distributed, when you take the mean of multiple RVs drawn from the same distribution, it starts to look normal
You do need N to be big enough for this to work, but that’s often not a problem
We will discuss some rules for deciding if N is big enough and adjustments to use when it is not
As N increases, the distribution starts to turn into a triangle shape, with the peak in the middle.
The more N increases, the width of each box gets smaller
o Theoretical understanding of distribution
Take the N=50 case
We can estimate the center using X bar which is 0.5
We can estimate the variance of X bar using (1/N) Varbar(X)
We know the shape of the distribution of the mean is normal
Using all this, we estimate that the mean should be distributed
• Xbar ~ N(µ = 0.5, σ2 = .08/N)
Covariance
the mean value of the product of the deviations of two variables from their respective means
measure of the join variability of two random variables
IF the greater values of one variable mainly correspond with the grader values of the other variable, and the same holds for lesser values, the variables tend to show similar behavior, the covariance is positive
Positive association: when X is higher, we expect Y is usually higher
Negatively associated when X is higher, we expect Y is usually lower
Not associated: when X is higher it doesn’t tell us anything about Y
Problem with covariance: scale is not very natural
Correlation
statistical technique that shows whether and how strongly pairs of variables are related
Correlation ranges between -1 & 1
A perfect positive relationship has Cor(X,Y) = 1
A perfect negative relationship has Cor(X,Y) = -1
Two perfectly unrelated variables have Cor (X,Y) = 0
When we divide by the standard deviations
We are in effect standardizing the covariance and rescaling it from -1 to 1.
Correlation only sees
linear relationships
Regression logic
o How can we best figure out this relationship between X & Y
o Suppose we guess the slope and intercept. How do we assess whether it is a good guess for the relationship between X & Y?
o We use the sum of squared errors (the residuals added up and square) to see how well we are doing
o It turns out the best way to estimate this relationship is to choose our slope and intercept for X (slope and intercept) to minimize the value: the sum of squared errors.
o The residual is the distance between the line and any given point
o The SSE takes those residuals, squares them, and adds them up
o The regression shows us the best fitting line in terms of sum of squared errors
Standard Error
o If we took another sample of data from the same source and estimated βˆ. o SE (βˆ) estimates the standard deviation for that distribution of βˆ
Compare modeled outcome to the simple mean
We can think of regression as a prediction machine that tells us our best guess of Y (growth) given our knowledge of X (yearsschool) for an observation
Understanding the variance explained or R squared
Spread of points around the regression line should be smaller than the spread of points around the mean line
If we average up the squared distances around the mean we get the variance of Y.
If we average up the squared distances around the regression line we get mean squared error for the regression
We were trying to minimize this form of prediction error in our choice of Beta.
Causal Inference
o Just like difference in means reflected an association, correlations, covariance, and regression coefficients only reflect an observed relationship in the data
The setup makes us focus on variation in IV as an explanation for DV
But those countries that are higher or lower on IV are probably higher on other things
These are things may be why DV differs not just from IV
In terms of confounders
What is bias
Unbiased estimate: on average, our estimate is equal to the true parameter
Biased estimate: our coefficient is systematically wrong, either too high or too low than the true parameter
Omitted Variable Bias
o OLS does not necessarily create unbiased estimates
o Omitted variable bias: this a specific form of endogeneity
X is correlated with something else that influences Y; the error term is correlated with Y
This is often the reason why, if you change the model specification, your estimates change. We say, our model is not robust to alternative specifications
Therefore our model is missing a key confounder, and we haven’t estimated a causal relationship
Theoretically, you could include this confounder in your model (multi-variable regression) but this is often hard to do: maybe you don’t know what the confounder is, or you don’t have data on it.
Homoscedasticity
when the random variable, X, HAS the same variance for all observations of X. This isn’t a problem
Heteroscedasticity
when the random variable, X, DOES NOT have the same variance for all observations of X. This is a fixable problem
o Remember this only affects our standard errors, not our slope estimate. Therefore it doesn’t cause bias
o So what the solution? We use slightly different estimator for our standard error calculations. Intuitively, we estimate the variance in our standard errors.
Outliers
o An observation that is extremely difference from the rest of the observations in the sample. “One of these things is not like the others
o Reminder, what does an outlier do to our estimate of the mean? It drags it toward the outlier. The mean is sensitive to outliers
o Intuitively, then what would outliers do to our regression estimates? It would also drag our estimate of the slope towards the outlier.
Two sample tests:
o How can we compare one sample to another sample to ask: “Are these samples from the same population or different populations?
We call this two sample tests, comparing the samples
• We will try to understand how likely we are to observe some difference if there really is not a difference
• That will require thinking about a null distribution, describing how “weird” or result is if there really is no difference (our null hypothesis)
Difference in means
o The fundamental quantity of interest today will be the difference in means between two groups on some outcome
Difference in mean income across two states
Difference in voter turnout among two groups of people (Students vs employed)
Difference in probability of war in two groups of countries (autocracies vs democracies)
Difference in proportion of heads after tossing one coin N1 time and another coin N2 times.
One sided vs. Two sided tests
Two sided: is one group different, either bigger or smaller than one?
One sided: is one group bigger (smaller) than the other
Hypothesis
o A good null: H0: VoteR = VoteD
o Alternative: VoteR ≠ VoteD
o We find that:
Among Republicans, 20/25 report they will vote. (VoteR=0.8)
Among Democrats, 22/35 report they will vote (VoteD=0.63)
Thus VoteR – VoteD = 0.17
What is the key thing we need to assess how “weird” this result is?
What is a p-value?
o A p-value is the probability of observing a difference in mean or a coefficient as big as what we observed, if the null were true.
o First, you need to decide is this a one-sided or two-sided test?
o Then you need to set your critical value- how different would I want these two distributions to be before I conclude they probably weren’t from the same distribution?
o Based on this critical value, you can say whether it seems like these two groups are significantly different
For two sample test with a difference in means
Null Hypothesis: difference in means between two groups is zero
Alternative: the two samples are probably from different groups
Critical value: how certain do I want to be that they’re really different? Often 1.96
P-value: how often you would get a result at least as extreme as you got under the null. Usually we set the cut-off for significance level to 0.05-1 out of 20 cases
Hypothesis testing with regression
o What if we wanted to know if the relationship we found between X and Y was real, versus we saw it just by chance?
o It could be that the sample we draw shows a relationship between X & Y but if we had a slightly different sample we wouldn’t see any relationship
o We need to set decision criteria: how different from zero do we want the relationship to be before we decide we’ve really uncovered a real relationship. We call this the critical value.
Hypothesis testing with regression
o Null: β1 = 0- there is not relationship between X and Y
o Alternative: β1 ≠ 0- there is a relationship between X and Y
o How would we set our cut off for making sure we don’t make mistakes? Pick our critical value
The standard normal distribution
o Up to 1.64 SD gets 95% confidence interval
o -1.64 to 1.64 SDs gets central 90% confidence interval
o Up to 1.96 SDs gets 97.5% confidence interval
o -1.96 to 1.96 SDs gets central 95% confidence interval
Hypothesis Testing
o State null hypothesis
o Determine if hypothesis test is one sided or two sided
o State alternative hypothesis
o Run Regression
o Test statistic: take coefficient and standardize it
Divide coefficient by standard error
We do this so we can apply coefficient to normal distribution
This will let us know how likely it would be that we got this coefficient just by chance (if the RA scrambled our data on accident)
o Decide on a critical value (typically 1.96 for two sided test)
o State whether we can reject the null in favor of the alternative
If critical value < |test statistic|: we reject the null
If critical value > |test statistic|: we fail to reject the null
Fearon and Laitin Dataset
o Research question: does ethnic fractionalization explain how long civil wars last?
Dependent variable? Year of civil war
Independent variable? Ethnic fractionalization
Null hypothesis?
• There is no association between ethnic fractionalization and years of civil war
One or two sided? Lets go with two-sided to be safe
Alternative hypothesis?
• There is an association (either positive or negative) between ethnic fractionalization and years of a civil war
Lets get test statistic
• Step 1: get β1 (coefficient for ethnic fractionalization)
• Step 2: Standardize β1
o Estimate/std. error
o For year of civil war: Estimate for the intercept (44.571), Std. error (2.079). t value (21.439), p-value (2e^-16)
o For ethfrac: Estimate (-9.830), std error (4.205), t-value (-2.338) , p-value (0.0207)
o For standardizing β1: (-9.830/4.205) = -2.338
Which is your test statistic for ethfrac!
• Lets decide on a critical value: 1.96
• |test statistic| = 2.338
• Critical value = 1.96
• Therefore, critical value < |test statistic|
o In other words, we are very unlikely to see such a large estimate with this data due to chance. How unlikely? Such an estimate will come up only about 2% of the time
o We reject the null hypothesis
Why do we use p-values?
• Intuitive: its just the probability of your coefficient being due to chance
• Easy short-hand: run the regression, if your p-value is below 0.05 reject the null. If not, you fail to reject the null
o Confidence intervals
Given we have just one sample, and we know there is variability from just having one sample rather than the full population of data, how confident should we be in our results—our point estimate?
What factors would make you more confident in your estimates? Larger sample size, smaller variance in data
What factors would make you less confident in your estimates? Smaller sample size, large variance in your data
o Calculating confidence intervals
As long as you have a relatively big sample size then you can use the following basic formula
Notice this formula is also dependent on choosing your critical value