Midterm Flashcards

Question

Individual

Answer 1

also called an "object" or "unit" | a single data point contributing to the sample

Answer 2

average of a variable | X bar

Answer 3

lowest value of a variable (min)

Answer 4

the highest value of a variable (max)

Answer 5

the number of observations (N)

Answer 6

how widely dispersed the values of the variable are (population, sample or standard deviation)

Answer 7

Population distribution (Y) ---Estimand/Parameter Then goes to sample (Y1, Y2,...Yn) Then to Estimator/Statistic g(Y1, Y2,...Yn) Finally to Estimate and back to population distribution

Answer 8

the best guess about what number will be drawn from the distribution

Answer 9

how far the numbers you drew tend to be from the best guess

Answer 10

something that exists for any random variable | if you could draw repeatedly from a distribution and take an average, it would get closer and closer to the expectation

Answer 11

measure of spread, or how fast you expect a random draw to be from the expectation E[X-E[X]^2] draw a bunch of numbers from the same distribution, squared the difference from each to the mean and averaged those

Answer 12

are properties of the distribution of X, not your data

Answer 13

X~N (E [X], Var(X)) X ∼ N (µ, σ2), where µ is the expectation and σ 2 is the variance.

Answer 14

The Sample mean of X1, X2…XN is X bar = 1/N  Xi = 1/N [X1 + X2 + ... + XN] This is an example of an estimator

Answer 15

For Rv's X1, X2...Xn the mean (x bar) gets closer and closer to E[X] As N grows the estimator gets better

Answer 16

Property 1: the means distribution is centered around E[X}  You only get a mean once, but you know it takes a value from a distribution centered on E[X]  We don’t know E[X] (though we know X bar gets closer and closer as N grows) Property 2: variance of the mean is V(X bar) which is the Var(X) divided by N • So we can estimate Var(Xbar) =Varbar/N

Answer 17

o There is one more piece to understanding the distribution of the mean: we know its expectation and variance, but what about the shape? o The shape must depend on the distribution of the underlying X, you would think  Fortunately it does not o Property 3: Central limit theorem (CLT): the distribution of the mean tends toward a normal distribution  This is magical: regardless of how the original X is distributed, when you take the mean of multiple RVs drawn from the same distribution, it starts to look normal  You do need N to be big enough for this to work, but that’s often not a problem  We will discuss some rules for deciding if N is big enough and adjustments to use when it is not  As N increases, the distribution starts to turn into a triangle shape, with the peak in the middle.  The more N increases, the width of each box gets smaller o Theoretical understanding of distribution  Take the N=50 case  We can estimate the center using X bar which is 0.5  We can estimate the variance of X bar using (1/N) Varbar(X)  We know the shape of the distribution of the mean is normal  Using all this, we estimate that the mean should be distributed • Xbar ~ N(µ = 0.5, σ2 = .08/N)

Answer 18

the mean value of the product of the deviations of two variables from their respective means measure of the join variability of two random variables IF the greater values of one variable mainly correspond with the grader values of the other variable, and the same holds for lesser values, the variables tend to show similar behavior, the covariance is positive Positive association: when X is higher, we expect Y is usually higher Negatively associated when X is higher, we expect Y is usually lower Not associated: when X is higher it doesn't tell us anything about Y Problem with covariance: scale is not very natural

Answer 19

statistical technique that shows whether and how strongly pairs of variables are related Correlation ranges between -1 & 1 A perfect positive relationship has Cor(X,Y) = 1 A perfect negative relationship has Cor(X,Y) = -1 Two perfectly unrelated variables have Cor (X,Y) = 0

Answer 20

We are in effect standardizing the covariance and rescaling it from -1 to 1.

Answer 21

linear relationships

Answer 22

o How can we best figure out this relationship between X & Y o Suppose we guess the slope and intercept. How do we assess whether it is a good guess for the relationship between X & Y? o We use the sum of squared errors (the residuals added up and square) to see how well we are doing o It turns out the best way to estimate this relationship is to choose our slope and intercept for X (slope and intercept) to minimize the value: the sum of squared errors. o The residual is the distance between the line and any given point o The SSE takes those residuals, squares them, and adds them up o The regression shows us the best fitting line in terms of sum of squared errors

Answer 23

``` o If we took another sample of data from the same source and estimated βˆ. o SE (βˆ) estimates the standard deviation for that distribution of βˆ ```

Answer 24

We can think of regression as a prediction machine that tells us our best guess of Y (growth) given our knowledge of X (yearsschool) for an observation

Answer 25

Spread of points around the regression line should be smaller than the spread of points around the mean line  If we average up the squared distances around the mean we get the variance of Y.  If we average up the squared distances around the regression line we get mean squared error for the regression  We were trying to minimize this form of prediction error in our choice of Beta.

Answer 26

o Just like difference in means reflected an association, correlations, covariance, and regression coefficients only reflect an observed relationship in the data  The setup makes us focus on variation in IV as an explanation for DV  But those countries that are higher or lower on IV are probably higher on other things  These are things may be why DV differs not just from IV  In terms of confounders

Answer 27

Unbiased estimate: on average, our estimate is equal to the true parameter Biased estimate: our coefficient is systematically wrong, either too high or too low than the true parameter

Answer 28

o OLS does not necessarily create unbiased estimates o Omitted variable bias: this a specific form of endogeneity  X is correlated with something else that influences Y; the error term is correlated with Y  This is often the reason why, if you change the model specification, your estimates change. We say, our model is not robust to alternative specifications  Therefore our model is missing a key confounder, and we haven’t estimated a causal relationship  Theoretically, you could include this confounder in your model (multi-variable regression) but this is often hard to do: maybe you don’t know what the confounder is, or you don’t have data on it.

Answer 29

when the random variable, X, HAS the same variance for all observations of X. This isn’t a problem

Answer 30

when the random variable, X, DOES NOT have the same variance for all observations of X. This is a fixable problem o Remember this only affects our standard errors, not our slope estimate. Therefore it doesn’t cause bias o So what the solution? We use slightly different estimator for our standard error calculations. Intuitively, we estimate the variance in our standard errors.

Answer 31

o An observation that is extremely difference from the rest of the observations in the sample. “One of these things is not like the others o Reminder, what does an outlier do to our estimate of the mean? It drags it toward the outlier. The mean is sensitive to outliers o Intuitively, then what would outliers do to our regression estimates? It would also drag our estimate of the slope towards the outlier.

Answer 32

o How can we compare one sample to another sample to ask: “Are these samples from the same population or different populations?  We call this two sample tests, comparing the samples • We will try to understand how likely we are to observe some difference if there really is not a difference • That will require thinking about a null distribution, describing how “weird” or result is if there really is no difference (our null hypothesis)

Answer 33

o The fundamental quantity of interest today will be the difference in means between two groups on some outcome  Difference in mean income across two states  Difference in voter turnout among two groups of people (Students vs employed)  Difference in probability of war in two groups of countries (autocracies vs democracies)  Difference in proportion of heads after tossing one coin N1 time and another coin N2 times.

Answer 34

Two sided: is one group different, either bigger or smaller than one? One sided: is one group bigger (smaller) than the other

Answer 35

o A good null: H0: VoteR = VoteD o Alternative: VoteR ≠ VoteD o We find that:  Among Republicans, 20/25 report they will vote. (VoteR=0.8)  Among Democrats, 22/35 report they will vote (VoteD=0.63)  Thus VoteR – VoteD = 0.17  What is the key thing we need to assess how “weird” this result is?

Answer 36

o A p-value is the probability of observing a difference in mean or a coefficient as big as what we observed, if the null were true. o First, you need to decide is this a one-sided or two-sided test? o Then you need to set your critical value- how different would I want these two distributions to be before I conclude they probably weren’t from the same distribution? o Based on this critical value, you can say whether it seems like these two groups are significantly different

Answer 37

 Null Hypothesis: difference in means between two groups is zero  Alternative: the two samples are probably from different groups  Critical value: how certain do I want to be that they’re really different? Often 1.96  P-value: how often you would get a result at least as extreme as you got under the null. Usually we set the cut-off for significance level to 0.05-1 out of 20 cases

Answer 38

o What if we wanted to know if the relationship we found between X and Y was real, versus we saw it just by chance? o It could be that the sample we draw shows a relationship between X & Y but if we had a slightly different sample we wouldn’t see any relationship o We need to set decision criteria: how different from zero do we want the relationship to be before we decide we’ve really uncovered a real relationship. We call this the critical value.

Answer 39

o Null: β1 = 0- there is not relationship between X and Y o Alternative: β1 ≠ 0- there is a relationship between X and Y o How would we set our cut off for making sure we don’t make mistakes? Pick our critical value

Answer 40

o Up to 1.64 SD gets 95% confidence interval o -1.64 to 1.64 SDs gets central 90% confidence interval o Up to 1.96 SDs gets 97.5% confidence interval o -1.96 to 1.96 SDs gets central 95% confidence interval

Answer 41

o State null hypothesis o Determine if hypothesis test is one sided or two sided o State alternative hypothesis o Run Regression o Test statistic: take coefficient and standardize it  Divide coefficient by standard error  We do this so we can apply coefficient to normal distribution  This will let us know how likely it would be that we got this coefficient just by chance (if the RA scrambled our data on accident) o Decide on a critical value (typically 1.96 for two sided test) o State whether we can reject the null in favor of the alternative  If critical value < |test statistic|: we reject the null  If critical value > |test statistic|: we fail to reject the null

Answer 42

o Research question: does ethnic fractionalization explain how long civil wars last?  Dependent variable? Year of civil war  Independent variable? Ethnic fractionalization  Null hypothesis? • There is no association between ethnic fractionalization and years of civil war  One or two sided? Lets go with two-sided to be safe  Alternative hypothesis? • There is an association (either positive or negative) between ethnic fractionalization and years of a civil war  Lets get test statistic • Step 1: get β1 (coefficient for ethnic fractionalization) • Step 2: Standardize β1 o Estimate/std. error o For year of civil war: Estimate for the intercept (44.571), Std. error (2.079). t value (21.439), p-value (2e^-16) o For ethfrac: Estimate (-9.830), std error (4.205), t-value (-2.338) , p-value (0.0207) o For standardizing β1: (-9.830/4.205) = -2.338  Which is your test statistic for ethfrac! • Lets decide on a critical value: 1.96 • |test statistic| = 2.338 • Critical value = 1.96 • Therefore, critical value < |test statistic| o In other words, we are very unlikely to see such a large estimate with this data due to chance. How unlikely? Such an estimate will come up only about 2% of the time o We reject the null hypothesis  Why do we use p-values? • Intuitive: its just the probability of your coefficient being due to chance • Easy short-hand: run the regression, if your p-value is below 0.05 reject the null. If not, you fail to reject the null o Confidence intervals  Given we have just one sample, and we know there is variability from just having one sample rather than the full population of data, how confident should we be in our results—our point estimate?  What factors would make you more confident in your estimates? Larger sample size, smaller variance in data  What factors would make you less confident in your estimates? Smaller sample size, large variance in your data o Calculating confidence intervals  As long as you have a relatively big sample size then you can use the following basic formula  Notice this formula is also dependent on choosing your critical value