Midterm Flashcards

1
Q

What is the Sample Average Treatment Effect? (SATE)
How do you find it?

A

SATE = mean of the Treatment variable - mean of the control variable

formula:
SATE 1/n * sum (Yi(1) - Yi(0))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you find Mean?
What is a detriment to using mean?

A

add together all of the numbers and divide the sum by the total amount of numbers

Detriment: can be influenced by outliers which pull the average too high or too low.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you find the median?
What is the benefit of using median

A

If you have an odd amount of numbers locate the exact middle number.

If you have an even amount of numbers locate the two middle numbers, add them together, and divide the sum by 2.

benefit: more robust against the impact of outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you find Range?

A

subtract the minimum number from the maximum number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you find the interquartile range?

A

Subtract Q1 from Q3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you determine if a number is an outlier?

A

You must find the highest and lowest limit of the dataset for non-outlier numbers. To find the lowest acceptable number take Q1 - 1.5IQR. To find the highest acceptable number take Q3 + 1.5IQR. If the number in question is below or above either of these numbers it is an outlier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you find the three Quartiles?

A

Start by finding the median of the entire list. The median is considered Q2. The median then separates the list into two halves. Locate the median of the first half of the list, this median is Q1. Locate the median of the second half of the list, this median is Q3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you find Standard Deviation? What is the formula?

A

Formula:
SD = sqrt (1/n-1 * sum (xi-mean of X)^2

Steps:
1.) find the mean of X
2.) Subtract the mean from each x variable
3.) Square each result from step 2
4.) Add together all the squares
5.) Divide the sum of the squares by the total number of observations minus 1
6.) Square root the result of step 5

The result of step 6 is the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you find variation

A

SD^2
Square the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the formula for the Correlation Coefficient (r)
<you will not compute this by hand!>
What is r telling you?
How is it written?
how is it described?

A

r= 1/(n-1) * sum of ((Xi-mean of x/ SD of X) * Yi-mean of y/SD of Y))

R tells you
-The strength and direction of a relationship between variables.
-How similar the measurements of two or more variables are across a dataset.
- How close the variables move together

will be between -1 and 1.
Can be described high or low, positive or negative, or no correlation (0).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you find mean prediction error?

A

1.) for each variable point subtract the predicted value from the actual value of the point.

2.) Add all of those values together.

3.) divide the sum by the total number of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you find the root mean square error (RMSE)?

A

RMSE = sqrt (RSS/n)

1.) find the value of RSS (subtract predicted y from real y, square the results, add the squares

2.) divide RSS by the total number of values

3.) square root the result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the equation for a linear regression model?

A

Y= α +βX + ε

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do the variables in the linear regression model mean?
Y= α +βX + ε

A

Y: dependent variable, what you are trying to predict
α: alpha, is the y-intercept. Where y is when X=0
β: Beta, slope, the increase in Y when X has a one-unit increase
X: independent variable, the predictor
ε: error term, the observed error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you find residuals (the error term)

A

Actual y - predicted y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you find the residual sum of squares (RSS)?
(What is the formula)

A

RSS= sum of (Yi-Ŷ)^2

1.) subtract the predicted value of y from the actual value of why for each data point.
2.) square each result
3.) add together all of the squares

The result of step 3 is the RSS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you find the total sum of squares (TSS)?
(What is the formula)

A

TSS = sum of (Yi-Ȳ)^2

1.) subtract the mean of y from each y value in the data set.
2.) square the results of each subtraction in step 1
3.) add together all of the squares

The result of step 3 is the TSS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is R2 and how do you find it?
what does it tell you?
What is the formula?

A

R2 is the proportion of variation in Y explained by the model.

Tells you how well a model fits the data

R2 = 1 - (RSS/TSS)

1.) find RSS
2.) find TSS
3.) Divide RSS by TSS
4.) Subtract the result to step 3 from 1

Result of step 4 is R2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the counterfactual?
What is the factual?

A

Counterfactual = what would have happened absent a condition or treatment, what would have been observed

Factual = What was actually observed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the fundamental problem of causal inference?

A

The counterfactual can never be observed

  • you must infer the counterfactual outcomes as accurately as possible, but will never actually know what would have happened.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the rule of causality?

(ie. ice cream sales and suicide)

A

association does not equal causation!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How can you figure out counterfactuals?
What is the problem with this tactic?

A

Matching- find a similar unit that matches as close as possible

Problem: you cannot match everything and this introduces confounders

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are confounders?

How do you minimize confounders

A

variables associated with treatment
and outcome, they impact the results and make it difficult to attribute changes to the treatment.

Can be observed or unobserved.

Minimize by using randomized controlled trials

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are Randomized Controlled Trials and how do they work to minimize confounders?

A

RCT is when scientists randomize the treatment to make the treatment and control groups identical on average.

The groups are similar in terms of all, observed and unobserved, characteristics. This allows scientist to be able to attribute any differences in outcome to the treatment variable and rule out confounders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are double-blind experiments?

A

An experiment where neither the scientists nor the study participants know who is receiving the treatment and who is part of the control. Often used to prevent bias in the experiment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the placebo effect?

A

when a “fake” treatment produces a result that cannot be attributed to the placebo itself and is therefore caused by the patient’s belief in the “treatment”

  • people think they receive treatment and affect the result

(ex. the subject says pills work to cure illness even though they received just a sugar pill that did nothing.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the Hawthorne Effect?

A

the phenomenon where study subjects behave differently because they know they are being observed by researchers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are observational Studies

A

Studies where the treatment is naturally assigned. Scientists don’t DO anything, they just observe what is happening in nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Why can observational studies not be randomized?

A

Ethical and logistical reasons

Ex. Ethical: smoking and lung cancer, it would be unethical to force a group of humans to smoke just to observe if they got lung cancer

Ex, Logistical: wars occur naturally, you cannot feasibly make countries go to war just to see what happens in the UN assembly. (This is ethical too)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Compared to controlled experiments do observational studies have weak or strong external validity? Why is this good

A

has better external validity for generalization beyond the experiment than RCT experiments. This is because the events occur naturally and are not confined to the extreme specifics of lab work.

Strong external validity is good because it means the findings can be applied very broadly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Compared to controlled experiments do observational studies have weak or strong internal validity?

Why?

A

They have weaker internal validity.

Because:
pre-treatment variables may differ between treatment and
control groups
* confounding bias may exist due to these differences
* selection bias from self-selection into treatment may occur
* statistical control is needed (subclassification, variables)
* unobserved confounding poses a threat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is external validity?

A

The extent to which the conclusions of a study can be generalized beyond the particular study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is internal validity?

A

the extent to which causal assumptions are satisfied in the study

The extent to which the effect of the treatment in a study can be attributed solely to the treatment itself and not other confounders.

This is the main advantage of Randomized controlled Trials

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

what are the three strategies of observational studies?

A

1.) Cross-section comparison
2.) Within-unit effects (AKA Before and After comparison)
3.) Differences-in-differences

No strategy is best!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is cross-section comparison?
What is the assumption and the problems associated with this strategy?

A

When you compare treated units with control units after the treatment.
* Assumption: the treated and control units are comparable
* Possible unit-specific confounders may exits causing you to need statistical controls
* may also have selection bias

Ex.) observe New Jersey and Pennsylvania unemployment after a minimum wage increase in New Jersey. Changes in New Jersey unemployment can be attributed to the minimum wage increase if there are no changes in Pennsylvania unemployment

36
Q

What is Within- Unit Effects comparison?
What are the problems associated with this strategy?

A

When you compare only one unit before and after treatment.

the advantage here is that differences between stats do not introduce unit-based confounders.

Problem is that this introduces time-varying confounders. Other changes overtime, aside from the identified treatment, may impact the results.

Ex. Compare just New Jersey unemployment before a minimum wage increase with the unemployment after the increase.

37
Q

What is a differences-in-differences comparison?
What is the assumption and the problems associated with this strategy?

A

Using what happened in the scenario without the treatment to predict what would have happened had the treatment not been implemented.

Uses the parallel trends assumption.
-the assumption that in the absence of treatment, the difference between the ‘treatment’ and ‘control’ group would be constant over time

Fixes both uniti-specific confounders and time-varying confounders.

Ex. using what happened in PA unemployment to determine what would have happened in NJ had they not introduced a higher minimum wage.

38
Q

What is probability Sampling?
Why is it used?

A

Is used to ensure representativeness.
Is when every unit in the population has a known non-zero probability of being selected to participate in the study

39
Q

What is Simple Random Sampling?

A

Is used to properly randomize the sample. The bigger the sample, the more accurate the results.

In simple random sampling, every unit has an equal selection probability.

40
Q

What is bias?
What are the 5 types of potential bias in survey sampling?

A

bias is the systematic faults in the sampling system. If it is not systematic then it is just white noise and not bias
1.) frame bias
2.) selection bias
3.) Unit non-response bias
4.) Item non-response bias
5.) response bias

41
Q

What is Frame bias?

A

When the general population frame is non-representative

42
Q

What is Sampling Bias?

A

when the sample population is systematically not randomized

43
Q

What is Unit non-response bias?

A

When people in the sample or frame population systematically do not respond/participate in the survey

44
Q

What is item non response bias?

A

When participants in the survey systematically do not respond to a specific item on the survey

45
Q

What is response bias?

A

When respondents lie on the survey or do not tell you the real response

ex.) social desirability bias, people tell you the answer they think is the most socially correct, not their real answer.

46
Q

What are list experiments?
When are they useful?
Example?

A

List experiments are when the control group of respondents is given a list of 3 items and are asked how many of the 3 they support (or another indicator) and the treatment group is given the same list but with an extra 4th item. If the average number of “supported” items reported is increased in the treatment group compared to the control group, this indicates “support” for the 4th variable in the list.

useful when the questions are sensitive or there is social pressure.

Ex.) to determine if afghanis supported the Taliban a control group was given a list of 3 organizations to support, the average response was calculated. A treatment group was given the same question and list with he addition of the taliban. The increase in average supported groups was 2 in the control and 3 in the treatment. This indicates they do support the taliban.

47
Q

What functions measure the center of data?

A

mean and media

48
Q

What functions measure the distribution of data?

A

Range
Quartiles
Interquartile range

49
Q

What does standard deviation measure?

A

measures on average how far away the data points are from their mean. It is purely descriptive.

50
Q

Why do you square and then square root in standard deviation?

A

You square to eliminate the impact of being on the opposite side of the mean, it negates negative numbers and gives you a flat distance from the mean. Squaring prevents the numbers from canceling each other out so you don’t end up with 0. You need to square root because once the numbers are squared they are no longer in the same units of the original data, the square root brings them back to the same unit.

51
Q

Why do you use n-1 in the standard deviation

A

In a study, observed values will fall closer to the sample mean than the true population mean. this underestimates the true population means. Subtracting 1 from the sample size makes the result more proportional to the population standard deviation.

n-1 accounts for the difference in sample mean and population mean when calculating SD for the true population.

52
Q

Why do you use n-1 in the standard deviation

A

In a study, observed values will fall closer to the sample mean than the true population mean. this underestimates the true population means. Subtracting 1 from the sample size makes the result more proportional to the population standard deviation.

n-1 accounts for the difference in sample mean and population mean when calculating SD for the true population.

53
Q

What is variance?

A

shows how much the data points vary overall. It is the total varying amount, not the average. It is helpful for comparing samples.

the total amount that the results vary

54
Q

If data is far from the mean (high variance) is the mean representative?

A

probably not, high variance means there is more uncertainty around the mean.

55
Q

what is the unit of analyis

A

the unit that represents the entity you are studying

ex. country, individual, household, congressional district, state

56
Q

What is the unit of observation

A

what uniquely identifies the observation being studied.

  • is a characteristic of the unit of analysis
    ex. country-year, state-month, individual wave
57
Q

what is a variable?
What is the key rule?

A

an empirical measure of a concept/characteristic.

key rule: variables must vary across observations

58
Q

what are the 2 types of variables, describe them.

A

1: Quantitative/Interval/Continuous- observations can take on an infinite number of numerical values between any two values (decimals).
2: Categorical — observations belong to one of a discrete set of categories & we assign a number to each category

59
Q

what are the 3 types of categorical variables, describe them

A

1.) Nominal — categories are named (independent).
2.) Ordinal — categories are ranked
3.) Dichotomous variables — two values (e.g., yes/no)

60
Q

what type of variable is age?

A

Age is used as continuous, but it is written to look ordinal and is often observed as non-continuous

61
Q

how can you transform variables

A

You can collapse continuous variables into ordinal (or nominal) variables. this does not work in the reverse
ex. you can turn incomes into categories of incomes

Log Transformation for continuous variables

62
Q

What does the distribution of a variable tell us?

A

what values a variable takes and how often it takes on these values

63
Q

What techniques can be used to describe the distribution of each type of variable

A

Categorical — frequency tables, barplots,

Continuous — mean/median, SD, histogram, density plots, boxplots,

64
Q

what two S words are used to describe distribution?
define them

A

symmetric- looks the same on both sides, a normal bell curve distribution

skewed- the data bunches on one side of the curve and creates a tail on the other.

65
Q

differentiate the two types of skewnees

A

right skew- the tail is on the right
left skew- the tail is on the left

66
Q

what are the two types of modes a distribution can have, define them

A

unimodal: one mode/one hump in a distribution
bimodal: two modes/two humps in a distribution

67
Q

what items do you look at to understand the shape of distribution?

A

symmetry, skewness, amount of modes, outliers and deviations from shape.

68
Q

what is a scatterplot

A

a plot with dots that shows a direct graphical comparison of two variables

69
Q

what is positive correlation?

A

When x is larger than its mean, y is likely to be larger than its mean

70
Q

What is negative correlation?

A

When x is larger than its mean, y is unlikely to be larger than its mean

71
Q

what does it mean to have high correlation?

A

data cluster tightly around a line
indicates the two variables have a strong relationship

72
Q

what are the properties of the correlation coefficient r

A

1.) Correlation is between −1 and 1
2.) Order does not matter: cor(x, y) = cor(y, x)
3.) Not affected by changes of scale
4.) Correlation measures linear association

73
Q

what is a Z score?

A

the score given to each observation of a variable which measures the number of standard deviations an observation is above or below the mean

It is a measure of deviation from the mean

It is not sensitive to how the variable is scaled and or shifted.

74
Q

how do you determine Z scores

A

for each iteration subtract the mean of the variable from the iteration value and then divide the result by the standard deviation of the variable.

z score of Xi = (Xi-x̄) / SD of X

75
Q

what is the difference between a scatter plot and a QQ plot?

A

A scatterplot plots the relationship between two variables. reading for the results

A QQ plot compares the frequencies of two distributions, use to understand if distributions are similar

75
Q

what is the difference between a scatter plot and a QQ plot?

A

A scatterplot plots the relationship between two variables

A QQ plot compares two distributions

76
Q

what is clustering

A

making meaningful groups in the data

77
Q

how does the K-Means algorithm work

A

goal: split the data into similar groups where each group is associated with its centroid, which is equal to the within-group mean.

steps:
1.) choose the initial center of K amount of clusters
2.) given the identified centroid assign each observation to the centroid which is closest to that observation
3.) recompute a new centroid to the average of the points in the cluster
4.) reassign the observations to the clusters of centroids closest to them
5.) repeat steps 3 and 4 until the observations can no longer be rearranged.

78
Q

what is a moving average?
What determines the smoothness?

A

when you find the average within a defined period of time. As time changes the average slowly changes as each unit of time is individually replaced.

Ex. if you want the average over a 7 day period on the 8th day you drop day one. You keep 6 of the 7 days and add one new day’s data.

smoothness is determined by the window size. A smaller window creates less smooth lines ( each “day” has a bigger impact in a small window)

79
Q

Why do you square and then square root in standard deviation?

A

You square to eliminate the impact of being on the opposite side of the mean, it negates negative numbers and gives you a flat distance from the mean. Squaring prevents the numbers from canceling each other out so you don’t end up with 0. you square root the standard deviation because otherwise, you could end up with a zero because the numbers cancel each other out. You need to square root because once the numbers are squared they are no longer in the same units of the original data, the square root brings them back to the same unit.

80
Q

are OLS regressions sensitive to outliers?
Why?

A

yes
because it uses a best-fit line to minimize the distance from all points to the line and if one or more points are far out of the pattern, the slope of
the line can change considerably

Ex. palm beach vote share

81
Q

What are the 4 things to keep in mind when it comes to OLS regressions , and their components?

A

1.) OLS regressions are linear
-uses line of best fit, but it may not be appropriate
-not resistant to the influence of outliers
-slope is constant
-true relationship may not be linear

2.) OLS allows for unreasonable predictions
-only want to generate reasonable predictions
-Evaluating predictions is key to assessing the relationship between variables & the strength of the model

3.) OLS correlations do not necessarily indicate causation
-correlation can be driven by unobserved variables

4.) OLS regressions are versatile and robust
-Models the relationship between IV and DV and allows for making predictions
-allows for including additional variables in the model
-Continuous DVs & continuous and/or dichotomous IVs

82
Q

if an OLS relationship is not linear what are the 2 possibilities?

A

Curvilinear (i.e., quadratic) — a sign/slope shift (or reversal) — i.e., the effect of X on Y changes direction at different levels of X

Diminishing returns — slope stays in same direction, but the effect of a one-unit change in X decreases (or increases) as values of X increase

83
Q

In OLS it is valuable that we can add additional IVs, why should we control for additional variables?

A

Worried about omitted variable bias: some underlying (unobserved) factor (X2) is driving relationship between X1 and Y

important to ‘control’ for other variables that we think lie in the causal path. When we control we can determine how much effect each X is having on Y

-Venn diagram, find the net effect of each by removing overlapping areas.

84
Q

what does the RSS tell you?

A

measure the amount of variance in the error term

85
Q

what does the TSS tell you

A

measures the total variation of y based on the square distance from the mean.

  • the deviation of data points away from the mean value