Notes Flashcards

1
Q

Experimental observations

A

the observation of a variable factor under controlled conditions to determine if this changes as the result of the manipulatin of another variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hypothesis testing

A

generating a theory from observations through inductive reasoning, and is usually the first step of most analyses in social sciences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Can you use the same data/information that gave rise of a theory to test that theory?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Random treatment

A

some subjects get a treatment, others do not, and we observe the outcome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Random treatment in identical subjects

A

confidence that any difference between the treated and the non-treated group is due to the treatment itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Random treatment in heterogenous subjects

A

cannot have full confidence that any difference between the treated and the non-treated group is due to the treatment itself, can get close to full confidence as it randomised and especially with larger same sizes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Self-selection

A

subjects put themselves forward for participation in the experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sources of endogeneity bias (problems in causal inference)

A

omitted variables, simultaneity, reverse causality, selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Identification strategy

A

a research design that addresses endogeneity bias in order to derive a robust causal inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Regression analysis

A

a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Probability distribution

A

a statistical function that gives the probabilities of occurrences of possible outcomes for an experiments within a given range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Sampling

A

the process by which we select a portion of observations from all the possible observations in the population (done in order to learn about the larger population)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Difference between good and bad sample

A

due to both the sampling procedure and luck

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Random sample

A

every possible sample of a given size has an equal chance of being selected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Central Limit Theorem

A

there is a systematic relationship between the probability that we will pick a particular sample, and how far that sample is from the true population average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sampling distribution

A

a picture that shows the relationship between the many possible sample averages we might conceivably calculate from different samples and the probability of getting those sample averages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Beta in this sampling distribution and what do we know?

A

Beta is the true mean which is unknown, we know the shape of the sampling distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the 0 and 15 represent here?

A

0 is the mean under H0, 15 is the cut-off value for a one-tail test at 5%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Is the null hypothesis correct here?

A

Yes, as the mean under H0 is equal to the true mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Is the null hypothesis correct here?

A

No, as the mean under H0 is not equal to the true mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the two possible errors one can make in hypothesis testing?

A

Type I error and Type II error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a Type I error and can the likelihood of it be controlled?

A

the null hypothesis is correct but we make a mistake and reject the null, yes it can be controlled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a Type II error and can the we know the likelihood of it?

A

the null hypothesis is incorrect but we fail to reject it, we cannot know what the likelihood of a Type-II error
is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What happens to a Type II error as the likelihood of a Type I error decreases?

A

it increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Null hypothesis
has status of maintained hypothesis that will not be rejected because it is assumed not to be proven false unless the sample data provide strong contrary evidence (is in a favoured position relative to the alternative hypothesis)
26
Statistical significance level
the probability of the observed coefficient if the null hypothesis were true
27
Regression line
can be thought of as a 'guessing rule' that takes any value of X and maps it to a predicted value of Y (and vice versa)
28
What is the average of the residuals in a regression line always equal to?
0
29
What does the error Term (/residual) equal?
equals the predicted value of y minus the true value of y
30
What does the Central Limit Theorem tell us?
the probability distribution of the sample residuals (errors) from a regression will be a normal distribution with mean 0, for sufficiently large sample sizes the sampling distribution of the sample mean will be approximately normally distributed
31
Normal distributions
have in common a fixed relationship between the probability mass and the standard deviation
32
How do you calculate the standard normal?
/z-disrtibution 1) subtract the mean, beta, from every value along the X-axis 2) divide every value on the X-axis by the original standard deviation distribution is a normal curve with (new) mean equal to 0 and (new) standard deviation equal to 1
33
How do you calculate the t-test?
a test of statistical significance, where t = (beta_hat - beta_H0)/sd_beta_hat, since beta_H0 equals 0, T = beta_hat/sd_beta_hat
34
What does the t-statistics tell you?
how many standard deviations of the sampling distribution (or standard errors) beta hat is from the mean under the null hypothesis, 0 (if sufficiently far away, then null hypothesis is rejected)
35
Counterfactual
a potential outcome that would have happened in the absence of the cause, not observable
36
Statistical inference
generalising from sample to population
37
Causal (statistical) inference
understanding cause- and effect- relationships
38
Law of Large Numbers (LLN)
the sample mean will converge in probability to the population mean as the sample size grows larger
39
What are the practical implications of the Law of Large Numbers (LLN)?
we can trust large sample sizes to yield accurate parameter estimates as the sample mean is used to estimate the population mean
40
What are the practical implications of the Central Limit Theorem (CLT)?
allows us to make probalistic statements about the sample mean and construct confidence intervals (the normal distribution approximation is crucial for applying z-tests and t-tests)
41
What does a regression do?
compares treatment and control subjects who have the same observed characteristics
42
What does regression-based causal inference have to assume?
that when key observed variables have been made equal across treatment and control groups, selection bias from the things we cannot see is mostly eliminated
43
What are regression estimates?
weighted averages of multiple matched comparisons
44
What do dummy variables do?
classify data into yes-or-no categories
45
How are residuals calculated?
the difference between the observed Y and the fitted values generated but the regression Y_hat
46
How is regression analysis accomplished?
by choosing values that minimise the sum of squared residuals
47
Omitted Variable Bias
selection bias generated by inadequate controls (not enough of not right ones)
48
What is the Omitted Variable Bias formula?
a tool that allows us to consider the impact of control for variables we wish we had (, 'Regression of omitted on included' multiplied by 'Effect of omitted in long')
49
What is long regression?
includes additional controls, those omitted from the short
50
What is short regression?
equals long regression plus the effect of omitted variables in long regression times the regression of omitted on included
51
What is regression to the mean?
not a causal relationship, but a statistical property of correlated pairs of variables
52
What are conditional expectations?
the expectation of a variable in groups defined by a second variable (the conditional expectation of Y given that X equals the particular value x)
53
What does conditional expectations tell us?
how the population average of one variable changes as e move the conditioning variable over the values this variable might assume (E(Y/X=x))
54
What is the conditional expectation function (CEF)?
for every value of the conditioning variable, we might get a different average of the dependent variable, Y, and this is the collection of all such averages (E(Y/X))
55
What do CEFs with more than one conditioning variable tell us?
the population average of Y with K other variables held fixed
56
What are the two important properties of covariance?
1) the covariance of a variable with itself is its variance 2) if the expectation of either X or Y is 0, the covariance between them is the expectation of their product
57
What is a bivariate regression model?
a regression with one regression, X, plus an intercept (, the slope and intercept are the values of a and b that minimise the associates RSS)
58
How do you calculate the RSS?
RSS(a,b) = E(Y-a-bX)^2
59
What are two properties of residuals?
have expectation 0 are uncorrelated with all the regressors that made them and with the corresponding fitted values
60
What does beta represent in a bivariate regression with dummy regressor?
the differenced in expected Y with the dummy switched on and off
61
What is homoskedasticity?
assumes the variance of residuals is unrelated to regressors (homoskedastic residuals can make regression estimates a statistically efficient matchmaker, however the assumption may not be satisfied)
62
What does the robust standard error formula allow for?
the possibility that the regression line fits more or less well for different values of X (heteroskedasticity)
63
What is the equation for a line?
Y = alpha + beta*X
64
What is the equation for a model?
Y = alpha + beta*X + u
65
What type of distribution is seen under the Central Limit Theorem?
normal distribution
66
What does the mean equal under the Central Limit Theorem?
0
67
What type of distribution does beta_hat have?
a normal sampling distribution
68
What is beta_hat?
the regression slope coefficient which is a function of the sample residuals
69
What is a normal distribution?
all have a fixed relationship between the probability mass and the standard deviation
70
What does the t-statistic tell you?
how many standard deviations (/standard errors) of the sampling distribution beta_hat is from the mean under the null hypothesis (if sufficiently far away, you reject the null hypothesis)
71
What happens if you set the criteria for rejecting the Null hypothesis to be stricter?
increases Type II error
72
What is multiple regression?
graphical representation of how a regression "holds Y constant"
73
What does the null hypothesis state?
always that there is no relationship (to reduce the risk of Type I error as it is worse than Type II error)
74
Is there a mechanical way to know the truth?
No
75
How do you compare to the standard error?
divide the coefficient by 2 and compare to the standard error, if it is less than it you fail to reject the null hypothesis
76
How does least squares work?
by minimising the sum of the residual software points from the plotted curve
77
What does the p-value measure?
the probability of obtaining he observed results, assuming that the null hypothesis is true (the lower it is, the greater the statistical significance of the observed difference)
78
In the (population) linear regression function Y = alpha + beta*X + u, what does each term represent?
Y is the dependent variable X is the independent variable alpha and beta are parameters we want to estimate u is error term
79
What does the Ordinary Least Squares method aim to do?
minimise the sum of squared residuals
80
What t-statistic indicates statistical significance?
larger than 1.96 at 5% level for a two-tailed test
81
What does the population regression function look like?
Y = alpha + beta*X + u
82
What does the sample regression function look like?
Y = alpha + beta*X
83
When does omitted variable bias occur?
when omitted variable is correlated with both dependent and independent variable of interest
84
What could the direction of bias do?
biases that operate in the opposite direction may actually strengthen the final argument because exclusion makes it harder to reject the null hypothesis, not easier
85
What do dummy variables capture?
some "qualitative" characteristic of each observation that does not have an obvious numerical variable
86
What is an omitted category (/control dummy variable)?
one dummy variable that is not included in the regression, coefficients interpreted are against the omitted category
87
What do you have to do when running a regression with dummy variables(, e.g. north, south, east, west)?
omit the intercept term or include the intercept term and omit one dummy variable
88
When do you use an interaction term?
when the relationship between the dependent and independent variable changes depending on the value of another independent variable
89
If a 'mechanism' variable connects Y and C and you hold the 'mechanism' variable constant across the whole data, what should you not observe?
you should not observe any relationship between X and Y anymore )beta_hat would not be statistically significant from zero)
90
What is the purpose of a model?
not to predict every outcome, but useful analytically to help to isolate and thus better understand a particular generalisable mechanism that may come into play
91
What is a description?
a perfect correspondence with outcomes
92
What is a reduced form relationship?
a simplified equation that directly shows how the dependent variable is affected by one or more indecent variables, without detailing the underlying mechanisms behind that relationship
93
What are parameters?
alpha and beta constant values to be estimates represent the relationship between variables
94
What are variables?
Y and X observed or measured quantities that can change Y is dependent (outcome) variable X is independent (explanatory) variable
95
What is does the error term represent and what does it account for?
represents unobserved factors affecting Y accounts for randomness and measurement errors
96
What are exogenous variables?
determined outside the model and not influenced by other variables in the model
97
What are endogenous variables?
determined within the model and influenced by other variable sin the model
98
When do endogeneity concerns in a regression model occur?
when explanatory variables are correlated with the error term (the estimated coefficient will be biased, meaning they do not converge to the true population parameters as the sample size increases)
99
If you have the true regression is Y = beta0 + beta1_X1 + beta2_X2 + u and misspecified regression is Y = alpha0 + alpha1_X1 + v, when does omitting X2 lead to a bias in alpha1?
when X2 has a non-zero effect on Y (means X2 is a relegation explanatory variable for Y) and when X2 is correlated with X1 (allows the effect of X2 to be partially captured by X1 in the misspecified model)
100
What do "country fixed effects" do?
allow each country to have its own intercept term
101
When "country fixed effects" are included, what does the single common slope coefficient across coefficient represent?
the average 'within' relationship within countries (regression will look like a set of parallel lines with different intercepts but same slope)
102
When "country fixed effects" are included, what does each country's intercept (their fixed effect) absorb?
will completely absorb any 'between variation' in the data
103
What do fixed effects force the regression to estimate?
the relationship within the countries
104
In panel data with country fixed effects, what do the slope coefficients capture?
a weighted average of the 'within' relationship in each country
105
What do country fixed effects control for?
'between' variation (control for both observe and unobservable time-invariant country characteristics (omitted variables))
106
What do time fixed effects control for in an analysis with countries?
control for time-varying observable and unobservable omitted variables that are common across countries (e.g. global shocks and trends)
107
Where does the difficulty lie in applying good causal inference?
difficulty lies in getting the analytics correct, not in getting the technical particulars correct
108
What must the estimates of standard errors take into consideration?
the patterns of correlations between the units of analysis
109
When are clustered standard errors necessary?
when an estimating regression includes variables of different degrees of aggregation
110
What are clustered standard errors?
a type of robust standard error that estimates the standard error of a regression parameter when observations are grouped into smaller clusters
111
When the regression is level(Y)-level(X), how do you interpret it?
a unit-increase in X results in a beta unit increase in Y
112
When the regression is level(Y)-log(X), how do you interpret it?
a 1% unit increase in X leads to a beta/100 unit increase in Y
113
When the regression is logY)-level(X), how do you interpret it?
a unit increase in X results in a 100*beta% increase in Y
114
When the regression is log(Y)-log(X), how do you interpret it?
a 1% increase in X leads to a beta% increase in Y
115
If a fixed effect is correlation with X and is left out of the regression, what does that mean for the regression of Y on X?
it is biased
116
How can a fixed effect correlated with X be omitted without leading to bias?
running the regression on the de-meaned variables is consistent, because by de-meaning, we removed fixed effects, making OLS consistent (running the regression for the de-meaned variables Y-Y_bar and X-X_bar is numerically equivalent to running OLS on Y with X and dummies)
117
When can fixed effects be used ubiquitous cross-sectional data?
can be used to account for group-level characteristics that are constant within groups but very between them (unobserved heterogeneity at the group level)
118
In what type of data are fixed effects most commonly used?
in panel data
119
Why does demeaning and dummy variables yield the same coefficients?
both methods isolate the within-entity variation in Y and X (demeaning subtracts the average of each variable for each entity, removing entity-specific, time-invariant effects) (dummy variables control for the same entity-specific effects by assigning each entity its own intercept)
120
What are two-way fixed effects?
entity fixed effects and time fixed effects
121
What is left after two-way fixed effects?
variation within-entity, over-time deviations relative to the overall time trends (estimation focuses on deviations of each entity from its average trajectory over time, relative to global or common time trends) --> remaining variation is how Y and X deviate from entity-specific average and global time trend for each year (time fixed effects absorb all variation common to all entities in given time-frame, entity fixed effects remove any time-invariant differences between entities)
122
What would the interacted fixed effect 'continent*year' fixed effect account for?
continent-specific shocks or trends that vary over time, but which affect all countries within a given continent the same way in each year (capture continent-level trends that vary from year to year, but are constant across all countries within the continent in a given year) --> estimates relationship between X and Y based on the within-country variation, net of time trends that are specific to the country or continent for a particular year
123
What is selection bias?
a distortion in a measure of association due to a sample selection that does not accurately reflect the target population
124
What are the four main examples of evidence?
1) testing the robustness of results to the inclusion of different sets of control variables and/or different methods of estimation 2) "placebo" or falsification tests (looking for an effect in otherwise similar circumstances but where you know no 'treatment' has been administered) 3) use of qualitative supporting evidence (e.g. descriptive information, survey data, ethnographic data, historical archives) 4) exploiting otherwise unlikely testable hypotheses from theory (Einstein approach)
125
What does reverse causality with any variable in a regression do to the coefficient estimates?
will bias all the coefficient estimates
126
What is reverse causality?
When Y causes X
127
What is simultaneity?
an omitted variable that is correlated to Y and X
128
How is the 2SLS estimator calculated?
(reduced form)/(first stage)
129
What type of estimation is Instrumental Variable Estimation?
Local Average Treatment Effect (LATE)
130
How do you test the exclusion restriction?
if there is more than one instrument, one can run an 'over-identification' test of the hypothesis that delta1=delta2=0
131
Are 'over-identification' tests stronng?
No, they are weak tests
132
When is an instrumental variable valid? (3 points)
must be significantly correlated with the endogenous explanatory variable of interest only impacts the dependent variable via its impact on the endogenous explanatory variable (exclusion restriction) is itself not exogenous in that the dependent variable cannot cause the instrumental variable
133
What does the exclusion restriction state?
that the instrumental variable only impacts the dependent variable via its impact on the endogenous explanatory variable
134
Does the 'treatment' (/causal) effect for any individual remain observed or unobserved?
unobserved
135
What does the Average Treatment Effects (ATE) equal?
the difference between the average outcome across all units if all units were in the treatment condition and the average outcome across all units if all units were in the control condition
136
How does true ATE emerge?
when we take averages across many individuals, the differences due to other unobserved factors tend to cancel out (,especially if treatment assignment is random)
137
What are compliers?
subjects who do what they are told
138
What are always-takers?
subjects who always take up the treatment whether they are told to or not
139
What are never-takers?
subjects who never take up the treatment whether are told to or not
140
What are defiers?
subjects who always do the opposite of what they are told
141
What is the Intention-To-Treat effect (ITT)?
the estimate from when being an always- or never-taker is correlated with outcomes, and the treatment and control group are unbalanced
142
How do you calculate the treatment on the compliers?
(ITT)/(% of compliers) (as long as you can rule out the presence of 'defiers')
143
What is the Local Average Treatment Effect (LATE)?
local treatment effect as it is the treatment effect only on a subset of the population
144
When is the LATE estimate very close to the ATE estimate?
when the subset in LATE is similar to rest of population
145
What are two forms of LATE?
treatment on compliers instrumental variables
146
What is a critique of shift-share approach?
takes time for markets to adjust to shocks
147
What is a covariate?
measurable variable that may impact a study's outcome (and has a statistical relationship with the dependent variable)
148
Do instrumental variables harness random assignment?
harness partial or incomplete random assignment whether naturally occurring or generated by researchers
149
What happens after standardisation?
values are measured in units defined by the standard deviation of the reference population
150
How is standardisation done?
by subtracting the mean and dividing by the standard deviation of the reference population
151
what is the first stage?
instrumental variable has a causal effect on the variable whose effects we are trying to capture
152
What is the independence assumption?
instrument variable are randomly assigned, so unrelated to the omitted variables we might like to control for
153
What is the reduced form?
the direct effect of the instrument on outcomes, which runs the full length of the chain
154
How is the causal effect of interest determined (LATE in IV)?
determined by the ratio of reduced form to first-stage estimates
155
Compliers and LATE
LATE is the average causal effect of interest on such people
156
What is monotonicity?
no-defiers assumption, meaning that the instrument pushes affected in one direction only
157
What is the LATE theorem?
for any randomly assigned instrument with a nonzero first stage, satisfying both monotonicity and an exclusion restriction, the ratio of reduced fotr to first stage is LATE (the average causal effect on compliers)
158
What is the average causal effect called?
treatment effect on the treated (TOT)
159
Is the TOT the same as LATE?
usually not the same, as treated population included always-takes
160
What is external validity?
whether a particular causal estimate has predictable values for times, places, and people beyond those represented in the study that produced it
161
What is the best evidence for external validity?
from comparisons of LATEs for the same or similar treatments across different populations
162
What are ITT effects?
effects of random assignment in randomised trials with imperfect compliance, where treatment assigned differs from treatment delivered (captures the causal effect of being assigned to treatment, but ignore the fact that some of those assigned to be not treated, were treated)
163
What is the ITT the reduced form for?
ITT is the reduced form for a randomly assigned instrument (dividing ITT estimates from a randomised trial by the corresponding difference in compliance rates)
164
What does Two-stage least squares (2SLS) do?
generalises IV estimates use multiple instruments efficiently estimates control for covariates, thereby mitigating OVB from imperfect instruments allows as many control variables as you like, but must appear in both the first and second stages
165
When is 2SLS weighted average more precise?
when instruments generate similar results when used one at a time, it is typically more precise estimate of the common causal effect
166
What does "manual 2SLS" not do?
does not produce the correct standard errors needed to measure sampling variance
167
What are two assumption checks for instrumental variable?
1) first stage vby looking for a strong relationship between instruments and the proposed causal channel 2) independence by checking covariate balance with the instrument switched on and off , as in a randomised trial
168
Can the exclusion restriction be easily verified?
No
169
What does finite sample bias occur?
occurs when the instrumental variable estimator does not converge to its true value as the sample size increases
170
What is the fixed effects estimator also called?
within estimator
171
What variation does the IV regression use?
only use variation in the endogenous regressor that is induced by the instruments
172
What do the IV estimators tell us?
IV estimators only tell us the effect on the outcome of the type of variation in the endogenous variable that is typically induced by the instruments
173
What is used to estimate IV estimators?
2SLS
174
What are the steps of 2SLS?
1) regress X on Z (X = y0 + y1Z) 2) regress Y on X --> the estimated coefficient for X, beta_hat is the causal estimate of interest
175
Are differences-in-differences commonly used?
one of the most common mainstays of quantitative causal analysis
176
What should a convincing identification strategy show in a DiD analysis?
convincing identification strategy should show that the control group is a valid counterfactual for the treated group
177
What are parallel (common) trends? (DiD)
before treatment, both groups were following this, even if they had different levels
178
Is it possible to control for differences in trends if parallel trends is violated? (DiD)
Yes
179
Is parallel trends enough? (DiD)
No, story must be compelling for estimator to be convincing
180
What is the 'treatment' arguably exogenous to? (DiD)
arguably exogenous to factors related to time trends in the outcome variable
181
For 'treatment', can something else have happened to the treated group at the same time as the treatment that could affect the outcome variable? (DiD)
No
182
What do robustness checks in DiD do?
ensure that it is likely the treatment itself caused the change
183
What is the difference-in-differences estimate?
the coefficient on the interaction tetrm
184
What is an event study?
it is part of the DiD estimating strategy exploiting the staggered timing of treatment
185
What does an event study have to do to work?
the year is normalised to 0 for all treatments for when treatment started
186
What do the time terms describe in an event study?
time terms describe the dynamic path in the years before treatment started and years after
187
What is the purpose of an event study model?
used for the purpose of estimating dynamic treatment effects when there are multiple instances of a treatment (an 'event') (treatments can occur simultaneously across all units, or staggered across time)
188
What do the coefficients in an event study model capture after the event occurring?
capture the dynamic effects of the treatment as these effects manifest over time since the event
189
What do the terms in an event study model provide before the event occurring?
provide a placebo or falsification test
190
What happens if there are only treated units with common event date in an event study model?
cannot identify treatment effects, as cannot separate effects of event from other confounders that occur in calendar time
191
What happens if there are both treated and untreated units with common event date in an event study model?
can identify treatment effects, as never-treated units help to identify the change in counterfactual outcomes across calendar times
192
What happens if there are only treated units with varying event date in an event study model?
if the timing of the event is as good as random those treated earlier or later can serve as controls for another
193
In event study models, what is a variable that is common to exclude?
the dummy variable for j = -1 event time
194
How can a classic differences-in-differences be implemented?
can be implemented using a two-way fixed effects regression that includes unit fixed effects and time fixed effects (unit fixed effects controls for time-invariant characteristics, time fixed effects control for common shocks)
195
What is the equation for a classic DiD?
Y_it = alpha + beta*(Treat_i*Post_t) + y*Treat_i + d*Post_t + u_it
196
What does a standard DiD estimate?
estimates single treatment effect
197
Does an event study allow dynamic responses?
Yes
198
What are 3 benefits of an event study?
tests parallel pre-trends (beta = 0 for k<0) reveals treatment dynamics (beta pattern over k>0) shows anticipation effects (beta =/= 0 just before treatment)
199
What are the implementation details for an event study?
normalise beta = 0 (omitted period), and add confidence intervals for inference
200
What are the interpretations of the beta regarding time in an event study?
beta_k for k<0 is pre-trends beta_0 is immediate effect beta_k for k>0 is dynamic responses
201
In an event study graph what do the solid points, empty circle and vertical bars represent?
solid dots are point estimates (beta_k) empty circle is the omitted period (normally k=-1, normalised to 0) vertical bars are the 95% confidence intervals
202
What are 3 identifying assumptions for a DiD model?
parallel trends no anticipation no spillovers
203
Is a suitable control group an important check for DiD?
most important check for DiD
204
When is a suitable control group often feasible for DiD?
often feasible when 'treatment' is (quasi-) random
205
When is finding a suitable control group challenging for DiD?
challenging when treated groups characteristics that could drive trends are correlated with treatment probability
206
What is matching?
use statistical techniques to construct an artificial control group by identifying for every treated observation an untreated observation that has similar observable characteristics
207
When matching, is there a guarantee groups will be matched on unobservable characteristics?
No
208
What does matching creates?
a comparison group for which the joint distribution of observable characteristics is the same as that of the treated group
209
What do matching estimators allow for?
allow broadly for nonlinearities in the relationships between observables
210
Do regression-based estimators return linear approximations?
Yes, unless it introduces nonlinearities e.g. quadratic, logs, interaction terms
211
When are regression-based estimators more likely to give a better answer?
when observable and unobservable characteristics are reasonably uniformly distributed across a continuum
212
When is matching better to find suitable controls?
if both the observables are clustered together, and unobservables are correlated with the observables (some chance that unobservable characteristics cluster as well)
213
How does traditional matching pair (and how to interpret findings)?
pair each 'treated' observation with an observably similar non-treated observation and interpret the difference in their outcomes as the effect of the treatment
214
What is the ATT the mean of?
mean of individual differences from traditional matching (expected difference in outcomes between the treated units and what those same units would have experienced had they not been treated)
215
Why do matching techniques yield ATT and not ATE?
because they examine the difference between the treatment and control on observations that have similar characteristics to the treated group
216
What is the curse of dimensionality problem?
difficult to apply exact matching if conditioning on a large set of characteristics is required
217
What does the Propensity score (Pr(Z)) represent?
probability that a unit of analysis is 'treated' based on observed characteristics (Z) (solution to the dimensionality problem) (probability of receiving treatment given covariaties)
218
What type of matching results in more participants being able to be matched than with exact matching?
Propensity Score Matching (PSM)
219
How does Propensity Score Matching (PSM) match?
units with same (or similar) propensity scores are matched
220
For which units can PSM only be done?
can only be done for units who's propensity scores lie within the common support
221
What is the common support?
the values of the propensity score for which there are observations in the data for both the treated and the untreated
222
What is nearest neighbour matching (pairwise matching)?
non-treated unit whose propensity score is closest to the treated is selected as the match
223
Why is nearest neighbour matching often used?
because of its ease of implementation
224
What is Caliper matching?
variation of nearest neighbour matching that attempts to avoid bad matches by limiting the maximum distance between propensity scores allowed
225
What does stratification/interval matching require?
requires decision about how wide intervals should be (e.g. intervals so that the mean values of the estimated propensity scores are not statistically different from each other within each interval)
226
What are the nonparametric methods in matching (Kernel matching/local linear matching)?
construct a match for each treated unit using a weighted average over multiple units in the non-treated group rather than a single nearest neighbour (more recent approach)
227
What is the main advantage of nonparametric methods in matching (Kernel matching/local linear matching)?
reduction in the variance of the estimated matched outcome (may come at expense of higher bias)
228
What is the difference between linear probability model and probit/logit model?
in a linear probability model when Y is a binary dummy variable, predicted values of Y from OLS can be greater than 1 or less than 0, whereas a profit/logit model is an alternative that generates only predicted values between 0 and 1 (nonlinear approximation function)
229
Why is the linear probability model recommended unless strong reason not to? (matching)
have known properties and tend to be much more robust under various conditions
230
What is the probit/logit model commonage for estimating and is it 'less biased' than the linear probability model?
coming for estimating PSM, and not 'less biased' than LMP as very unlikely that the true underlying distribution is Probit or Logit (trading in one bias for another)
231
What data do DiD matching estimators require?
require panel data or repeated cross-sectional data both before and after treatment time
232
How do DiD matching estimators identify treatment effects?
by comparing outcome changes of treated observations to outcome changes for matched untreated observations
233
What do DiD matching estimators identify allow selection into treatment to be based on?
allow selection into treatment to be based on unobserved time-invariant characteristics of observations
234
What do DiD matching estimators control for?
control for time-varying unobserved characteristics to the extent that time varying unobservables are clustered and correlated with time varying observables
235
What can DiD matching estimators not control for?
cannot control for time varying unobservables not correlated with observables
236
What do synthetic control matching (SCM) construct? (matching)
construct a "synthetic" control group as a weighted combination of potential control units, which in theory provides a counterfactual of what would have happened to treated units in absence of treatment
237
What do synthetic control matching (SCM) place a lot of emphasis on?
matching pre-treatment trends between treated unit and synthetic control
238
What cases are synthetic control matching (SCM) suited to?
suited to cases with clear treatment and control group , but where traditional DiD might be problematic due to lack of parallel trends or other pre-treatment differences
239
What type of data is synthetic control matching (SCM) more suited for?
aggregate-level data and works more effectively when there is a single or few treated units and many potential control units
240
What can synthetic control matching (SCM) not address?
cannot address unobserved confounders that affect post-treatment trends differently than pre-treatment trends
241
What cases are synthetic control matching (SCM) most suitable for?
most suitable for cases where a counterfactual is needed for a single country or region (or at most a few) with a clear treatment and where there is sufficiently long time series data for both treatment and 'donor' countries or regions
242
What challenge does PSM address and how does PSM address it?
the challenge of how to estimate causal effects when treatment is not random (selection bias) solution is to find 'similar' individuals across treatment and control groups
243
What is the fundamental problem in DiD?
(treatment outcome - no treatment outcome) is wanted, but only one potential outcome is observed, and comparing means introduces selection bias
244
What is the key theorem of PSM?
if selection into treatment depends only on observables (X), then matching on p(X) is as good as matching on X (reduces dimensionality problem)
245
What are 3 questions regarding the dimensionality problem in matching?
When are units "similar"? How close is "close enough"? What are the trade-offs between variables?
246
What do you need to do for multi-dimensional matching?
need to match on each X separately (many cells will be empty)
247
What are 3 benefits of one-dimensional matching?
single number summarises all characteristic clear metric for "closeness" so easier to find matches balancing property ensures X's are balanced
248
How are Propensity Scores usually estimated?
usually via logistic regression (log((p(X)/(1-p(X)))
249
What does a high Propensity Score mean?
high score means similar to treated units (low score means similar to non-treated units)
250
In the distribution of Propensity Scores, what does the left tail, right tail and middle represent? What are the implications?
left tail are controls with no comparable treated units right tail are treated units with no comparable units middle is where valid comparisons can be made implications are that the treatment effect is only defined where distributions overlap (cannot generalise to very high/low propensity regions, balance improves when restricting to common support)
251
How do you use PSM?
1) find comparable individuals (estimate p(X)=P(T=1/X)) and match treated individuals with similar control individuals 2) compare matched individuals (single period or DiD to compare changes over time)
252
What do matches from PSM in single period account for?
account for X, but not pre-treatment outcome
253
What do matches from PSM in DiD account for?
account for X and pre-treatment outcome
254
What do matches from PSM in DiD control for?
controls for observables through matching controls for time-invariant unobservables through differencing
255
Why is PSM and DiD combined methods more credible than either alone?
PSM ensures similar individuals are compared and DiD removes fixed differences between groups
256
What assumption does PSM alone rely on?
selection on observables assumption
257
In a Regression Discontinuity Design (RDD), what is being found?
'jumps' in the probability of treatment as we move along some running variable
258
How is 'treatment' assigned in Regression Discontinuity Design (RDD)?
assigned to a unit if and only if Z>z, where Z is observable and where z is a known threshold
259
What is Z in a Regression Discontinuity Design (RDD)?
the 'running' (or 'forcing') variable and is a continuous variable assigning units to treatment
260
What does Z in a Regression Discontinuity Design (RDD) depend on?
can depend on unit's characteristics and choices, but there is also a random chance element
261
In a Regression Discontinuity Design (RDD), when is treatment status as good as randomised?
When Z = z (for units in z, treated and control groups should possess the same distribution of baseline characteristics)
262
What does Regression Discontinuity Design (RDD) require?
requires assignment rule to be known, precise, and free of manipulation (does not have to be arbitrary)
263
What is the complication in a Regression Discontinuity Design (RDD)?
matching cannot be used, aster is no case where the same underling attribute occurs both in treatment and without treatment (no "common support")
264
What is extrapolation in RDD?
comparing units with different values of the running variable (only overlap on the limit as Z approaches the cutoff from either direction)
265
What does Kernel local linear regression do (RDD)?
gives more weight to observations closer to the cutoff (bandwidth crucial parameter of Kernel function that determines the range around the cutoff within which data is included)
266
Are there warnings against using higher order polynomials in RDD?
Yes, so methods are still evolving
267
What is the local random assignment assumption? (RDD)
there needs to be a non-trivial random chance component to the ultimate precise value of the running variable Z
268
What is the exclusion restriction? (RDD)
a random draw of Z does not itself have an impact on the outcome except through its impact on treatment status
269
What is the continuity assumption? (RDD)
the running variable Z is a smooth, continuous process (absent the treatment the expected potential outcomes would not have humped, would have remained smooth functions of Z)
270
What does the continuity assumption rule out? (RDD)
rules out omitted variable bias at the cut-off itself (as without the treatment, the expected potential outcomes are not jumping at z, then there are no competing interventions occurring at z)
271
Can the continuity assumption be proved? (RDD)
cannot directly prove it, but some implications are empirically testable
272
What is the identifying assumption in RDD?
density of the forcing variable should be smooth around the cutoff
273
What are sharp RD designs?
probability of treatment goes from 0 to 1 at the cutoff, C (treatment status is entirely determined by the running variable, Z)
274
What are fuzzy RD designs?
probability of treatment discontinuously increases at the cutoff (represents a discontinuous "jump" in the probability of treatment when Z>=C, and cutoff is used as an instrumental variable for treatment)
275
What are two common kinds of RDD studies?
sharp and fuzzy
276
What is the cutoff used as in fuzzy RD designs?
as an instrumental variable
277
What are the differences between fuzzy design and sharp design?
fizzy design differs from the star design in that the treatment assignment is not deterministic function of Z because there are also other variables that determine assignment to treatment
278
What do randomised evaluations do?
use random assignment to create a counterfactual
279
How do randomised control trials (RCTs) work?
there is a heterogeneous population on observables and unobservables and individuals are randomly assigned to being under treatment or not (comparison group) (the two groups are, on average, comparable)
280
What do we ask to judge internal validity?
Can we infer from the data that policy caused the desired outcome? Did X really cause Y?
281
What do RCTs solve?
solve reverse causality and omitted variable bias by construction
282
What do we ask to judge external validity?
Can we predict that this policy will have the same impact when implemented somewhere else? Will X cause Y in other, similar contexts?
283
What are 4 challenges to internal validity?
Measurement bias Statistical power Spillovers Attrition
284
What is involved in RCTs measurement?
innovative data collection (primary and secondary data collection) evaluation effects survey-based (bias, but not restricted to RCTs)
285
What are evaluation effects?
when respondents change their behaviour in response to the evaluation itself instead of the intervention (salience of being evaluated, social pressure)
286
What are 5 evaluation effects?
Hawthorne effects Anticipation effects Resentment/demoralisation effects Demand effects Survey effects
287
What are Hawthorne effects? (evaluation effects)
behaviour changes due to attention from the study or intervention
288
What are anticipation effects? (evaluation effects)
comparison group changes behaviour because they expect to receive the treatment later
289
What are resentment/demoralisation effects? (evaluation effects)
comparison group resents missing out on treatment and changes behaviour
290
What are demand effects? (evaluation effects)
behaviour changes due to perceptions of evaluators objectives
291
What are survey effects? (evaluation effects)
being surveyed changed subsequent behaviour
292
What are 2 solutions to evaluation effect ts?
minimise salience of evaluation as much as possible (make sure staff is impartial and treats both groups similarly, e.g. blind data collection staff to treatment arm) measure the evaluation-driven effects in a subset of the sample (prime a subset of the sample by reminding them of the evaluation)
293
What is statistical power?
the probability of detecting an impact of a given size if there is one (probability of finding an effect if there actually is one)
294
Without statistical power, can we learn much from an expression?
might not learn much
295
What does statistical power avoid?
avoids false negatives (falsely concluding there is no impact, Type II error)
296
What is statistical power by convention?
80% power is aimed for (expect that 20% of the time, falsely conclude there is no impact)
297
What is statistical significance, what does it avoid and what is it usually set to?
"detecting an effect" avoiding false positives (falsely condoling there is an impact when there is none, Type I error) usually set to 90% or higher (10% of time, false positive is gotten)
298
What can failure to find statistically significant effect be misinterpreted as?
can be misinterpreted as failure of the program, but it might just be a failure of the evaluation
299
What affects statistical power? (7)
effect size (minimal detectable effect size of X on Y) sample size (power calculations) variance of the outcome unit of randomisation attrition spillovers non-compliance
300
Can unit of randomisation only randomise at individual level?
No, can randomise at individual level or relevant unit (e.g. schools, households, villages)
301
What is the challenge of unit of randomisation in clusters?
challenge in units within clusters are not independent of one another (e.g. students within same school likely have similar family income)
302
What is attrition?
Data are missing for some participants in the study (refusals, not located, missing from administrative data, etc.)
303
Does attrition reduce statistical power?
Yes
304
What are spillovers?
the outcomes of comparison units are indirectly affected by the treatment given to the treated units (common causes are geographic proximity, social networks like information transmission or market interactions)
305
What are marketwide/general equilibrium effects?
competing in same region for marketshare (control group harmed)
306
How can you deal with spillovers?
avoid spillovers (e.g. spatial buffers between treatment and control units, randomise at higher level) measure spillovers
307
What can partial compliance happen?
individuals assigned to treatment group may not receive the program individuals assigned to the comparison group may access the treatment can be due to project implementers or the participants themselves
308
What can non compliance lead to?
can lead to sample selection bias and threaten internal validity of not properly accounted for in analysis
309
When does selection bias occur?
occurs when individuals who receive or opt into the program are systematically different from those who do not
310
Can you switch or drop non-compliers?
No, you cannot switch or drop them, you remain comparing the original groups (treatment, and control)
311
What does ITT measure?
difference in means regardless of whether groups received the treatment
312
What overall effect does ITT give?
gives overall effect of intervention, acknowledging that noncompliance is likely to happen
313
How is ITT calculated?
(average outcome in treated group) - (average outcome in control group)
314
What is ToT and how is it calculated?
effect of the treatment on those who complied with their treatment status ((ITT)/((take-up in the treatment group)-(take-up in control group)))
315
When is ToT LATE?
if IV
316
What are 3 steps (questions) to judge external validity?
1) What needs does the program address and what is the disaggregated theory behind the program? Are the needs the same in the new setting? 2) How strong is the evidence? 3) Can the intervention be implemented in the news setting?
317
What are 5 advantages of RCTs?
when well implemented, allow for rigorous counterfactual analysis with the fewest assumptions take advantage of scarcity of resources to rigorously assess impact cleanest/easiest technique easier to communicate results and methods to policy-makers, more likely to be scaled up allow for straightforward cost-effectiveness
318
What are 3 disadvantages of RCTs?
studies are in 'real' time (only works for prospective evaluations since it requires random assignment before the policy starts) restricting treatment may be politically difficult or undesirable (best for pilots we are not certain will work) not suitable for several key policies (e.g. rule of law, macroeconomic policies, etc.)
319
What does a placebo test do?
involves repeating an analysis using a different dataset or a part of the dataset where no intervention occurred
320
What is bunching?
a behavioral pattern where individuals or firms locate at key policy thresholds (e.g. firms reporting earnings just below a threshold that triggers taxes or regulations, individuals working hours just below a threshold that classifies them as full-time)
321
What is Moore’s Law?
the number of transistors on computer chips doubles approximately every two years
322
How has computational capacity changed over time?
has increased exponentially over time
323
How has cost of memory over time?
- has plummeted over time
324
What are structured datasets? (ML)
matrices of variables across observations
325
What are "Bulldozer methods" according to Diana?
one class of ML tools which are methods that work with structured datasets that look very much like the datasets economists work with using ‘conventional’ econometrics
326
What can "Bulldozer methods" provide insight on?
can provide insights into the importance of different features for making predictions and identifying heterogenous treatments effects
327
Can "Bulldozer methods" address problems of endogenous unobservable selection or reverse causality?
cannot (yet) mechanically address problems of endogenous unobservable selection or reverse causality
328
What can you do to make "Bulldozer methods" more powerful?
used in combination with more conventional approaches to causal inference, may become powerful additions to researchers’ toolkits
329
What type of AI ML technique are 'deep learning models' and what do they use?
second class AI ML techniques use more complex neural network engines
330
Why are 'deep learning models' considered ‘black box’ approaches?
are considered ‘black box’ approaches in that it is hard to back out how they arrived at a conclusion
331
In what way are 'deep learning models' powerful?
powerful enough to be able to work with initially featureless (raw) data (work out themselves how best to combine the data to generate analytically salient features (variables))
332
What is a key difference between conventional econometric causal inference and AI/ML methods regarding the primary goal?
primary goal of ML is predictive power, rather than estimation of a particular structural or causal parameter or the ability to formally test hypotheses (inference)
333
What is a key difference between conventional econometric causal inference and AI/ML methods regarding what it relies on?
ML relies much more on out-of-sample comparisons rather than in-sample goodness-of-fit measures
334
What has the focus on prediction come at the expense of for AI/ML methods?
focus on prediction has come at expense of ability to do inference (e.g. construct asymptotically valid confidence intervals), for many ML methods it is currently impossible to construct valid confidence intervals
335
What is a key difference between conventional econometric causal inference and AI/ML methods regarding the literature?
ML literature is much more concerned with overfitting compared to traditional statistics and econometrics literatures, the ML literature is much more concerned with computational issues and the ability to implement estimation methods with large data sets (e.g. key computational optimisation tool used in many ML methods is stochastic gradient descent (SGD))
336
What is a key difference between conventional econometric causal inference and AI/ML methods regarding how data is divided?
in ML, data is typically divided into three separate datasets (training data, cross-validation dataset, testing data
337
What is training data in ML?
used to fit the parameters in the model
338
What is a cross-validation dataset in ML?
used to assess how well the model is performing i.e. overfitting vs out of sample performance, e.g. in ‘k-fold cross validation’ the training set is split into ‘k’ smaller sets and the model is then trained on ‘k-1’ of these sets and validated on the remaining set with the process repeated ‘k’ times
339
What is testing data in ML?
never used in the training phase, is used to test the model’s performance after the model has been trained and validated to assess the real-world performance of the model
340
What is stochastic gradient descent (SGD)?
idea behind is that it is better to take many small steps that are noisy but on average in the right direction than it is to spend equivalent computational cost in very accurately figuring out in what direction to take a single small step)
341
What is a key difference between conventional econometric causal inference and AI/ML methods regarding how model use?
ML use of model averaging and “ensemble” methods (in many cases a single model does not perform as well as a combination of possibly quite different models, averaged using weights (sometimes called votes) obtained by optimizing out-of-sample performance)
342
In ML, what is “regression” in “regression trees” ?
“regression” in “regression trees” refers to the type of problem being addressed, not necessarily the method used to address it (“regression” is used to describe techniques for predicting a continuous outcome variable based on one or more predictor variables)
343
In ML, instead to he model being estimated, what is it?
it is being trained
344
What are features in ML?
regressors, covariates, or predictors
345
In ML, what are regression parameters sometimes referred to as?
weights
346
What does the bias-variance trade-off?
refers to the balance between two types of errors that a predictive model can make
347
When does high bias occur in ML?
occurs when there are systematic errors in predictions, regardless of the sample size (high bias means the model cannot capture the underlying patterns in the data (the model may be considered too simple))
348
When does high variance occur in ML?
occurs when the model is so flexible it fits the training data very closely, including its noise (overly flexible model will be sensitive to fluctuations and can lead to overfitting (when a model is overfitting, it performs very well on the training data but poorly on unseen or new data))
349
What type of models in ML often generalise better to new, unseen data in real-world scenarios?
simpler models
350
What might happen in ML when fitting a high-degree polynomial to a model using data that is inherently linear with some noise?
the curve might wiggle to pass through every data point, capturing not just the linear trend but also the noise (such a model will have large coefficients for higher-degree terms)
351
What will too simplistic model (high bias) may not capture?
may not capture the varying effects of treatment across subgroups
352
What may too complex models (high variance) detect?
might detect patterns that are just noise, leading to overfitting and unreliable estimates on new data
353
What is a decision-tree?
flowchart-like structure
354
What is a node in a decision-tree and what type of nodes are there?
node is a point of decision (root node, decision node, leaf node)
355
What is a root node in a decision-tree?
the node from which the tree starts, represents the entire dataset
356
What is a decision node in a decision-tree?
node that splits the data into subsets based on some criteria
357
What is a leaf node in a decision-tree?
the terminal node that predicts the outcome
358
What does each internal node in a decision-tree represent?
each internal node represents a “test” on an attribute
359
What does each branch represent in a decision-tree?
each branch represents the outcome of the test
360
What does each leaf node represent in a decision-tree?
each leaf node represents a class label
361
What do the paths from the root to the leaf represent in a decision-tree?
the paths from the root to the leaf represent classification rules
362
What is this type of chart called?
decision-tree
363
What is Gini impurity?
a metric used to determine how often a randomly chosen element would be incorrectly classified (measure of disorder or impurity)
364
What does lower Gini impurity represent?
the lower the Gini impurity, the closer we are to having a ‘pure subset’ in which all samples in that subset belong to the same classification of the target variable
365
What is a 'pure subset' in ML?
all samples in that subset belong to the same classification of the target variable
366
How is the Gini impurity index calculated?
in binary case the Gini impurity index is classified as Gini(p) = 1 – (p^2 + (1 – p)^2) is calculated and weighted for all possible splits on all attributes and the attribute with the lowest weighted Gini impurity is chosen as the best attribute to split on (algorithm “tries out” all possible splits and picks the one that minimises the impurity of the resulting groups)
367
What is information gain?
the decrease in entropy achieved by partitioning a dataset based on an attribute (aim is to choose the attribute that provides the highest information gain for a split)
368
What is entropy?
the randomness or disorder in a set
369
What does 0 in entropy represent?
0 when all examples are of one class (either all positive or all negative (no disorder)
370
What does 1 in entropy represent?
1 when the set has an equal number of positive and negative examples (maximum disorder
371
What do both Gini Impurity and Information Gain do?
tend to produce similar decision trees
372
How do you choose between using Gini Impurity and Information Gain?
choice between is usually based on computation considerations, the nature of the data, or personal preference (practitioners may evaluate both and see which one performs better for a particular problem)
373
What are the steps to build a decision tree?
(1) Selection of attribute (2) Splitting (3) Repeating steps 1 and 2 (4) Termination
374
What does the step (1) Selection of attribute do when building a decision tree?
tree selects the best attribute using a metric like Gini impurity or information gain (IG) to split the data into subsets
375
What does the step (2) Splitting do when building a decision tree?
dataset is split into subsets based on the chosen attribute, which results in a decision node (binary splits are the most common, multi-way splits are possible but create complexity and can lead to over-fitting)
376
What does the step (2) Repeating steps 1 and 2 do when building a decision tree?
steps 1 and 2 are repeated recursively for each subset, creating further branches in the tree
377
What does the step (4) Termination do when building a decision tree?
tree stops growing when it achieves a certain depth, or when further splitting no longer adds value
378
What is pruning decision trees?
process of reducing the size of the tree by removing parts of it, to make it more general and less susceptible to overfitting
379
What is the relationship between decision trees and non-linearity?
decision trees are able to split data non-linearly, building interesting relationships between features and the target variable that would not be discoverable with more conventional linear models
380
What does the non-linear nature of decision trees lead to?
due to non-linear nature, decision trees are typically high variance models that have a tendency towards overfitting (, especially if they are allowed to grow deep) (such a tree would perform very well on the training data but might generalise poorly to unseen data)
381
What else can you call pre-pruning of a decision tree?
early stopping
382
What does pre-pruning of a decision tree involve and what are examples of doing this?
involves setting conditions to stop the tree growth earlier, before it perfectly classifies the training set (e.g. setting maximum depth for the tree, setting a minimum number of samples requires to make a split at a node, setting a minimum gain or reduction in impurity to continue splitting)
383
What else can you call post-pruning of a decision tree?
reduced error pruning
384
What does post-pruning of a decision tree involve and what are ways of doing this?
- involves letting the tree grow to its full size and then removing the nodes or branches that provide little power in predicting the target variable
385
What is a regression tree?
a type of decision tree that has continuous numerical values (or ordered discrete values) as its target variable instead of predicting a class label, it predicts a numerical value of the target
386
What does a regression tree predict instead of a class label?
it predicts a numerical value of the target
387
How does a regression tree decide where to split?
use the mean squared error (MSE) instead of Gini impurity or Information Gain to decide where to split, aiming to minimise the MSE of the predicted value (average of lead) against the actual values
388
How does the regression tree - find the ‘best fit’?
find the ‘best fit’ by partitioning the data into subsets that are as homogenous as possible in terms of the target variable
389
In a regression tree, what is the prediction for a leaf?
within each one, the prediction for a leaf is simply the sample average outcome of the target variable within the leaf
390
What does a causal tree try to do?
tries to find partitions (subgroups) where the effect of a ‘treatment’ on the outcome variable is the most different
391
How does the causal tree decide how to split?
for each potential split, the tree evaluates if splitting on that attribute would result in two or more groups with notably different treatment effects (selects the split that results in the greatest difference in the outcome variable between treatment groups) potential splits are compared based on how much they increase the homogeneity within groups and enhance the differences in treatment effects between groups (is looking for splits that lead to clearer, more distinct treatment effects)
392
Do causal trees solve the problems of causal inference?
do not solve the problems of causal inference (cannot test whether the treatment is exogenous, so this approach is more useful in conjunction with a more conventional causal inference technique, e.g. RCT’s, DiD, RDD)
393
With what in mind are causal trees designed?
designed to capture complex, non-linear relationships and interactions between features that might be associated with heterogenous effects and provides a local treatment effect estimate for observations that fall into each terminal leaf
394
What are negligible effects in causal trees?
no heterogeneity in the treatment effect within this group based on the available data and covariates
395
For what are design trees, regression trees and causal trees optimised to find?
decision trees are optimised to predict a classification outcome (e.g. child ‘attends school’ or ‘does not attend school’) (able to split data non-linearly, building interest relationships between features and the target variable that would be difficult to discover with more conventional linear models) regression trees are optimised to predict a numerical value of a target variable (e.g. child years of schooling) based on a set of predictor variables (e.g. household income, mother’s education) causal trees are optimised to find heterogenous treatment effects  partitions (subgroups) where the effect of a ‘treatment’ (e.g. mother’s education) on the target variable (e.g. children’s schooling) is the most different
396
What is a random forest (and causal forest)?
an algorithm to construct a ‘forest’ of diverse trees and aggregate their predictions, offering higher accuracy, robustness, and versatility than a single decision tree
397
What do random forests (and causal forests) tell us regarding features?
feature importance (random forests provide insights into the importance of different features in making predictions (features that lead to better splits (more homogenous nodes) are considered more important))
398
What else is 'bagging' sometimes called? (ML)
bootstrap aggregating
399
What is a bootstrapped sample?
a random sample taken with replacement
400
What does 'bagging' do?
each tree in the random forest is trained on a different bootstrapped sample of the data
401
How does feature randomness help within 'bagging'?
at each split in each new tree, a random subset of features is considered (introduced further diversity among the trees)
402
What has become one of the most widely-used techniques in machine learning?
boosting, with decision trees as the base learner (‘weak learner’)
403
What is boosting? (ML)
an iterative technique that adjusts the weights of observations based on the last classification (if an observation was classified incorrectly, it tries to increase the weight of this observation in the next iteration, making it more likely to be classified correctly)
404
What are the steps in boosting?
(1) initialise weights (2) build a weak learner (3) compute errors (4) adjust weights (5) iteratre (6) aggregate predictions
405
What does step (1) initialise weights in boosting do?
every observation in the dataset is initially given an equal weight
406
What does step (2) build a weak learner in boosting do?
train a decision tree on the dataset using the current weights, this tree is usually shallow, making it a “weak learner” (meaning it does slightly better than random guessing)
407
What does step (3) compute errors in boosting do?
after training, classify all observations and identify the ones that were misclassified
408
What does step (4) adjust weights in boosting do?
increase the weights of the misclassified observations and decrease the weights of correctly classified ones
409
What does step (5) iterate in boosting do?
build another tree using the adjusted weights
410
What does step (6) aggregate predictions in boosting do?
predictions from individual trees are combined through a weighted majority vote or the weighted sum of the predictions from individual trees
411
What does regularisation promote? (ML)
promotes simpler models
412
What does regularisation do? (ML)
by adding a penalty for complexity, the model is pushed towards the optimal point in the bias-variance trade-off to help prevent overfitting
413
What is regularisations relationship with features?
sensitive to the scale of features features should be standardised (have a mean of 0 and a standard deviation of 1) or normalised (scaled to lie between 0 and 1) before applying regularisation
414
What is the objective of L1 (Lasso) regularisation?
(gamma is the regularisation strength, a larger gamma results in more penalty, pushing the coefficients towards zero)
415
What is the objective of L2 (Ridge regression) regularisation?
(gamma is the regularisation strength, a larger gamma results in more penalty, pushing the coefficients towards zero)
416
What is elastic net regularisation?
a combination of L1 and L2 regularisation
417
How does the penalty in L1 (Lasso) regularisation work?
penalty added to the loss function is the absolute value of the magnitude of coefficients (can lead to some coefficients becoming exactly zero, which means the corresponding feature is entirely ignored for predicting the output)
418
When is L1 (Lasso) regularisation useful?
can perform feature selections and makes Lasso especially useful when dealing with datasets with a large number of features, where only a subset might be relevant
419
How does the penalty in L2 (Lasso) regularisation work?
penalty added is the square of the magnitude of coefficient (coefficients are shrunken towards zero, but they will never be exactly zero)
420
Why is cross-validation often used across the three methods of regularisation?
across all three methods of regularisation, cross-validation is commonly used to find the optimal value of the hyperparameter gamma (by training the model with different values of gamma and evaluating on a validation set, you can find the value that gives the best performance on unseen data)
421
What are Convolutional Neural Networks (CNNs) inspired by and what are they designed to do?
inspired by the organisation of the animal visual cortex and are designed to automatically and adaptively learn spatial hierarchies of features from input images
422
What are Convolutional Neural Networks (CNNs) prone to?
prone to overfitting, in which a model generates accurate predictions on the data used to fit parameters, but fails to generalise on out-of-sample data
423
How is the best value for the hyperparamter determined in Convolutional Neural Networks (CNNs) ?
to determine best values for hyperparameters, data is partitioned into subsets for training, validation, and testing
424
What are the steps for Convolutional Neural Networks (CNNs) ?
(1) input (2) convolutional layers (3) pooling layers (4) fully connected layers (5) output layer (6) backpropagation
425
What does step (1) input involve for Convolutional Neural Networks (CNNs)?
input layer takes an image as input, which is a height*width*depth array of pixel values (the depth dimension is shown as the stacking of the three colour channels (red, green, blue) that make up a typical RGB image, indicating that each pixel in the image has three values associated with it, one for each colour channel
426
What does step (2) convolutional layers involve for Convolutional Neural Networks (CNNs)?
these layers will extract features from the image, early layers might detect simple features like edges and curves, while deeper layers can detect more complex features like shapes or specific objects
427
What does step (3) pooling layers involve for Convolutional Neural Networks (CNNs)?
these layers reduce the spatial resolution of the feature maps, making the detection of features more robust to variations and reducing the number of parameters and computations
428
What does step (4) fully connected layers involve for Convolutional Neural Networks (CNNs)?
after several stages of convolution and pooling, the network combines all the features learned by previous layers across the image to identify more complex objects
429
What does step (5) output layer involve for Convolutional Neural Networks (CNNs)?
the output is a vector where each entry corresponds to a class label (e.g. a cat, dog, car, etc.), the network assigs a probability to each label
430
What does step (6) backpropagation involve for Convolutional Neural Networks (CNNs)?
this is a training method where the CNN learns by adjusting the weights of the filters to minimise the difference between the predicted output and ground-truth labels
431
Why is tuning used for Convolutional Neural Networks (CNNs)?
CNNs contain a large number of tunable parameters (hyperparameters) which control the model architecture and optimisation process (e.g. dimension of convolution filters, number of channels produced by each convolution layer, strength of regularisation on weights, and step size used by the optimisation algorithm)
432
What danger is there when using Convolutional Neural Networks (CNNs)?
danger of mechanically induced association (first-order concern for researchers using CNNs is that image-derived proxies for a variable of interest could themselves include the policy intervention of interest, undermining inference)
433
What is imperative for researcher to fully understand? (AI/ML)
the nature of the underlying data before using any AI/ML generated predicted variables
434
How does Matrix Completion (MC) work?
works by treating counterfactual untreated observations in the treatment group as missing values in a matrix, with these values imputed through a regularized process that penalizes matrix complexity
435
When using Matrix Completion, is there a guarantee that the conditions of unbiased potential outcome predictions will be met?
no guarantee that the conditions of unbiased potential outcome predictions will be met
436
What are the assumptions for Matrix Completion?
low rank assumption missing at random (MAR) assumption causal inference assumption
437
What does the low rank assumption for Matrix Completion say?
suggests that the outcomes are determined by a few key factors, there are underling factors (latent variables) which are fewer in number than the observed data points
438
What does the missing at random assumption for Matrix Completion say?
pertains to the mechanism by which data is missing, if data is MAR the missingness in the outcome is related to the observed data but not to the unobserved data once you control for the observed data, the missingness is random with respect to the unobserved data
439
What does the causal inference assumption for Matrix Completion say?
potential outcomes are independent of the treatment assignment, conditional on observed covariates
440
What is the challenge of using Matrix Completion as an ML method?
assumptions may be less intuitive and harder to verify
441
What does SC-EN stand for?
Synthetic Controls with Elastic Net
442
What does SC-EN do?
combines the ideas behind synthetic controls and elastic net regularisation, using a flexible approach to matching treated units with a weighted average of control units who were trending similarly pre-treatment
443
What is elastic net's role in SC-EN?
elastic net is used to determine the weights given to each control unit when constructing the synthetic control
444
What is synthetic controls' role in SC-EN?
validity of causal inference here relies on the ability to construct a synthetic control that provides a good approximation of the unobserved counterfactual (if the treated and untreated clusters differ systematically in unobserved ways that also affect the outcome, then the synthetic control may not be a valid counterfactual)
445
What are ML methods useful for discovering and what is the core difficulty regarding this?
can be useful for discovering ex post whether there is any relevant heterogeneity in treatment effect by covariates core difficulty of applying ML tools to the estimation of heterogenous causal effects is that, while they are successful in prediction empirically, it is much more difficult to obtain valid inference
446
What is quantile-aggregated inference?
technique for calculating valid, interpretable standard errors for ML approach to a-theoretically searching for heterogenous treatment effects
447
In ML methods, what is the difference in requirements for calculating standard errors and estimating average treatment effects?
generating robust and conservative standard errors for estimates of heterogenous effects detected with ex post ML methods requires a larger sample size than just estimating (mean) average treatment effects
448
In ML methods, what is the importance of theory for conventional causal inference?
use of theory provides significant power to statistical tests (without theory, e.g. using ML bulldozer methods, much more data is needed and cross-validation/out-of-sample techniques for robust inference
449
Are credible instrumental variables easy to find?
can be hard to find
450
Are dramatic policy discontinuities easy to find?
can be hard to find
451
What method recognises that in the absence of random assignment treatment and control groups are likely to differ for many reasons?
Difference-in-Differences
452
What does comparing changes in the outcomes between the treatment and control group (instead of levels) adjust for?
adjusts for differences in the pre-treatment period (subtract pre-treatment difference from the post-treatment difference)
453
Where does the DiD counterfactual come from?
comes from the common trends assumption (absent the treatment, the treated group would have evolved along the same trend as the control group)
454
Is the common trends assumption strong?
strong but easily stated assumption
455
What does the common trends assumption take account of?
takes account of pre-treatment differences in levels
456
Can the common trends assumption be tested?
with more data can be probed, tested and relaxed
457
Does regression DiD facilitate statistical inference
yes, facilitates statistical inference
458
What does the equation of the regression DiD contain?
contains dummy for treatment contains dummy for post-treatment period contains interaction term between treatment dummy and post-treatment period
459
In the equation of the regression DiD, what does the dummy for treatment control for?
controls for fixed differences between the units being compared
460
In the equation of the regression DiD, what does the dummy for post-treatment period control for?
controls for the fact that conditions change over time for everyone, whether treated or not
461
In the equation of the regression DiD, what represents the DiD causal effect?
coefficient on the interaction term is the DiD causal effect
462
When can one relax the common trends assumption? (DiD)
samples that include many years and many units allow us to do introduce a degree of nonparallel evolution in outcomes between units in the absence if a treatment effect (differences in trend can be captured by unit-specific trend parameters)
463
How does the standard OLS estimator fit a line?
fits a line by minimising the sample average of squared residuals, with each squared residual getting equal weight in the sum (e.g. for regression on states, estimates are averages over states, not over people)
464
How does weighted least squares (WLS) work?
weights each term in the residual sum of squares by unit size or some other researcher-chosen weight (e.g. for regression on states, population weighting generates a people-weighted average)
465
Does the population-weighted average from state regression increase precision of regression estimates?
may increase precision of regression estimates (in a statistical sense, data from state with larger population may be more reliable and therefore worthy of higher weight)
466
When does the population-weighted average from state regression increase precision of regression estimates?
only when a number of restrictive technical conditions are met e.g. the underlying CEF is linear (however many regression models are only linear approximations to the CEF)
467
Is population-weighted average from state regression appealing to use?
may not be appealing, as variation may be just as useful in both states
468
When using population-weighted average what do you need to hope for regarding the regression estimate?
hope that regression estimates from state-year panel are not highly sensitive to weighting
469
Why is DiD better than simple two-group comparisons?
comparing changes instead of levels eliminates fixed differences between groups that might otherwise generate omitted variable bias
470
In a state-year panel DiD, what do you need to control for?
need to only control for state and year effects
471
What does the fate of DiD estimates rest on?
lives and dies by parallel trends (though we can allow for unit-specific linear trends when a panel is long enough, we hope for results that are unchanged by their inclusion)
472
What does unit-year panel data typically exhibit?
serial correlation (repetitive structure of such data raises this)
473
What is serial correlation regarding unit-year panel data?
deviation from randomness with important consequence that each new observation in a serially correlated time series contains less information than would be the case if the sample were random
474
What is the issue with serially correlated data?
persistent, meaning the values of variables for nearby periods are likely to be similar when the dependent variable in a regression is serially correlated, the residuals from any regression model explaining this variable are often serially correlated as well
475
When you have a combination of serially correlated residuals and serially correlated regressors, what changes?
a combination of serially correlated residuals and serially correlated regressors changes the formula required to calculate standard errors
476
If you ignore serial correlation and use the simple standard error formula, what happens?
resulting statistical conclusions are likely to be misleading (penalty for this is that you exaggerate the precision of regression estimates, as sampling theory for regression inference presumes that data come from random samples)
477
What do robust standard errors correct for?
correct for heteroskedasticity
478
What issues do clustered standard errors address?
answers the serial correlation challenge appropriate for a wide variety of settings solves for any sort of dependence problem in your data (although may lead to large standard errors that result)
479
What does clustering allow for?
allow for correlated data within researcher-defined clusters
480
Does clustering require that all data are randomly sampled?
does not require that all data are randomly sampled, requires only that the clusters be sampled randomly, with no random sampling assumption invoked for what is inside them
481
Is a pair or a handful of clusters enough?
a pair or a handful of clusters may not be enough
482
When you do clustering, what does statistctal inference presume?
once you start, statistical inference presumes you have many clusters instead of (or in addition to) many individual observations within clusters
483
Does RDD work for all causal questions?
No, does not work for all causal questions
484
When an RDD works, what does it almost have the same causal force as?
as those from a randomised trial
485
What does RDD exploit?
exploits abrupt changes in treatment status that arise when treatment is determined by a cutoff
486
What does RDD require us to know?
requires us to know the relationship between the running variable and potential outcomes in the absence of treatment
487
When must RDD control for the relationship between the running variable and potential outcomes in the absence of treatment?
must control for this relationship when using discontinuities to identify causal effects (randomised trial require no such control)
488
When does confidence in the causal conclusions from RDD increase?
cannot be sure that the control strategy is adequate, but confidence in causal conclusions increases when RD estimates remain similar as we change details of the RD models
489
On what paradoxical idea is RD based on?
based on the seemingly paradoxical idea that rigid rules, which at first appear to reduce or even eliminate the scope for randomness, create valuable experiments
490
How does RD separate trend variation from any treatment effects?
to separate trend variation from any treatment effects, controls for smooth variation in outcomes generated by z
491
What are signal features of RD designs?
treatment status is a deterministic function of z so that once we knows z, we know whether D is 0 or 1
492
Why is treatment status a discontinuous function of z?
treatment status is a discontinuous function of z because no matter how close z gets to the cutoff, D remains unchanged until the cutoff is reached
493
What is the running variable in RDD?
the variable that determines treatment
494
Is there a value of the running variable where we observe both treatment and control observations?
no value of the running variable where we observe both treatment and control observations
495
What is sharp RDD?
treatment switches cleanly off or on as the running variable passes a cutoff
496
What is fuzzy RDD?
the probability or intensity of treatment jumps at a cutoff
497
What does the validity of RD depend on? (regarding the running value)
turns on our willingness to extrapolate across values of the running variable (at least for values in the neighbourhood of the cutoff at which treatment switches on)
498
In the RDD equation, what parameter captures the jump in outcome differences?
parameter on D captures jump in outcome differences ((Y = alpha + p*D+y*z+e))
499
In the RDD equation, what coefficient reflects outcome trend?
slope coefficient on running variable reflects outcome trend as running variable changes (Y = alpha + p*D+y*z+e)
500
When can we be sure that that no OVB afflicts the short regression (Y = alpha + p*D+y*z+e) in RDD?
when the effect of z on the outcome is captured by a linear function
501
What does the OVB formula tell us in the RDD equation (Y = alpha + p*D+y*z+e) in RDD?
OVB formula tells us that the estimate of p in this short regression and the results any longer regression might produce depend on correlation between variables added to the long regression and D
502
What does the question of causality turn to in RDD?
turns on whether the relationship between the running variable and outcomes has indeed been nailed by a regression with linear control for the running variable
503
Are RD tools guaranteed to produce reliable causal estimates?
not guaranteed to produce reliable causal estimates
504
What are the strategies to reduce likelihood of RD mistakes?
none provide perfect insurance modelling nonlinearities directly focuses solely on observations near the cutoff
505
What is the standard error?
standard deviation of a statistic like the sample error
506
What is the sampling variance?
variability of a sample statistic (as opposed to the dispersion of raw data)
507
What is variance?
average squared deviations from the mean to gauge variability (positive and negative gaps get equal weights)
508
What does causal inference compare?
compares potential outcomes (descriptions of the world when alternative roads are taken)
509
What is the first check in any research design and what does it involve?
checking for balance process to check whether treatment and control groups indeed look similar and amounts to a comparison of sample averages
510
How is the conditional expectation of a variable, Yi, given a dummy variable, Di = 1 written?
E(Yi/Di = 1) which is the average of Yi in the population that had Di equal to 1 E(Yi/Di = 0) is the average of Yi in the population that had Di equal to 0
511
What is the difference in conditional expectation based on whether Yi and Di are generated by a random process or come from a sample survey?
if Yi and Di are variables generated by a random process, E(Yi/Di=d) is the average of infinitely many repetitions of this process while holding the circumstances indicated by Di fixed at d, if Yi and Di come from a sample survey E(Yi/Di=d) is the average computed when everyone in the population who has Di = d is sampled
512
What is the mathematical experctation and what is the difference in it based on whether Yi is generated by a random process or from a sample survey?
E(Yi), is the population average of this variable if Yi is a variable generated by a random process E(Yi) is the average in infinitely many repetitions of this process, if Yi is a variable that comes from a sample survey E(Yi) is the average obtained if everyone in the population from which the sample is drawn were to be enumerated
513
What does the LLN state about the sample average?
a sample average can be brought as close as we like to the average in the population from which it is drawn simply by enlarging the sample
514
How do groups treated or untreated assigned by random assignment differ? (randomised trial/experiment)
groups treated or untreated by random assignment differ only in their treatment and any outcomes that follow from it (due to LLN two randomly chosen groups are indeed comparable when large enough )
515
How does experimental random assignment eliminate selection bias?
works not by eliminating individual differences but rather by ensuring that the mix of individuals being compared is the same
516
What is the constant-effects assumption?
Y1i - Y0i = k (k is both the individual and average causal effect of the treatment on the outcome)
517
How does average causal effects of treatment?
Y1i - Y0i, where averaging is done in the usual way (sum individual outcomes and divide by n) difference in group means (Avgn(Y1i/Di=1)-Avgn(Y0/Di=0)) (difference in group means = average causal effect + selection bias)
518
What do standard deviations measure?
measure variability in data
519
What does a good control group reveal?
reveals the fate of the treated in a counterfactual world where they are not treated
520
What is fuzzy RD, what is its instrument and how is it analysed?
when the probability or intensity of treatment jumps at a cutoff a dummy for clearing the cutoff becomes an instrument is analysed by 2SLS
521
What is sharp RD?
when treatment itself switches on or off at a cutoff
522
What are clustered standard errors used for?
used to adjust for the fact that the data contain correlated observations
523
What does the exclusion restriction require us to commit to?
requires us to commit to a specific causal channel, but the assumed channel need not be the only one that matters in practice
524
What does it mean when the reduced-form estimate is close to and not significantly different from zero?
if close to and not significantly different from zero, so is the corresponding 2SLS estimate
525
What does the second-stage always include?
always includes the same control variables as appear in the first stage (Y = alpha + y*X+ß2*Z+e), where y is the causal effect and the variable X is the first-stage fitted value
526
What is the reduced form and first stage that make fuzzy RD IV?
(Y = alpha + p*D+ß0*Z+e) is the reduced form for a 2SLS setup where the endogenous variable is X the first-stage equation that goes with this reduced form is (X = alpha + o*D+ß1*Z+e), where the parameter o captures the jump in the endogenous variable induced by the treatment (equation (Y = alpha + p*D+ß0*Z+e) inherits a covariate from, the first stage and reduced form, the running variable Z and the jump dummy, D, is excluded from the second stage since this is the instrument that makes the 2SLS run)
527
Why is fuzzy RD an IV?
it is IV because we assume that around the cutoff, after adjusting for running variable effects with a linear control, the instrument has no direct effect on the outcome, but rather influences the independent variable (exclusion restriction)
528
What does RD require tough judgement on?
requires tough judgement about the causal channels through which instruments affect outcomes (as with any IV)
529
What is the difference between sharp and fuzzy RD designs?
with fuzzy, units who cross a threshold are exposed to more intense treatment, while in a sharp design treatment switches cleanly on or off at the cutoff
530
What is bandwidth? (RDD)
the parameter b that describes the width of the window (z0-b=
531
What should bandwidth vary as a function of?
should vary as a function of the sample size (the more information available about outcomes in the neighbourhood of an RD cutoff, the narrower we can set the bandwidth while still hoping to generate estimates precise enough to be useful)
532
What is bandwidths choice goal?
bandwidth choice requires a judgement call and goal is not as much to find perfect bandwidth as to show that the findings generated by choice of bandwidth are not a fluke
533
What does non-parametric RD estimate and what does this method do?
estimates the RDD equation (Y = alpha + p*D+y*z+e) in a narrow window around the cutoff the method that makes the trade off of the reduction in bias near the boundary against the increased variance suffered by throwing data away
534
What does parametric RD exploit?
second RD strategy that exploits the fact that the problem of distinguishing jumps from nonlinear trends grows less vexing as we zero in on points close to the cutoff (for the small set of points close to the boundary, nonlinear trends need not concern us at all)
535
What does parametric RD do?
an approach that compares averages in a narrow window just to the left and just to the right if the cutoff
536
What is a drawback of parametric RD?
drawback is if the window is very narrow, there are few observations left, meaning the resulting estimates are likely to be too imprecise to be useful (should be able to trade the reduction in bias near the boundary against the increased variance suffered by throwing data away, generating some kind of optimal window size)
537
What does modification of the RD equation allow for?
allows for different running variable coefficient to the left and right of the cutoff (generates models that interact z with D)
538
How does modification of the RD work and what is the implication of the model?
center the running variable by subtracting the cutoff z0, replacing z by (z-z0) and adding an interaction term (z-z0)*D, so the RD model becomes (Y = alpha + p*D+y*(z-z0)+b*((z-z0)*D)+e) (centering the running variables ensures that the p is still the jump in average outcomes at the cutoff) implication of the model with interaction terms is that away from the z0 cutoff, the treatment effect is given by p+b(z-z0)
539
How are nonlinearities modelled using RD strategy?
typically modeled using polynomial functions of the running variable (ideally results are insensitive to the degree of nonlinearity the model allows, however sometimes they are not and question of how much nonlinearity is enough requires a judgement call)
540
What is the risk of modelling nonlinearities using RD strategy?
risk is RD practioners picking model that produces results that seem most appealing (perhaps favouring those that conform most closely to your prejudices), so RD practitioners owe their readers a report on how their RD estimates change as the details of the regression model used to construct them change