Oral Exam Questions Flashcards

1
Q

OLS. What does this abbreviation stand for? What does it mean?

A

Ordinary Least Squares. Method used in linear regression to find line of best fit for data points. Minimizes sum of squared differences between actual and predicted values on the line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do we test whether a coefficient is different from zero? [t-statistics]

A

We use t-test: (Beta^hat / SE of B^hat), to check if Beta is significantly different from 0 or if it is just random chance. If t-stat is large, Beta is likely different from zero. If t-stat is small, you can’t confidently say Beta is different from zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do we interpret the coefficient of interest? [e.g., test_score = α + β teacher_student_ratio + u]

A

The coefficient Beta (B) represents the impact of a one unit change in the teacher-student ratio on the test score. B>0, increase in impact, B<0 decrease in impact, B=0 no impact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Economic significance. How do we measure it? Why do we need it?

A

Economic significance asks if an effect is big enough to matter in real life.
* Measured by the size of the coefficient in context.
* Needed because statistical significance alone doesn’t tell us if an effect is practically important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do we measure goodness of fit? [R-squared, intuition behind its construction, limitations]

A

We measure goodness of fit by the R-squared. It compares how well the model’s predictions match the actual data versus a baseline (mean of the data). Values range from 0 to 1 (higher is better).
* Doesn’t measure if the model is correct or if variables are relevant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why do we include control variables in regressions?

A

We include control variables in regressions to account for factors that might affect the dependent variable. This helps isolate the effect of the main variable of interest by holding other factors constant, improving the accuracy of the coefficient estimates and reducing omitted variable bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we interpret the coefficient of interest in a multiple regression? [e.g., test_score = α +
β teacher_student_ratio + γ avg_f amily_income + u]

A

The coefficient Beta represents the change in test score for a one-unit increase in the teacher-student ratio, holding average family income constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is adjusted R-squared? Why do we need it? [intuition behind its construction, limitations]

A

Adjusted R-squared accounts for the number of predictors in a model, penalizing excessive use of irrelevant variables. Provides a more accurate measure of goodness of fit but can be misleading if the model is incorrectly specified.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is F-test? Why is t-test not enough?

A

F-test is used to determine if a group of variables collectively have a significant effect on the dependent variable in the model. The F-test compliments the t-test by evaluation the overall fit of the model, allowing us to see if at least one predictor significantly contributes to explaining variation in dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which hypotheses can we test with F-test?

A

We can test: Overall model significance, whether at least one of the regression coefficients is significantly different from zero. Multiple regression, whether additional predictors provide a significantly better fit than a simple model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

OLS assumption 1: Zero conditional mean. What is it? What happens if it doesn’t hold?

A

Zero conditional mean assumption states that: The expected value of the error term is zero given the independent variables. If it doesn’t hold, it leads to biased and inconsistent coefficient estimates, making it difficult to determine the true effects of the independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

OLS assumption 2: Random sampling. What is it? What happens if it doesn’t hold?

A

Random Sampling Assumption: Data points are collected randomly from the population, ensuring that every individual has an equal chance of being selected. If it doesn’t hold, can result in sampling bias, leading to non-representative data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

OLS assumption 3: Rare outliers. What is it? What happens if it doesn’t hold?

A

Rare Outliers Assumption: Extreme values in the data are rare, and not unduly influence the regression results. If it doesn’t hold, outliers can skew the estimates and lead to misleading conclusions. May result in inflated coefficients, and reduced model accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

OLS assumption 4: No multicollinearity. What is it? What happens if it doesn’t hold?

A

No Multicollinearity: The independent variables in a regression model should not be highly correlated. If it doesn’t hold, can lead to inflated standard errors, making it difficult to determine the individual effect of each predictor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

VIF. What is it? Why do we need it?

A

Measures how much of the variance of an estimated regression coefficient increases when independent variables are correlated.

We need VIF to detect multicollinearity. High VIF (Above 10) Indicated problematic correlation among predictors, suggesting that coefficients may be unreliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which methods are used to deal with outliers?

A

Remove them if they are deemed irrelevant.

Trim the dataset for observations.

Winsorize the data, replacing extreme values with the nearest values within a specified range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

OLS property 1: Unbiased. What is it? Why is it important?

A

On average, the estimated coefficients equal the true population parameters. It is important because it ensures that the conclusions drawn from the regression are valid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

OLS property 2: Consistent. What is it? Why is it important?

A

As the sample size increases, the estimated coefficients converge to the true population parameters. It is important because it assures us that with enough data, our estimated will become more accurate and reliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

OLS property 3: Normally distributed. What is it? Why is it important?

A

When the sample size is large, the sampling distribution of the coefficients approaches a normal distribution due to the central limit theorem.

Important because it enables valid hypothesis testing and construction of confidence intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Heteroscedasticity of errors. What is it? Why should we care about it?

A

Occurs when the variance of the error terms in a regression is not constant across all levels of the independent variable. We care about it because it can lead to inefficient estimates and biased standard errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Biases: Sample selection bias. What is it? How can we correct it?

A

Occurs when the sample used is not representative of the population due to non-random selection process. Can lead to biased estimates and conclusions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Biases: Omitted variable bias. What is it? How can we correct it?

A

Occurs when a relevant variable that affects the dependent variable is left out of the regression model.

Can lead to biased and inconsistent coefficient estimates. To remedy this, we can try to include the omitted variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Biases: Simultaneity bias. What is it? How can we correct it?

A

Simultaneity bias happens when the explanatory variable is correlated with the regression error term, ε. X causes Y, but Y also causes X…

Correct with Instrumental Variable regression, 2SLS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Biases: Attenuation bias. What is it? How can we correct it?

A

Occurs when an independent variable is measured with error, leading to an underestimation of the true effect on the dependent variable. Results in biased, usually smaller, coefficient estimates.

Can correct by increasing sample size. (reduce impact of random errors).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Polynomials. How do we interpret coefficients? [e.g., wage = α + β1 age + β2 age2 + u]

A

Quadratic term indicates how the effect of age on wage changes as age increases. Positive B2 means impact of age on wage increases as age rises, and opposite.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Log transformation. How do we interpret coefficients? [e.g., log(wage) = α + β age + u or log(quantity) =
α + β log(price) + u]

A

Log(wage): Beta represents the percentage change in wage for a one-unit change in age.

Log(quantity): Beta indicates the price elasticity of quantity demanded. If Beta = -1.2, a 1% increase in price results in a 1.2% decrease in quantity demanded.

26
Q

Interactions. How do we interpret coefficients? [e.g., wage = α + β1 age + β2 Ind + β3 age* Ind + u, where
Ind equals one for jobs in technical industries (like oil, IT)]

A

𝛽 1 β 1 (age): Represents the change in wage for a one-year increase in age for non-technical jobs (when Ind = 0).

𝛽 2 β 2 (Ind): Indicates the wage difference associated with working in a technical industry at age 0 (the intercept for technical jobs).

𝛽 3 β 3 (age * Ind): Measures how the effect of age on wage changes for those in technical industries.

27
Q

What is a binary variable? Give an example.

A

A binary variable is a variable that can take on the value of 0 or 1. It can be used in regression models as so called “dummy variables”.

28
Q

Linear probability model (LPM). What is it? How do we interpret coefficient of interest? [e.g.,
loan_approval = α + β wage + u, where loan_approval equals one for applicants whose loans were
approved]

A

LPM model is a regression model used for binary outcomes (0 or 1). It predicts the probability of the outcome happening. The coefficient of interest shows how a one-unit increase in independent variable (wage) changes the probability of the outcome (loan approval)

29
Q

Logit and probit regressions. What are their benefits relative to LPM?

A

Benefits of Logit and Probit are:
1. Always predict probs within valid range (0-1) unlike LPM.

  1. Better capture non-linear relationship between variables and probabilities.
  2. No heteroskedasticity
30
Q

How can you estimate marginal effects of logit and probit regressions? [average marginal effect, marginal
effect at the mean]

A

Average Marginal Effect: gives the average change in probability across all observations for a one-unit increase in the independent variable.

Marginal Effect at Mean: Shows change in probability when the independent variable increases by one unit, evaluated at average values of all other variables

31
Q

What does MLE stand for? What is the intuition behind its construction?

A

Finds the parameter values that make the observed data most likely, by maximizing likelihood function. Idea is to choose the parameters that best explain the data.

32
Q

What is McFadden R-squared? Why is simple R-squared bad?

A

Measures goodness of fit for Logit & Probit models. Compares likelihood of fitted model to a null model. Values generally lower than normal R2.

Simple R2 assumes linearity and does not work well for binary outcomes.

33
Q

How do you test hypotheses in logit and probit? [LR-test, intuition behind its construction]

A

Using Likelihood Ratio (LR) test. Compares two models: one with full set of variables and one without variables of interest. Intuition: See if more complex model improves likelihood of the data. If LR stat is large, full model fits better.

34
Q

What is cross-sectional data, time-series data, and panel data? Provide examples.

A

Data observations at a single point in time across different entities.

Time-Series: Data collected over time for a single entity.

Panel Data: Combination of cross-sectional and time-series data. -> Multiple entities observed over time.

35
Q

First Difference Estimator. How is it constructed? When can it be used?

A

Measures change over time by subtracting the previous observation from the current observation (DeltaYt = Yt – Yt-1). Used in panel data to control for unobserved variables that are constant over time.

36
Q

Pooled OLS vs First Difference Estimator. When is the second estimator better?

A

Pooled OLS: Treat panel data like cross-sectional data. Used if individual effects uncorrelated with variables.

FD Estimator: Focuses on changes over time, eliminating unobserved, constant factors.

FD is Better:
If there are constant unobserved factors that could bias results, because FD removes them.

37
Q

Entity-demeaned estimator. Why is it useful?

A

Subtracts the mean of each entity’s observations from their values to focus on within-entity variations.

Useful for: Controlling for unobserved factors (eliminate biases).

Improves estimates by providing more accurate coefficients in panel data.

38
Q

How do we interpret coefficient of interest in regressions with entity fixed effects? [e.g., market_to_book_ratioi,t =
αi + β R&D_spendingsi,t + ui,t]

A

𝛼 𝑖 α i ​ (Entity Fixed Effects): Represents unobserved factors unique to each entity (like company culture or management style) that do not change over time.

𝛽 β (Coefficient of Interest): Indicates the change in the market-to-book ratio for a one-unit increase in R&D spending, controlling for all time-invariant differences between entities.

39
Q

How can we deal with fixed effects?

A

Entity-Demeaning: Subtract the mean of each entity’s observations from individual values, focusing on within-entity variations.

Including Dummy Variables: Add dummy variables for each entity (except one to avoid multicollinearity) to control for fixed effects directly in the regression.

First Difference Estimator: Use the first difference approach to eliminate fixed effects by focusing on changes over time.

Fixed Effects Regression Model

40
Q

Zero conditional mean assumption. What changes relative to simple OLS assumption?

A

Simple OLS Assumption: Assumes that the errors are uncorrelated with the independent variables. It generally requires that the errors have a mean of zero overall.

Zero Conditional Mean Assumption: Specifically focuses on the relationship between the error term and independent variables in the context of the regression model. Requires that the error term’s mean is zero for each value of the independent variables.

41
Q

Clustered standard errors. Why do we need them?

A

Happens when observations in data is related to each other. We need them to adjust for correlation within the same cluster. Provide robust inference, and control for heteroscedasticity.

42
Q

Price-based event studies vs value event studies. What’s the difference? Why do we need them?

A

Price-based: Designed to test the efficient market hypothesis. Testing information efficiency, i.e., the speed and accuracy with which prices reflect new information.

Value event: Examine impact of events on the market value of companies, given EMH

43
Q

Which steps should you undertake to perform an event study?

A
  1. Identify the event (nature, announcement date)
  2. Identify event windows
  3. Compute abnormal returns
  4. Compute cumulative abnormal returns
  5. Hypothesis testing – measure of variance.
44
Q

How does a usual timeline in an event study look like?

A

1 - Event window: Include only the event.

2 – Hold-out: exclude co-founding events, potential drift.

3 – Estimation window: Long enough to get precise estimates, but short enough to not have structural breaks.

45
Q

Why do we need an estimation window?

A

Estimation window helps to calculate expected returns under normal conditions, without effects of the event.

It also isolates the event’s effect by comparing actual returns during event to expected return without event

46
Q

How do we assess whether an event leads to a significant market reaction?

A

We compute abnormal return and cumulative abnormal return, then test for significance to determine if event led to significant market reaction.

47
Q

What do abbreviations “IV” and “2SLS” stand for?

A

IV stands for Instrumental Variables, and is used when explanatory variable is correlated with error term. 2SLS stands for 2 Stage Least Squares, and is an extension of OLS method.

48
Q

Why do we need an IV regression? Which issues can it solve?

A

To break x into two parts: one that is correlated with u, and one that is not. By isolating the part not correlated with u, it is possible to estimate true Beta1. It can solve and eliminate bias when E(u/x)! = 0.

49
Q

Good instrument: Relevance. What is it? What happens if this property is missing?

A

Relevance: at least one instrument must enter the population counterpart of the first stage regression. If it is missing the instruments explain very little of the variation in x, beyond what is explained by w

50
Q

How can we check whether an instrument is relevant?

A

We run an F-test to see if the instruments are weak. If the F-stat is lower than 10 we can conclude that the set of instruments are weak.

51
Q

Good instrument: Exogeneity. What is it? What happens if this property is missing?

A

Exogenous: all the instruments must be uncorrelated with the error term. If missing, first stage of 2SLS cannot isolate a component of x that is uncorrelated with error term.

52
Q

Why should you estimate 2SLS regression in one go?

A

You do it to get correct standard errors and avoid errors that may arise from two separate stages. R does not understand that fitted values x^hat is estimated.

53
Q

How many instruments do you need if you have 3 endogenous variables? Which properties should these
instruments have?

A

You will need to have at least 3 instruments so that m = k, and we can estimate Beta_1. These instruments should have relevance: enters the population counterpart) and be exogenous: all are uncorrelated with error term.

54
Q

What is the J-test? What does it show?

A

It is a test for overidentifying restrictions in IV regression (m>k). Checks if instruments are exogenous. If test-statistic is significant, it suggest that instruments might be correlated with the error term.

55
Q

Why are experiments useful?

A

Testing Hypotheses: Experiments allow researchers to test their hypotheses in controlled environments, ensuring that the results are due to the variables being tested and not other factors.

Establishing Cause and Effect: can establish causal relationships, which is crucial for understanding how different factors influence each other.

56
Q

What is the difference between experiments and quasi-experiments?

A

Experiments involve random assignment to treatment and control groups, ensuring that any differences observed are due to the treatment itself.

Quasi-experiments lack random assignment, relying instead on natural variations or pre-existing groups. This makes it harder to rule out other factors influencing the results.

57
Q

What is internal Validity? When can it break down?

A

Internal validity refers to the extent to which a study can establish a causal relationship between the treatment and the observed outcome, free from confounding variables.

It can break down when:

There are confounding variables that aren’t controlled.

Selection bias occurs, meaning groups differ in ways other than the treatment.

58
Q

What is external validity? When can it break down?

A

External validity is the extent to which the results of a study can be generalized to other settings, populations, and times.

It can break down when:

The sample isn’t representative of the larger population.

59
Q

Which assumption should hold if you use difference in differences estimator?

A

parallel trends assumption. This means that in the absence of treatment, the difference between the treatment and control groups would remain constant over time.

60
Q

Can you test the parallel trend assumption?

A

To test the parallel trends assumption, check if the outcome trends for both groups are similar before the intervention. If they are, the assumption likely holds. You can also run statistical tests on the pre-treatment period to confirm this.

61
Q

What is matching? Why do we perform it?

A

Matching is a method used to pair units in the treatment group with similar units in the control group based on certain characteristics. We perform it to reduce bias and make the treatment and control groups more comparable, ensuring a more accurate estimate of the treatment effect.

62
Q

Name two different matching algorithms.

A

nearest neighbor matching pairs each treated unit with the closest control unit based on certain characteristics.

propensity score matching pairs units based on the probability of receiving the treatment, given their characteristics.