Regression, event, panel, portfolio Flashcards

1
Q

Regression analysis shows correlations, not causal relationships. Why?

A

Because the direction or nature of causality depends on a solid theory, not just statistical modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which are the OLS assumptions? (3)

A
  1. linear dependence between independent variables
  2. exogeneity of covariates
  3. homoskedasticity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain Exogeneity of covariates

A

That the error u is not a function of X.

  • The covariates (independent variables) don’t contain any information that predicts the error term (u).
  • This ensures that the model is correctly specified, and the independent variables only explain the dependent variable, not the errors.
  • For every data point, the expected value of the error term, given the independent variables, is zero.
  • This ensures that errors are purely random and not systematically related to the covariates.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is endogeneity?

A

Endogeneity is when the error term u is related to the independent variables → biased and inconsistent estimates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is homoskedasticity?

A

aka constant variance assumption.
Assumes that errors are uncorrelated: the covariance between any error terms is zero. The errors are evenly distributed. When errors are uncorrelated, it ensures that the error terms are independent, meaning one error does not affect the other.

Homoskedasticity ensures that the regression model treats all observations equally. If the variance changes (heteroskedasticity), it can lead to inefficient or biased estimates.

The error u has the same variance given any value of the explanatory variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What can you say about data generating process of covariates and errors?

A

The data in X can be a combination of constant and random variables

  • OLS relies on variance in the covariates to estimate the relationship between independent variables and the dependent variable.
  • If a covariate doesn’t vary (e.g., all values are the same), OLS cannot estimate its effect because it has no explanatory power.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the exogeneity assumption say?

A

The error term u is unrelated to the independent variables X.

It ensures that the model captures the true relationship between X and Y without bias from omitted variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which are the OLS derivations?

A
  1. standard errors
  2. the t-test
  3. goodness-of-fit (rsquare)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is standard errors?

A

Standard errors tell us how much the model’s predictions and estimates (like the coefficients) might vary due to random noise or limited data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is residual variance?

A

Measures how far the actual data points are from the model’s prediction on average (tells how much error is left after fitting the regression line).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the residual standard error?

A

The average size of the errors in the model’s predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the t-test?

A

T-tests are used in regression to check if a regression coefficient B is significantly different from zero. It helps determine if an independent variable significantly contributes to the model.

The significance level (p-value) should be below 0.05 for a variable to be considered meaningful in most cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is goodness-of-fit?

A

aka r squared or the coefficient of determination.
Used to evaluate how well a regression model fits the data. It helps you assess whether the model is good at predicting the dependent variable Y or if it leaves too much unexplained variability.

The value ranges from 0 to 1. Higher R2 is better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When is it good to use the adjusted R2?

A

Good to use when evaluating and comparing different models when having multiple independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which are the OLS assumption violations? (6)+1

A
  1. non-linearity
  2. heteroskedasticity
  3. auto-correlated errors
  4. multicollinearity
  5. irrelevant variables (over specified model)
  6. omitted variables (under specified model)
    other issues
  7. outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is heteroskedasticity?

A

Occurs when the variance of the error terms u in a regression model is not constant. So the “errors” (mistakes) in your regression model don’t have a consistent spread (their variability changes across observations).

Heteroskedasticity doesn’t bias the regression coefficients but it makes standard errors and hypothesis testing unreliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can heteroskedasticity be addressed?

A
  • Robust Standard Errors: Adjusts the standard errors to account for heteroskedasticity.
  • Weighted Least Squares (WLS): Reweights observations to stabilize variance.
  • Model Refinements: Modify the model to better explain the variability in the data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is auto-correlated errors?

A

Autocorrelated errors occur when the errors (residuals) in a regression model are not independent but instead show a pattern or relationship over time. This violates one of the key assumptions in regression analysis, leading to unreliable results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How can you solve auto-correlated errors?

A
  • Adjust your model to directly address the source of autocorrelation (e.g., include lagged terms).
  • Use robust standard errors (like Newey-West) to correct for the issues in residuals.
  • Robust standard errors are a good default because they work under a variety of error conditions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is multicollinearity?

A

Multicollinearity doesn’t violate the assumption of “no perfect linear dependence” (as long as predictors aren’t perfectly collinear), but it still causes numerical issues in estimating coefficients.

Large standard errors due to multicollinearity make it hard to determine the true effect of each variable, leading to unstable regression results.

Multicollinearity can be measured through VIF, Variance Inflation Factor. High VIF indicates severe multicollinearity and inflated standard errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How can you solve multicollinearity?

A
  • increase sample size N (this increases SST)
  • remove or combine highly correlated variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are irrelevant variables?

A

An over specified model occurs when there are irrelevant variables included in the model. Irrelevant variable is named z.

Including irrelevant variables (z) does not introduce bias in the coefficient estimates (β). However, it increases variance in the estimates due to sampling error, making the model less efficient.

Over-specifying the model adds unnecessary noise, which can affect the reliability and interpretability of the results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is omitted variables?

A

When the error term u is not purely random noise; it contains an omitted variable z, which creates bias.

Omitted variables creates:

Bias in Coefficients:

Omitting a relevant variable z introduces bias in β^ because the effect of z is wrongly attributed to X.

The bias increases if z is strongly correlated with X or has a large γ (strong effect on y).

In contrast to over-specified models (where coefficients remain unbiased but less efficient), under-specified models produce biased estimates.

Practical Impact:

Omitting relevant variables can severely distort conclusions from the model, leading to incorrect inferences about the relationship between X and y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the difference between sampling error and omitted variables bias?

A

Sampling error diminishes when sample size N increase. Not the same for omitted variable bias because the bias is systematic and stems from the structure of the model itself caused by leaving out a relevant variable; the omitted variable is correlated with X. X picks up the effects of the omitted variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are outliers?

A

Outliers are extreme values that deviate a lot from the rest of the data. Outliers can disrupt or distort the casual relationship between the dependent variable and the independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How can you treat outliers?

A
  • Transformation:

Apply mathematical transformations to reduce the influence of extreme values. Example: Use the natural logarithm to compress large values and spread smaller ones.

  • Trimming:

Remove extreme values (e.g., top and bottom 5% of the dataset) from the analysis.

  • Winsorizing:

Replace extreme values with the nearest non-outlier value. Example: Cap values at the 95th percentile or floor them at the 5th percentile.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the constant elasticity model?

A

constant elasticity model is a type of regression model where the relationship between the dependent variable and the independent variable(s) exhibits a constant percentage change (elasticity).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the Gauss-Markov assumptions for simple regression?

A

Justifies the use of the OLS method rather than using a variety of competing estimators.

The Gauss-Markov theorem requires errors to have constant variance (homoskedasticity) and no correlation over time (no serial correlation). If errors are serially correlated, OLS is no longer the best estimator, and its standard errors and test statistics become invalid, even for large samples.

A1: Linear in Parameters
A2: Random Sampling
A3: Sample Variation in the Explanatory Variable
A4: Zero Conditional Mean
A5: Homoskedasticity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is ordinary least squares?

A

chooses the estimates to minimize the sum of squared residuals. the method of ordinary least squares is easily applied to estimate the multiple regression model. Each slope estimate measures the partial effect of the corresponding independent variable on the dependent variable, holding all other independent variables fixed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the Frisch-Waugh theorem?

A

the general partialling out result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What does no perfect collinearity mean?

A

In the sample (and therefore in the population), none of the independent variables is constant, and there are no exact linear relationships among the independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is perfect collinearity?

A

If an independent variable in is an exact linear combination of the other independent variables, then we say the model suffers from perfect collinearity, and it cannot be estimated by OLS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is zero conditional mean?

A

the error u has an expected value of zero given any values of the independent variables. when this assumption holds, we often say that we have exogenous explanatory variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is micronumerosity?

A

Problems of small sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is BLUE?

A

best linenar unbiased estimator. Under the Gauss-Markov assumptions the OLS estimators are the best linear unbiased estimators (BLUEs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the error variance?

A

The error variance measures the “noise” or randomness in the model. Higher means more variability in the dependent variable () that cannot be explained by the independent variables When it is large, the variances of the OLS estimators increase, making them less precise. To reduce , include more relevant independent variables in the model (which reduces unexplained variation). However, finding these variables can be challenging.

37
Q

What is the sample variation in the independent variable?

A

Measures the total variation in the independent variable X. Higher variation in X improves the precision. When SST is small, VarB becomes large.

Improve SST by increasing the sample size.

38
Q

What is the normality assumption?

A

states that the error terms u are normally distributed. This assumption is particularly important for conducting hypothesis tests and creating confidence intervals for the regression coefficients

39
Q

Which are the classical lienar model assumptions?

A

A1: Linear in Parameters

A2: Random Sampling

A3: No Perfect Collinearity

A4: Zero Conditional Mean

A5: Homoskedasticity

40
Q

What is a dummy variable?

A

dummy variable is defined to distinguish between two groups, and the coefficient estimate on the dummy variable estimates the ceteris paribus difference between the two groups. dummy variables are useful for incorporating ordinal information in regression models. we simply define a set of dummy variables representing different outcomes of the ordinal variable, allowing one of the categories to be the base group

41
Q

What is the dummy variable trap?

A

arises when too many dummy variables describe a given number of groups

42
Q

What is an ordinal variable?

A

an ordinal variable is a type of categorical variable where the categories have a natural, meaningful order or ranking, but the differences between the ranks are not necessarily equal or measurable (bra, bättre bäst)

43
Q

What is the self-selection problem?

A

the problem of participation decisions differing systematically by individual characteristics

44
Q

What is the white test for heteroskedasticity?

A

is a statistical test used to detect the presence of heteroskedasticity in a regression model. Heteroskedasticity occurs when the variance of the residuals (errors) is not constant across observations, violating a key assumption of the Classical Linear Regression Model (CLRM). The White test is a versatile and widely used diagnostic tool because it does not require prior knowledge of the specific form of heteroskedasticity

45
Q

What is the sample size in time series dataset determined by?

A

the sample size is determined by the number of time periods for which we observe the variables of interest (e.g., 30 daily observations of stock prices = sample size of 30).

46
Q

What is the spurious regression problem?

A

happens when you analyze two unrelated data sets that both have trends (non-stationary), and the regression falsely shows a strong relationship. This misleading result happens because the trends overlap, not because there’s any real connection. Ex: If you compare ice cream sales to sea level rise, both might increase over time, but they have no real link. The regression might say they are connected just because they both trend upward

47
Q

What is the Durbin-Watson statistic?

A

a test for serial correlation. is a test that measures the presence of serial correlation (autocorrelation) in the residuals of a regression model. It checks whether the errors in your model are independent over time, as required by standard regression assumptions

48
Q

What is the Breusch-Godfrey test?

A

is a statistical test used to detect serial correlation (autocorrelation) in the residuals of a regression model. Unlike the Durbin-Watson test, which is limited to detecting first-order autocorrelation, the Breusch-Godfrey test can detect higher-order autocorrelation

49
Q

What is quasi-differenced data?

A

is a transformation of a time series that helps remove serial correlation or handle trends in regression models. It’s particularly useful when working with time series data that exhibits autoregressive behavior, where past values influence current values

50
Q

What is the Cochrane-Orcutt estimation?

A

it transforms the original regression model to correct for first-order serial correlation

51
Q

What is the Prais-Winsten estimation?

A

similar to Cochrane-Orcutt, but it modifies the procedure to retain the first observation, addressing the data loss issue

52
Q

What is weighted least squares?

A

weighted Least Squares (WLS) is a regression method used when the assumption of homoskedasticity (constant variance of errors) is violated. In cases of heteroskedasticity (errors with unequal variances), WLS assigns weights to each observation to give less influence to observations with higher variance and more influence to those with lower variance

53
Q

What is the ARCH model?

A

stands for autoregressive conditional heteroskedasticity. is a statistical model used to describe and predict time-varying volatility in time series data, especially in financial markets. It is particularly useful when the variance of the errors (or returns) changes over time, which is a common feature in financial data like stock prices or exchange rates

54
Q

What is panel data regression?

A

Panel data regression is a method used to analyze data that varies across two dimensions: entities (e.g., individuals, firms, countries) and time periods (e.g., years, quarters). This type of data allows us to control for unobserved characteristics that are constant within entities or time periods, making the analysis more robust.

Panel data is data where you observe the same entities (e.g., people, companies, countries) over multiple time periods. Example: Tracking the income of individuals over 5 years.

Panel data has two dimensions:

Entities (cross-sectional dimension): Different individuals, firms, or countries.

Time (time-series dimension): Multiple observations over time for each entity.

55
Q

What is pooled OLS?

A

Pooled OLS simplifies panel data analysis by ignoring the panel structure, but it requires strong assumptions. When these assumptions do not hold, alternative models like fixed or random effects are more appropriate. Pooled OLS simplifies the model by ignoring the entity-specific and time-specific effects.

Pooled OLS is appropriate when entity-specific and time-specific effects are uncorrelated with the independent variables.

It assumes:

  • No Correlation Between Covariates and Entity-Specific Effects
  • No Correlation Between Covariates and Time-Specific Effects
56
Q

Which are the implications of pooled OLS?

A
  • Collapses the Panel Structure:

Pooled OLS treats the data as a single dataset without accounting for individual-specific or time-specific variation. It does not differentiate between differences across entities or time periods.

  • Efficiency and Bias:

Pooled OLS can be efficient if the assumptions are valid, as it does not estimate fixed or random effects.

However, if the assumptions are violated, the model produces biased and inconsistent estimates.

Alternative Methods

If the assumptions of Pooled OLS are violated, consider:

  1. Fixed Effects Model:
    • Accounts for entity-specific or time-specific effects by removing their influence.
  2. Random Effects Model:
    • Models entity- and time-specific effects as random and assumes they are uncorrelated with X
  3. Two-Way Fixed Effects:
    • Controls for both entity- and time-specific effects simultaneously.
57
Q

What is first differencing?

A

First-differencing is a way to clean up your data by removing things that don’t change over time. First-differencing removes anything that stays the same over time (e.g., intelligence, personality), so you can focus on how changes in your independent variable (e.g., education) impact changes in your dependent variable (e.g., salary).

58
Q

What is LSDV?

A

Least Squares Dummy Variables. LSDV is a panel data regression approach that explicitly includes dummy variables to control for unobserved factors that vary:

  • Across entities (cross-sectional effects).
  • Across time (time-specific effects).

is the same as de-meaning that is done for the fixed effects regression. LSDV is basically the same as fixed effect but with more output

59
Q

What is an independently pooled cross section?

A

is obtained by sampling randomly from a large population at different points in time (usually years). From a statistical standpoint, these data sets have an important feature: they consist of independently sampled observations. This was also a key aspect in our analysis of cross-sectional data: among other things, it rules out correlation in the error terms across different observations.

An independently pooled cross section differs from a single random sample in that sampling
from the population at different points in time likely leads to observations that are not identically
distributed.

OLS using pooled data is the leading method of estimation, and the usual inference procedures are available, including corrections for heteroskedasticity. (Serial correlation is not an issue because the samples are independent across time.) Because of the time series dimension, we often allow different time intercepts. We might also interact time dummies with certain key variables to see how they have changed over time.

60
Q

What is a natural experiment`?

A

natural experiment occurs when some exogenous event—often a change in government policy—changes the environment in which individuals, families, firms, or cities operate. A natural experiment always has a control group, which is not affected by the policy change, and a treatment group, which is thought to be affected by the policy change. Unlike a true experiment, in which treatment and control groups are randomly and explicitly chosen, the control and treatment groups in natural experiments arise from the particular policy change.

61
Q

What is parallel trends assumption?

A

assume that average
health trends would be the same for the low-income and middle-income families in the absence of the intervention.

62
Q

What is parallel trends assumption?

A

assume that average
health trends would be the same for the low-income and middle-income families in the absence of the intervention

63
Q

What is the difference between first differencing and fixed effects estimator?

A

Compared with first differencing, the fixed effects estimator is efficient when the idiosyncratic errors are serially uncorrelated (as well as homoskedastic), and we make no assumptions about
correlation between the unobserved effect ai and the explanatory variables. As with first differencing, any time-constant explanatory variables drop out of the analysis. Fixed effects methods apply immediately to unbalanced panels, but we must assume that the reasons some time periods are missing are not systematically related to the idiosyncratic errors.

64
Q

When is the random effects estimator appropriate?

A

when the unobserved effect is thought to be uncorrelated with all the explanatory variables.

65
Q

What is a cluster sample?

A

A cluster sample has the same appearance as a cross-sectional data set, but there is an important difference: clusters of units are sampled from a population of clusters rather than sampling individuals from the population of individuals. In the previous examples, each family is sampled from the population of families, and then we obtain data on at least two family members. Therefore, each family is a cluster

66
Q

What are different analyses used for panel data?

A

Panel data combines both the cross-sectional and time series elements

  1. cross-sectional data format: 1 time period but several entities (horizontal)
  2. time series data
    1 entity, several time periods (vertical)
67
Q

What is a balanced panel?

A

Equal number of observations for each firm for the entire period. (every firm has 3 years of observations)

When there are no missing observations.

68
Q

What is an ubalanced panel?

A

Number of observations are NOT the same for all subjects

69
Q

What is the residual error term called in panel data regression?

A

epsilon. its a time-invariant component in the error term. (time invariant = does not change over time, stays the same)

70
Q

A common way of estimating a panel regression is by using the fixed-effects regression. What is another name of it?

A

Within regression

71
Q

How is fixed effects regression used?

A
  • the data is transformed to remove the individual-specific average.
  • no constant term

subtract the individuals average over time to every variable (time-demeaning): within estimator because we focus on changes within an entity over time; estimation in terms of deviations-from-means

72
Q

An alternative way of estimating a panel data model is by taking differences. What does this mean?

A
  • the problematic time-invariant individual specific unobserved heterogeneity is removed by first differencing the data
  • we lose one time period when taking differences
  • focus on 2 points in time
73
Q

Why is the standard OLS regression a bad model for panel data?

A

It does not take into account the panel structure of the data. It fails because because it does not take into account unobserved time-invariant heterogeneity

74
Q

How is FE and FD similar?

A

both can deal with the problemn of eta(i); the time-invariant individual-specific component.

75
Q

What is another word for panel data?

A

Longitudinal data

76
Q

What does panel data consist of?

A

time series and cross-section observations

77
Q

What is cross-sectional data?

A

Observations of the subjects are obtained at the same point in time

78
Q

What is cross-sectional regression?

A

For each year, estimating a cross-sectional regression. So we have 3 different cross-section regressions. Very limiting. Limits the degrees of freedom required to perform a meaningful analysis.

79
Q

What is time-series data?

A

Observations are generated over time

80
Q

What is time-series regression?

A

Estimate a time-series model for each firm using OLS

But we end up with disparate pieces of information, which would not enable a comprehensive asssessment on how X1 and X2 jointly affects Y

Also ignores information about other firms operating in the same environment. Serial correlation might be a problem because of time-dependent nature of Y

81
Q

So how do we estimate the panel data model?

A

Consider
1. pooled OLS
2. Fixed effects model
- LSDV
- within group
- first differencing
3. Random effects model

82
Q

What is an event study?

A

an event study typically tries to examine return behavior for a sample of firms experiencing a common type of event. examines the behavior of firms stock prices around corporate events. event studies focusing on announcement effects for a short-horizon around an event provide evidence relevant for understanding corporate policy decisions. vent studies are joint tests of market efficiency and a model of expected returns

83
Q

How does event studies test market efficiency?

A

event study also test market efficiency since systematically nonzero abnormal security returns that persist after a particular type of corporate event are inconsistent with market efficiency. event studies focusing on long-horizons following an event can provide key evidence on market efficiency. examination of post-event returns provides information on market efficiency

84
Q

What is a model of normal returns?

A

must be specified before an abnormal return can be defined. a variety of exptected return models have been used in event studies, like the capital asset pricing model or constant mean return model. the approaches can be grouped into two categories: statistical and economics. ex:

Statistical

  1. Constant Mean Return Model
  2. Market Model
  3. Factor Model

Economic

  1. Capital Asset Pricing Model
85
Q

What is the Type I error?

A

occurs when the null hypothesis is falsely rejected

86
Q

What is type II error?

A

occurs when the null is falsely accepted

87
Q

What is the joint test problem?

A

all tests are joint tests. that is, event study tests are well-specified only to the extent that the assumptions underlying their estimation are correct. this poses a significant challenge because event study tests are joint tests of whether abnormal returns are zero and of whether the assumed model of expected returns, CAPM etc, is correct

88
Q

What is the Brown-Warner simulation?

A

to directly address the issue of event study properties, the standard tool in event study methodology research is simulation procedures that use actual security return data. the basic idea is simple and intuitive. Different event study methods are simulated by repeated application of each method to samples that have been constructed through a random (or stratified random) selection of securities and random selection of an event date to each. If performance is measured correctly, these samples should show no abnormal performance, on aver- age. This makes it possible to study test statistic specification, that is, the probability of rejecting the null hypothesis when it is known to be true. Further, various levels of abnormal performance can be artificially introduced into the samples. This permits di- rect study of the power of event study tests, that is, the ability to detect a given level of abnormal performance. the evidence in Brown and Warner is that analytical and simulation methods yield similar power functions for a well-specified test statistic.

89
Q
A