4. Mathematical Foundations/Probability Theory, III Flashcards
What is the variance?
The variance of a random variable X, denoted as
Var(X) or σ^2, measures how much the values of X deviate from their mean (expected value). Variance provides a population-level description of the spread or dispersion of X.
* Sample variance: a measure of how much the values in a sample deviate from the sample mean. It is used as an estimate of the population variance when only a subset (sample) of the entire population is available, often denoted s^2.
What is covariance and correlation?
Covariance measures the extent to which two variables change together. If X and Y tend to increase or decrease together, the covariance is positive. If one variable increases while the other decreases, the covariance is negative. If there is no consistent linear relationship, the covariance is zero.
Correlation standardizes the covariance to provide a dimensionless measure of the linear relationship between two variables. It ranges from -1 to 1, where -1 is a perfect negative linear relationship, 1 is a perfect positive linear relationship and 0 is no linear relationship.
What is a population model?
A population model is a theoretical framework that describes the true relationship between variables in a population. In statistics population models are used to make inferences about how variables are related at the population level. Components:
* Dependent variable (Y): The outcome the model aims to explain or predict
* Independent variables (X): The explanatory variables (predictors) that are believed to influence Y.
* Parameters (𝛽): The coefficients that quantify the relationship between X and Y in the population.
* Error term (𝜖): A random variable representing unobserved factors that affect the dependent variable.
What is the basic linear model and assumptions to estimate its parameters?
In the context of linear regression, the basic population model can be expressed as:
* Y=β0+β1X+ϵ
Assumptions to estimate parameters
* Zero mean of errors: E(u)=0, the average value of the error term, u, across the population is zero
* Mean independence: E(u∣x)=E(u), the error term is mean independent of X aka. u and x are not systematically correlated
* Zero conditional mean assumption: E(u∣x)=0, the error term has an expected value of zero conditional on
X aka. on average, the error term does not affect the relationship between x and y at any level of x.
Conditional expectation functions: If E(u∣x)=0, then the population regression function becomes E(y∣x)=β0+β1x
How can we estimate the slope and intercept of a linear model?
To estimate the slope and intercept of a linear model, we use the Ordinary Least Squares (OLS) method. OLS minimizes the sum of squared residuals (SSR) to find the best-fitting line.
You can calculate the slope of the regression line by measuring the covariance between X and Y relative to the variance of X.
* ^β1= Cov(X,Y)/(Var(X))
You can estimate the intercept of the regression line by
* ^β1=ˉY−^β1ˉX
What is a residual in linear regression and how is it related to the error term?
A residual (𝑒𝑖) is the difference between the actual observed value of the dependent variable (𝑌𝑖) and its predicted value from the regression model. Residuals show how far off the model’s predictions are from the actual values.
* Positive residuals mean the model under-predicted.
* Negative residuals mean the model over-predicted.
Residuals are sample-based estimates of the unobservable population error term (𝜖). They capture the unexplained variation in Y in the sample. They are used to assess model fit and validate assumptions, such as linearity and homoscedasticity.
What are the three main sum of squares terms in linear regression?
In regression analysis, the sum of squares terms measure different components of the variation in the dependent variable (Y):
* Total sum of squares (SST): measures the sum of squared differences between each observed value of Y variable and the mean of Y. It represents the total variability in the dependent variable, including the portions of variability both explained and unexplained by the regression model.
* Explained sum of squared: measures the variation in Y that is explained by the regression model. It reflects how well X explains Y. Larger SSE means the model explains more of the variability in Y
* Sum of squared residuals (SSR): measures the unexplained variation in Y. SSR quantifies the part of the variability in Y that the model cannot explain. A smaller SSR indicates a better-fitting model.
What is the relationship between SST, SSR, and SSE and how is R^2 related to them?
The total variation in Y (SST) is the sum of the unexplained variation (SSR) and the explained variation (SSE):
* SST=SSR+SSE
R^2 is the coefficient of determination, it measures the proportion of total variation explained by the model:
* R^2= SSE/SST=1−SSR/SST
* If R^2=1, then all of the variation is explained by the model. If R^2=0, then none of the variation is explained by the model.
What does unbiasedness mean and what are the assumptions required for OLS to be unbiased?
In OLS regression, unbiasedness means that the expected value of the estimated coefficients equals the true population coefficients.
* Mathematically: E(^β)=β, implies that, across many repeated samples, the OLS estimators (^β) will, on average, hit the true population parameter (𝛽).
Assumptions for unbiasedness:
* Linear in parameters: The model must be linear, meaning that y=β0+β1x+u is a correct representation of the relationship between y and x.
* Random sampling: The sample we are using must be randomly drawn from the population, ensuring that the data represents the population fairly.
* Sample Variation in x: There must be some variation in the values of x in the sample. If all the x-values are the same, we cannot estimate the effect of x on y.
* Zero conditional mean: The error term u must have an expected value of zero for any given value of x. E(u∣x)=E(u)=0.
* Homoscedasticity: The variance of the error term (𝜖/u) is constant across all levels of X. V(u∣x)=σ^2
Why are robust standard errors used?
Robust standard errors are used when the homoskedasticity assumption does not hold (heteroskedasticity). They adjust for non-constant error variance and ensure valid inference even when the error variance is not constant.
What are cluster robust standard errors, and when are they used?
Cluster robust standard errors are used when errors are correlated within groups (clusters) but independent across clusters. They:
* Account for within-cluster correlation.
* Allow error variances to differ between clusters.
* Ensure valid standard errors for OLS in clustered data