Statistical Modeling Flashcards
Let dPaup = a + bdOut + cdOld+ d *dPop + error, where Paup=# of ppl receiving poor relief, Out=ppl getting poor relief outside of poorhouses (phs)/ppl in phs, Old=prop. of over-65-year-olds in general population and Pop=total population.
Yule: Out and poor ppl are correlated even when accounting for other like Old. Therefore, supporting ppl in their usual surroundings leads to even more poverty.
What is the problem with his conclusion?
- Those districts with more efficient administration also built more poorhouses at the time
- Efficient administration leads to a reduction in poverty
- I.e. effects of efficient administration and of the establishment of phs cannot be separated
- We call these two variables “confounded”
What is our goal in regression?
analyze the influence of the covariates on the mean value of the response: E[y|x1,..,xk]=f(x1,…,xk), where y is the response/target/dependent variable and x1,…,xk are the explanatory/independent variables / regressors/covariates.
Assumptions:
-additive noise eps (r.v.): y=f(x1,…,xk)+eps. The map D to f is not deterministic.
- error eps does not depend on covariates
- error term may comprise unobservable variables that depend on the covariates and the response (“omitted variable bias”)
Give the scalar, vector and matrix notations of the linear model
scalar: Y_i = b_1x_i1+…+b_px_ip+eps=f(x_i1,..,x_ip)+eps
vector: Y_i= x_i^T * b + eps=f(x_i1,…,x_ip)+eps
matrix: Y = X * b + eps, where Y=(Y_1,…,Y_n)^T and X=(1,x_1,…,x_n)^T (dim=nxp), dim(eps)=nx1. We assume that p l.e. n, E[eps]=0, Cov(eps)=sig^2 * Id, eps|X~N(0,sig^2*Id)
Define the linear models (i.e. give X, p, and b):
location model,
2-sample model,
regression through the origin,
simple linear regression,
multiple linear regression,
quadratic regression.
Can the following be transformed into a lin. model?:
power: Y_i=alphax_i^beta+eps_i,
exponential: Y_i=alphaexp(beta*x_i)+eps_i?
p=1,X=(1,..,1)^T, b=b_1;
p=2, X=[ [1,0],..,[1,0],[0,1].[0,1] ], b = (b1,b2)^T (Q: is b1=b2 plausible? How large is the diff?);
p=1, X=(x_1,…,x_n)^T, b=b_1;
p=2, X=[ [1,x_1],…,[1,x_n] ], b = (b1,b2)^T;
p=k+1, X=[ [1,x_11,…,x_p1],…,[1,x_1n,…,x_pn] ], b = (b1,..,bp)^T;
p=3, X=[ [1,x_1,x_1^2],…,[1,x_n,x_n^2] ], b = (b1,b2,b3)^T;
Model: Y_i=alphax_i^beta+eps_i, or =alphaexp(betax_i)+eps_i,
then perform gen.lin.reg. on log(Y_i)=log(alpha)+betalog(x_i)+eps’_i,
p=2,X=[ [1,log(x_1)],…,[1,log(x_n)] ], b=(log(alpha),beta)^T, then transform back by taking exponential to get Y_i=alphax_i^betanü_i, where nü_i=exp(eps)
R command for linear regression and interpretation of output?
- Summary statistics of residuals
- Coefficients with their standard error, t value, p values [ Pr(>|t|) ], significant codes
- Multiple/adjusted R-squared
- F-statistic
Call: eg lm(formula = rent ~ area, data = dataset_xy )
or eg lm(formula=y~x1+x2+x3+x4)
Definitions of outputs:
A standard error is the standard deviation of the sampling distribution of a statistic. (in R output estimated standard deviation of coefficient s.e.^(b^)=(sig_b_i)^
The t-value/t-statistic is the ratio of the departure (/difference) of the estimated parameter from its hypothesized value to its standard error, i.e. (b^-b_{H0})/s.e.^(b^).
The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null-hypothesis is correct. P( above |t|)
Multiple R-squared=R^2=||y^-bar(y)||^2/||y-bar(y)||^2
(=coefficient of determination=portion of variance explained by model; measures goodness of fit)
Adjusted R-squared=1-(1-R^2)*(n-1) / df, df=n-p (since adding covariables always increases accuracy on training samples, this adjusts for just number of covariables).
Residual sum of squares = sqrt( SSE / df ), where df=degrees of freedom=n-p
An F-statistic is the ratio of two scaled sums of squares reflecting different sources of variability i.e. the variance explained by the parameters in the model (sum of squares of regression, SSR/(p-1) ) and the residual or unexplained variance (sum of squared errors, SSE/(n-p)).
So: F=[ ||y^-bar(y)||^2/(p-1) / ||y^-y)||^2/(n-p) ]
Let b^{(0)}:=[bar(y),0,…,0], y^{(0)}:=bar(y)*[1,…,1]=:bar(y) (as vec).
H0: SSE_0:=||y-y^{(0)}||^2 = ||y-bar(y)||^2= ||y-y^||^2+||y^-bar(y)||^2
Under H0 F~F{n-1,n-p}, this is the F-test on which the last p-value is based.
State the assumptions of the “classical linear model”
(Y = X * b + eps, where Y=(Y_1,…,Y_n)^T and X=(1,x_1,…,x_n)^T (dim=nxp), dim(eps)=nx1)
- Expectation: E[eps]=(0,…,0)^T
- Covariance: Cov(eps)=sig^2*I_n=diag(sig^2,…,sig^2)
- Homoscedastic errors: Var(eps_i)=sig^2
- Correlation: uncorrelated erros: cov(eps_i,eps_j)=0
- Gaussian erros: Often assume normal distribution for the errors, eps | X ~ N(o,sig^2*I_n) - implying eps and X indep. (consequence of CLT since errors are additive [multiplicative combination of elementary errors gives logarithmic normal distribution])
(Design matrix: there are two settings: x1,…,xk are deterministic or random
– If X is random then all observations as realizations of random vector (y,x^T), then all model assumptions are conditional on design matrix, i.e. E[eps|X]=0, cov[eps|X] and X ~ N(o,sig^2*Id_n)
but we omit dependence on X for notational simplicity)
What is the distribution of y under the assumptions of the classical linear model?
Since Y = X * b + eps, it follows that: If E[eps]=E[eps|X]=0 and Cov(eps)=sig^2*Id, then E[y]=X*b Cov(y) = sig^2 * Id_n If even eps|X~N(0,sig^2*Id), y ~ N(X*b,sig^2 * id_n)
Definition of a residual
eps^ = y_i - x_i^T*b^
Note that eps!=eps^, i.e. error != residual
What does a partial residual do?
Define a partial residual.
A partial residual quantifies the removal of the effect of “some” covariates, e.g. all but x_j.
Def: eps^_{x_j, i} = y_i - x_i^T * b^ + b^j * x{i,j}
(slide 34 in lec 1)
Def: Y~ = Y - X * b^ + b^_j * X_j ~ b_j * X_j + eps, where X_j is j-th feature and b_j measures the effect of X_j on Y when keeping all other X_i (i != j) fixed, i.e. when conditioning on X_i.
b_j=E[y|X_j=x+1,{X_l : j != l}]-E[y|X_j=x,{X_l : j != l}]
(from notes in 2nd week)
BTW: Partial residuals regress out all effects of X_l (l != j)
b^_j measure the effect of X_j on Y which has not been explained by other X_l (l != j) (i.e. when the others are fixed)
First steps in data analysis
- Look at univariate distributions of the variables (exploratory analysis)
- -continuous vars: look at summary statistics (e.g. mean, media, std etc), visualize with histograms, box-plots etc.
- -categorical vars: look at frequency tables, visualize with bar graphs etc.
- Check for extreme values
- Graphical association analysis ie plot response variable against explanatory variables (ie bivariate analysis = individually) to get insight e.g. linear vs. non-linear etc
- -continuous: scatter plots (might not be informative for large sample sizes, in that case discretise to show mean and std)
- -Boxplots
Define the terms returned by the R function lm: standard error, t value, p-value = Pr(>|t|), significant codes, residual standard error, multiple/adjusted R-squared, F-statistic
A standard error is the standard deviation of the sampling distribution of a statistic. (in R output estimated standard deviation of coefficient s.e.^(b^)=(sig_b_i)^
The t-value/t-statistic is the ratio of the departure (/difference) of the estimated parameter from its hypothesized value to its standard error, i.e. (b^-b_{H0})/s.e.^(b^).
The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null-hypothesis is correct. P( above |t|)
Residual standard error is an estimation of the standard deviation of the residuals, which is assumed to be iid 0-mean and const. standard deviation.
Residual sum of squares = sqrt( SSE / df ), where df=degrees of freedom=n-p
Multiple R-squared=R^2=||y^-bar(y)||^2/||y-bar(y)||^2
(=coefficient of determination=portion of variance explained by model; measures goodness of fit)
Adjusted R-squared=1-(1-R^2)*(n-1) / df, df=n-p (since adding covariables always increases accuracy on training samples, this adjusts for just number of covariables).
An F-statistic is the ratio of two scaled sums of squares reflecting different sources of variability i.e. the variance explained by the parameters in the model (sum of squares of regression, SSR/(p-1) ) and the residual or unexplained variance (sum of squared errors, SSE/(n-p)).
So: F=[ ||y^-bar(y)||^2/(p-1) / ||y^-y)||^2/(n-p) ]
Let b^{(0)}:=[bar(y),0,…,0], y^{(0)}:=bar(y)*[1,…,1]=:bar(y) (as vec).
H0: SSE_0:=||y-y^{(0)}||^2 = ||y-bar(y)||^2= ||y-y^||^2+||y^-bar(y)||^2
Under H0 F~F{n-1,n-p}, this is the F-test on which the last p-value is based.
Which nonlinear relations are possible within the scope of linear models?
Relations that are linear in parameters, e.g. y=b0+b1log(z)+eps but not e.g. y=b0+b1sin(b2*zi)+eps
Define homo- and heteroscedastic
Homo: Having the same finite variance for all elements.
Hetero: opposite
Define autocorrelated errors
eps_i= rho * eps_{i-1} + u_i with u_i iid
What to do if errors are multiplicative instead of additive?
logarithmic transformation: turns y_i=exp(b0+b1*x_{i1}+...+bk*x_{ik}+eps_i)=exp(b0)exp(b1*x_{i1})***exp(bk*x_{ik})exp(eps_i) into log(y_i)=b0+b1*x_{i1}+...+bk*x_{ik}+eps_i BTW: If errors are normally distributed, response is log-normally distributed with E[y_i|x_i]=exp(x_i^T*b+sig^2 /2)
State the theorem about the explicit solution of least squares:
Thm: Let X be an nxm matrix and let y be a vector in R^n. Then TFAE:
1. Xz=y has a unique least-squares solution.
2. The columns of X are linearly independent.
3. X^TX is invertible.
In this case, the least-squares solution is:
b=(X^TX)^{-1}X^T*y.
https://textbooks.math.gatech.edu/ila/least-squares.html
Lecture: requires n>=p.
Explain dummy coding
(almost one-hot encoding)
For modeling the effect of a covariate x in {1,…,c} with c categories using dummy coding, we define the c-1 dummy variables x_{i,1}=1{x_i=1},…,x_{i,c-1}=1{x_i=c-1}.
For identifiability, we omit one of the dummy variables - this category is called the reference category.
BTW: Estimated effects can be interpreted by direct comparison with the omitted reference category (coeff. b0).
With reference category, total effects are: b0+b1,…,b0+b{c-1},
with reference category, total effects are:
b0,b0+b1,…,b0+b{c-1}. Ie interpretation with reference to reference category.
In the second case regression parameters are uniquely determined.
Dummy coding: Interpret the last two coefficients of the rent index: rent^=112.69+5.85*area+57.26*glocation, glocation=good location, alocation=average location (reference)
For apartments in a good and average location, an increase of living area by 1m2 leads to an average increase of rent of about 5.85 Euros.
The average rent for an apartment in a good location is about 57.26 Euro higher than for an apartment of the same living area in an average location.
How do you model effects of categorical covariate interactions?
What happens to E[y] when x changes by an amount d?
For two explanatory variables with two levels:
If y=b0+b1x+b2z+eps is initial model, add interaction as xz to get new model: y=b0+b1x+b2z+b3xz+eps
For two explanatory variables with three levels:
Define dummy variables x1,x2 and z1,z2 for x and z and model: y=b0+b1x1+b2x2+b3z1+b4z2+b5x1z1+b5x2z1+b6x1z2+b7x2z2+eps
Total effects: b0 for (x,z)=(3,3), b0+b1 for (x,z)=(1,3) etc.
Etc.
E[y|x+d,z]-E[y|x,z]=b0+b1(x+d)+b2z+b3(x+d)z-b0+b1x+b2z+b3xz=b1d+b3d*z
- If b3=0:expected change b1*d is independent of z.
- If b3!=0:expected change bd+b3d*z depends on d and the value of z
Define the normal equation(s). (Derive for 3+)
To compute the least squares estimate, we calculate the partial derivatives of ||y-Xb||^2 by b (which forma a vector) and require them to be zero, thus obtaining the equation (-2)X^T(y-Xb^)=0 (iff X^TXb^=X^Ty).
Def. The normal equation is the equation X^TXb^=X^T*y.
Describe the goal of least squares in geometric terms.
What property does the vector of residuals have?
Project y onto the space {Xv | v in R^p}=col(X). Projection is Xb^
eps^=r=y-Xb^ is orthogonal to all the columns of X, i.e. r^TX=0=X^T*r.
Least Squares: Define the P- (or H-) and Q-matrices and state their properties.
The matrix P is the matrix y^ = Py, so P=X(X^TX)^{-1}X^Ty=Py.
It has the properties P=P^T, P^2=P and sum(P_{ii};i)=p.
Tpreohese are necessary and sufficient conditions for P to be an orthogonal projection from R^n to R^p.
r=y-y^=Qy, where Q=I-P.
Q is also a projection:
Q^T=Q^2=Q, PQ=QP=0 (hence orthogonal to each other), tr(Q)=n-p.
BTW: It is also known as the hat matrix. P_{ii} tells us how much influence the observation y_i has over the fitted value y^_i.
Define the MLE
The MLE is b^=argmax_b P[y1,…,yn | x1,…,xn,b).
What is the connection between the MLE and linear regression?
The MLE under the assumption of Gaussian iid errors is equal to the LSE (least-squares estimator).
BTW: Y = X * b + eps ~ N(Xb, sig^2Id) => P(y_i|x_i,b,sig)=N(Xb,sig^2id) => L(b,sig^2)=Prod(exp(-[(y_i-x_ib)/(2sig)]^2)/(sqrt(2pi)sig) (taking argmax of log L yields LSE for b)
The MLE for sig^2 is (sig^)^2=1/nsum(y_i-y^_i)^2, however in practice one uses (sig^)^2=1/(n(n-p))sum(y_i-y^_i)^2 to “scale to become unbiased”.
Why doesn’t one replace multiple linear regression with p simple regressions?
Because problems arise when the explanatory variables are strongly correlated.
If explanatory variables are orthogonal, then multiple regression is equal to iteratively applying simple regressions.