Statistical Modeling Flashcards

1
Q

Let dPaup = a + bdOut + cdOld+ d *dPop + error, where Paup=# of ppl receiving poor relief, Out=ppl getting poor relief outside of poorhouses (phs)/ppl in phs, Old=prop. of over-65-year-olds in general population and Pop=total population.
Yule: Out and poor ppl are correlated even when accounting for other like Old. Therefore, supporting ppl in their usual surroundings leads to even more poverty.
What is the problem with his conclusion?

A
  • Those districts with more efficient administration also built more poorhouses at the time
  • Efficient administration leads to a reduction in poverty
  • I.e. effects of efficient administration and of the establishment of phs cannot be separated
  • We call these two variables “confounded”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is our goal in regression?

A

analyze the influence of the covariates on the mean value of the response: E[y|x1,..,xk]=f(x1,…,xk), where y is the response/target/dependent variable and x1,…,xk are the explanatory/independent variables / regressors/covariates.
Assumptions:
-additive noise eps (r.v.): y=f(x1,…,xk)+eps. The map D to f is not deterministic.
- error eps does not depend on covariates
- error term may comprise unobservable variables that depend on the covariates and the response (“omitted variable bias”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give the scalar, vector and matrix notations of the linear model

A

scalar: Y_i = b_1x_i1+…+b_px_ip+eps=f(x_i1,..,x_ip)+eps
vector: Y_i= x_i^T * b + eps=f(x_i1,…,x_ip)+eps
matrix: Y = X * b + eps, where Y=(Y_1,…,Y_n)^T and X=(1,x_1,…,x_n)^T (dim=nxp), dim(eps)=nx1. We assume that p l.e. n, E[eps]=0, Cov(eps)=sig^2 * Id, eps|X~N(0,sig^2*Id)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define the linear models (i.e. give X, p, and b):
location model,
2-sample model,
regression through the origin,
simple linear regression,
multiple linear regression,
quadratic regression.
Can the following be transformed into a lin. model?:
power: Y_i=alphax_i^beta+eps_i,
exponential: Y_i=alpha
exp(beta*x_i)+eps_i?

A

p=1,X=(1,..,1)^T, b=b_1;
p=2, X=[ [1,0],..,[1,0],[0,1].[0,1] ], b = (b1,b2)^T (Q: is b1=b2 plausible? How large is the diff?);
p=1, X=(x_1,…,x_n)^T, b=b_1;
p=2, X=[ [1,x_1],…,[1,x_n] ], b = (b1,b2)^T;
p=k+1, X=[ [1,x_11,…,x_p1],…,[1,x_1n,…,x_pn] ], b = (b1,..,bp)^T;
p=3, X=[ [1,x_1,x_1^2],…,[1,x_n,x_n^2] ], b = (b1,b2,b3)^T;

Model: Y_i=alphax_i^beta+eps_i, or =alphaexp(betax_i)+eps_i,
then perform gen.lin.reg. on log(Y_i)=log(alpha)+beta
log(x_i)+eps’_i,
p=2,X=[ [1,log(x_1)],…,[1,log(x_n)] ], b=(log(alpha),beta)^T, then transform back by taking exponential to get Y_i=alphax_i^betanü_i, where nü_i=exp(eps)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

R command for linear regression and interpretation of output?

  • Summary statistics of residuals
  • Coefficients with their standard error, t value, p values [ Pr(>|t|) ], significant codes
  • Multiple/adjusted R-squared
  • F-statistic
A

Call: eg lm(formula = rent ~ area, data = dataset_xy )
or eg lm(formula=y~x1+x2+x3+x4)

Definitions of outputs:

A standard error is the standard deviation of the sampling distribution of a statistic. (in R output estimated standard deviation of coefficient s.e.^(b^)=(sig_b_i)^

The t-value/t-statistic is the ratio of the departure (/difference) of the estimated parameter from its hypothesized value to its standard error, i.e. (b^-b_{H0})/s.e.^(b^).

The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null-hypothesis is correct. P( above |t|)

Multiple R-squared=R^2=||y^-bar(y)||^2/||y-bar(y)||^2
(=coefficient of determination=portion of variance explained by model; measures goodness of fit)

Adjusted R-squared=1-(1-R^2)*(n-1) / df, df=n-p (since adding covariables always increases accuracy on training samples, this adjusts for just number of covariables).

Residual sum of squares = sqrt( SSE / df ), where df=degrees of freedom=n-p

An F-statistic is the ratio of two scaled sums of squares reflecting different sources of variability i.e. the variance explained by the parameters in the model (sum of squares of regression, SSR/(p-1) ) and the residual or unexplained variance (sum of squared errors, SSE/(n-p)).
So: F=[ ||y^-bar(y)||^2/(p-1) / ||y^-y)||^2/(n-p) ]
Let b^{(0)}:=[bar(y),0,…,0], y^{(0)}:=bar(y)*[1,…,1]=:bar(y) (as vec).
H0: SSE_0:=||y-y^{(0)}||^2 = ||y-bar(y)||^2= ||y-y^||^2+||y^-bar(y)||^2
Under H0 F~F
{n-1,n-p}, this is the F-test on which the last p-value is based.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

State the assumptions of the “classical linear model”

A

(Y = X * b + eps, where Y=(Y_1,…,Y_n)^T and X=(1,x_1,…,x_n)^T (dim=nxp), dim(eps)=nx1)

  • Expectation: E[eps]=(0,…,0)^T
  • Covariance: Cov(eps)=sig^2*I_n=diag(sig^2,…,sig^2)
    • Homoscedastic errors: Var(eps_i)=sig^2
    • Correlation: uncorrelated erros: cov(eps_i,eps_j)=0
  • Gaussian erros: Often assume normal distribution for the errors, eps | X ~ N(o,sig^2*I_n) - implying eps and X indep. (consequence of CLT since errors are additive [multiplicative combination of elementary errors gives logarithmic normal distribution])

(Design matrix: there are two settings: x1,…,xk are deterministic or random
– If X is random then all observations as realizations of random vector (y,x^T), then all model assumptions are conditional on design matrix, i.e. E[eps|X]=0, cov[eps|X] and X ~ N(o,sig^2*Id_n)
but we omit dependence on X for notational simplicity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the distribution of y under the assumptions of the classical linear model?

A
Since Y = X * b + eps, it follows that:
If E[eps]=E[eps|X]=0 and Cov(eps)=sig^2*Id,
then
E[y]=X*b
Cov(y) = sig^2 * Id_n
If even eps|X~N(0,sig^2*Id),
y ~ N(X*b,sig^2 * id_n)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Definition of a residual

A

eps^ = y_i - x_i^T*b^

Note that eps!=eps^, i.e. error != residual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does a partial residual do?

Define a partial residual.

A

A partial residual quantifies the removal of the effect of “some” covariates, e.g. all but x_j.
Def: eps^_{x_j, i} = y_i - x_i^T * b^ + b^j * x{i,j}
(slide 34 in lec 1)
Def: Y~ = Y - X * b^ + b^_j * X_j ~ b_j * X_j + eps, where X_j is j-th feature and b_j measures the effect of X_j on Y when keeping all other X_i (i != j) fixed, i.e. when conditioning on X_i.
b_j=E[y|X_j=x+1,{X_l : j != l}]-E[y|X_j=x,{X_l : j != l}]
(from notes in 2nd week)
BTW: Partial residuals regress out all effects of X_l (l != j)
b^_j measure the effect of X_j on Y which has not been explained by other X_l (l != j) (i.e. when the others are fixed)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

First steps in data analysis

A
  • Look at univariate distributions of the variables (exploratory analysis)
  • -continuous vars: look at summary statistics (e.g. mean, media, std etc), visualize with histograms, box-plots etc.
  • -categorical vars: look at frequency tables, visualize with bar graphs etc.
  • Check for extreme values
  • Graphical association analysis ie plot response variable against explanatory variables (ie bivariate analysis = individually) to get insight e.g. linear vs. non-linear etc
  • -continuous: scatter plots (might not be informative for large sample sizes, in that case discretise to show mean and std)
  • -Boxplots
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define the terms returned by the R function lm: standard error, t value, p-value = Pr(>|t|), significant codes, residual standard error, multiple/adjusted R-squared, F-statistic

A

A standard error is the standard deviation of the sampling distribution of a statistic. (in R output estimated standard deviation of coefficient s.e.^(b^)=(sig_b_i)^

The t-value/t-statistic is the ratio of the departure (/difference) of the estimated parameter from its hypothesized value to its standard error, i.e. (b^-b_{H0})/s.e.^(b^).

The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null-hypothesis is correct. P( above |t|)

Residual standard error is an estimation of the standard deviation of the residuals, which is assumed to be iid 0-mean and const. standard deviation.

Residual sum of squares = sqrt( SSE / df ), where df=degrees of freedom=n-p

Multiple R-squared=R^2=||y^-bar(y)||^2/||y-bar(y)||^2
(=coefficient of determination=portion of variance explained by model; measures goodness of fit)

Adjusted R-squared=1-(1-R^2)*(n-1) / df, df=n-p (since adding covariables always increases accuracy on training samples, this adjusts for just number of covariables).

An F-statistic is the ratio of two scaled sums of squares reflecting different sources of variability i.e. the variance explained by the parameters in the model (sum of squares of regression, SSR/(p-1) ) and the residual or unexplained variance (sum of squared errors, SSE/(n-p)).
So: F=[ ||y^-bar(y)||^2/(p-1) / ||y^-y)||^2/(n-p) ]
Let b^{(0)}:=[bar(y),0,…,0], y^{(0)}:=bar(y)*[1,…,1]=:bar(y) (as vec).
H0: SSE_0:=||y-y^{(0)}||^2 = ||y-bar(y)||^2= ||y-y^||^2+||y^-bar(y)||^2
Under H0 F~F
{n-1,n-p}, this is the F-test on which the last p-value is based.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which nonlinear relations are possible within the scope of linear models?

A

Relations that are linear in parameters, e.g. y=b0+b1log(z)+eps but not e.g. y=b0+b1sin(b2*zi)+eps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define homo- and heteroscedastic

A

Homo: Having the same finite variance for all elements.
Hetero: opposite

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define autocorrelated errors

A

eps_i= rho * eps_{i-1} + u_i with u_i iid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What to do if errors are multiplicative instead of additive?

A
logarithmic transformation:
turns y_i=exp(b0+b1*x_{i1}+...+bk*x_{ik}+eps_i)=exp(b0)exp(b1*x_{i1})***exp(bk*x_{ik})exp(eps_i)
into
log(y_i)=b0+b1*x_{i1}+...+bk*x_{ik}+eps_i
BTW: If errors are normally distributed, response is log-normally distributed with E[y_i|x_i]=exp(x_i^T*b+sig^2 /2)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

State the theorem about the explicit solution of least squares:

A

Thm: Let X be an nxm matrix and let y be a vector in R^n. Then TFAE:
1. Xz=y has a unique least-squares solution.
2. The columns of X are linearly independent.
3. X^T
X is invertible.
In this case, the least-squares solution is:
b=(X^TX)^{-1}X^T*y.
https://textbooks.math.gatech.edu/ila/least-squares.html
Lecture: requires n>=p.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain dummy coding

A

(almost one-hot encoding)
For modeling the effect of a covariate x in {1,…,c} with c categories using dummy coding, we define the c-1 dummy variables x_{i,1}=1{x_i=1},…,x_{i,c-1}=1{x_i=c-1}.
For identifiability, we omit one of the dummy variables - this category is called the reference category.
BTW: Estimated effects can be interpreted by direct comparison with the omitted reference category (coeff. b0).
With reference category, total effects are: b0+b1,…,b0+b{c-1},
with reference category, total effects are:
b0,b0+b1,…,b0+b{c-1}. Ie interpretation with reference to reference category.
In the second case regression parameters are uniquely determined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
Dummy coding:
Interpret the last two coefficients of the rent index:
rent^=112.69+5.85*area+57.26*glocation,
glocation=good location,
alocation=average location (reference)
A

For apartments in a good and average location, an increase of living area by 1m2 leads to an average increase of rent of about 5.85 Euros.
The average rent for an apartment in a good location is about 57.26 Euro higher than for an apartment of the same living area in an average location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you model effects of categorical covariate interactions?

What happens to E[y] when x changes by an amount d?

A

For two explanatory variables with two levels:
If y=b0+b1x+b2z+eps is initial model, add interaction as xz to get new model: y=b0+b1x+b2z+b3xz+eps
For two explanatory variables with three levels:
Define dummy variables x1,x2 and z1,z2 for x and z and model: y=b0+b1x1+b2x2+b3z1+b4z2+b5x1z1+b5x2z1+b6x1z2+b7x2z2+eps
Total effects: b0 for (x,z)=(3,3), b0+b1 for (x,z)=(1,3) etc.
Etc.

E[y|x+d,z]-E[y|x,z]=b0+b1(x+d)+b2z+b3(x+d)z-b0+b1x+b2z+b3xz=b1d+b3d*z

  • If b3=0:expected change b1*d is independent of z.
  • If b3!=0:expected change bd+b3d*z depends on d and the value of z
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Define the normal equation(s). (Derive for 3+)

A

To compute the least squares estimate, we calculate the partial derivatives of ||y-Xb||^2 by b (which forma a vector) and require them to be zero, thus obtaining the equation (-2)X^T(y-Xb^)=0 (iff X^TXb^=X^Ty).
Def. The normal equation is the equation X^TXb^=X^T*y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe the goal of least squares in geometric terms.

What property does the vector of residuals have?

A

Project y onto the space {Xv | v in R^p}=col(X). Projection is Xb^

eps^=r=y-Xb^ is orthogonal to all the columns of X, i.e. r^TX=0=X^T*r.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Least Squares: Define the P- (or H-) and Q-matrices and state their properties.

A

The matrix P is the matrix y^ = Py, so P=X(X^TX)^{-1}X^Ty=Py.
It has the properties P=P^T, P^2=P and sum(P_{ii};i)=p.
Tpreohese are necessary and sufficient conditions for P to be an orthogonal projection from R^n to R^p.
r=y-y^=Qy, where Q=I-P.
Q is also a projection:
Q^T=Q^2=Q, PQ=QP=0 (hence orthogonal to each other), tr(Q)=n-p.
BTW: It is also known as the hat matrix. P_{ii} tells us how much influence the observation y_i has over the fitted value y^_i.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define the MLE

A

The MLE is b^=argmax_b P[y1,…,yn | x1,…,xn,b).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the connection between the MLE and linear regression?

A

The MLE under the assumption of Gaussian iid errors is equal to the LSE (least-squares estimator).
BTW: Y = X * b + eps ~ N(Xb, sig^2Id) => P(y_i|x_i,b,sig)=N(Xb,sig^2id) => L(b,sig^2)=Prod(exp(-[(y_i-x_ib)/(2sig)]^2)/(sqrt(2pi)sig) (taking argmax of log L yields LSE for b)
The MLE for sig^2 is (sig^)^2=1/nsum(y_i-y^_i)^2, however in practice one uses (sig^)^2=1/(n(n-p))sum(y_i-y^_i)^2 to “scale to become unbiased”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Why doesn’t one replace multiple linear regression with p simple regressions?

A

Because problems arise when the explanatory variables are strongly correlated.
If explanatory variables are orthogonal, then multiple regression is equal to iteratively applying simple regressions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Simple linear regression: What can you say about (bar(x),bar(y))?

A

Always: (bar(x),bar(y)) in {b0+b1*x | x in R} (regression line)

27
Q

Simple linear regression: How does one perform simple linear regression without an intercept when one is required?

A
  1. Transform all data x,y –> y~:=y-bar(y), y~:=x-bar(x),
  2. perform regression Y~=b*x~+eps~, and
  3. transform back to get new parameters.
    BTW: Y^=b^_0+b^_1*x=bar(y)+b^_1(x-bar(x))
28
Q

Simple linear regression: Describe the regression to the mean phenomenon

A

y^-bar(y) / (sig_y)^=corr(x,y)^ * (x-bar(x)) / (sig_x)^
iff
y^-bar(y)=corr(x,y)^ * (sig_y)^ / (sig_x)^ * (x-bar(x))
Therefore Y^ is less above/below bar(Y) than x is above/below bar(x)
BTW: y^-bar(y)=b^_1(x-bar(x))= Sum[ (x_i-bar(x))(y_i-bar(y))](x-bar(x))

29
Q

Define (Sig/) and state how it relates to b_j

A

(Sig/):=Cov(Y,X_1,…,X_p)

b_j= Parcor(Y,X_j | {X_l : j != j })* (Sig/)^{-1}{j,j} / (Sig/)^{-1}{y,y}

30
Q

IID errors: What property does the LSE b^ have under the assumption of a linear model with i.i.d. centered errors?

A

It is unbiased: E[b^]=b

BTW: E[b^]=E[(X^TX)^{-1}X^TY]=E[(X^TX)^{-1}X^T(Xb+eps)] = b (last eq. iff E[eps]=0)
Also: E[hat(Y)]=E[X
hat(b)]=X*b

31
Q

IID errors:
Under the assumption of a linear model with i.i.d. centered errors, what is Cov(b^), where b^ is the LSE?

What about Cov(hat(Y))=Cov(X*cov(b)) and Cov(r)=Cov(eps^)?

A

Cov(b^)=Cov((X^TX)^{-1}X^TY)=ACov(Y)A^T=ACov(Y)A^T=sig^2 * (X^TX)^{-1}, where A=(X^TX)^{-1}X^T, since Cov(Y)=Cov(eps)=sig^2 * Id.

Cov(Y^)=Cov(PY)=PCov(Y)*P^T=sig^2 * P (since P projection)

Cov(eps^)=Cov(r)=Cov(QY)=QCov(Y)Q^T=sig^2Q=sig^2*(I-P) (similarity)

NOTE: => eps^ are (unfortunately) correlated!

BTW: Var(eps^i)=sig^2Q_{i,i}=sig^2(1-P{i,i}) not constant in contrast to Cov(eps)=sig^2*Id and Cov(eps^,y^)=0 (since QP=0)

32
Q

IID errors:
What is the MLE of sig^2 under the assumption of a linear model with i.i.d. normal errors?

Whats the problem with this estimator?
HINT: E[Sum((eps^_i)^2]=?

A

(sig^)^2=1/n*Sum((eps^_i)^2)=||Y-Xb^||^2/n

Dropping these assumption:
we choose/define (sig^)^2=1/(n-p)Sum((eps^_i)^2)=||y-Xb^||^2/(n-p), since then E[(sig^)^2]=sig^2 (i.e sig^ is then unbiased)

This estimator is unbiased under weaker assumptions (errors aren’t necessarily Gaussian).

BTW: The justification follows from Var(eps^i)=sig^2(1-P_{i,i}):
E[Sum((eps^_i)^2]=(lin)=Sum(Var(eps^_i))=Sum(sig^2
1-P
{i,i})=sig^2*(n-p) using that Trace(P)=p.

33
Q

Under the Gaussian iid noise assumption, what is the distribution of the MLE/LSE b^?

3+: What holds for y^and eps^?

A

b^=(X^TX)^{-1}X^TY=(X^TX)^{-1}X^T(Xb+eps)=
(X^T
X)^{-1}X^TXb+(X^TX)^{-1}X^Teps)=b+(X^TX)^{-1}X^Teps=N(b,sig^2(X^TX)^{-1}) using lin. comb. of Gaussian is Gaussian [Var(AZ)=AVar(Z)A^T] and eps=N(0,sig^2Id)
OR
b^=(X^T
X)^{-1}X^TY=N(b,sig^2(X^TX)^{-1}) using Y~N(Xb,sig^2Id) and lin.comb. of Gaussians is Gaussian
NOTE: b is unknown, so the true distribution is unknown!

BTW: y^=N(X*b,sig^2 * P) and eps^=N(0,sig^2 * Q)
They are also independent, since they are uncorrelated and normally distr. => (sig^)^2 and b^ are also independent

34
Q

State the Theorem of asymptotic normality for LSE of the model Y=X*b+eps, where eps_i are i.i.d. with E[eps_i]=0 and Var(eps_i)=sig^2.

(But not necessarily Gaussian)

A

Assume:
A1: lambda diverges to inf, where lambda is the smallest eigenvalue of X^TX
A2: max_j { P_{j,j} } = max_j { x_j^T
Sum(x_ix_i^T)^{-1} * x_j}=max_j (X(X^TX)^{-1}X^T)_{j,j} converges to 0.
A3: the errors eps_i are iid with mean 0 and variance sig^2.
Then the LSE b^ is consistent (for b), and the distribution of (X^T
X)^{1/2}(b^-b) converges weakly to N(0,sig^2I).

BTW: It can also be shown that sig^ is also consistent.
If the fourth moment of eps_i exist, then sig^ is also asymptotically normal.

35
Q

Least squares:
Under the Gaussian iid noise assumption, what is the the distribution of:
Sum(r_i^2)/sig^2 ?

A

Sum(r_i^2)/sig^2 ~ Chi^2_(n-p)

36
Q

X^T*X=Var(X)^=Sig^?

A

False: Var(X)^=Sig^=X^T*X/n

37
Q

Linear model with normal iid errors:
What are consequences of the fact that (sig^)^2 and b^are independent?
State the distributions of:
a. (b^i - b_i ) / ( sig^ * sqrt( (X^T * X)^{-1}{i,i}) ) ) ~ ???
b. (b^ - b)^T * (X^T * X) * ( b^ - b) / (p * (sig^)^2 ) ~ ???
c. Let v:=Bb (B a qxp matrix), then
(v^- v)^T
V^{-1}(b^ - b) / (q * (sig^)^2 ) ~ ??? , where V=B(X^T * X)^{-1}B^T
d. (y^_i - E[y_i]) / (sig^
sqrt(P_ii) ) ~ ???
e. (y^_0 - E[y_0]) / ( sig^ * sqrt(x_0^T(X^TX)^{-1}*x_0) ) ~ ???
f. (y_0-y^_0) / ( sig^ * sqrt( 1 + x_0^T * (X^T * X) ^ {-1} * x0) ) ~ ???

Under the null hypothesis H0: b_j = 0, what is the test statistic?

Which of a-f can test the global null hypothesis? I.e. H0: b=0?

A

a. (b^i - b_i ) / ( sig^ * sqrt( (X^T * X)^{-1}{i,i}) ) ) ~ t_{n-p}
b. ||X(b^-b||^2/(p(sig^)^2)(b^ - b)^T * (X^T * X) * ( b^ - b) / (p * (sig^)^2 ) ~ F_{p,n-p}
c. Let v:=B
b (B a qxp matrix), then
(v^- v)^TV^{-1}(b^ - b) / (q * (sig^)^2 ) ~ F_{q,n-p}, where V=B(X^T * X)^{-1}B^T
d. (y^i - E[y_i]) / (sig^* sqrt(P_ii) ) ~ t{n-p}
e. (y^0 - E[y_0]) / ( sig^ * sqrt(x_0^T(X^TX)^{-1}*x_0) ) ~ t{n-p}
f. (y_0-y^0) / ( sig^ * sqrt( 1 + x_0^T * (X^T * X) ^ {-1} * x0) ) ~ t{n-p}, y_0=y_0(x_0)

H0: T_j = b^i / ( sig^ * sqrt( (X^T * X)^{-1}{i,i}) ) ) ~ t_{n-p}
(BTW:) |-> R-output = realized value t_j and p-value = p_j = P(

38
Q

Linear model with normal iid errors (intuition):
Why are estimators divided by sig^ t-distributed instead of normally distributed as with sig?

How does an F-distribution look like?

A

Because estimating sig results in more randomness and the t-distribution is flatter and wider than normal distribution. In the limit these t-distributions again become a normal distribution.

One sided decay, Gaussian-like or something in between (heavy-tailed distr)

39
Q

Linear model with normal iid errors:

Describe the 1-alpha confidence region of the LSE E[y_0] for unseen data (y_0,x_0), using the LSE y^_0.

A

The “simultanious” (“for all x0”) confidence region is described by the hyperboloid:
(y^0 - E[y_0])^2 =< (sig^)^2 * (x_0^T * (X^T * X)^{-1}x_0)p*F{p,n-p}(1-alpha)

40
Q

Reassess if I should add 1.5.3-1.5.4 & 1.5.1 c) B… here….
Testing if multiple b_i=c_i simultaneously…
“lemma not important, its kind of boring, important is that you understand the geometric stuff that is going on here”

A

update

41
Q

Define and interpret the coefficient of determination.

A

coefficient of determination=R^2:=||y^-bar(y)||^2/||y-bar(y)||^2
It is the proportion of variance explained by the model. It measures the goodness of fit of the model with explanatory variables X_j.
NOTE: R^2 and F are at first the most important numbers in the computer output
BTW: It is not difficult to see that R^2 is also the maximum squared correlation of y with an arbitrary linear combination of the columns X_j.
It is also equal to the square of the multiple correlation coefficient between y and X_j. The inear combination maximizing the correlation with y is the least sqaures estimate y^ itself.

42
Q

Define F in the computer output.

A

update

43
Q

Simple linear LS regression:
How does one test:
H0: b=b0 at level g ?

What is the confidence interval?

Describe the confidence interval around y^_0(x_0) for new data x_0.

What about for all points x simultaneously?

A

Reject null if:
sqrt(SS_X) * | b^ - b0 | > t_{n-2;1-g/2},
where SS_X=Sum[(x_i-bar(x))^2].

The confidence interval for b is:
b^ +/- t_{n-2;1-g/2}*sig^/ sqrt(SS_X)

b^_0 + b^1 * x_0 +/- t{n.2;1-g/2}*sig^ * sqrt(1/n+(x_0 - bar(x) )^2 / SS_X.

replace x_0 by x in the formula above and the factor t_{n-2;1-g/2} with sqrt(2*F_{2,n-2;1-g/2})

NOTE: The second last is wider than the last

44
Q

State the properties of (Pearson) correlation rho=:Cov(X,Y))

State the definition of an estimator and its properties

A
  • rho in [-1,1]
  • |rho| = 1 iff joint distr. of X and Y is concentrated on a line (and the sign of rho matches the sign of this line’s gradient)

rho can be estimated by :

r: =rho^ = Sum[ (x_i-bar(x)) * (y_i-bar(y)) ] / sqrt(Sum[(x_i-bar(x))^2]*Sum[(y_i-bar(y))^2]
* r in [-1,1]
* |rho^|=1 iff all points lie on a single line
* sign(rho)=sign(b^)

45
Q

Define the z transformation (“cariance-stabilizing transformation for the correlation coefficient”) and state its distribution and interpretation.

A

Z:=tanh^{-1}(rho^)=0.5*log[ (1+rho^) / (1 - rho^) ] ~N(tanh^{-1}(rho), 1/(n-3) )

If the true value of rho is near 0, the variance of rho^is high.
If the true value of rho lies near +/- 1, the variance of rho^is small.
The z transformation rescales so as to make the variance constant ( i.e. it “compresses in the middle” and “stretches at the edges”)

46
Q

How can we test if H0: Cov(X,Y)=0 vs. H1: rho !=0 ?

A
  1. Use a table or diagram
  2. Use the t- or F-test of b=0
  3. tanh^{-1} transformation
47
Q

Define Spearman’s rank correlation

A

r_S = 1 - 6* Sum(D_i^2) / [n*(n^2-1)], where D_i:=rank(x_i) - rank(y_i) and rank(x_i):=|{j: x_j >=x_i}|.

48
Q

Define Kendall’s rank correlation

A

r_K = 2 * (T_k - T_d)/[n(n-1)], where T_k = # concordances=# pairs with (x_i - x_j)(y_i - y_j) > 0 and T_d = # discordances=# pairs with (x_i - x_j)*(y_i - y_j) < 0

49
Q

Define and interpret partial correlation

A

rho_{XY.Z}:= [rho_{XY} - rho_{XZ}*rho_{YZ}] / sqrt[ (1-rho^2{XZ})*(1-rho^2{YZ}) ]

50
Q

Show regression to the mean by expressing a formula.

A

y-bar(y)=rho^sig^_Y/sig^_X(x-bar(x))

51
Q

Define QQ plot and normal plot

A
QQ plot = quantile/quantile plot
Normal: For iid r.v.s X1,...,Xn w.l.o.g. ordered, let u = F_n(x)=|{X_i=Phi[(x-mu)/sig] if X_i follow a normal distribution. Thus if we set z:=Phi^{-1}(F_n(x)), then z~(x-mu)/sig for large n. Thus if the r.v.s really are normal, then ploting (x,z) [for x  in {X1,..,Xn}] should result in a straight line.
52
Q

Define the Tukey-Anscombe plot

What to do if:
Variance increases linearly?
Variance increases by square root?

A

Plot (y^_i,r_i), where r_i=residual (y^_i - y_i)
If this plot is non-linear then the model assumptions are broken. Since Sum(r_i*y^_i)=0 the points are centered around r=0. Ideally, all points should be randomly scattered above and below r=0 (just random, no patterns).
Trends: Regression function likely not specified correctly (i.e. mean error isn’t zero), linear increase: log transform? sqaure root increase: take square root of target variables (BTW: can be seen by taylor)
Parabolic shape: most likely x^2 term missing
For simple linear regression, this is (essentially) equivalent to plotting r_i against x_i (unlike for multiple regression against x_ij).

53
Q

Correlated errors suppose Y=X*b + eps, where eps~N_n(0,Sig). How is b^ distributed?

A

b^~N_p(b,(X^TX)^{-1}(X^TSigX)(X^TX)^{-1})

54
Q

How can autocorrelation in time series data be detected?

A

By plotting residuals r_i against the observation times t_i.
If the points vary randomly around the horizontal axis in the time series plot, everything is fine. However, if adjacent r_i are similar, this indicates that the errors may be serially correlated.

Sometimes we even observe a jump in the level of the residuals. In such a case, the model has evidently changed suddenly at a particular point in time.

55
Q

Tests to identify independence in time series alternative to serial correlation (autocorrelation)?

A

i) run test: counts number of continuous sub-sequences (“runs”) in which the residuals have identical signs. Should not be too low or too high.
ii) Durbin-Watson test: H0: uncorrelated errors. Roughly Cov(eps_i,eps_{i+1}), if Var(eps_i)=const(i)). Uses test statistic T=Sum[ (r_ir_{i+1})^2 : i in {1,..,n-1}]. Expanding reveals: T~2(1-Sum[ (r_i*r_{i+1})^2 : i in {1,..,n-1}]/Sum(r_i^2: all i). If T is too small, reject H0
More details skript pg. 38

56
Q

Generalized least squares aka weighted regression: describe the model.
Reformulate it into the “tilde model”.
Give the explicit solution to the tilde model and its distribution

A

Y=Xb+eps with eps~N(0,sig^2Sig), assume Sig to be known and positive definite but sig^2 to be unknown.
Since Sig is positive definite, there exists a regular/invertible matrix A s.t. AA^T=Sig.
Now: y~ := A^{-1}
y =A^{-1}(Xb+eps)=A^{-1}Xb+A^{-1}eps=: X~b+eps~-
Then E[eps~]=E[A^{-1}eps]=A^{-1}E[eps]=0
Cov(eps~)=Cov(A^{-1}eps)=A^{-1}Cov[eps](A^{-1})^T=A^{-1}sig^2(A^TA)(A^{-1})^T=sig^2Id
BTW: Performing least squares on ||y~-X~b||^2=(see pg. 40) is equivalent to performing least squares on the original data with a different scalar product
b^=(X~^T
X~)^{-1}X~^Ty~=(X^TSig^{-1}X)^{-1}X^TSig^{-1}y
Now, b^ ~ N_p(b,sig^2
(X^TSig^{-1}X)^{-1})
BTW: If Sig!=Id then the generalized least squares method has a smaller variance than the standard least squares.

57
Q

What is the Cochrane-Orcutt procedure?

A
  1. Perform least squares
  2. Estimate Sig^
  3. Perform generalized least squares with Sig^
    BTW: often with when errors exhibit correlation in time.
58
Q

Define sampling distribution, standard error, t-value/t-statistic, p-value, residual sum of squares, multiple R^2, adjusted R^2, F-statistic

A

A sampling distribution is a probability distribution of a statistic obtained from a number of samples drawn from a specific population.

A standard error is the standard deviation of the sampling distribution of a statistic. (in R output estimated standard deviation of coefficient s.e.^(b^)=(sig_b_i)^

The t-value/t-statistic is the ratio of the departure (/difference) of the estimated parameter from its hypothesized value to its standard error, i.e. (b^-b_{H0})/s.e.^(b^).

The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null-hypothesis is correct.

Residual sum of squares = sqrt( SSE / df ), where df=degrees of freedom=n-p

Multiple R-squared=R^2=||y^-bar(y)||^2/||y-bar(y)||^2
(=coefficient of determination=portion of variance explained by model; measures goodness of fit)

Adjusted R-squared=1-(1-R^2)*(n-1) / df, df=n-p (since adding covariables always increases accuracy on training samples, this adjusts for just number of covariables).

An F-statistic is the ratio of two scaled sums of squares reflecting different sources of variability i.e. the variance explained by the parameters in the model (sum of squares of regression, SSR/(p-1) ) and the residual or unexplained variance (sum of squared errors, SSE/(n-p)).
So: F=[ ||y^-bar(y)||^2/(p-1) / ||y^-y)||^2/(n-p) ]
Let b^{(0)}:=[bar(y),0,…,0], y^{(0)}:=bar(y)*[1,…,1]=:bar(y) (as vec).
H0: SSE_0:=||y-y^{(0)}||^2 = ||y-bar(y)||^2= ||y-y^||^2+||y^-bar(y)||^2
Under H0 F~F
{n-1,n-p}, this is the F-test on which the last p-value is based.

59
Q

Exam FS18:
The least squares estimator is always unique (that is, there is only one solution of the least squares optimization criterion) if and only if rank(X)=p. T/F?

A

T

60
Q

Exam FS18:
Consider the least squares estimator b^. Then, the variance of b^1+b^2 equals Var(b^_1+b^_2)=sig^2*((X^T*X)^{-1}{11}+(X^T*X)^{-1}{22}). T/F?

A

F

61
Q

Exam FS18:
For testing the null-hypothesis H_{0,j}: b_j=0 versus H_{A,j}:b_j !=0, the two-sided t-test is appropriate whenever the linear model is correct and the errors eps~N(0,sig^2*Id). T/F?

A

T

62
Q

Exam FS18:
Consider a linear model with p=10 covariables. First we fit a model using the first 8 variables. Assume that in this model with 8 covariables the two-sided t-test for the null-hypothesis H_{0,1}:b_1=0 exhibits a p-value of 2.1e-4 and thus (on the 5%-level) is significantly different from 0. Now we look at the full model with all 10 covariables. Based on the p-value above it is guaranteed that the covariable X_1 is also significant on the same 5%-level in this full model.

A

F

63
Q

Let H:y=Xb+eps with rank(X)=p, eps~N(0,sig^2Id).
Suppose we want to test if the first q covariables are superfluous, i.e. “the first p-q coefficients are all zero”. How can we test this?

A

Define B=[[1,0,…,0,…,0],…,[0,…,1,…,0]]=[Id_{p-q} , 0_q] (rank(B)=p-q), v=[0,…,0] (b in notes)
Let the null hypothesis be:
H0: B*b=v.

As a test statistic take:
(Bb^-v)^T(B(X^TX)^{-1}B^T)^{-1}(Bb^-v)/[(p-q)(sig^)^2]
Under the hypothesis, its distribution is F_{p-q,n-p}.

Alternatively: Assume v=0.
Then: [ (SSE_0-SSE)/(p-q) ] / [ SSE/(n-p) ]=[ ||y^-(y^)^(0)||^2/(p-q) ] / [ ||y-y^||^2/(n-p) ] ~ F_{p-q,n-p}
So we can use this as a test statistic.

With SSE_0=||y^-(y^)^(0)||^2 and (y^)^(0) the estimator of y^ assuming B*b=v=0.

BTW: The following lemma shows these two versions are identical

Lem: The least squares estimator (b^)^(0) under the supplementary condition Bb=b is:
(b^)^(0)=b^-(X^T
X)^{-1}B^T(B(X^TX)^{-1}B^T)^{-1}(Bb^-v).
Furthermore,
SSE_0=SSE+(B
b^-v)^T(B(X^TX)^{-1}B^T)^{-1}(Bb^-v)

64
Q

Define biased estimator.

A

update