Statistical Modeling Flashcards

Question

Why doesn't one replace multiple linear regression with p simple regressions?

Answer 1

Because problems arise when the explanatory variables are strongly correlated. If explanatory variables are orthogonal, then multiple regression is equal to iteratively applying simple regressions.

Answer 2

Always: (bar(x),bar(y)) in {b0+b1*x | x in R} (regression line)

Answer 3

1. Transform all data x,y --> y~:=y-bar(y), y~:=x-bar(x), 2. perform regression Y~=b*x~+eps~, and 3. transform back to get new parameters. BTW: Y^=b^_0+b^_1*x=bar(y)+b^_1(x-bar(x))

Answer 4

y^-bar(y) / (sig_y)^=corr(x,y)^ * (x-bar(x)) / (sig_x)^ iff y^-bar(y)=corr(x,y)^ * (sig_y)^ / (sig_x)^ * (x-bar(x)) Therefore Y^ is less above/below bar(Y) than x is above/below bar(x) BTW: y^-bar(y)=b^_1*(x-bar(x))= Sum[ (x_i-bar(x))(y_i-bar(y))]*(x-bar(x))

Answer 5

(Sig/):=Cov(Y,X_1,...,X_p) | b_j= Parcor(Y,X_j | {X_l : j != j })* (Sig/)^{-1}_{j,j} / (Sig/)^{-1}_{y,y}

Answer 6

It is unbiased: E[b^]=b BTW: E[b^]=E[(X^T*X)^{-1}*X^T*Y]=E[(X^T*X)^{-1}*X^T*(X*b+eps)] = b (last eq. iff E[eps]=0) Also: E[hat(Y)]=E[X*hat(b)]=X*b

Answer 7

Cov(b^)=Cov((X^T*X)^{-1}*X^T*Y)=A*Cov(Y)*A^T=A*Cov(Y)*A^T=sig^2 * (X^T*X)^{-1}, where A=(X^T*X)^{-1}*X^T, since Cov(Y)=Cov(eps)=sig^2 * Id. Cov(Y^)=Cov(P*Y)=P*Cov(Y)*P^T=sig^2 * P (since P projection) Cov(eps^)=Cov(r)=Cov(Q*Y)=Q*Cov(Y)*Q^T=sig^2*Q=sig^2*(I-P) (similarity) NOTE: => eps^ are (unfortunately) correlated! BTW: Var(eps^_i)=sig^2*Q_{i,i}=sig^2*(1-P_{i,i}) not constant in contrast to Cov(eps)=sig^2*Id and Cov(eps^,y^)=0 (since QP=0)

Answer 8

(sig^)^2=1/n*Sum((eps^_i)^2)=||Y-Xb^||^2/n Dropping these assumption: we choose/define (sig^)^2=1/(n-p)*Sum((eps^_i)^2)=||y-X*b^||^2/(n-p), since then E[(sig^)^2]=sig^2 (i.e sig^ is then unbiased) This estimator is unbiased under weaker assumptions (errors aren't necessarily Gaussian). BTW: The justification follows from Var(eps^_i)=sig^2*(1-P_{i,i}): E[Sum((eps^_i)^2]=(lin)=Sum(Var(eps^_i))=Sum(sig^2*1-P_{i,i})=sig^2*(n-p) using that Trace(P)=p.

Answer 9

b^=(X^T*X)^{-1}*X^T*Y=(X^T*X)^{-1}*X^T*(X*b+eps)= (X^T*X)^{-1}*X^T*X*b+(X^T*X)^{-1}*X^T*eps)=b+(X^T*X)^{-1}*X^T*eps=N(b,sig^2*(X^T*X)^{-1}) using lin. comb. of Gaussian is Gaussian [Var(A*Z)=A*Var(Z)*A^T] and eps=N(0,sig^2*Id) OR b^=(X^T*X)^{-1}*X^T*Y=N(b,sig^2*(X^T*X)^{-1}) using Y~N(X*b,sig^2*Id) and lin.comb. of Gaussians is Gaussian NOTE: b is unknown, so the true distribution is unknown! BTW: y^=N(X*b,sig^2 * P) and eps^=N(0,sig^2 * Q) They are also independent, since they are uncorrelated and normally distr. => (sig^)^2 and b^ are also independent

Answer 10

Assume: A1: lambda diverges to inf, where lambda is the smallest eigenvalue of X^T*X A2: max_j { P_{j,j} } = max_j { x_j^T*Sum(x_i*x_i^T)^{-1} * x_j}=max_j (X*(X^TX)^{-1}*X^T)_{j,j} converges to 0. A3: the errors eps_i are iid with mean 0 and variance sig^2. Then the LSE b^ is consistent (for b), and the distribution of (X^T*X)^{1/2}*(b^-b) converges weakly to N(0,sig^2*I). BTW: It can also be shown that sig^ is also consistent. If the fourth moment of eps_i exist, then sig^ is also asymptotically normal.

Answer 11

Sum(r_i^2)/sig^2 ~ Chi^2_(n-p)

Answer 12

False: Var(X)^=Sig^=X^T*X/n

Answer 13

a. (b^_i - b_i ) / ( sig^ * sqrt( (X^T * X)^{-1}_{i,i}) ) ) ~ t_{n-p} b. ||X(b^-b||^2/(p*(sig^)^2)(b^ - b)^T * (X^T * X) * ( b^ - b) / (p * (sig^)^2 ) ~ F_{p,n-p} c. Let v:=B*b (B a qxp matrix), then (v^- v)^T*V^{-1}*(b^ - b) / (q * (sig^)^2 ) ~ F_{q,n-p}, where V=B*(X^T * X)^{-1}*B^T d. (y^_i - E[y_i]) / (sig^* sqrt(P_ii) ) ~ t_{n-p} e. (y^_0 - E[y_0]) / ( sig^ * sqrt(x_0^T*(X^T*X)^{-1}*x_0) ) ~ t_{n-p} f. (y_0-y^_0) / ( sig^ * sqrt( 1 + x_0^T * (X^T * X) ^ {-1} * x0) ) ~ t_{n-p}, y_0=y_0(x_0) H0: T_j = b^_i / ( sig^ * sqrt( (X^T * X)^{-1}_{i,i}) ) ) ~ t_{n-p} (BTW:) |-> R-output = realized value t_j and p-value = p_j = P(

Answer 14

Because estimating sig results in more randomness and the t-distribution is flatter and wider than normal distribution. In the limit these t-distributions again become a normal distribution. One sided decay, Gaussian-like or something in between (heavy-tailed distr)

Answer 15

The "simultanious" ("for all x0") confidence region is described by the hyperboloid: (y^_0 - E[y_0])^2 =< (sig^)^2 * (x_0^T * (X^T * X)^{-1}*x_0)*p*F_{p,n-p}(1-alpha)

Answer 16

coefficient of determination=R^2:=||y^-bar(y)||^2/||y-bar(y)||^2 It is the proportion of variance explained by the model. It measures the goodness of fit of the model with explanatory variables X_j. NOTE: R^2 and F are at first the most important numbers in the computer output BTW: It is not difficult to see that R^2 is also the maximum squared correlation of y with an arbitrary linear combination of the columns X_j. It is also equal to the square of the multiple correlation coefficient between y and X_j. The inear combination maximizing the correlation with y is the least sqaures estimate y^ itself.

Answer 17

Reject null if: sqrt(SS_X) * | b^ - b0 | > t_{n-2;1-g/2}, where SS_X=Sum[(x_i-bar(x))^2]. The confidence interval for b is: b^ +/- t_{n-2;1-g/2}*sig^/ sqrt(SS_X) b^_0 + b^_1 * x_0 +/- t_{n.2;1-g/2}*sig^ * sqrt(1/n+(x_0 - bar(x) )^2 / SS_X. replace x_0 by x in the formula above and the factor t_{n-2;1-g/2} with sqrt(2*F_{2,n-2;1-g/2}) NOTE: The second last is wider than the last

Answer 18

* rho in [-1,1] * |rho| = 1 iff joint distr. of X and Y is concentrated on a line (and the sign of rho matches the sign of this line's gradient) rho can be estimated by : r: =rho^ = Sum[ (x_i-bar(x)) * (y_i-bar(y)) ] / sqrt(Sum[(x_i-bar(x))^2]*Sum[(y_i-bar(y))^2] * r in [-1,1] * |rho^|=1 iff all points lie on a single line * sign(rho)=sign(b^)

Answer 19

Z:=tanh^{-1}(rho^)=0.5*log[ (1+rho^) / (1 - rho^) ] ~N(tanh^{-1}(rho), 1/(n-3) ) If the true value of rho is near 0, the variance of rho^is high. If the true value of rho lies near +/- 1, the variance of rho^is small. The z transformation rescales so as to make the variance constant ( i.e. it "compresses in the middle" and "stretches at the edges")

Answer 20

1. Use a table or diagram 2. Use the t- or F-test of b=0 3. tanh^{-1} transformation

Answer 21

r_S = 1 - 6* Sum(D_i^2) / [n*(n^2-1)], where D_i:=rank(x_i) - rank(y_i) and rank(x_i):=|{j: x_j >=x_i}|.

Answer 22

r_K = 2 * (T_k - T_d)/[n*(n-1)], where T_k = # concordances=# pairs with (x_i - x_j)*(y_i - y_j) > 0 and T_d = # discordances=# pairs with (x_i - x_j)*(y_i - y_j) < 0

Answer 23

rho_{XY.Z}:= [rho_{XY} - rho_{XZ}*rho_{YZ}] / sqrt[ (1-rho^2_{XZ})*(1-rho^2_{YZ}) ]

Answer 24

y-bar(y)=rho^*sig^_Y/sig^_X*(x-bar(x))

Answer 25

``` QQ plot = quantile/quantile plot Normal: For iid r.v.s X1,...,Xn w.l.o.g. ordered, let u = F_n(x)=|{X_i=Phi[(x-mu)/sig] if X_i follow a normal distribution. Thus if we set z:=Phi^{-1}(F_n(x)), then z~(x-mu)/sig for large n. Thus if the r.v.s really are normal, then ploting (x,z) [for x in {X1,..,Xn}] should result in a straight line. ```

Answer 26

Plot (y^_i,r_i), where r_i=residual (y^_i - y_i) If this plot is non-linear then the model assumptions are broken. Since Sum(r_i*y^_i)=0 the points are centered around r=0. Ideally, all points should be randomly scattered above and below r=0 (just random, no patterns). Trends: Regression function likely not specified correctly (i.e. mean error isn't zero), linear increase: log transform? sqaure root increase: take square root of target variables (BTW: can be seen by taylor) Parabolic shape: most likely x^2 term missing For simple linear regression, this is (essentially) equivalent to plotting r_i against x_i (unlike for multiple regression against x_ij).

Answer 27

b^~N_p(b,(X^T*X)^{-1}*(X^T*Sig*X)*(X^T*X)^{-1})

Answer 28

By plotting residuals r_i against the observation times t_i. If the points vary randomly around the horizontal axis in the time series plot, everything is fine. However, if adjacent r_i are similar, this indicates that the errors may be serially correlated. Sometimes we even observe a jump in the level of the residuals. In such a case, the model has evidently changed suddenly at a particular point in time.

Answer 29

i) run test: counts number of continuous sub-sequences ("runs") in which the residuals have identical signs. Should not be too low or too high. ii) Durbin-Watson test: H0: uncorrelated errors. Roughly Cov(eps_i,eps_{i+1}), if Var(eps_i)=const(i)). Uses test statistic T=Sum[ (r_i*r_{i+1})^2 : i in {1,..,n-1}]. Expanding reveals: T~2*(1-Sum[ (r_i*r_{i+1})^2 : i in {1,..,n-1}]/Sum(r_i^2: all i). If T is too small, reject H0 More details skript pg. 38

Answer 30

Y=X*b+eps with eps~N(0,sig^2*Sig), assume Sig to be known and positive definite but sig^2 to be unknown. Since Sig is positive definite, there exists a regular/invertible matrix A s.t. A*A^T=Sig. Now: y~ := A^{-1}*y =A^{-1}*(X*b+eps)=A^{-1}*X*b+A^{-1}*eps=: X~*b+eps~- Then E[eps~]=E[A^{-1}*eps]=A^{-1}*E[eps]=0 Cov(eps~)=Cov(A^{-1}*eps)=A^{-1}*Cov[eps]*(A^{-1})^T=A^{-1}*sig^2*(A^T*A)*(A^{-1})^T=sig^2*Id BTW: Performing least squares on ||y~-X~*b||^2=(see pg. 40) is equivalent to performing least squares on the original data with a different scalar product b^=(X~^T*X~)^{-1}*X~^T*y~=(X^T*Sig^{-1}*X)^{-1}*X^T*Sig^{-1}*y Now, b^ ~ N_p(b,sig^2*(X^T*Sig^{-1}*X)^{-1}) BTW: If Sig!=Id then the generalized least squares method has a smaller variance than the standard least squares.

Answer 31

1. Perform least squares 2. Estimate Sig^ 3. Perform generalized least squares with Sig^ BTW: often with when errors exhibit correlation in time.

Answer 32

A sampling distribution is a probability distribution of a statistic obtained from a number of samples drawn from a specific population. A standard error is the standard deviation of the sampling distribution of a statistic. (in R output estimated standard deviation of coefficient s.e.^(b^)=(sig_b_i)^ The t-value/t-statistic is the ratio of the departure (/difference) of the estimated parameter from its hypothesized value to its standard error, i.e. (b^-b_{H0})/s.e.^(b^). The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null-hypothesis is correct. Residual sum of squares = sqrt( SSE / df ), where df=degrees of freedom=n-p Multiple R-squared=R^2=||y^-bar(y)||^2/||y-bar(y)||^2 (=coefficient of determination=portion of variance explained by model; measures goodness of fit) Adjusted R-squared=1-(1-R^2)*(n-1) / df, df=n-p (since adding covariables always increases accuracy on training samples, this adjusts for just number of covariables). An F-statistic is the ratio of two scaled sums of squares reflecting different sources of variability i.e. the variance explained by the parameters in the model (sum of squares of regression, SSR/(p-1) ) and the residual or unexplained variance (sum of squared errors, SSE/(n-p)). So: F=[ ||y^-bar(y)||^2/(p-1) / ||y^-y)||^2/(n-p) ] Let b^_{(0)}:=[bar(y),0,...,0], y^_{(0)}:=bar(y)*[1,...,1]=:bar(y) (as vec). H0: SSE_0:=||y-y^_{(0)}||^2 = ||y-bar(y)||^2= ||y-y^||^2+||y^-bar(y)||^2 Under H0 F~F_{n-1,n-p}, this is the F-test on which the last p-value is based.

Answer 33

Define B=[[1,0,...,0,...,0],...,[0,...,1,...,0]]=[Id_{p-q} , 0_q] (rank(B)=p-q), v=[0,...,0] (b in notes) Let the null hypothesis be: H0: B*b=v. As a test statistic take: (B*b^-v)^T*(B*(X^T*X)^{-1}*B^T)^{-1}*(B*b^-v)/[(p-q)*(sig^)^2] Under the hypothesis, its distribution is F_{p-q,n-p}. Alternatively: Assume v=0. Then: [ (SSE_0-SSE)/(p-q) ] / [ SSE/(n-p) ]=[ ||y^-(y^)^(0)||^2/(p-q) ] / [ ||y-y^||^2/(n-p) ] ~ F_{p-q,n-p} So we can use this as a test statistic. With SSE_0=||y^-(y^)^(0)||^2 and (y^)^(0) the estimator of y^ assuming B*b=v=0. BTW: The following lemma shows these two versions are identical Lem: The least squares estimator (b^)^(0) under the supplementary condition B*b=b is: (b^)^(0)=b^-(X^T*X)^{-1}*B^T*(B*(X^T*X)^{-1}*B^T)^{-1}*(B*b^-v). Furthermore, SSE_0=SSE+(B*b^-v)^T*(B*(X^T*X)^{-1}*B^T)^{-1}*(B*b^-v)

Statistical Modeling Flashcards

(64 cards)