V9 Flashcards

Regression

1
Q

what is regression analysis

A
  • statistical process for estimating the relationships among variables
  • many techniques for analysing several variable models
  • focus is on the relationship between a dependent variable and one or more independent variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

regression model

A

regression model variables

  • unknown parameters, denoted as ß (effect size)
  • independent variables (x)
  • dependent variable (y)
  • > independent causes / influenced dependent

Formula Y = f(X, ß)

ß is a vector
N rows of X, k column of x

Constants
N = tuber of independent measurements (more like siblings, not totally independent)
k = number of unknown parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

assumptions of the regression model

A
  • sample is representative of the population for the interference prediction
  • error is a random variable with a mean of zero conditional on the explanatory variables
  • independent variables are measured with no error
  • predictors are linearly independent, i.e. it is not possible to express any predictor as a linear combination of the others
  • errors are uncorrelated
  • variance of the data should be approximately equal across the range of your predicted values. if not, log-transform or other methods might instead be used
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

how to make statistical predictions about the unknown parameters

A

when N&raquo_space; k , and the measurement errors are normally distributed then the excess of information contained in (N-k) measurements is used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how to code linear model

A

with lm function

data(air quality)
lm.temp Ozone dependent, Temp independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how to plot a regression line

A

plot(airquality$Ozone ~airquality$Temp, pch=19, col=”blue”)

a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

confidence interval

A
  • calculate the margin of error
  • in standard regression we can use the t-statistic
  • > standard error (from the lm summary)
  • > critical value - probability boundary (alpha) 0.05 , 2 sided = 0.975 - degrees of freedom (N-2)
  • > critique
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how to calculate Margin of error

A

Critical value x standard error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

code confidence interval and margin of error

A

lm.sum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

predicted values

- predict data from min til max temperature

A

new

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

plot predicted data with confidence interval

A

plot(air quality$temp, air quality&ozone, xlab=temp, lab”Ozone”)
lines(new$Temp,as.numeric(conf[,”fit”]), lty=1, lwd=2) # regression slope
lines(new$Temp,(conf[,”upr”]), lty=2, col=”blue”) # upper limit
lines(new$Temp,(conf[,”lwr”]), lty=2, col=”blue”) #lower limit

-> can also use function visage (get from library)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

residuals

A
  • describe how well the regression line fit the data (goodness of fit)
  • we aim to - minimise sum of squares of the residuals, if we do - maximum likelihood model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how to visualise residuals (code)

A
  • easiest on clean data, not containing NAs

airqual. clean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

multiple linear regression

A
  • every factor has a different ß

Y = a + bx1 + bx2

  • modeling ozone as a function of Temp and Wind

Ozone = a + b1Temp + b2Wind

-> two factors are often better, can be checked in R^2. the higher the more amount of values explained by model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

plot observed versus estimated data points

A

first make predicted dataset

plot(airquality$Temp, airquality$Ozone) # plot observed data
plot(airquality$Temp, Ozone.est, pch=2, col”red)
legend(“top left”,legend = c(“observed”, “predicted”), pch=c(1,2), bty=”n”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

quadratic regression + coding

A
  • linearity of coefficients means: a change in ONE of the independent variables always yiels a corresponding change in the response variable
  • this means the following functions are also linear

Y= a+b1(X^2)
Y=a + e^b1*X

lm(Ozone ~ Temp + I(Temp^2), data = air quality)

  • around ^always needs to be I()
17
Q

how to plot quadratic regression coeffiction

A

plot(Ozone ~ Temp, data = air quality)
curve(a - b*I(X^2), add = TRUE)
- take a and b from lm summary

18
Q

how to see which model is the best ?

A
  • mode selection is the task of selecting a statistical model from a set of candidate models
  • AIC
  • relative comparison of models
  • not a test, just a guideline
  • air rewards goodness of fit( as assessed by the likelihood function), bit it also includes a penalty that is an increasing function of the number of estimated parameters
19
Q

AIC

A

AIC = 2k - 2ln(L)

  • L is the maximum value of the likelihood function for the model (goodness of fit)
  • k is the number of estimated parameters in the model
  • want the lowest AIC value
  • AIC()