V9 Flashcards
Regression
what is regression analysis
- statistical process for estimating the relationships among variables
- many techniques for analysing several variable models
- focus is on the relationship between a dependent variable and one or more independent variables
regression model
regression model variables
- unknown parameters, denoted as ß (effect size)
- independent variables (x)
- dependent variable (y)
- > independent causes / influenced dependent
Formula Y = f(X, ß)
ß is a vector
N rows of X, k column of x
Constants
N = tuber of independent measurements (more like siblings, not totally independent)
k = number of unknown parameters
assumptions of the regression model
- sample is representative of the population for the interference prediction
- error is a random variable with a mean of zero conditional on the explanatory variables
- independent variables are measured with no error
- predictors are linearly independent, i.e. it is not possible to express any predictor as a linear combination of the others
- errors are uncorrelated
- variance of the data should be approximately equal across the range of your predicted values. if not, log-transform or other methods might instead be used
how to make statistical predictions about the unknown parameters
when N»_space; k , and the measurement errors are normally distributed then the excess of information contained in (N-k) measurements is used
how to code linear model
with lm function
data(air quality)
lm.temp Ozone dependent, Temp independent
how to plot a regression line
plot(airquality$Ozone ~airquality$Temp, pch=19, col=”blue”)
a
confidence interval
- calculate the margin of error
- in standard regression we can use the t-statistic
- > standard error (from the lm summary)
- > critical value - probability boundary (alpha) 0.05 , 2 sided = 0.975 - degrees of freedom (N-2)
- > critique
how to calculate Margin of error
Critical value x standard error
code confidence interval and margin of error
lm.sum
predicted values
- predict data from min til max temperature
new
plot predicted data with confidence interval
plot(air quality$temp, air quality&ozone, xlab=temp, lab”Ozone”)
lines(new$Temp,as.numeric(conf[,”fit”]), lty=1, lwd=2) # regression slope
lines(new$Temp,(conf[,”upr”]), lty=2, col=”blue”) # upper limit
lines(new$Temp,(conf[,”lwr”]), lty=2, col=”blue”) #lower limit
-> can also use function visage (get from library)
residuals
- describe how well the regression line fit the data (goodness of fit)
- we aim to - minimise sum of squares of the residuals, if we do - maximum likelihood model
how to visualise residuals (code)
- easiest on clean data, not containing NAs
airqual. clean
multiple linear regression
- every factor has a different ß
Y = a + bx1 + bx2
- modeling ozone as a function of Temp and Wind
Ozone = a + b1Temp + b2Wind
-> two factors are often better, can be checked in R^2. the higher the more amount of values explained by model
plot observed versus estimated data points
first make predicted dataset
plot(airquality$Temp, airquality$Ozone) # plot observed data
plot(airquality$Temp, Ozone.est, pch=2, col”red)
legend(“top left”,legend = c(“observed”, “predicted”), pch=c(1,2), bty=”n”)