Section 3/4 - Tutorial R and In-Class Questions R Flashcards
Lecture slide 37 week 4
Question 2
Using the mtcars dataset, use backward selection to choose a model that predicts whether an engine will be v-shaped or straight (vs), based on:
o Weight wt
o Engine size, displacement disp
Question 3
Using your model from question 2 obtain the linear predictor
Question 2 - Backward selection
#########################################################
### Workings for Section 4 - model selection ###
#########################################################
Uses mtcars data set
summary(mtcars)
####################################################################################################
####################################################################################################
Start with all covariates
modelA <- glm(vs ~ wt * disp, data=mtcars, family=binomial)
summary(modelA)
None of these factors is significant
# So remove interaction term which is least significant
modelB <- update(modelA, ~.-wt:disp)
summary(modelB)
AIC has fallen from 29.361 to 27.4
The wt term is not significant, so try removing that
modelC <- update(modelB, ~.-wt)
summary(modelC)
Both coefficients are significant and the AIC has fallen from 27.4 to 26.696
We could also use anova to consider the p-values
anova(modelB, modelC, test=”Chi”)
There is no significant difference between the models (p-value 0.255) and so it is correct to remove wt
addldata <- data.frame(disp=180)
predict(modelC, addldata, type=”response”)
linear predictor is: 4.14 - 0.022xdisp
Tutorial
Q7 This question uses the mtcars data set in R.
i) Using R, formulate a GLM to predict the probability of a car having a manual gearbox, including the number of cylinders and gross horsepower as main effects.
Hint 1: use ?mtcars to get a description of the data in R Studio.
Hint2: choose an appropriate distribution for the probability of having a manual gearbox.
State the linear predictor, and state the parameters.
ii) Using your model from part (i), predict the probability of an engine with 8 cylinders and horsepower of 200 having a manual engine
note that canonical link will be used if link is blank, same result is given, data= arg also not needed
Q7 i) See R file for glm.
Linear predictor: 𝜂 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 where 𝑥1 = 𝑐𝑦𝑙𝑖𝑛𝑑𝑒𝑟𝑠 and 𝑥2 =
ℎ𝑜𝑟𝑠𝑒𝑝𝑜𝑤𝑒𝑟
Parameters: 𝛽0 = 5.83220, 𝛽1 = −1.70306, 𝛽2 = 0.02775
ii) Probability = 9.6%
Rfile
###########################################################################
### Question 7 ###
###########################################################################
(i) Write model
Use binomial distribution as we’re looking for a %
gearboxmodel <- glm(am ~ cyl + hp, data=mtcars, family=binomial(link=logit))
gearboxmodel
summary(gearboxmodel)
(ii) predict
newdata <- data.frame(hp=200, cyl=8)
predict(gearboxmodel, newdata, type=”response”)
Tutorial
Q8 This question uses the CO2 data in R Studio (note, not the co2 dataset!), which examines the carbon dioxide uptake in grass plants. We will consider the impact of the origin of the plant (Quebec or Mississippi), overnight chilling and ambient carbon dioxide levels on the uptake.
i) Assuming that the uptake measurements are gamma distributed, fit a generalised linear model with uptake as the response variable and Type (𝛽𝑖), Treatment (𝛾𝑗) and conc (𝑥) as explanatory variables:
𝑦 = 𝛼 + 𝛽𝑖 + 𝛾𝑗 + 𝛿𝑥
Obtain the coefficients, and explain what has happened to the coefficients for TypeQuebec and TreatmentNonChilled.
ii) Test the hypothesis 𝐻0: 𝛿 = 0 vs 𝐻1: 𝛿 ≠ 0.
iii) Find a 99% confidence interval for 𝛿, and use it to test at the 1% level whether 𝛿 = −1.6 × 10−5.
iv) Obtain the value of the linear predictor for uptake for a plant in Quebec which is chilled overnight, with an ambient CO2 level of 420 mL/L.
v) Calculate the deviance residuals for this model, and explain why we don’t consider the Pearson residuals.
vi) Obtain a plot of the residuals against the fitted values. Comment on the constancy of the variance of the residuals and whether a normal model is appropriate.
vii) Define the AIC, and obtain the AIC for this model.
viii) Create a new model, which doesn’t include Type as an explanatory variable. Use the AIC to compare the two models
Could calculate both deviance and pearson residuals to confirm this
Q8 i) Model fit in R file.
Coefficients:
𝛼 = 3.598 × 10−2, 𝛽𝑀𝑖𝑠𝑠𝑖𝑠𝑠𝑖𝑝𝑝𝑖 = 1.727 × 10−2, 𝛾𝐶ℎ𝑖𝑙𝑙𝑒𝑑 = 8.599 × 10−3, 𝛿
= −1.957 × 10^ − 5
ii) Either: p-value is less than 5% so we reject 𝐻0, or, confidence interval doesn’t include 0 so we reject.
iii) 99% confidence interval contains 1.6 × 10−5 and so we don’t reject the hypothesis
iv) Using R file, answer is 0.03635705
Exercise – check by hand
v) Residuals in R file
Don’t consider Pearson residuals as they are often skewed for non-normally distributed data (we are using the gamma distribution), making it more difficult to interpret residual plots.
Deviance residuals are more likely to be symmetrically distributed and to have approximately normal distributions.
vi)
There does appear to be something of a pattern, and more residuals lie above the line than below. However there are no clear outliers.
vii) The AIC is: 𝑑𝑒𝑣𝑖𝑎𝑛𝑐𝑒 + 2 × 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠
AIC for this model (from summary() ): 583.63
viii) Updated model has AIC 622.28. This has increased from the previous model and so we should keep Type as an explanatory variable
##########################################################################
### Question 8 ###
###########################################################################
(i) fit GLM
uptakemodel <- glm(uptake ~ conc + Type + Treatment, family=Gamma, data=CO2)
uptakemodel
summary(uptakemodel)
(ii) test delta =0 hypothesis (or use summary)
confint(uptakemodel,”conc”,level=0.95)
(iii) test whether delta 1.6e-5
confint(uptakemodel,”conc”,level=0.99)
(iv) obtain linear predictor for plant described in question
addldata <- data.frame(Type=”Quebec”, Treatment=”chilled”, conc=420)
predict(uptakemodel,addldata) #no “response” type as we need the raw log odd i.e. the linear predictor
(v) residuals
residuals(uptakemodel)
R automatically calculates deviance residuals.
# (Not required)
residuals(uptakemodel, type=”pearson”)
residuals(uptakemodel, type=”deviance”)
(vi) residuals plot
plot(uptakemodel,1)
(viii) updated model with Type removed
uptakemodel1 <- glm(uptake ~ conc + Treatment, family=Gamma, data=CO2)
summary(uptakemodel1)
Tutorial - exam standard question - 2019/20 computer based assessment Q3 (continued from Chapter 2 tutorial parts 1-5)
Q9 An engineer is analysing the heating and cooling load of different
buildings. The data file BuildingData.csv includes the heating and
cooling loads for 768 buildings, as well as the following data for each
building:
* Relative Compactness (Compactness)
* Surface Area (SurfaceArea)
* Wall Area (WallArea)
* Roof Area (RoofArea)
* Overall Height (OverallHeight)
* Orientation (Orientation)
* Glazing Area (GlazArea)
* Glazing Area Distribution (GlazAreaDist)
(i) Fit a linear model to the data with the cooling load as the
response variable, and relative compactness, surface area,
wall area, overall height and glazing area as explanatory
variables. (6 Marks)
(ii) State the formula of the model fitted in part (i), clearly
explaining all terms you use. (3 Marks)
(iii) State the “Adjusted 𝑅2” for this model and comment on the
fit of the model to the data. (3 Marks)
(iv) Produce a plot (you must include the plot in your Word
document) to analyse whether the data is normally
distributed, and comment. (5 Marks)
(v) Adjust your model in (i) to allow for an interaction term
between surface area and wall area and compare the
suitability of the models in light of your answer in (iii). (4 Marks)
(vi) Reformulate your model from (i) in R using the Generalised
Linear Models framework, and state the Akaike’s
Information Criterion (AIC) for the model. (4 Marks)
(vii) Fit a generalised linear model to the data using a Gamma
distribution, with the cooling load as the response variable,
and relative compactness, surface area, overall height and
glazing area as explanatory variables. (2 Marks)
(viii) State the formula for the model fitted in (vii), clearly
explaining all terms you use, and state the AIC of this
model. (5 Marks)
(ix) Compare the fit of the models in (i) and (vii), considering
both the AIC and significance of the parameters. Select one
model for use in the rest of this question, and justify your
selection. (6 Marks)
(x) Using the model selected in (ix), predict the cooling load for
an individual building with the following traits:
* Relative Compactness: 0.73
* Surface Area: 690
* Wall Area: 275
* Overall Height: 7
* Glazing Area: 0.15 (3 Marks)
Q9 - Regression
(i) See R
(ii) 𝑌 = 97.761848 − 70.787707𝑥1 − 0.088245𝑥2 + 0.44682𝑥3 + 4.283843𝑥4 +
14.817971𝑥5
Where 𝑌 is cooling load, 𝑥1 is the variable for compactness, 𝑥2 surface area, 𝑥3 wall
area, 𝑥4 height, 𝑥5 glazing area.
(iii) 0.8868 – this is quite high and implies that the model captures most of the behaviour of the cooling load
(iv)
Normal assumption sensible in the middle of the data set, but breaks down at the tails, especially at the higher end.
(v) Adjusted 𝑅2 has increased to 0.8989 and so the fit is improved.
(vi) AIC: 3974.3
(vii) See R
(viii) Gamma GLM:
𝜂̂ = 1
𝜇 = 0.315 − 0.09399𝑥1 − 0.0001871𝑥2 − 0.01251𝑥3 − 0.02194𝑥4
Where 𝑥1 is the variable for compactness, 𝑥2 surface area, 𝑥3 height, 𝑥4 glazing area.
AIC: 3709.6
(ix) All parameters are significantly different from 0 in both models.
The AIC is lower in the gamma model, and so we select it for use in the rest of the question.
(x) 37.8955
#########################################################################
############## Regression question ###################
#########################################################################
data <- read.csv(“BuildingData.csv”, header=TRUE)
compact <- data[,1]
surface <- data[,2]
wall <- data[,3]
roof <- data[,4]
height <- data[,5]
orientation <- data[,6]
glazarea <- data[,7]
glazdist <- data[,8]
heatingload <- data[,9]
coolingload <- data[,10]
(i)
model <- lm(coolingload ~ compact+surface+wall+height+glazarea)
summary(model)
(iv)
plot(model,2)
(v)
modelinteract <- lm(coolingload ~ compact+surface*wall+height+glazarea)
summary(modelinteract)
(vi)
modelg <- glm(coolingload ~ compact+surface+wall+height+glazarea)
summary(modelg)
(vii)
gammamodel <- glm(coolingload ~ compact+surface+height+glazarea, family=Gamma)
summary(gammamodel)
(x)
newbuilddata <- data.frame(compact=0.73, surface=690, wall=275, height=7, glazarea=0.15)
predict.glm(gammamodel, newbuilddata, type=”response”)