Section 2 - Tutorial R and In-Class Questions R Flashcards

Question 1

Q

Tutorial 2 Q2 (exam standard)

Q2 This question uses the airquality data set in R, which records the following data
each day from 1 May 1973 to 30 September 1973:
* Ozone – mean ozone in parts per billion
* Solar.R – solar radiation in Langleys
* Wind – average wind speed in miles per hour
* Temp – average temperature in degrees Fahrenheit
i. Calculate the mean ozone level, solar radiation, wind speed and
temperature over the whole period. [2]
ii. Calculate the mean ozone level, solar radiation, wind speed and temperature for the month of July. [4]
iii. Produce scatter plot of data for the entire period. [3]
iv. Obtain the intercept and slope parameters for the linear regression
model of the form 𝑦 = 𝛼 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥3 + 𝛽4𝑥4, where the response variable 𝑦 is Ozone and the explanatory variables 𝑥𝑖 are:
𝑥1: Solar radiation
𝑥2: Wind speed
𝑥3: Temperature
𝑥4: Month
[6]
v. Obtain the total sum of squares, residual sum of squares and regression sum of squares for the above model. [6]
vi. Obtain the coefficient of determination and adjusted coefficient of determination for the above model. Explain the difference. [7]
vii. Use an F-test to evaluate the model. [8]
viii. Predict the Ozone level for a day in August with solar radiation of 271
Langleys, wind speed of 12 miles per hour and temperature 91 degrees
Fahrenheit, and calculate the 95% confidence interval. [6]
ix. Test at the 5% level whether 𝛽3 is different to 1.75. [6]
x. Obtain a Q-Q plot of the residuals, and comment on the normality
assumption. [8]
xi. Update the model to remove any explanatory variables not
significantly different from zero at the 1% level. Compare the adjusted
R-squared value with that for the initial model. [10]
xii. It has been suggested that this model could be used as a general model
for pollution. Comment on this suggestion. [9]
[Total 75]

Answer

A

SS_TOT

attach(airquality)

(i) summarise all data

summary(airquality)

(ii) summarise data for July

July<-airquality[Month==7,1:6]

summary(July)

(iii) scatterplot

plot(airquality)

(iv) fit model

modelAQ <- lm(Ozone~Solar.R+Wind+Temp+Month,data=airquality)
modelAQ

summary(modelAQ)

(v) sum of squares
anova(modelAQ)

sum(anova(modelAQ)[,2])

sum(anova(modelAQ)[,2])-anova(modelAQ)[5,2]

anova(modelAQ)[5,2]

(vi) from summary above, Adjusted R squared takes account of number of predictors – if just use R squared, adding explanatory variables will always seem to increase accuracy of model

(vii) from summary above, F-statistic 43.21 on 4 and 106 degrees of freedom, p-value 2.2 e-16 so conclude at least one of the slope parameters is non-zero

(viii) predict ozone level

AugustData <- data.frame(Solar.R=271,Wind=12,Temp=91.0,Month=8)

predict(modelAQ,AugustData,interval=”predict”,level=0.95)

(ix) confidence interval beta_3

confint(modelAQ,level=0.95), 95% confidence interval (1.328,2.413) contains 1.75, cannot conclude 𝛽3 is different to 1.75.

(x) Q-Q plot

plot(modelAQ,2), Normality assumption only appears to apply in middle of data set – fat tails. But not many big outliers.

(xi) adjust model

remove Month as not signficant at 1% level

modelAQ1 <- lm(Ozone~Solar.R+Wind+Temp,data=airquality)
modelAQ1

summary(modelAQ1)

Adjusted R-squared: 0.5948
Adjusted R-squared has fallen slightly as Month is significant at the 5% level.
Temperature and wind speed significant at 1% level – these will correlate with month, and so significance of month may be explained by weather conditions already modelled. May need to examine interaction terms

(xii)
R-sq and adjusted R-sq around 60%, so model explains about 60% of
pollution due to ozone in the data set.
* Only for summer months – winter months will have very different conditions
* NY very urban, model may not work in rural or heavy industrial areas
* Data from 1973, more recent data may give different results
* Only looks at ozone as measure of pollution – not heavy particles from diesel etc.
3 marks for any valid observation

Question 2

Q

Exam standard question - 2019/20 computer based assessent Q3

Q3. An engineer is analysing the heating and cooling load of different
buildings. The data file BuildingData.csv includes the heating and cooling loads for 768 buildings, as well as the following data for each building:
* Relative Compactness (Compactness)
* Surface Area (SurfaceArea)
* Wall Area (WallArea)
* Roof Area (RoofArea)
* Overall Height (OverallHeight)
* Orientation (Orientation)
* Glazing Area (GlazArea)
* Glazing Area Distribution (GlazAreaDist)
(i) Fit a linear model to the data with the cooling load as the response variable, and relative compactness, surface area,
wall area, overall height and glazing area as explanatory variables. (6 Marks)
(ii) State the formula of the model fitted in part (i), clearly explaining all terms you use. (3 Marks)
(iii) State the “Adjusted 𝑅2” for this model and comment on the fit of the model to the data. (3 Marks)
(iv) Produce a plot (you must include the plot in your Word document) to analyse whether the data is normally distributed, and comment. (5 Marks)
(v) Adjust your model in (i) to allow for an interaction term between surface area and wall area and compare the suitability of the models in light of your answer in (iii). (4 Marks)
To be completed following generalised linear models lectures

Answer

A

#########################################################################
############## Regression question ###################
#########################################################################

data <- read.csv(“BuildingData.csv”, header=TRUE)

compact <- data[,1]
surface <- data[,2]
wall <- data[,3]
roof <- data[,4]
height <- data[,5]
orientation <- data[,6]
glazarea <- data[,7]
glazdist <- data[,8]
heatingload <- data[,9]
coolingload <- data[,10]

(i)

model <- lm(coolingload ~ compact+surface+wall+height+glazarea)
summary(model)

(ii)

𝑌 = 97.761848 − 70.787707𝑥1 − 0.088245𝑥2 + 0.044682𝑥3 +
4.283843𝑥4 + 14.817971𝑥5

Where 𝑌 is cooling load, 𝑥1 is the variable for compactness, 𝑥2 surface
area, 𝑥3 wall area, 𝑥4 height, 𝑥5 glazing area

(iii)

0.8868 – this is quite high and implies that the model captures most of
the behaviour of the cooling load

(iv)

plot(model,2)
Normal assumption sensible in the middle of the data set, but breaks
down at the tails, especially at the higher end

(v)

modelinteract <- lm(coolingload ~ compact+surface*wall+height+glazarea)
summary(modelinteract)

Adjusted 𝑅2 has increased to 0.8989 and so the fit is improved

Question 3

Q

FIN3026 – Actuarial Econometrics and Data Science
Section 2 Tutorial – In Class Question
Q1 This question uses the Boston Housing data set from the MASS package, Boston.
To load the package and the data set for the tutorial, use the following code:
library(MASS)
i) Get a feel for the data by:
a) Reading the help file
b) Summarising the data using summary()
c) Printing the first few lines of data using head()
ii) Estimate a simple linear regression model with the response variable
equal to the median house value of a district (medv) and one explanatory variable which is the percentage of households with low socioeconomic status (lstat).
iii) State the equation of the fitted line of regression.
iv) Comment on the fit of this model.
v) Create a new linear model by adding two explanatory variables to your model in part (ii):
* age – average age of buildings
* crim – per-capita crime
vi) Compare your models from parts (ii) and (v).
vii) Create a new linear model by adding the response variable rm
(average number of rooms per property) to your model from part (v).
viii) Is your latest model an improvement?
Can you improve the model any further?

Answer

A

—————————————————————————————-

######################################
### Section 2 Tutorial - Solutions ###
######################################

In Class Question

### part (i)
#—————————————————————————————-

install.packages(“MASS”) if first time using package
# If issues, ensure R is updated

library(MASS)

a) help file

?Boston

b) summarise

summary(Boston)

c) print first few lines

head(Boston)

### part (ii)
#—————————————————————————————-

model1 <- lm(medv ~ lstat, data = Boston)

summary(model1)

### part (iii)
#—————————————————————————————-

medv = 34.55 - 0.95 lstat + u

### part (iv)
#—————————————————————————————-

From summary output:
# - R-sq is 0.544, so over half of the behaviour is explained
# - The coefficient on lstat is significantly different to 0 so there is a relationship
# - Coefficient on lstat is negative,
# so as the percentage of deprivation increases, house prices fall

### part (v)
#—————————————————————————————-

model2 <- lm(medv ~ lstat + age + crim, data = Boston)

summary(model2)

### part (vi)
#—————————————————————————————-

Adj R-sq of model 2 is higher (0.5533 vs 0.5432), indicating a better fit
# All parameters in model 2 are significantly different than 0 at the 5% level
# Model 2 is an improvement on Model 1

### part (vii)
#—————————————————————————————-

model3 <- lm(medv ~ lstat + age + crim + rm, data = Boston)

summary(model3)

### part (viii)
#—————————————————————————————-

Adj R-sq has increased (0.5533 to 0.6439)
# However, age is no longer significant, so we should remove it from the model

### part (ix)
#—————————————————————————————-

Remove age

model4 <- lm(medv ~ lstat + crim + rm, data = Boston)

summary(model4)

Adj R-Sq slightly down but all parameters now significant

Plenty of other options available, e.g. add pupil-teacher ratio

model5 <- lm(medv ~ lstat + crim + rm + ptratio, data = Boston)

summary(model5)

Adj R-sq has increased to 0.6789
# And all parameters significant
# Coefficients behave as expected

plot(model5,1)

plot of residuals - slight trend in behaviour at edges
# But in the middle residuals are centred on zero with no pattern

plot(model5,2)

Most points following line
# But clearly some behaviour unexplained

Section 2 - Tutorial R and In-Class Questions R Flashcards

(3 cards)