Section 6 - Tutorial R and In-Class Questions R Flashcards
Questions 1 to 6 use the same flights data from the nycflights13 package as the lecture
1 Find all flights that:
i) Arrived more than two hours late, but didn’t leave late.
ii) Were delayed by at least an hour, but made up over 30 minutes in
flight.
iii) Departed between midnight and 6am inclusive.
Question 1
(i)
filter(flights, arr_delay >= 120 & dep_delay <= 0)
(ii)
filter(flights, dep_delay >= 60 & dep_delay - arr_delay > 30)
(iii)
We need to be sure we include midnight correctly.
# To do this we get a summary of the data and note the min and max for dep_time
# We could also use summary(flights$dep_time)
summary(flights)
filter(flights, dep_time ==2400 | dep_time <= 0600)
Questions 1 to 6 use the same flights data from the nycflights13 package as the lecture
Q2 i) Find the three most delayed flights.
ii) Find the three flights that left earliest.
Question 2
(i)
head(arrange(flights, desc(dep_delay)))
(ii)
head(arrange(flights, dep_delay))
Q2 i) Most delayed flights:
1. HA 51 on 9 January 2013: left 1301 minutes late
2. MQ 3535 on 15 June 2013: left 1137 minutes late
3. MQ 3659 on 10 January 2013: left 1126 minutes late
ii) Fights that left earliest:
1. B6 97 on 7 December 2013: left 43 minutes early
2. DL 1715 on 3 February 2013: left 33 minutes early
3. EV 5713 on 10 November 2013: left 32 minutes early
Questions 1 to 6 use the same flights data from the nycflights13 package as the lecture
Q3 Find the five fastest flights, and state their carrier, flight number, destination, departure date and speed.
Question 3
fastflights <- mutate(flights, speed = distance / air_time *60)
fastflights <- select(fastflights, speed, carrier, flight, tailnum, origin, dest, year, month, day)
head(arrange(fastflights, desc(speed)))
Questions 1 to 6 use the same flights data from the nycflights13 package as the lecture
Q4 i) Currently, dep_time and arr_time are convenient to look at, but hard to compute with because they’re not continuous numbers. Convert them into a more convenient representation of the number of minutes since midnight.
ii) Compare air_time and arr_time – dep_time. Are they equal? Suggest reasons for your findings.
Q4 i) See R file.
ii) First count the number of non-zero differences: 327,150 (this will also have removed missing values).
Then consider the reasons for these non-zero values.
Some may be because the flight occurred over midnight, this would give a difference of 1440.
Some will also be because the flights flew to different time zones. The data is
for flights leaving New York, so all destinations will be in time zones the same or behind this. The longest time difference is 5 hours (to Hawaii).
Removing differences of 60, 120, 180, 240, 300 and 1440 minutes, leaves 326,977 flights.
This is because air_time is the amount of time spent in the air, whereas dep_time and arr_time are the times the flights left/arrived at the gates. So the differences are due to taxiing time etc.
Question 4
(i)
Times are shown as HHMM, so to get the hours, divide by 100 using %/% and multiply by 60
# Then add on the minutes using %%
# This will give 1440 for midnight rather than 0 (because midnight is recorded as 2400 not 0000)
# So if we then use %% 1440, this will give 0 for midnight and not change the answer for other times
flights_times <- mutate(flights,
dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
arr_time_mins = (arr_time %/% 100 * 60 + arr_time %% 100) %% 1440)
view only relevant columns
select(
flights_times, dep_time, dep_time_mins, arr_time,
arr_time_mins, air_time)
(ii)
create airtimediff to show the difference
flights_airtime <- mutate(flights_times, airtimediff = air_time - arr_time_mins+dep_time_mins)
summary(flights_airtime$airtimediff)
count number of non-zero differences
nrow(filter(flights_airtime, airtimediff !=0))
filter out flights that pass midnight (where the difference should be 1440)
# filter out flights where the difference is due to crossing a timezone
# (i.e. difference is 1, 2, 3, 4 or 5 hours)
nrow(filter(flights_airtime, airtimediff != 0)) -
nrow(filter(flights_airtime, airtimediff == 1440 | airtimediff == 60 |airtimediff == 120 |
airtimediff == 180 |airtimediff == 240|airtimediff == 300))
Questions 1 to 6 use the same flights data from the nycflights13 package as the lecture
Q5 Which carrier has the worst delays?
Question 5
flights %>%
group_by(carrier) %>%
summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(arr_delay))
Carrier is F9. This is Frontier Airlines.
Questions 1 to 6 use the same flights data from the nycflights13 package as the lecture
Q6 i) Which plane (tailnum) has the worst on-time record?
ii) Does the answer change if you include only planes that flew 20 or more times in a year?
Question 6
flights %>%
group_by(tailnum) %>%
summarise(arr_delay = mean(arr_delay), n=n()) %>%
filter(min_rank(desc(arr_delay))==1)
flights %>%
group_by(tailnum) %>%
summarise(arr_delay = mean(arr_delay), n=n()) %>%
filter(n >= 20) %>%
filter(min_rank(desc(arr_delay))==1)
6 i) Tail number N844MH.
ii) Yes, it is now N203FR (41 flights)
Q7 Consider TableA.csv saved on Canvas. Convert this to tidy data, where each observation is an insurance company and type of policy
Question 7
tableA <- read_csv(“TableA.csv”)
tableA %>%
gather(House, Car, key = “Type”, value = “Number Claims”)
Q8 i) Consider TableB.csv saved on Canvas. Split the claim and payment data into two columns.
ii) Add a column to calculate the average payment per claim.
Hint: What form is your split claim and payment data in? How can you tell?
To make sure it’s numerical, include convert=TRUE in part (i).
Question 8
tableB <- read_csv(“TableB.csv”)
tableBtidy <- separate(tableB, ClaimPayment, into = c(“Claim”, “Payment”), convert=TRUE)
tableBtidy
tableBcalc <- mutate(tableBtidy, AvgPayment = Payment / NumClaims)
tableBcalc
Exam standard question – 2019/20 Computer Based Assessment Q1
Q9 This question uses the airquality data set stored within R
Studio, which records the following data each day from 1 May 1973
to 30 September 1973:
* Ozone – mean ozone in parts per billion
* Solar.R – solar radiation in Langleys
* Wind – average wind speed in miles per hour
* Temp – average temperature in degrees Fahrenheit
(i) Construct a scatterplot of Solar.R and Ozone, and explain your choice of axis for each variable. (7 Marks)
(ii) Select just the data entries where Temp is greater than or equal to 75. (4 Marks)
(iii) Using the data selected in (ii), construct a scatterplot of Solar.R and Ozone, with the same choice of axes as in (i). (4 Marks)
(iv) Count the number of rows in airquality, and the number of missing values in Ozone. Then do the same for the data
selected in (ii) and compare the answers. (6 Marks)
(Total 21 Marks)
Q9 (i)
By reading the help file for the data, we can see Ozone is the response variable as it is
the measure of air quality. Solar radiation the explanatory variable, and so we put
ozone on the y axis and solar radiation on the x axis.
(ii) See R file
(iii)
(iv) Full data: 153 rows, 37 missing ozone values
Selected data: 105 rows, 29 missing ozone values
Both data sets have a high proportion of missing values – 24% for the full data set,
and 28% for the high temperature data set.
This may imply that the ozone data is not very reliable.
####################
(i)
plot(airquality$Solar.R, airquality$Ozone,
xlab = “Solar radiation”, ylab=”Ozone”,
main=”Solar radation and ozone levels for New York, May-Sept 1973”)
(ii)
hightemp <- filter(airquality, Temp >=75)
(iii)
plot(hightemp$Solar.R, hightemp$Ozone,
xlab = “Solar radiation”, ylab=”Ozone”,
main=”Solar radation and ozone levels for days over 75 degrees F”)
(iv)
nrow(airquality)
sum(is.na(airquality$Ozone))
nrow(hightemp)
sum(is.na(hightemp$Ozone))