Section 5 - Tutorial R and In-Class Questions R Flashcards

Question 1

Q

FIN3026 – Actuarial Econometrics and Data Science
Section 5 Tutorial
Q1 i) Using the CO2 data set within R, construct a scatterplot of CO2 concentration and uptake. Explain your choice of axis for each variable.
ii) Amend your plot so that each treatment is displayed as a different colour or point character.
iii) Discuss whether the plot in part (ii) allows you to draw any additional conclusions about the data than the plot in part (i).
iv) Amend your plot from (i) so that each type of plant is displayed as a different colour or point character.
v) Discuss whether the plot in part (iv) allows you to draw any additional
conclusions about the data than the plot in part (i).
vi) Amend your plot from (ii) to show a different subplot for each plant.
vii) Amend your plot from (iv) to show a different subplot for each treatment.
viii) Discuss the usefulness of the plots in (vi) and (vii).

Answer

A

Q1(i)
By reading the help file for the data, we can see that CO2 uptake is the response
variable, and so we plot uptake on the y-axis and concentration on the x-axis.
(ii)
(iii) The plot allows us to make a quick rough assessment as to whether uptake is
influenced by treatment.
All of the lowest uptake points are for plants that were chilled overnight, and all
of the highest uptake points are for plants that weren’t chilled.
So there does appear to be some influence.
However, there is a lot of overlap, so while the relationship would be worth
investigating, there is no overwhelming evidence from the plot alone.
(iv)
(v) Here, it is clear that the uptake is higher for the plants which originated in
Quebec, with very little overlap between the plant types.
(vi)
(vii)
(viii) The usefulness of each plot depends on the question we are trying to answer.
The first allows us to see the shape of the curve more clearly, and that the shape
is broadly consistent across all plants.
We can see that the effect of the treatment seems more pronounced on the
Minnesota plants.
The second clearly demonstrates this bigger impact.

Rfile:
### Q1

(i)

ggplot(data = CO2) +
geom_point(mapping = aes (x = conc, y = uptake))

(ii)

ggplot(data = CO2) +
geom_point(mapping = aes (x = conc, y = uptake, colour = Treatment))

(iv)

ggplot(data = CO2) +
geom_point(mapping = aes (x = conc, y = uptake, colour = Type))

(vi)

ggplot(data = CO2) +
geom_point(mapping = aes (x = conc, y = uptake, colour = Treatment)) +
facet_wrap( ~ Plant)

ggplot(data = CO2) +
geom_point(mapping = aes (x = conc, y = uptake, colour = Type)) +
facet_wrap( ~ Treatment)

Question 2

Q

FIN3026 – Actuarial Econometrics and Data Science
Section 5 Tutorial
Q2 The gss_cat data set within R contains categorical variables from the General Social survey, 2000-2014.
i) Produce a bar chart showing the numbers of people in each category of marital status.
ii) Amend your bar chart to display the number of people in each marital status category with a different colour for each party affiliation.
iii) Amend your plot to display the proportion in each party affiliation by marital status. Choose an appropriate Brewer colour scale.
iv) Amend your plot from part (iii) to use appropriate colours for the political parties: red for Republican, blue for Democrat, purple for Independent and yellow for those who declined to answer or who answered Don’t Know or Other. There is no need to use a Brewer colour scale.
Hint: Use colours() to display the colour names within R.
Hint: You can specify the colours using scale_manual_fill

Answer

A

Q2

(i)

ggplot(data = gss_cat) +
geom_bar(mapping = aes(x = marital))

(ii)

ggplot(data = gss_cat) +
geom_bar(mapping = aes(x = marital, fill = partyid))

(iii)

ggplot(data = gss_cat) +
geom_bar(mapping = aes(x = marital, fill = partyid), position = “fill”) +
scale_fill_brewer(palette = “Paired”)

(iv)

ggplot(data = gss_cat) +
geom_bar(mapping = aes(x = marital, fill = partyid), position = “fill”) +
scale_fill_manual(values = c(“lemonchiffon”, “lemonchiffon”, “lemonchiffon”, “red2”, “red2”,
“violet”, “violet”, “violet”, “royalblue”, “royalblue”))

Question 3

Q

FIN3026 – Actuarial Econometrics and Data Science
Section 5 Tutorial
Q3 This question uses the iris data set within R.
i) Construct a scatterplot with sepal length on the x axis and petal length on the y axis. Display each species of plant in a different colour.
ii) Amend your plot in (i) so that is suitable for publication in a botany magazine, discussing the difference in sepal and petal length for each species. You can assume that the reader is familiar with the species of iris and the difference between sepals and petals. Include alt text in your answer.

Answer

A

Alt text: The petal length increases with the sepal length for the virginica and
versicolor species. The versicolor iris is smaller than virginica, but the
relationship is similar for these species. The setosa species is smaller, and
has similar petal length for all sepal lengths, indicating there is no
relationship for the setosa species.

Rfile:

(i)

ggplot(data = iris) +
geom_point(mapping = aes (x = Sepal.Length, y = Petal.Length, colour=Species))

(ii)

ggplot(data = iris) +
geom_point(mapping = aes (x = Sepal.Length, y = Petal.Length, colour = Species)) +
scale_colour_brewer(palette=”Set2”) +
labs(
title = “Petal length increases with sepal length for Versicolor and Virginica irises”,
subtitle = “Petal Length is constant for Setosa irises”,
caption = “iris dataset”,
x = “Sepal Length”,
y = “Petal Length”,
colour = “Iris Species”
)+
theme_minimal()

Question 4

Q

FIN3026 – Actuarial Econometrics and Data Science
Section 5 Tutorial
Q6 This question uses the PlantGrowth data set in R, which records the results from an experiment to compare yields (as measured by the dried weight of plants) using two different treatment conditions and a control.
Produce a box plot showing the weight recorded, grouped by treatment option.

Answer

A

Boxes middle quartiles, thick lines median.
Treatment 1 lowers weight compared to control.
Results under treatment 1 look less skewed than control or treatment 2.
Treatment 2 slightly increases median growth compared with control.
But lower quartile higher than control.
Control skewed left, treatment 2 skewed right.
Narrower range of results under treatment 2.

Rfile:
ggplot(data = PlantGrowth)+
geom_boxplot(mapping = aes(x=group, y=weight))

Question 5

Q

FIN3026 – Actuarial Econometrics and Data Science
Section 5 Tutorial – In Class Question
December 2021 Computer Based Assessment
Strictly Come Dancing is a television show that airs in the UK. Every week, each celebrity performs a dance with a professional partner, which is scored by four judges out of a possible 40. One celebrity is eliminated each week based
on the judges’ scoring and a public vote.
Your actuarial consultancy has been hired to do some preliminary analysis on the scoring. You have been provided with a data set containing the scores for all dances in the first three series. This data set is called Strictly.csv and is available on Canvas. It includes the following data items:
* celebrity_id: a unique identifier for each celebrity
* celebrity: name of celebrity
* professional: name of professional
* dance: name of dance performed
* series: 1, 2 or 3
* week: episode number within series
* total: total score received from judges out of a possible 40
* Sex: sex of celebrity
* Type: categorises dance as Latin or Ballroom
(i) Construct a scatterplot showing week on the x-axis and total on the y-axis. Your scatterplot should differentiate between the scores for male and female celebrities.
Your intended audience for this scatterplot is your colleague who will be furthering the analysis tomorrow while you are on annual leave.
(ii) Write a brief description of the data displayed in your scatterplot above that you will email to your colleague along with your plot.
Your good friend writes a blog about Strictly Come Dancing and asks you to provide a graphic based on the above data file. He would like to display the male/female split in the numbers of each dance performed. He mentions that the accessibility of the graphics in his blog is important to him.
(iii) Create a graphic that displays the proportion of male and female performances of each dance.

Answer

A

###############################################################################
### FIN3026 Section 5 In Class Question ###
###############################################################################

library(tidyverse)

import data

strictlydata <- read.csv(“Strictly.csv”, header=TRUE)

(i) scatterplot

ggplot(data=strictlydata)+
geom_point(mapping=aes(x=week, y=total, colour=Sex))

(ii) Description:

Number of dances reduces each week as contestants are eliminated.
# Scores broadly increase as weeks go on, but top of distribution increases very slowly.
# Variance reduces significantly as weaker contestants eliminated.
# High scores appear strongly female, low scores appear strongly male
# but further analysis required to see if statistically significant.

(iii) visualisation for blog (example)

ggplot(data=strictlydata)+
geom_bar(mapping=aes(x=dance, fill=Sex), position=”fill”)+
labs(title=”Male contestants performed more than half of almost all dance types”,
x=NULL,
y=”Proportion”)+
scale_fill_brewer(palette=”Set1”)+
coord_flip()+
theme_minimal()

ALT text:

Male contestants performed just over half of most dances, implying there were more
# performances by male celebrities than female over the first three series.
# The dance with the highest proportion of male performances is the Viennese Waltz,
# with over 75% of performances by a male celebrity. The dance with the lowest proportion
# of male performances is the Waltz, with just under half performed by a male celebrity.

Section 5 - Tutorial R and In-Class Questions R Flashcards

(5 cards)