Chapter 2: Data Visualization Flashcards

1
Q

What is the correct way to make the points blue in a scatterplot?

  1. ggplot(persinj, aes(x = op_time, y = total, color = “blue”) + geom_point()
  2. ggplot(persinj, aes(x = op_time, y = total)) + geom_point(colour = “blue)
A
  1. is the correct way

remember that the aes() function is used to map variables in our dataset to visual properties of the graph. If we were to use the first choice, the function would look for a variable “blue” in our dataset, that doesn’t exist.
The mistake is setting an aesthetic to a constant value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

T/F: the aesthetics determine what relationships we want to see in the plot and the geoms determine how we want to see these relationships.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How could we use the “color” aesthetic the right way in a ggplot?

A

we could use color = , alongside a comparison of two numeric variables
it will make the points on the graph different colours depending on the value of the factor variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

can you set colour = factor variable without converting it to a factor variable first (in the aesthetics mapping)

A

no, you have to convert it as a factor using factor() function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what are two common arguments of the geom_smooth() ?

A

method = “lm” and se = FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what are the 4 common arguments used in geom_point()?

A

color
alpha = how light/dark you want your points to be
shape
size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are the 2 common arguments of geom_bar()?

A

fill

alpha

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what are the 3 common arguments for geom_histogram()?

A

fill
alpha
bins = number of bins you want to use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what curve is usually used with geom_point?

A

geom_smooth(method = “lm”, se = FALSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

you are comparing 3 variables. two are numeric and one is a factor variable.
you want to make the colour of the points different based on the value of the factor variable.
you use geom_point() and geom_smooth() with se bounds.
how could you make sure that the se lines are filled in with the corresponding factor value (consistent with the points)?

A

ggplot(dataset, x = numeric, y = numeric, color = factor(categorical), fill = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)

consistent colouring between the SE lines and the smoothed curve. This is because we assigned the factor variable to both fill and colour.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

you are comparing 3 variables. two are numeric and one is a factor variable.
you want to make the colour of the points different based on the value of the factor variable.
you use geom_point() and geom_smooth() with an se bound.
how could you make sure that there is only one smoothed line and one se bound for all data points?

A

ggplot(dataset, aes(x = numeric var, y = numeric var)) + geom_point(aes(color = factor(categorical)) + geom_smooth(method = “lm”, se = TRUE)

  • take it out of ggplot
  • put it into geom_point as an aesthetic because we want the points to still be different colours
  • se = TRUE in the geom_smooth
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

T/F: aesthetic mappings in the original ggplot function will not be inherited to all geom functions.

A

False. they are

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is faceting? What output does it provide when used? What function is used?

A

it’s a convenient way to categorize our data into distinct groups based on the value of our categorical predictors.
faceting displays the observations in separate plots produced for each value of the faceting variable placed side-by-side to ease comparison

the function used is facet_wrap()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you use the facet_wrap() function?

A

ggplot(dataset, aes(x = numeric, y = numeric, color = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)
+ facet wrap(~ FACET VAR, ncol = n)

facet_wrap(~ FACET VAR, ncol = n) is added to the end of a ggplot

by default, the two graphs will be on the same scale so that they are easy to compare

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you add titles to a ggplot?

A

ggplot(dataset, aes(x = numeric, y = numeric, color = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)

+ labs( x = “x axis title”, y = “y axis title”, main = “main title”)

using the labs() function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you display more than one graph on the page? What package?

A

library(gridExtra)

grid.arrage(graph1, graph2, …, graph n, ncol = n)

17
Q

How do you paste a graph into word from R?

A

right click on the graph > copy > go to word > right click > paste options: picture

18
Q

how do you alter the coordinates of a ggplot?

A

ggplot(dataset, aes(x = numeric, y = numeric, color = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)

+ coords_cartesian(ylim = c(number1, number2), xlim = c(400, 1000)

19
Q

What are the two components of exploratory data analysis?

A
  1. descriptive statistics

2. visual aids (graphs)

20
Q

univariate data exploration: what 2 aspects of a numeric variable are you able to reveal using summary()

A
central tendency (mean and median - tell us the "size" of the variable)
dispersion (variance) 

can look for a skew in the variables data as well by comparing the mean and median.

21
Q

Univariate data exploration: what kind of graphs are used when exploring numeric variables?

A

Histograms: provide a visual summary of the count or relative frequency in each bin. we can learn about the shape of the distribution of a numeric variable

Boxplots: visualize the distribution of a numeric variable.
- boxplots don’t directly show the actual shape of the variable’s distribution, they offer a useful graphical summary of the key numeric statistics and allow for a visual comparison of the distributions of different numeric variables (relative magnitude of their mean and medians)

22
Q

TODO
if you want to compare the values of a numeric variable across different levels of a categorical predictor, what kind of graphical display could you use?

A

ggplot(dataset, x = factor(

23
Q

How could you calculate the summary statistics of a numeric variable for two groups of observations?

ex. separating the values of another numeric variable; you want the summary stats for y based on if x > 50 or x < 50?
ex. separating the values of a categorical predictor?

A

categorical:

dataset. value1 #, ]
dataset. under

24
Q

what should we do if we have a skewed target variable?

A

transformation. apply a monotonic concave function to shrink the outliers and symmetrize the overall distribution while preserving the ranks fo the observed values of the variable

25
Q

what kind of transformations can be done on a skewed target variable?

A
  1. log-transformation (can only be done to strictly positive data)
  2. sqrt (similar to the log transformation)
26
Q

how do we deal with natural outliers? 4

A
  1. remove: if we know that it will not have a material effect
  2. ignore: if the outliers make up an insignificant proportion of the data and are unlikely to create bias, then keep them
  3. Modify: modify them to make them more reasonable
  4. using robust model forms: instead of minimizing the squared error between predicted and observed, we could replace that with the absolute error. places a lot less weight on the observations that are outliers
27
Q

what could you use to look at the descriptive statistics for a categorical variable?

A

frequency tables.

table( dataset $ categorical ) / nrow( dataset)
this will produce a frequency table, it will tell you what the predominant level is of the categorical predictor.

28
Q

How could you graphically explore a categorical variable?

A

bar charts: extract the information in a frequency table and present the numeric counts visually

29
Q

what is bivariate data exploration used for?

A

relationships, patterns, outliers

30
Q

bivariate: numeric vs. numeric

what kind of descriptive statistics are to be used?

A

correlation coefficient!

use the cor() function

31
Q

bivariate: numeric vs. numeric

what kind of graph would you use to evaluate the relationship?

A
scatterplot 
ggplot(dataset, aes(x = numeric1, y = numeric2) + geom_point() + geom_smooth()
32
Q

bivariate: categorical vs. numeric

how could you use descriptive statistics to look at the relationship? How could you do this?

A

we can partition the data into different subsets, one for each level of the categorical variable, and compute the mean of the numeric variable within the subset

(using a tibble in the library(tidyverse)
* don’t need to know how to write that code

33
Q

bivariate: numeric vs. categorical
what if you wanted to look at a 3-way interaction between one numeric variable (target variable) and 2 categorical predictors?

A

use boxplots!

ggplot(dataset, aes(x = categorical1, y = numeric, fill = factor(categorical)) + geom_boxplot()

34
Q

bivariate: categorical vs. categorical

How could you use descriptive statistics to evaluate this relationship?

A

two-way frequency table using the table() function. the first argument is the rows and columns = 2nd argument

ex.
table(dataset $ categorical1, dataset $ categorical2)

35
Q

bivariate: categorical vs. categorical

how could you use graphical displays? what are the 3 options to display this graph?

A

split bar charts are a good way visualize the relationship

ggplot(dataset, aes(x = categorical1, y = categorical2)) + geom_bar(position = “ ….. “)

position can be three things:
position = “fill”
position = “dodge”
default value = stacked

fill is usually the most useful for depicting the interplay between two categorical variables