Chapter 2: Data Visualization Flashcards
What is the correct way to make the points blue in a scatterplot?
- ggplot(persinj, aes(x = op_time, y = total, color = “blue”) + geom_point()
- ggplot(persinj, aes(x = op_time, y = total)) + geom_point(colour = “blue)
- is the correct way
remember that the aes() function is used to map variables in our dataset to visual properties of the graph. If we were to use the first choice, the function would look for a variable “blue” in our dataset, that doesn’t exist.
The mistake is setting an aesthetic to a constant value
T/F: the aesthetics determine what relationships we want to see in the plot and the geoms determine how we want to see these relationships.
T
How could we use the “color” aesthetic the right way in a ggplot?
we could use color = , alongside a comparison of two numeric variables
it will make the points on the graph different colours depending on the value of the factor variable
can you set colour = factor variable without converting it to a factor variable first (in the aesthetics mapping)
no, you have to convert it as a factor using factor() function
what are two common arguments of the geom_smooth() ?
method = “lm” and se = FALSE
what are the 4 common arguments used in geom_point()?
color
alpha = how light/dark you want your points to be
shape
size
what are the 2 common arguments of geom_bar()?
fill
alpha
what are the 3 common arguments for geom_histogram()?
fill
alpha
bins = number of bins you want to use.
what curve is usually used with geom_point?
geom_smooth(method = “lm”, se = FALSE)
you are comparing 3 variables. two are numeric and one is a factor variable.
you want to make the colour of the points different based on the value of the factor variable.
you use geom_point() and geom_smooth() with se bounds.
how could you make sure that the se lines are filled in with the corresponding factor value (consistent with the points)?
ggplot(dataset, x = numeric, y = numeric, color = factor(categorical), fill = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)
consistent colouring between the SE lines and the smoothed curve. This is because we assigned the factor variable to both fill and colour.
you are comparing 3 variables. two are numeric and one is a factor variable.
you want to make the colour of the points different based on the value of the factor variable.
you use geom_point() and geom_smooth() with an se bound.
how could you make sure that there is only one smoothed line and one se bound for all data points?
ggplot(dataset, aes(x = numeric var, y = numeric var)) + geom_point(aes(color = factor(categorical)) + geom_smooth(method = “lm”, se = TRUE)
- take it out of ggplot
- put it into geom_point as an aesthetic because we want the points to still be different colours
- se = TRUE in the geom_smooth
T/F: aesthetic mappings in the original ggplot function will not be inherited to all geom functions.
False. they are
What is faceting? What output does it provide when used? What function is used?
it’s a convenient way to categorize our data into distinct groups based on the value of our categorical predictors.
faceting displays the observations in separate plots produced for each value of the faceting variable placed side-by-side to ease comparison
the function used is facet_wrap()
How do you use the facet_wrap() function?
ggplot(dataset, aes(x = numeric, y = numeric, color = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)
+ facet wrap(~ FACET VAR, ncol = n)
facet_wrap(~ FACET VAR, ncol = n) is added to the end of a ggplot
by default, the two graphs will be on the same scale so that they are easy to compare
How do you add titles to a ggplot?
ggplot(dataset, aes(x = numeric, y = numeric, color = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)
+ labs( x = “x axis title”, y = “y axis title”, main = “main title”)
using the labs() function
How do you display more than one graph on the page? What package?
library(gridExtra)
grid.arrage(graph1, graph2, …, graph n, ncol = n)
How do you paste a graph into word from R?
right click on the graph > copy > go to word > right click > paste options: picture
how do you alter the coordinates of a ggplot?
ggplot(dataset, aes(x = numeric, y = numeric, color = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)
+ coords_cartesian(ylim = c(number1, number2), xlim = c(400, 1000)
What are the two components of exploratory data analysis?
- descriptive statistics
2. visual aids (graphs)
univariate data exploration: what 2 aspects of a numeric variable are you able to reveal using summary()
central tendency (mean and median - tell us the "size" of the variable) dispersion (variance)
can look for a skew in the variables data as well by comparing the mean and median.
Univariate data exploration: what kind of graphs are used when exploring numeric variables?
Histograms: provide a visual summary of the count or relative frequency in each bin. we can learn about the shape of the distribution of a numeric variable
Boxplots: visualize the distribution of a numeric variable.
- boxplots don’t directly show the actual shape of the variable’s distribution, they offer a useful graphical summary of the key numeric statistics and allow for a visual comparison of the distributions of different numeric variables (relative magnitude of their mean and medians)
TODO
if you want to compare the values of a numeric variable across different levels of a categorical predictor, what kind of graphical display could you use?
ggplot(dataset, x = factor(
How could you calculate the summary statistics of a numeric variable for two groups of observations?
ex. separating the values of another numeric variable; you want the summary stats for y based on if x > 50 or x < 50?
ex. separating the values of a categorical predictor?
categorical:
dataset. value1 #, ]
dataset. under
what should we do if we have a skewed target variable?
transformation. apply a monotonic concave function to shrink the outliers and symmetrize the overall distribution while preserving the ranks fo the observed values of the variable
what kind of transformations can be done on a skewed target variable?
- log-transformation (can only be done to strictly positive data)
- sqrt (similar to the log transformation)
how do we deal with natural outliers? 4
- remove: if we know that it will not have a material effect
- ignore: if the outliers make up an insignificant proportion of the data and are unlikely to create bias, then keep them
- Modify: modify them to make them more reasonable
- using robust model forms: instead of minimizing the squared error between predicted and observed, we could replace that with the absolute error. places a lot less weight on the observations that are outliers
what could you use to look at the descriptive statistics for a categorical variable?
frequency tables.
table( dataset $ categorical ) / nrow( dataset)
this will produce a frequency table, it will tell you what the predominant level is of the categorical predictor.
How could you graphically explore a categorical variable?
bar charts: extract the information in a frequency table and present the numeric counts visually
what is bivariate data exploration used for?
relationships, patterns, outliers
bivariate: numeric vs. numeric
what kind of descriptive statistics are to be used?
correlation coefficient!
use the cor() function
bivariate: numeric vs. numeric
what kind of graph would you use to evaluate the relationship?
scatterplot ggplot(dataset, aes(x = numeric1, y = numeric2) + geom_point() + geom_smooth()
bivariate: categorical vs. numeric
how could you use descriptive statistics to look at the relationship? How could you do this?
we can partition the data into different subsets, one for each level of the categorical variable, and compute the mean of the numeric variable within the subset
(using a tibble in the library(tidyverse)
* don’t need to know how to write that code
bivariate: numeric vs. categorical
what if you wanted to look at a 3-way interaction between one numeric variable (target variable) and 2 categorical predictors?
use boxplots!
ggplot(dataset, aes(x = categorical1, y = numeric, fill = factor(categorical)) + geom_boxplot()
bivariate: categorical vs. categorical
How could you use descriptive statistics to evaluate this relationship?
two-way frequency table using the table() function. the first argument is the rows and columns = 2nd argument
ex.
table(dataset $ categorical1, dataset $ categorical2)
bivariate: categorical vs. categorical
how could you use graphical displays? what are the 3 options to display this graph?
split bar charts are a good way visualize the relationship
ggplot(dataset, aes(x = categorical1, y = categorical2)) + geom_bar(position = “ ….. “)
position can be three things:
position = “fill”
position = “dodge”
default value = stacked
fill is usually the most useful for depicting the interplay between two categorical variables