Chapter 2: Data Visualization Flashcards by Marija Spehar

What is the correct way to make the points blue in a scatterplot?

ggplot(persinj, aes(x = op_time, y = total, color = “blue”) + geom_point()
ggplot(persinj, aes(x = op_time, y = total)) + geom_point(colour = “blue)

is the correct way

remember that the aes() function is used to map variables in our dataset to visual properties of the graph. If we were to use the first choice, the function would look for a variable “blue” in our dataset, that doesn’t exist.
The mistake is setting an aesthetic to a constant value

How well did you know this?

Not at all

Perfectly

T/F: the aesthetics determine what relationships we want to see in the plot and the geoms determine how we want to see these relationships.

How well did you know this?

Not at all

Perfectly

How could we use the “color” aesthetic the right way in a ggplot?

we could use color = , alongside a comparison of two numeric variables
it will make the points on the graph different colours depending on the value of the factor variable

How well did you know this?

Not at all

Perfectly

can you set colour = factor variable without converting it to a factor variable first (in the aesthetics mapping)

no, you have to convert it as a factor using factor() function

How well did you know this?

Not at all

Perfectly

what are two common arguments of the geom_smooth() ?

method = “lm” and se = FALSE

How well did you know this?

Not at all

Perfectly

what are the 4 common arguments used in geom_point()?

color
alpha = how light/dark you want your points to be
shape
size

How well did you know this?

Not at all

Perfectly

what are the 2 common arguments of geom_bar()?

fill

alpha

How well did you know this?

Not at all

Perfectly

what are the 3 common arguments for geom_histogram()?

fill
alpha
bins = number of bins you want to use.

How well did you know this?

Not at all

Perfectly

what curve is usually used with geom_point?

geom_smooth(method = “lm”, se = FALSE)

How well did you know this?

Not at all

Perfectly

you are comparing 3 variables. two are numeric and one is a factor variable.
you want to make the colour of the points different based on the value of the factor variable.
you use geom_point() and geom_smooth() with se bounds.
how could you make sure that the se lines are filled in with the corresponding factor value (consistent with the points)?

ggplot(dataset, x = numeric, y = numeric, color = factor(categorical), fill = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)

consistent colouring between the SE lines and the smoothed curve. This is because we assigned the factor variable to both fill and colour.

How well did you know this?

Not at all

Perfectly

you are comparing 3 variables. two are numeric and one is a factor variable.
you want to make the colour of the points different based on the value of the factor variable.
you use geom_point() and geom_smooth() with an se bound.
how could you make sure that there is only one smoothed line and one se bound for all data points?

ggplot(dataset, aes(x = numeric var, y = numeric var)) + geom_point(aes(color = factor(categorical)) + geom_smooth(method = “lm”, se = TRUE)

take it out of ggplot
put it into geom_point as an aesthetic because we want the points to still be different colours
se = TRUE in the geom_smooth

How well did you know this?

Not at all

Perfectly

T/F: aesthetic mappings in the original ggplot function will not be inherited to all geom functions.

False. they are

How well did you know this?

Not at all

Perfectly

What is faceting? What output does it provide when used? What function is used?

it’s a convenient way to categorize our data into distinct groups based on the value of our categorical predictors.
faceting displays the observations in separate plots produced for each value of the faceting variable placed side-by-side to ease comparison

the function used is facet_wrap()

How well did you know this?

Not at all

Perfectly

How do you use the facet_wrap() function?

ggplot(dataset, aes(x = numeric, y = numeric, color = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)
+ facet wrap(~ FACET VAR, ncol = n)

facet_wrap(~ FACET VAR, ncol = n) is added to the end of a ggplot

by default, the two graphs will be on the same scale so that they are easy to compare

How well did you know this?

Not at all

Perfectly

How do you add titles to a ggplot?

ggplot(dataset, aes(x = numeric, y = numeric, color = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)

+ labs( x = “x axis title”, y = “y axis title”, main = “main title”)

using the labs() function

How well did you know this?

Not at all

Perfectly

How do you display more than one graph on the page? What package?

Study These Flashcards

library(gridExtra)

grid.arrage(graph1, graph2, …, graph n, ncol = n)

How do you paste a graph into word from R?

Study These Flashcards

right click on the graph > copy > go to word > right click > paste options: picture

how do you alter the coordinates of a ggplot?

Study These Flashcards

ggplot(dataset, aes(x = numeric, y = numeric, color = factor(categorical)) + geom_point() + geom_smooth(method = “lm”, se = TRUE)

+ coords_cartesian(ylim = c(number1, number2), xlim = c(400, 1000)

What are the two components of exploratory data analysis?

Study These Flashcards

descriptive statistics

2. visual aids (graphs)

univariate data exploration: what 2 aspects of a numeric variable are you able to reveal using summary()

Study These Flashcards

central tendency (mean and median - tell us the "size" of the variable)
dispersion (variance)

can look for a skew in the variables data as well by comparing the mean and median.

Univariate data exploration: what kind of graphs are used when exploring numeric variables?

Study These Flashcards

Histograms: provide a visual summary of the count or relative frequency in each bin. we can learn about the shape of the distribution of a numeric variable

Boxplots: visualize the distribution of a numeric variable.
- boxplots don’t directly show the actual shape of the variable’s distribution, they offer a useful graphical summary of the key numeric statistics and allow for a visual comparison of the distributions of different numeric variables (relative magnitude of their mean and medians)

TODO
if you want to compare the values of a numeric variable across different levels of a categorical predictor, what kind of graphical display could you use?

Study These Flashcards

ggplot(dataset, x = factor(

How could you calculate the summary statistics of a numeric variable for two groups of observations?

ex. separating the values of another numeric variable; you want the summary stats for y based on if x > 50 or x < 50?
ex. separating the values of a categorical predictor?

Study These Flashcards

categorical:

dataset. value1 #, ]
dataset. under

what should we do if we have a skewed target variable?

Study These Flashcards

transformation. apply a monotonic concave function to shrink the outliers and symmetrize the overall distribution while preserving the ranks fo the observed values of the variable

what kind of transformations can be done on a skewed target variable?

1. log-transformation (can only be done to strictly positive data) 2. sqrt (similar to the log transformation)

how do we deal with natural outliers? 4

1. remove: if we know that it will not have a material effect 2. ignore: if the outliers make up an insignificant proportion of the data and are unlikely to create bias, then keep them 3. Modify: modify them to make them more reasonable 4. using robust model forms: instead of minimizing the squared error between predicted and observed, we could replace that with the absolute error. places a lot less weight on the observations that are outliers

what could you use to look at the descriptive statistics for a categorical variable?

frequency tables. table( dataset $ categorical ) / nrow( dataset) this will produce a frequency table, it will tell you what the predominant level is of the categorical predictor.

How could you graphically explore a categorical variable?

bar charts: extract the information in a frequency table and present the numeric counts visually

what is bivariate data exploration used for?

relationships, patterns, outliers

bivariate: numeric vs. numeric | what kind of descriptive statistics are to be used?

correlation coefficient! | use the cor() function

bivariate: numeric vs. numeric | what kind of graph would you use to evaluate the relationship?

``` scatterplot ggplot(dataset, aes(x = numeric1, y = numeric2) + geom_point() + geom_smooth() ```

bivariate: categorical vs. numeric | how could you use descriptive statistics to look at the relationship? How could you do this?

we can partition the data into different subsets, one for each level of the categorical variable, and compute the mean of the numeric variable within the subset (using a tibble in the library(tidyverse) * don't need to know how to write that code

bivariate: numeric vs. categorical what if you wanted to look at a 3-way interaction between one numeric variable (target variable) and 2 categorical predictors?

use boxplots! ggplot(dataset, aes(x = categorical1, y = numeric, fill = factor(categorical)) + geom_boxplot()

bivariate: categorical vs. categorical | How could you use descriptive statistics to evaluate this relationship?

two-way frequency table using the table() function. the first argument is the rows and columns = 2nd argument ex. table(dataset $ categorical1, dataset $ categorical2)

bivariate: categorical vs. categorical | how could you use graphical displays? what are the 3 options to display this graph?

split bar charts are a good way visualize the relationship ggplot(dataset, aes(x = categorical1, y = categorical2)) + geom_bar(position = " ..... ") position can be three things: position = "fill" position = "dodge" default value = stacked fill is usually the most useful for depicting the interplay between two categorical variables

Chapter 2: Data Visualization Flashcards

(35 cards)