Introduction to R Flashcards

This deck/module focuses on mastering key points from Hadley Wickam's Intro to R text (https://r4ds.had.co.nz/introduction.html). The goal is fluid recall than can be deployed while coding.

You may prefer our related Brainscape-certified flashcards:
1
Q

What is Haley’s model of the tools needed in a typical data science project?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does it mean to tidy data

A

Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored.

In brief, when your data is tidy, each column is a variable, and each row is an observation.

Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does it mean to transform data

A

​Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year).

Creating new variables that are functions of existing variables (like computing speed from distance and time).

And calculating a set of summary statistics (like counts or means).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is wrangling data

A

Tidying and transforming, together are called wrangling because getting your data in a form that’s natural to work with often feels like a fight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two engines of knowledge generation?

A

Visualization and Modeling

These have complementary strengths and weaknesses so any real analysis will iterate between them many times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a good visualisation

A

A good visualisation will show you things that you did not expect, or raise new questions about the data.

A good visualisation might also hint that you’re asking the wrong question, or you need to collect different data.

Visualisations can surprise you, but don’t scale particularly well because they require a human to interpret them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are models? When should you use them? what can or can’t they do?

A

Models are complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them.

Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains!

But every model makes assumptions, and by its very nature a model cannot question its own assumptions.

That means a model cannot fundamentally surprise you.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the last step of data science?

A

Communication, an absolutely critical part of any data analysis project. It doesn’t matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does programming tie in data science?

A

Programming is a cross-cutting tool that you use in every part of the project.

You don’t need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two camps analysis can be divided into?

A

Hypothesis generation and hypothesis confirmation (sometimes called confirmatory analysis).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two key regions in the R studio interface?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is tidyverse?

A

A collection of R packages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a package in R

A

Packages are the fundamental units of reproducible R code.

They include reusable functions, the documentation that describes how to use them, and sample data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you install a package? Install the tidyverse package

A

install.packages(“tidyverse”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What’s the “prompt” in R?

A

Refers to the “>” in the local console

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When asking a question, what are the three things you need to include to make your example reproducible?

A

required packages, data, and code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is data exploration

A

Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the goal of data exploration?

A

The goal of data exploration is to generate many promising leads that you can later explore in more depth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the grammar of graphics?

A

A coherent system for describing and building graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is ggplot2

A

One of the more elegant and versatile systems for visualisation.

Located in the Tidyverse package

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How do you load a library? Use tidyverse as an example

A

library(tidyverse)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What must you do with a package every time you start a new R session?

A

reload the package

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do you call up a function from a package? Give an example.

A

> package:: function ( )

eg: > ggplot2:: ggplot ( )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a data frame?

A

A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How do you learn more about a function or a package loaded into R?

A

”?”

eg. ?mpg

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Provide code for a reusable template for mapping graphs with ggplot2

A

ggplot (data = <data>) +</data>

<geom_function> (mapping = aes (<mappings>) )</mappings></geom_function>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How do you begin a plot in ggplot?

A

ggplot ( ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What does the function “ggplot ( )” do?

A

“ggplot( )” creates a coordinate system that you can add layers to.

The first argument of ggplot ( ) is the dataset to use in the graph.

This creates an empty graph-eg ggplot (data = mpg)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What does the function “geom_point ( )” do?

A

Adds a layer of points to your plot, which creates a scatterplot

30
Q

What is the “mapping” argument in a geom function?

A

This defines how variables in your dataset are mapped to visual properties.

31
Q

What is the “mapping” argument always paired with?

A

“aes ( )”

32
Q

What do the “aes ( )” arguments “x” and “y” specify?

A

Specify which variables to map to the x and y axes.

ggplot2 looks for the mapped variables in the “data” argument

33
Q

How can you add a third variable to a scatter plot?

A

You can add a third variable, like class, to a two dimensional scatterplot by mapping it to an aesthetic

34
Q

What is an aesthetic? Give an example:

A

An aesthetic is a visual property of the objects in your plot.

Aesthetics include things like the size, the shape, or the color of your points.

35
Q

What is scaling?

A

The process of assigning a unique level of the aesthetic ( such as a unique color) to each unique value of the variable

36
Q

What does the “alpha” aesthetic do?

A

controls the transparency of the points

37
Q

Give an example using ggplot and the mpg data set and anesthetic.

x = displ, y = hwy, alpha/shape = class

A

ggplot (data = mpg) +

geom_point ( mapping = aes ( x = displ, y = hwy, alpha = class) )

38
Q

What is a common problem regarding the “+” when creating ggplot2 graphics

A

Putting the + sign in the wrong place. It comes at the end, not the beginning.

39
Q

What are facets?

A

subplots that each display one subset of data.

40
Q

What are two types of facet functions?

A

facet_wrap ( )

facet_grid ( )

41
Q

What is a geom?

A

A geom is the geometrical object that a plot uses to represent data.

People often describe plots by the type of geom that the plot uses.

42
Q

How many geoms does ggplot2 have?

A

Over 30 geoms, and extension packages provide even more (see https://www.ggplot2-exts.org for a sampling)

43
Q

How can you display multiple geoms in the same plot?

A

To display multiple geoms in the same plot, add multiple geom functions to ggplot():

ggplot(data = mpg) +

geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy))

44
Q

How can you avoid having to modify each function (and the inherent error potential this creates) when multiple geoms are being used in the same plot?

A

You can avoid this type of repetition by passing a set of mappings to ggplot().

ggplot2 will treat these mappings as global mappings that apply to each geom in the graph.

45
Q

How do you apply a set of mappings to ggplot ( ) function? Show by example.

A

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +

geom_point() + geom_smooth()

46
Q

What does the “se” argument in geom_smooth( ) do?

A

Shading

47
Q

What is an additional feature seen with bar charts, histograms, and frequency polygons?

A

bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.

48
Q

What do “smoothers” do?

A

Smoothers fit a model to your data and then plot predictions from the model.

49
Q

What is stat?

A

The algorithm used to compute new values for a graph.

50
Q

You can generally use geoms and _____ interchangeably

A

stats

51
Q

Why can you use geoms and stats interchaneably

A

Because every geom has a default stat and every stat has a defaulg geom.

52
Q

What does stat_summary ( ) do?

A

Summarises the y values for each unique x value, to draw attention to the summary that you’re computing:

53
Q

How many stats does ggplot2 have? How can you see a complete list?

A

Over 20

ggplot2 cheat sheet

54
Q

How can you color a bar chart?

A

You can colour a bar chart using either the colour aesthetic, or, more usefully, fill.

55
Q

With regards to the “fill” aesthetic in bar graphs, what is necessary with regards to the height of the bars?

A

With the fill aesthetic, the heights of the bars need to be normalized.

56
Q

What does the position argument do?

A

Controls whether stacking is performed in a bar graph.

57
Q

How can you modify the position argument so it doesn’t stack the bars?

A

If you don’t want a stacked bar chart, you can use one of three other options: “identity”, “dodge”or “fill”.

58
Q

What does position = “identity” do?

A

position = “identity” will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them.

59
Q

What does position = “fill” do? When should this be used?

A

position = “fill” works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

60
Q

What does position = “dodge” do in a bar graph?

A

position = “dodge” places overlapping objects directly beside one another. This makes it easier to compare individual values.

61
Q

What is overplotting

A

When there exists observations with identical or similar values in a scatter plot. Because the points are rounded they may overlap and appear on top of each other.

62
Q

What is one approach to avoid overplotting?

A

You can avoid this gridding by setting the position adjustment to “jitter”. position = “jitter” adds a small amount of random noise to each point.

63
Q

What does coord_flip() do?

A

witches the x and y axes in a cartesian coordinate system

64
Q

What does coord_polar do?

A

coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

65
Q

What does geom_abline do?

A

Adds a reference line for the slope.

66
Q

What is the seven-parameter template applicable to most graphs which Hadley suggests?

A

ggplot ( data = <data> ) +</data>

<geom_function> (</geom_function>

mapping = aes (<mappings>),</mappings>

stat = <stat>,</stat>

position = <position></position>

) +

<coordinate_function> +</coordinate_function>

<facet_function></facet_function>

67
Q

What insight is the grammar of graphics based on?

A

Based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.

68
Q

What are five key key dplyr functions that allow you to solve the vast majority of your data manipulation challenges?

A

Pick observations by their values ( filter ( ) ).

Reorder the rows ( arrange ( ) ).

Pick variables by their names ( select ( ) ).

Create new variables with functions of existing variables ( mutate( ) ).

Collapse many values down to a single summary ( summarise( ) ).

along with group_by ( ) we have six functions which provide the verbs for a language of data manipulation

69
Q

What does group_by ( ) allow us to do in conjunction with the dplyr five key functions?

A

Changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

70
Q

What function can we use to avoid the trouble created by floating point numbers?

A

near ( , )

71
Q
A