R programming Flashcards

1
Q

import

A

take data stored in a file, database, or web application programming interface (API), and load it into a data frame in R.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

tidy

A

storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

transform

A

first step, narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

wrangling

A

tidying and transforming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

two main engines of knowledge generation:

A

visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

visualization

A

A good visualisation will show you things that you did not expect, or raise new questions about the data. A good visualisation might also hint that you’re asking the wrong question, or you need to collect different data. Visualisations can surprise you, but don’t scale particularly well because they require a human to interpret them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

models

A

complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

communication

A

last step, communicate results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

prompt and comment

A

This is a segment of prompt:
> text
>
This is a comment:
#comment here

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

data exploration

A

the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again. The goal of data exploration is to generate many promising leads that you can later explore in more depth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile.

A

ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

geom

A

geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

facet

A

can make multi-panel plots and control how the scales of one panel relate to the scales of another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

stat

A

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. The figure below describes how this process works with geom_bar().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

geom_bar and the data set

A
  1. geom_bar() gets data set
  2. geom_bar transforms data with y-axis stat, returns a data set of x-values and y-values
  3. geom_bar() uses transformed data to build plot. x and y axes are then mapped
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

geom_bar() can be used interchangeably with another function, both will result in the same results:

A

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

This works because every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you might need to use a stat explicitly:

A

You might want to override the default stat. In the code below, I change the stat of geom_bar() from count (the default) to identity. This lets me map the height of the bars to the raw values of a
y
variable. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.

demo <- tribble(
~cut, ~freq,
“Fair”, 1610,
“Good”, 4906,
“Very Good”, 12082,
“Premium”, 13791,
“Ideal”, 21551
)

ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = “identity”)

You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count:

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))
#> Warning: stat(prop) was deprecated in ggplot2 3.4.0.
#> ℹ Please use after_stat(prop) instead.

You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarises the y values for each unique x value, to draw attention to the summary that you’re computing:

ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

color gradient on bar chart:
The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: “identity”, “dodge” or “fill”.

A

position = “identity” will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

overplotting

A

This arrangement makes it hard to see where the mass of the data is. Are the data points spread equally throughout the graph, or is there one special combination of hwy and displ that contains 109 values?

You can avoid this gridding by setting the position adjustment to “jitter”. position = “jitter” adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = “jitter”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

coord_flip()

A

switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

coord_quickmap()

A

sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

coord_polar()

A

uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

assignment statements

A

object <- val

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Object names are snake_case

A

i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_RENOUNCEconvention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

calling functions

A

function_name(arg1 = val1, arg2 = val2, …)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

(seq)uences
seq() function

A

makes regular sequences of numbers

seq(1, 10)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

TAB button

A

keyboard shortcut to find possible completions of a function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

grammar of graphics

A

coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

int
dbl
chr
dttm

A

integer
doubles/ real nums
character vector, string
date + time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

lgl

A

logical, vector of TRUE/FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

fctr

A

factors, represent categorical vars with fixed possible values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

date

A

stands for datess

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:

A

Pick observations by their values (filter()).
Reorder the rows (arrange()).
Pick variables by their names (select()).
Create new variables with functions of existing variables (mutate()).
Collapse many values down to a single summary (summarise()).
These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

All verbs work similarly:

The first argument is a data frame.

The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

The result is a new data frame.

Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Comparisons

To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal).

A

common problem you might encounter when using ==: floating point numbers. These results might surprise you!

sqrt(2) ^ 2 == 2
#> [1] FALSE
1 / 49 * 49 == 1
#> [1] FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on ==, use near()

A

near(sqrt(2) ^ 2, 2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Sometimes you can simplify complicated subsetting by remembering De Morgan’s law:

As well as & and |, R also has && and ||

A

!(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y. For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:

filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Missing values or NA’s

A

f you want to determine if a value is missing, use is.na():

is.na(x)
#> [1] TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

A

df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)

filter(df, is.na(x) | x > 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

arrange() works similarly to filter() except that instead of selecting rows, it changes their order.

A

It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

desc()

A

re-order by column in descending order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

select()

A

rapidly zoom in on a useful subset using operations based on the names of the variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

There are a number of helper functions you can use within select():

A

starts_with(“abc”): matches names that begin with “abc”.

ends_with(“xyz”): matches names that end with “xyz”.

contains(“ijk”): matches names that contain “ijk”.

matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.

num_range(“x”, 1:3): matches x1, x2 and x3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

rename()

A

Variant of select() that keeps variables that aren’t explicitly mentioned to change the variable name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

A

select(flights, time_hour, air_time, everything())

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

mutate()

A

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

mutate() always adds new columns at the end of your dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

View()

A

see all columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

transmute()

A

to only keep the new variables created for the data set

transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

There are many functions for creating new variables that you can use with mutate(). The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:

A

Arithmetic operators: +, -, *, /, ^. These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number: air_time / 60, hours * 60 + minute, etc.

Arithmetic operators are also useful in conjunction with the aggregate functions you’ll learn about later. For example, x / sum(x) calculates the proportion of a total, and y - mean(y) computes the difference from the mean.

Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with:

transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Logs

log2()is recommended because it’s easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.

A

log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive, a feature we’ll come back to in modelling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)).

A

They are most useful in conjunction with group_by(), which you’ll learn about shortly.

(x <- 1:10)
#> [1] 1 2 3 4 5 6 7 8 9 10
lag(x)
#> [1] NA 1 2 3 4 5 6 7 8 9
lead(x)
#> [1] 2 3 4 5 6 7 8 9 10 NA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Cumulative and rolling aggregates

A

R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.

x
#> [1] 1 2 3 4 5 6 7 8 9 10
cumsum(x)
#> [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
#> [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Logical comparisons

A

<, <=, >, >=, !=, and ==, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Ranking

A

there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.

y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
#> [1] 1 2 2 NA 4 5
min_rank(desc(y))
#> [1] 5 3 3 NA 2 1

If min_rank() doesn’t do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()

row_number(y)
#> [1] 1 2 3 NA 4 5
dense_rank(y)
#> [1] 1 2 2 NA 3 4
percent_rank(y)
#> [1] 0.00 0.25 0.25 NA 0.75 1.00
cume_dist(y)
#> [1] 0.2 0.6 0.6 NA 0.8 1.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

summarize()

A
  • collapses a data frame to a single row
  • Not very useful
  • used with group_by() to change unit of analysis from complete dataset to individual groups
  • use with dplyr verbs on grouped data frame to have them applied by the group
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))).

A

That way you can check that you’re not drawing conclusions based on very small amounts of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Measure of location-
mean(x) - sum / length
median(x) - 50% of x is above it, 50% below it

A

It’s sometimes useful to combine aggregation with logical subsetting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Measures of spread:
sd(x)
IQR(x)
mad(x)

A

The root mean squared deviation, or standard deviation sd(x), is the standard measure of spread. The interquartile range IQR(x) and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.

not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Measure of rank
min(x)
quantile(x, 0.25)
max(x)

A

Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.

not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Measure of position:
first(x)
nth(x, 2)
last(x)

A

These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:

not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)

These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:

not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Counts

A

n(), no arguments, return size of current group

To count number of non-missing values, use sum(!is.na(x)).

To count the number of distinct (unique) values, use n_distinct(x)

not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))

dplyr provides a simple helper if all you want is a count:
not_cancelled %>%
count(dest)

You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:

not_cancelled %>%
count(tailnum, wt = distance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Counts and proportions of logical values:
sum(x > 10)
mean(y == 0)

A

When used with numeric functions, TRUE is converted to 1 and FALSE to 0. This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.

How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)

not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500))

What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_prop = mean(arr_delay > 60))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Grouping by multiple variables

Be careful when progressively rolling up summaries: it’s OK for sums and counts, but you need to think about weighting means and variances, and it’s not possible to do it exactly for rank-based statistics like the median. In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.

A

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset:

daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))

(per_month <- summarise(per_day, flights = sum(flights)))

(per_year <- summarise(per_month, flights = sum(flights)))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Ungrouping

If you need to remove grouping, and return to operations on ungrouped data, use ungroup().

A

daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n()) # all flights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

grouped mutates and filters:

A

Grouping is most useful in conjunction with summarise(), but you can also do convenient operations with mutate() and filter()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Pipe ( %>% )
lhs %>% rhs

A

lhs - value or magrittr placeholder

rhs - function call using magrittr semantics

When functions require only one argument, x %>% f is equivalent to f(x) (not exactly equivalent; see technical note below.)

Placing lhs as the first argument in rhs call
The default behavior of %>% when multiple arguments are required in the rhs call, is to place lhs as the first argument, i.e. x %>% f(y) is equivalent to f(x, y).

Placing lhs elsewhere in rhs call
Often you will want lhs to the rhs call at another position than the first. For this purpose you can use the dot (.) as placeholder. For example, y %>% f(x, .) is equivalent to f(x, y) and z %>% f(x, y, arg = .) is equivalent to f(x, y, arg = z).

Using the dot for secondary purposes
Often, some attribute or property of lhs is desired in the rhs call in addition to the value of lhs itself, e.g. the number of rows or columns. It is perfectly valid to use the dot placeholder several times in the rhs call, but by design the behavior is slightly different when using it inside nested function calls. In particular, if the placeholder is only used in a nested function call, lhs will also be placed as the first argument! The reason for this is that in most use-cases this produces the most readable code. For example, iris %>% subset(1:nrow(.) %% 2 == 0) is equivalent to iris %>% subset(., 1:nrow(.) %% 2 == 0) but slightly more compact. It is possible to overrule this behavior by enclosing the rhs in braces. For example, 1:10 %>% {c(min(.), max(.))} is equivalent to c(min(1:10), max(1:10)).

Using %>% with call- or function-producing rhs
It is possible to force evaluation of rhs before the piping of lhs takes place. This is useful when rhs produces the relevant call or function. To evaluate rhs first, enclose it in parentheses, i.e. a %>% (function(x) x^2), and 1:10 %>% (call(“sum”)). Another example where this is relevant is for reference class methods which are accessed using the $ operator, where one would do x %>% (rc$f), and not x %>% rc$f.

Using lambda expressions with %>%
Each rhs is essentially a one-expression body of a unary function. Therefore defining lambdas in magrittr is very natural, and as the definitions of regular functions: if more than a single expression is needed one encloses the body in a pair of braces, { rhs }. However, note that within braces there are no “first-argument rule”: it will be exactly like writing a unary function where the argument name is “.” (the dot).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

The magrittr pipe operators use non-standard evaluation. They capture their inputs and examines them to figure out how to proceed. First a function is produced from all of the individual right-hand side expressions, and then the result is obtained by applying this function to the left-hand side.

A

For most purposes, one can disregard the subtle aspects of magrittr’s evaluation, but some functions may capture their calling environment, and thus using the operators will not be exactly equivalent to the “standard call” without pipe-operators.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

A grouped filter is a grouped mutate followed by an ungrouped filter. I generally avoid them except for quick and dirty manipulations: otherwise it’s hard to check that you’ve done the manipulation correctly.

A

Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries). You can learn more about useful window functions in the corresponding vignette: vignette(“window-functions”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

pipe operation sequence

A

df %>%
do_this_operation %>%
then_do_this_operation %>%
then_do_this_operation …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Pipe operator to summarize one variable

A

summarize mean mpg grouped by cyl

mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Use Pipe Operator to Group & Summarize Multiple Variables
The following code shows how to use the pipe (%>%) operator to group by the cyl and am variables, and then summarize the mean of the mpg variable and the standard deviation of the hp variable:

A

mtcars %>%
group_by(cyl, am) %>%
summarise(mean_mpg = mean(mpg),
sd_hp = sd(hp))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Use Pipe Operator to Create New Variables
The following code shows how to use the pipe (%>%) operator along with the mutate function from the dplyr package to create two new variables in the mtcars data frame:

A

add two new variables in mtcars

new_mtcars <- mtcars %>%
mutate(mpg2 = mpg*2,
mpg_root = sqrt(mpg))

head(new_mtcars)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

EDA (Exploratory Data Analysis) is an iterative cycle.

A

Generate questions about your data.

Search for answers by visualising, transforming, and modelling your data.

Use what you learn to refine your questions and/or generate new questions.

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not.

A

To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery.

A

You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

A

What type of variation occurs within my variables?

What type of covariation occurs between my variables?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

variable

A

quantity, quality, or property that you can measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

value

A

the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

observation

A

set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

Tabular data

A

set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

Variation

A

the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable’s values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is categorical if it can only take one of a small set of values.

A

In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart:

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

83
Q

A variable is continuous if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables.

A

To examine the distribution of a continuous variable, use a histogram

84
Q

A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin.

A

You can set the width of the intervals in a histogram with the binwidth argument, which is measured in the units of the x variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.

85
Q

If you wish to overlay multiple histograms in the same plot, I recommend using geom_freqpoly() instead of geom_histogram().

A

geom_freqpoly() performs the same calculation as geom_histogram(), but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.

86
Q

In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:

A

Which values are the most common? Why?

Which values are rare? Why? Does that match your expectations?

Can you see any unusual patterns? What might explain them?

87
Q

Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science.

A

When you have a lot of data, outliers are sometimes difficult to see in a histogram.

88
Q

(coord_cartesian() also has an xlim() argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.)

A

This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr

89
Q

It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification.

A

You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.

90
Q

If you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.

A

Drop the entire row with the strange values: NOT RECOMMENDED, just because one measurement is invalid != all invalid.
Low quality data leads to even less data

replacing the unusual values with missing values. The easiest way to do this is to use mutate() to replace the variable with a modified copy. You can use the ifelse() function to replace unusual values with NA

91
Q

ifelse() has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is false. Alternatively to ifelse, use dplyr::case_when(). case_when() is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables.

A

Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed

92
Q

If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way.

A

The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.

93
Q

t’s common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of geom_freqpoly() is not that useful for that sort of comparison because the height is given by the count.

A

That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape.

94
Q

To make the comparison easier we need to swap what is displayed on the y-axis.

A

Instead of displaying count, we’ll display density, which is the count standardised so that the area under each frequency polygon is one.

95
Q

Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:

A

A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.

Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.

A line (or whisker) that extends from each end of the box and goes to the
farthest non-outlier point in the distribution.

96
Q

Scatterplots become less useful as the size of your dataset grows,

A

because points begin to overplot, and pile up into areas of uniform black

97
Q

using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used geom_histogram() and geom_freqpoly() to bin in one dimension.

A

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins.

98
Q

Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:

A

Could this pattern be due to coincidence (i.e. random chance)?

How can you describe the relationship implied by the pattern?

How strong is the relationship implied by the pattern?

What other variables might affect the relationship?

Does the relationship change if you look at individual subgroups of the data?

99
Q

Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it.

A

If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.

100
Q

Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related.

A

It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain.

101
Q

The first two arguments to ggplot() are data and mapping, and the first two arguments to aes() are x and y. In the remainder of the book, we won’t supply those names.

A

That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what’s different between plots.

102
Q

Sometimes we’ll turn the end of a pipeline of data transformation into a plot. Watch for the transition from %>% to +. I wish this transition wasn’t necessary but unfortunately ggplot2 was created before the pipe was discovered.

A

diamonds %>%
count(cut, clarity) %>%
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()

103
Q

R has a powerful notion of the working directory. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save.

A

RStudio shows your current working directory at the top of the console

And you can print this out in R code by running getwd()

104
Q

Paths and directories are a little complicated because there are two basic styles of paths: Mac/Linux and Windows. There are three chief ways in which they differ:

A

The most important difference is how you separate the components of the path. Mac and Linux uses slashes (e.g. plots/diamonds.pdf) and Windows uses backslashes (e.g. plots\diamonds.pdf). R can work with either type (no matter what platform you’re currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes.

Absolute paths (i.e. paths that point to the same place regardless of your working directory) look different. In Windows they start with a drive letter (e.g. C:) or two backslashes (e.g. \servername) and in Mac/Linux they start with a slash “/” (e.g. /users/hadley). You should never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.

The last minor difference is the place that ~ points to. ~ is a convenient shortcut to your home directory. Windows doesn’t really have the notion of a home directory, so it instead points to your documents directory.

105
Q

R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for this via projects.

A

Click File > New Project > New Directory > New Project

106
Q

run the complete script which will save a PDF and CSV file into your project directory.

A

library(tidyverse)

ggplot(diamonds, aes(carat, price)) +
geom_hex()
ggsave(“diamonds.pdf”)

write_csv(diamonds, “diamonds.csv”)

107
Q

Tibbles

A

data frames, but they tweak some older behaviours to make life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It’s difficult to change base R without breaking existing code, so most innovation occurs in packages.

provides opinionated data frames that make working in the tidyverse a little easier. In most places, I’ll use the term tibble and data frame interchangeably; when I want to draw particular attention to R’s built-in data frame, I’ll call them data.frames.

108
Q

Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with as_tibble():

A

as_tibble(iris)

109
Q

You can create a new tibble from individual vectors with tibble(). tibble() will automatically recycle inputs of length 1, and allows you to refer to variables that you just created

A

tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
#> # A tibble: 5 × 3
#> x y z
#> <int> <dbl> <dbl>
#> 1 1 1 2
#> 2 2 1 5
#> 3 3 1 10
#> 4 4 1 17
#> 5 5 1 26</dbl></dbl></int>

110
Q

If you’re already familiar with data.frame(), note that tibble() does much less: it never changes the type of the inputs

A

(e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.

111
Q

It’s possible for a tibble to have column names that are not valid R variable names, aka non-syntactic names. For example, they might not start with a letter, or they might contain unusual characters like a space. To refer to these variables, you need to surround them with backticks, `:

You’ll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.

A

tb <- tibble(
:) = “smile”,
` ` = “space”,
2000 = “number”
)
tb
#> # A tibble: 1 × 3
#> :) ` ` 2000
#> <chr> <chr> <chr>
#> 1 smile space number</chr></chr></chr>

112
Q

Another way to create a tibble is with tribble(), short for transposed tibble. tribble() is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.

A

tribble(
~x, ~y, ~z,
#–|–|—-
“a”, 2, 3.6,
“b”, 1, 8.5
)
#> # A tibble: 2 × 3
#> x y z
#> <chr> <dbl> <dbl>
#> 1 a 2 3.6
#> 2 b 1 8.5</dbl></dbl></chr>

113
Q

There are two main differences in the usage of a tibble vs. a classic data.frame:

A

printing and subsetting.

114
Q

Printing
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str()

A

tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
#> # A tibble: 1,000 × 5
#> a b c d e
#> <dttm> <date> <int> <dbl> <chr>
#> 1 2022-11-18 23:15:18 2022-11-25 1 0.368 n
#> 2 2022-11-19 17:20:28 2022-11-30 2 0.612 l
#> 3 2022-11-19 11:44:07 2022-12-10 3 0.415 p
#> 4 2022-11-19 01:05:24 2022-12-09 4 0.212 m
#> 5 2022-11-18 21:29:41 2022-12-06 5 0.733 i
#> 6 2022-11-19 08:30:38 2022-12-02 6 0.460 n
#> # … with 994 more rows</chr></dbl></int></date></dttm>

115
Q

Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames. But sometimes you need more output than the default display. There are a few options that can help.

First, you can explicitly print() the data frame and control the number of rows (n) and the width of the display. width = Inf will display all columns:

nycflights13::flights %>%
print(n = 10, width = Inf)

A

You can also control the default print behaviour by setting options:

options(tibble.print_max = n, tibble.print_min = m): if more than n rows, print only m rows. Use options(tibble.print_min = Inf) to always show all rows.

Use options(tibble.width = Inf) to always print all columns, regardless of the width of the screen.

116
Q

use RStudio’s built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations.

A

nycflights13::flights %>%
View()

117
Q

Subsetting
If you want to pull out a single variable, you need some new tools, $ and [[. [[ can extract by name or position; $ only extracts by name but is a little less typing.

A

df <- tibble(
x = runif(5),
y = rnorm(5)
)

Extract by name
df$x
#> [1] 0.73296674 0.23436542 0.66035540 0.03285612 0.46049161
df[[“x”]]
#> [1] 0.73296674 0.23436542 0.66035540 0.03285612 0.46049161

Extract by position
df[[1]]
#> [1] 0.73296674 0.23436542 0.66035540 0.03285612 0.46049161

118
Q

Compared to a data.frame, tibbles are more strict:

A

they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.

119
Q

Some older functions don’t work with tibbles. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data.frame:

A

class(as.data.frame(tb))
#> [1] “data.frame”

120
Q

The main reason that some older functions don’t work with tibble is the [ function. We don’t use [ much because dplyr::filter() and dplyr::select() allow you to solve the same problems with clearer code

A

With base R data frames, [ sometimes returns a data frame, and sometimes returns a vector. With tibbles, [ always returns another tibble.

121
Q

readr functions:

read_csv() reads comma delimited files, read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place), read_tsv() reads tab delimited files, and read_delim() reads in files with any delimiter.

A

read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions(). read_table() reads a common variation of fixed width files where columns are separated by white space.

read_log() reads Apache style log files. (But also check out webreadr which is built on top of read_log() and provides many more helpful tools.)

122
Q

The first argument to read_csv() is the most important: it’s the path to the file to read.

When you run read_csv() it prints out a column specification that gives the name and type of each column. That’s an important part of readr

A

heights <- read_csv(“data/heights.csv”)

123
Q

You can also supply an inline csv file. This is useful for experimenting with readr and for creating reproducible examples to share with others:

A

read_csv(“a,b,c
1,2,3
4,5,6”)

124
Q

read_csv() uses the first line of the data for the column names, which is a very common convention. There are two cases where you might want to tweak this behaviour:

A

Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; or use comment = “#” to drop all lines that start with (e.g.) #.

read_csv(“The first line of metadata
The second line of metadata
x,y,z
1,2,3”, skip = 2)

read_csv(“# A comment I want to skip
x,y,z
1,2,3”, comment = “#”)

The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn:

read_csv(“1,2,3\n4,5,6”, col_names = FALSE)

125
Q

“\n” is a convenient shortcut for adding a new line.

A

Alternatively you can pass col_names a character vector which will be used as the column names:

read_csv(“1,2,3\n4,5,6”, col_names = c(“x”, “y”, “z”))

126
Q

Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file:

A

read_csv(“a,b,c\n1,2,.”, na = “.”)

127
Q

There are a few good reasons to favour readr functions over the base equivalents:

A

hey are typically much faster (~10x) than their base equivalents. Long running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try data.table::fread(). It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.

They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.

They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.

128
Q

parse_*() functions

These functions take a character vector and return a more specialised vector like a logical, integer, or date:

A

These functions are useful in their own right, but are also an important building block for readr.

Like all functions in the tidyverse, the parse_*() functions are uniform: the first argument is a character vector to parse, and the na argument specifies which strings should be treated as missing

129
Q

If there are many parsing failures, you’ll need to use problems() to get the complete set. This returns a tibble, which you can then manipulate with dplyr.

A

problems(x)

130
Q

Using parsers is mostly a matter of understanding what’s available and how they deal with different types of input. There are eight particularly important parsers:

A

parse_logical() and parse_integer() parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further.

parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways.

parse_character() seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings.

parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values.

parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.

131
Q

It seems like it should be straightforward to parse a number, but three problems make it tricky:

A

People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,.

Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.

Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.

132
Q

People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,.

A

readr has the notion of a “locale”, an object that specifies parsing options that differ from place to place. When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of . by creating a new locale and setting the decimal_mark argument:

parse_double(“1.23”)
#> [1] 1.23
parse_double(“1,23”, locale = locale(decimal_mark = “,”))
#> [1] 1.23

133
Q

readr’s default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English). An alternative approach would be to try and guess the defaults from your operating system. This is hard to do well, and, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country

A

parse_number() addresses the second problem: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.

parse_number(“$100”)
#> [1] 100
parse_number(“20%”)
#> [1] 20
parse_number(“It cost $123.45”)
#> [1] 123.45

134
Q

The final problem is addressed by the combination of parse_number() and the locale as parse_number() will ignore the “grouping mark”:

A

Used in America
parse_number(“$123,456,789”)
#> [1] 123456789

Used in many parts of Europe
parse_number(“123.456.789”, locale = locale(grouping_mark = “.”))
#> [1] 123456789

Used in Switzerland
parse_number(“123’456’789”, locale = locale(grouping_mark = “’”))
#> [1] 123456789

135
Q

there are multiple ways to represent the same string. To understand what’s going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using charToRaw():

A

charToRaw(“Hadley”)
#> [1] 48 61 64 6c 65 79

Each hexadecimal number represents a byte of information: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII. ASCII does a great job of representing English characters, because it’s the American Standard Code for Information Interchange.

136
Q

How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides guess_encoding() to help you figure it out. It’s not foolproof, and it works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.

A

guess_encoding(charToRaw(x1))

The first argument to guess_encoding() can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).

137
Q

R uses factors to represent categorical variables that have a known set of possible values. Give parse_factor() a vector of known levels to generate a warning whenever an unexpected value is present

A

fruit <- c(“apple”, “banana”)
parse_factor(c(“apple”, “banana”, “bananana”), levels = fruit)
#> Warning: 1 parsing failure.
#> row col expected actual
#> 3 – value in level set bananana
#> [1] apple banana <NA>
#> attr(,"problems")
#> # A tibble: 1 × 4
#> row col expected actual
#> <int> <int> <chr> <chr>
#> 1 3 NA value in level set bananana
#> Levels: apple banana</chr></chr></int></int></NA>

138
Q

You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight). When called without any additional arguments:

A

parse_datetime() expects an ISO8601 date-time. ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.

parse_date() expects a four digit year, a - or /, the month, a - or /, then the day

parse_time() expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier

139
Q

Year

A

%Y (4 digits).

%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.

140
Q

Month

A

%m (2 digits).

%b (abbreviated name, like “Jan”).

%B (full name, “January”).

141
Q

Day

A

%d (2 digits).

%e (optional leading space).

142
Q

Time

A

%H 0-23 hour.

%I 0-12, must be used with %p.

%p AM/PM indicator.

%M minutes.

%S integer seconds.

%OS real seconds.

%Z Time zone (as name, e.g. America/Chicago). Beware of abbreviations: if you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is not Eastern Standard Time! We’ll come back to this time zones.

%z (as offset from UTC, e.g. +0800).

143
Q

Non-digits

A

%. skips one non-digit character.

%* skips any number of non-digits.

144
Q

If you’re using %b or %B with non-English month names, you’ll need to set the lang argument to locale(). See the list of built-in languages in date_names_langs(), or if your language is not already included, create your own with date_names().

A

parse_date(“1 janvier 2015”, “%d %B %Y”, locale = locale(“fr”))
#> [1] “2015-01-01”

145
Q

readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column.

A

You can emulate this process with a character vector using guess_parser(), which returns readr’s best guess, and parse_guess() which uses that guess to parse the column

146
Q

The heuristic tries each of the following types, stopping when it finds a match:

A

logical: contains only “F”, “T”, “FALSE”, or “TRUE”.

integer: contains only numeric characters (and -).

double: contains only valid doubles (including numbers like 4.5e-5).

number: contains valid doubles with the grouping mark inside.

time: matches the default time_format.

date: matches the default date_format.

date-time: any ISO8601 date.

If none of these rules apply, then the column will stay as a vector of strings.

147
Q

defaults don’t always work for larger files. There are two basic problems:

The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general. For example, you might have a column of doubles that only contains integers in the first 1000 rows.

The column might contain a lot of missing values. If the first 1000 rows contain only NAs, readr will guess that it’s a logical vector, whereas you probably want to parse it as something more specific.

A

readr contains a challenging CSV that illustrates both of these problems:

challenge <- read_csv(readr_example(“challenge.csv”))

148
Q

There are two printed outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures. It’s always a good idea to explicitly pull out the problems(), so you can explore them in more depth:

A

problems(challenge)
#> # A tibble: 0 × 5
#> # … with 5 variables: row <int>, col <int>, expected <chr>, actual <chr>,
#> # file <chr></chr></chr></chr></int></int>

149
Q

A good strategy is to work column by column until there are no problems remaining. Here we can see that there are a lot of parsing problems with the y column. If we look at the last few rows, you’ll see that they’re dates stored in a character vector:

A

tail(challenge)
#> # A tibble: 6 × 2
#> x y
#> <dbl> <date>
#> 1 0.805 2019-11-21
#> 2 0.164 2018-03-29
#> 3 0.472 2014-08-04
#> 4 0.718 2015-08-16
#> 5 0.270 2020-02-04
#> 6 0.608 2019-01-06</date></dbl>

That suggests we need to use a date parser instead.

challenge <- read_csv(
readr_example(“challenge.csv”),
col_types = cols(
x = col_double(),
y = col_logical()
)
)
Then you can fix the type of the y column by specifying that y is a date column:

challenge <- read_csv(
readr_example(“challenge.csv”),
col_types = cols(
x = col_double(),
y = col_date()
)
)
tail(challenge)

150
Q

Every parse_xyz() function has a corresponding col_xyz() function. You use parse_xyz() when the data is in a character vector in R already

A

you use col_xyz() when you want to tell readr how to load the data.

151
Q

readr also comes with two useful functions for writing data back to disk: write_csv() and write_tsv(). Both functions increase the chances of the output file being read back in correctly by:

A

Always encoding strings in UTF-8.

Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.

152
Q

If you want to export a csv file to Excel, use write_excel_csv() — this writes a special character (a “byte order mark”) at the start of the file which tells Excel that you’re using the UTF-8 encoding.

A

The most important arguments are x (the data frame to save), and path (the location to save it). You can also specify how missing values are written with na, and if you want to append to an existing file.

write_csv(challenge, “challenge.csv”)

153
Q

Note that the type information is lost when you save to csv:

A

This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two alternatives:

write_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS

The feather package implements a fast binary file format that can be shared across programming languages. Feather tends to be faster than RDS and is usable outside of R. RDS supports list-columns (which you’ll learn about in many models); feather currently does not.

154
Q

haven

A

reads SPSS, Stata, and SAS files.

155
Q

readxl

A

reads excel files (both .xls and .xlsx).

156
Q

DBI

A

along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.

157
Q

For hierarchical data: use jsonlite

A

for json, and xml2 for XML.

158
Q

There are three interrelated rules which make a dataset tidy:

A

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

159
Q

Why ensure that your data is tidy? There are two main advantages:

A

There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

160
Q

The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. The second step is to resolve one of two common problems:

A

One variable might be spread across multiple columns.

One observation might be scattered across multiple rows.

Typically a dataset will only suffer from one of these problems; it’ll only suffer from both if you’re really unlucky! To fix these problems, you’ll need the two most important functions in tidyr: pivot_longer() and pivot_wider().

161
Q

A common problem is a dataset where some of the column names are not names of variables, but values of a variable.

A

pivot the offending columns into a new pair of variables. To describe that operation we need three parameter

162
Q

pivot_wider() is the opposite of pivot_longer()

A

You use it when an observation is scattered across multiple rows.

163
Q

To tidy this up, we first analyse the representation in similar way to pivot_longer(). This time, however, we only need two parameters:

A

The column to take variable names from. Here, it’s type.

The column to take values from. Here it’s count.

164
Q

pivot_wider() and pivot_longer() are complements.

A

pivot_longer() makes wide tables narrower and longer; pivot_wider() makes long tables shorter and wider.

165
Q

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears.

A

By default, separate() will split values wherever it sees a non-alphanumeric character (i.e. a character that isn’t a number or letter).

166
Q

Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:

A

Explicitly, i.e. flagged with NA.

Implicitly, i.e. simply not present in the data.

167
Q

The function data.frame() creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R’s modeling software.

A

data.frame(…, row.names = NULL, check.rows = FALSE,
check.names = TRUE, fix.empty.names = TRUE,
stringsAsFactors = FALSE)

default.stringsAsFactors() # &laquo_space;this is deprecated !
Arguments

these arguments are of either the form value or tag = value. Component names are created based on the tag (if present) or the deparsed argument itself.

row.names
NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.

check.rows
if TRUE then the rows are checked for consistency of length and names.

check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.

fix.empty.names
logical indicating if arguments which are “unnamed” (in the sense of not being formally called as someName = arg) get an automatically constructed name or rather name “”. Needs to be set to FALSE even when check.names is false if “” names should be kept.

stringsAsFactors
logical: should character vectors be converted to factors? The ‘factory-fresh’ default has been TRUE previously but has been changed to FALSE for R 4.0.0.

168
Q

multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

A

Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair. Sometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.

169
Q

To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:

A

Mutating joins, which add new variables to one data frame from matching observations in another.

Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.

Set operations, which treat observations as if they were set elements.

170
Q

The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases. If you’ve used a database before, you’ve almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different.

A

Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis.

171
Q

The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation. In simple cases, a single variable is sufficient to identify an observation. For example, each plane is uniquely identified by its tailnum.

A

In other cases, multiple variables may be needed. For example, to identify an observation in weather you need five variables: year, month, day, hour, and origin.

172
Q

primary key

A

uniquely identifies an observation in its own table. For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.

173
Q

foreign key

A

uniquely identifies an observation in another table. For example, flights$tailnum is a foreign key because it appears in the flights table where it matches each flight to a unique plane.

174
Q

A variable can be both a primary key and a foreign key. For example, origin is part of the weather primary key, and is also a foreign key for the airports table.

A

Once you’ve identified the primary keys in your tables, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to count() the primary keys and look for entries where n is greater than one

175
Q

surrogate key

A

If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data.

176
Q

primary key and the corresponding foreign key in another table form a relation. Relations are typically one-to-many.

A

you’ll occasionally see a 1-to-1 relationship. You can think of this as a special case of 1-to-many. You can model many-to-many relations with a many-to-1 relation plus a 1-to-many relation.

177
Q

mutating join

A

allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.

Like mutate(), the join functions add variables to the right, so if you have a lot of variables already, the new variables won’t get printed out.

178
Q

An inner join keeps observations that appear in both tables. An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

A

A left join keeps all observations in x.

A right join keeps all observations in y.

A full join keeps all observations in x and y.

These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA.

179
Q

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match.

A

The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

180
Q

The default, by = NULL, uses all variables that appear in both tables, the so called natural join.

A named character vector: by = c(“a” = “b”). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.

A

A character vector, by = “x”. This is like a natural join, but uses only some of the common variables.

181
Q

Joining different variables between the tables, e.g. inner_join(x, y, by = c(“a” = “b”)) uses a slightly different syntax in SQL: SELECT * FROM x INNER JOIN y ON x.a = y.b.

A

As this syntax suggests, SQL supports a wider range of join types than dplyr because you can connect the tables using constraints other than equality (sometimes called non-equijoins).

182
Q

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

A

semi_join(x, y) keeps all observations in x that have a match in y.

anti_join(x, y) drops all observations in x that have a match in y.

183
Q

Join problems

A

Start by identifying the variables that form the primary key in each table. You should usually do this based on your understanding of the data, not empirically by looking for a combination of variables that give a unique identifier. If you just look for variables without thinking about what they mean, you might get (un)lucky and find a combination that’s unique in your current data but the relationship might not be true in general.

Check that none of the variables in the primary key are missing. If a value is missing then it can’t identify an observation!

Check that your foreign keys match primary keys in another table. The best way to do this is with an anti_join(). It’s common for keys not to match because of data entry errors. Fixing these is often a lot of work.

If you do have missing keys, you’ll need to be thoughtful about your use of inner vs. outer joins, carefully considering whether or not you want to drop rows that don’t have a match.

184
Q

Set Operations
All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, and treat the observations like sets:

A

intersect(x, y): return only observations in both x and y.

union(x, y): return unique observations in x and y.

setdiff(x, y): return observations in x, but not in y.

185
Q

The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:

A

Your pipes are longer than (say) ten steps. In that case, create intermediate objects with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.

You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe.

You are starting to think about a directed graph with a complex dependency structure. Pipes are fundamentally linear and expressing complex relationships with them will typically yield confusing code.

186
Q

All packages in the tidyverse automatically make %>% available for you, so you don’t normally load magrittr explicitly. However, there are some other useful tools inside magrittr that you might want to try out:

A

When working with more complex pipes, it’s sometimes useful to call a function for its side-effects. Maybe you want to print out the current object, or plot it, or save it to disk. Many times, such functions don’t return anything, effectively terminating the pipe.

To work around this problem, you can use the “tee” pipe. %T>% works like %>% except that it returns the left-hand side instead of the right-hand side. It’s called “tee” because it’s like a literal T-shaped pipe.

rnorm(100) %>%
matrix(ncol = 2) %>%
plot() %>%
str()
#> NULL

rnorm(100) %>%
matrix(ncol = 2) %T>%
plot() %>%
str()
#> num [1:50, 1:2] -0.387 -0.785 -1.057 -0.796 -1.756 …

If you’re working with functions that don’t have a data frame based API
(i.e. you pass them individual vectors, not a data frame and expressions to be evaluated in the context of that data frame), you might find %$% useful. It “explodes” out the variables in a data frame so that you can refer to them explicitly. This is useful when working with many functions in base R:

mtcars %$%
cor(disp, mpg)
#> [1] -0.8475514

For assignment magrittr provides the %<>% operator which allows you to replace code like:

mtcars <- mtcars %>%
transform(cyl = cyl * 2)
with

mtcars %<>% transform(cyl = cyl * 2)

187
Q

ou can use || (or) and && (and) to combine multiple logical expressions. These operators are “short-circuiting”: as soon as || sees the first TRUE it returns TRUE without computing anything else. As soon as && sees the first FALSE it returns FALSE. You should never use | or & in an if statement: these are vectorised operations that apply to multiple values (that’s why you use them in filter()). If you do have a logical vector, you can use any() or all() to collapse it to a single value.

A

Be careful when testing for equality. == is vectorised, which means that it’s easy to get more than one output. Either check the length is already 1, collapse with all() or any(), or use the non-vectorised identical(). identical() is very strict: it always returns either a single TRUE or a single FALSE, and doesn’t coerce types. This means that you need to be careful when comparing integers and doubles:

identical(0L, 0)
#> [1] FALSE

188
Q

Be wary of floating point numbers

A

x <- sqrt(2) ^ 2
x
#> [1] 2
x == 2
#> [1] FALSE
x - 2
#> [1] 4.440892e-16
Instead use dplyr::near() for comparisons, as described in comparisons.

And remember, x == NA doesn’t do anything useful!

189
Q

The arguments to a function typically fall into two broad sets: one set supplies the data to compute on, and the other supplies arguments that control the details of the computation. For example:

A

In log(), the data is x, and the detail is the base of the logarithm.

In mean(), the data is x, and the details are how much data to trim from the ends (trim) and how to handle missing values (na.rm).

In t.test(), the data are x and y, and the details of the test are alternative, mu, paired, var.equal, and conf.level.

In str_c() you can supply any number of strings to …, and the details of the concatenation are controlled by sep and collapse.

190
Q

data arguments should come first. Detail arguments should go on the end, and usually should have default values.

A

You specify a default value in the same way you call a function with a named argument

191
Q

The names of the arguments are also important. R doesn’t care, but the readers of your code (including future-you!) will. Generally you should prefer longer, more descriptive names, but there are a handful of very common, very short names. It’s worth memorising these:

A

x, y, z: vectors.

w: a vector of weights.

df: a data frame.

i, j: numeric indices (typically rows and columns).

n: length, or number of rows.

p: number of columns.

matching names of arguments in existing R functions. For example, use na.rm to determine if missing values should be removed.

192
Q

Figuring out what your function should return is usually straightforward: it’s why you created the function in the first place! There are two things you should consider when returning a value:

A

Does returning early make your function easier to read?

Can you make your function pipeable?

193
Q

two basic types of pipeable functions: transformations and side-effects. With transformations, an object is passed to the function’s first argument and a modified object is returned. With side-effects, the passed object is not transformed. Instead, the function performs an action on the object, like drawing a plot or saving a file.

A

Side-effects functions should “invisibly” return the first argument, so that while they’re not printed they can still be used in a pipeline. For example, this simple function prints the number of missing values in a data frame

show_missings <- function(df) {
n <- sum(is.na(df))
cat(“Missing values: “, n, “\n”, sep = “”)

invisible(df)
}

194
Q

R uses rules called lexical scoping to find the value associated with a name. Since y is not defined inside the function, R will look in the environment where the function was defined:

A

y <- 100
f(10)
#> [1] 110

y <- 1000
f(10)
#> [1] 1010

195
Q

There are two types of vectors:

Atomic vectors, of which there are six types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors.

Lists, which are sometimes called recursive vectors because lists can contain other lists.

A

The chief difference between atomic vectors and lists is that atomic vectors are homogeneous, while lists can be heterogeneous.

196
Q

type, which you can determine with typeof().

A

typeof(letters)
#> [1] “character”
typeof(1:10)
#> [1] “integer”

197
Q

ength, which you can determine with length().

A

x <- list(“a”, “b”, 1:10)
length(x)
#> [1] 3

198
Q

Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create augmented vectors which build on additional behaviour. There are three important types of augmented vector:

A

Factors are built on top of integer vectors.

Dates and date-times are built on top of numeric vectors.

Data frames and tibbles are built on top of lists.

199
Q

There are two ways to convert, or coerce, one type of vector to another:

A

Explicit coercion happens when you call a function like as.logical(), as.integer(), as.double(), or as.character(). Whenever you find yourself using explicit coercion, you should always check whether you can make the fix upstream, so that the vector never had the wrong type in the first place. For example, you may need to tweak your readr col_types specification.

Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector. For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.

200
Q

As well as implicitly coercing the types of vectors to be compatible, R will also implicitly coerce the length of vectors. This is called vector recycling, because the shorter vector is repeated, or recycled, to the same length as the longer vector.

A

Because there are no scalars, most built-in functions are vectorised, meaning that they will operate on a vector of numbers.

201
Q

There are four types of things that you can subset a vector with:

A

A numeric vector containing only integers. The integers must either be all positive, all negative, or zero.

Subsetting with a logical vector keeps all values corresponding to a TRUE value. This is most often useful in conjunction with the comparison functions

If you have a named vector, you can subset it with a character vector

The simplest type of subsetting is nothing, x[], which returns the complete x. This is not useful for subsetting vectors, but it is useful when subsetting matrices (and other high dimensional structures) because it lets you select all the rows or all the columns, by leaving that index blank. For example, if x is 2d, x[1, ] selects the first row and all the columns, and x[, -1] selects all rows and all columns except the first.

202
Q

There are three very important attributes that are used to implement fundamental parts of R:

A

Names are used to name the elements of a vector.

Dimensions (dims, for short) make a vector behave like a matrix or array.

Class is used to implement the S3 object oriented system.

203
Q

Atomic vectors and lists are the building blocks for other important vector types like factors and dates. I call these augmented vectors, because they are vectors with additional attributes, including class. Because augmented vectors have a class, they behave differently to the atomic vector on which they are built. In this book, we make use of four important augmented vectors:

A

factors
dates
date-times
tibbles