R programming Flashcards

Question

calling functions

Answer 1

function_name(arg1 = val1, arg2 = val2, ...)

Answer 2

makes regular sequences of numbers seq(1, 10)

Answer 3

keyboard shortcut to find possible completions of a function

Answer 4

coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.

Answer 5

integer doubles/ real nums character vector, string date + time

Answer 6

logical, vector of TRUE/FALSE

Answer 7

factors, represent categorical vars with fixed possible values

Answer 8

stands for datess

Answer 9

Pick observations by their values (filter()). Reorder the rows (arrange()). Pick variables by their names (select()). Create new variables with functions of existing variables (mutate()). Collapse many values down to a single summary (summarise()). These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation. All verbs work similarly: The first argument is a data frame. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes). The result is a new data frame. Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work.

Answer 10

common problem you might encounter when using ==: floating point numbers. These results might surprise you! sqrt(2) ^ 2 == 2 #> [1] FALSE 1 / 49 * 49 == 1 #> [1] FALSE

Answer 11

near(sqrt(2) ^ 2, 2) #> [1] TRUE near(1 / 49 * 49, 1) #> [1] TRUE

Answer 12

!(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y. For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters: filter(flights, !(arr_delay > 120 | dep_delay > 120)) filter(flights, arr_delay <= 120, dep_delay <= 120)

Answer 13

f you want to determine if a value is missing, use is.na(): is.na(x) #> [1] TRUE

Answer 14

df <- tibble(x = c(1, NA, 3)) filter(df, x > 1) filter(df, is.na(x) | x > 1)

Answer 15

It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns: arrange(flights, year, month, day)

Answer 16

re-order by column in descending order

Answer 17

rapidly zoom in on a useful subset using operations based on the names of the variables.

Answer 18

starts_with("abc"): matches names that begin with “abc”. ends_with("xyz"): matches names that end with “xyz”. contains("ijk"): matches names that contain “ijk”. matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings. num_range("x", 1:3): matches x1, x2 and x3.

Answer 19

Variant of select() that keeps variables that aren't explicitly mentioned to change the variable name

Answer 20

select(flights, time_hour, air_time, everything())

Answer 21

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate(). mutate() always adds new columns at the end of your dataset

Answer 22

see all columns

Answer 23

to only keep the new variables created for the data set transmute(flights, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours )

Answer 24

Arithmetic operators: +, -, *, /, ^. These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number: air_time / 60, hours * 60 + minute, etc. Arithmetic operators are also useful in conjunction with the aggregate functions you’ll learn about later. For example, x / sum(x) calculates the proportion of a total, and y - mean(y) computes the difference from the mean. Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with: transmute(flights, dep_time, hour = dep_time %/% 100, minute = dep_time %% 100 )

Answer 25

log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive, a feature we’ll come back to in modelling.

Answer 26

They are most useful in conjunction with group_by(), which you’ll learn about shortly. (x <- 1:10) #> [1] 1 2 3 4 5 6 7 8 9 10 lag(x) #> [1] NA 1 2 3 4 5 6 7 8 9 lead(x) #> [1] 2 3 4 5 6 7 8 9 10 NA

Answer 27

R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package. x #> [1] 1 2 3 4 5 6 7 8 9 10 cumsum(x) #> [1] 1 3 6 10 15 21 28 36 45 55 cummean(x) #> [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

Answer 28

<, <=, >, >=, !=, and ==, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.

Answer 29

there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks. y <- c(1, 2, 2, NA, 3, 4) min_rank(y) #> [1] 1 2 2 NA 4 5 min_rank(desc(y)) #> [1] 5 3 3 NA 2 1 If min_rank() doesn’t do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), ntile() row_number(y) #> [1] 1 2 3 NA 4 5 dense_rank(y) #> [1] 1 2 2 NA 3 4 percent_rank(y) #> [1] 0.00 0.25 0.25 NA 0.75 1.00 cume_dist(y) #> [1] 0.2 0.6 0.6 NA 0.8 1.0

Answer 30

- collapses a data frame to a single row - Not very useful - used with group_by() to change unit of analysis from complete dataset to individual groups - use with dplyr verbs on grouped data frame to have them applied by the group

Answer 31

That way you can check that you’re not drawing conclusions based on very small amounts of data.

Answer 32

It’s sometimes useful to combine aggregation with logical subsetting.

Answer 33

The root mean squared deviation, or standard deviation sd(x), is the standard measure of spread. The interquartile range IQR(x) and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers. not_cancelled %>% group_by(dest) %>% summarise(distance_sd = sd(distance)) %>% arrange(desc(distance_sd))

Answer 34

Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%. not_cancelled %>% group_by(year, month, day) %>% summarise( first = min(dep_time), last = max(dep_time) )

Answer 35

These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day: not_cancelled %>% group_by(year, month, day) %>% summarise( first_dep = first(dep_time), last_dep = last(dep_time) ) These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row: not_cancelled %>% group_by(year, month, day) %>% mutate(r = min_rank(desc(dep_time))) %>% filter(r %in% range(r))

Answer 36

n(), no arguments, return size of current group To count number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x) not_cancelled %>% group_by(dest) %>% summarise(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) dplyr provides a simple helper if all you want is a count: not_cancelled %>% count(dest) You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew: not_cancelled %>% count(tailnum, wt = distance)

Answer 37

When used with numeric functions, TRUE is converted to 1 and FALSE to 0. This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion. How many flights left before 5am? (these usually indicate delayed # flights from the previous day) not_cancelled %>% group_by(year, month, day) %>% summarise(n_early = sum(dep_time < 500)) What proportion of flights are delayed by more than an hour? not_cancelled %>% group_by(year, month, day) %>% summarise(hour_prop = mean(arr_delay > 60))

Answer 38

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset: daily <- group_by(flights, year, month, day) (per_day <- summarise(daily, flights = n())) (per_month <- summarise(per_day, flights = sum(flights))) (per_year <- summarise(per_month, flights = sum(flights)))

Answer 39

daily %>% ungroup() %>% # no longer grouped by date summarise(flights = n()) # all flights

Answer 40

Grouping is most useful in conjunction with summarise(), but you can also do convenient operations with mutate() and filter()

Answer 41

lhs - value or magrittr placeholder rhs - function call using magrittr semantics When functions require only one argument, x %>% f is equivalent to f(x) (not exactly equivalent; see technical note below.) Placing lhs as the first argument in rhs call The default behavior of %>% when multiple arguments are required in the rhs call, is to place lhs as the first argument, i.e. x %>% f(y) is equivalent to f(x, y). Placing lhs elsewhere in rhs call Often you will want lhs to the rhs call at another position than the first. For this purpose you can use the dot (.) as placeholder. For example, y %>% f(x, .) is equivalent to f(x, y) and z %>% f(x, y, arg = .) is equivalent to f(x, y, arg = z). Using the dot for secondary purposes Often, some attribute or property of lhs is desired in the rhs call in addition to the value of lhs itself, e.g. the number of rows or columns. It is perfectly valid to use the dot placeholder several times in the rhs call, but by design the behavior is slightly different when using it inside nested function calls. In particular, if the placeholder is only used in a nested function call, lhs will also be placed as the first argument! The reason for this is that in most use-cases this produces the most readable code. For example, iris %>% subset(1:nrow(.) %% 2 == 0) is equivalent to iris %>% subset(., 1:nrow(.) %% 2 == 0) but slightly more compact. It is possible to overrule this behavior by enclosing the rhs in braces. For example, 1:10 %>% {c(min(.), max(.))} is equivalent to c(min(1:10), max(1:10)). Using %>% with call- or function-producing rhs It is possible to force evaluation of rhs before the piping of lhs takes place. This is useful when rhs produces the relevant call or function. To evaluate rhs first, enclose it in parentheses, i.e. a %>% (function(x) x^2), and 1:10 %>% (call("sum")). Another example where this is relevant is for reference class methods which are accessed using the $ operator, where one would do x %>% (rc$f), and not x %>% rc$f. Using lambda expressions with %>% Each rhs is essentially a one-expression body of a unary function. Therefore defining lambdas in magrittr is very natural, and as the definitions of regular functions: if more than a single expression is needed one encloses the body in a pair of braces, { rhs }. However, note that within braces there are no "first-argument rule": it will be exactly like writing a unary function where the argument name is "." (the dot).

Answer 42

For most purposes, one can disregard the subtle aspects of magrittr's evaluation, but some functions may capture their calling environment, and thus using the operators will not be exactly equivalent to the "standard call" without pipe-operators.

Answer 43

Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries). You can learn more about useful window functions in the corresponding vignette: vignette("window-functions").

Answer 44

df %>% do_this_operation %>% then_do_this_operation %>% then_do_this_operation ...

Answer 45

#summarize mean mpg grouped by cyl mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg))

Answer 46

mtcars %>% group_by(cyl, am) %>% summarise(mean_mpg = mean(mpg), sd_hp = sd(hp))

Answer 47

#add two new variables in mtcars new_mtcars <- mtcars %>% mutate(mpg2 = mpg*2, mpg_root = sqrt(mpg)) #view first six rows of new data frame head(new_mtcars)

Answer 48

Generate questions about your data. Search for answers by visualising, transforming, and modelling your data. Use what you learn to refine your questions and/or generate new questions. EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.

Answer 49

To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.

Answer 50

You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.

Answer 51

What type of variation occurs within my variables? What type of covariation occurs between my variables?

Answer 52

quantity, quality, or property that you can measure.

Answer 53

the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

Answer 54

set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.

Answer 55

set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.

Answer 56

the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable’s values.

Answer 57

In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart: ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))

Answer 58

To examine the distribution of a continuous variable, use a histogram

Answer 59

You can set the width of the intervals in a histogram with the binwidth argument, which is measured in the units of the x variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.

Answer 60

geom_freqpoly() performs the same calculation as geom_histogram(), but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.

Answer 61

Which values are the most common? Why? Which values are rare? Why? Does that match your expectations? Can you see any unusual patterns? What might explain them?

Answer 62

When you have a lot of data, outliers are sometimes difficult to see in a histogram.

Answer 63

This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr

Answer 64

You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.

Answer 65

Drop the entire row with the strange values: NOT RECOMMENDED, just because one measurement is invalid != all invalid. Low quality data leads to even less data replacing the unusual values with missing values. The easiest way to do this is to use mutate() to replace the variable with a modified copy. You can use the ifelse() function to replace unusual values with NA

Answer 66

Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed

Answer 67

The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.

Answer 68

That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape.

Answer 69

Instead of displaying count, we’ll display density, which is the count standardised so that the area under each frequency polygon is one.

Answer 70

A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side. Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually. A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.

Answer 71

because points begin to overplot, and pile up into areas of uniform black

Answer 72

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins.

Answer 73

Could this pattern be due to coincidence (i.e. random chance)? How can you describe the relationship implied by the pattern? How strong is the relationship implied by the pattern? What other variables might affect the relationship? Does the relationship change if you look at individual subgroups of the data?

Answer 74

If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.

Answer 75

It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain.

Answer 76

That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what’s different between plots.

Answer 77

diamonds %>% count(cut, clarity) %>% ggplot(aes(clarity, cut, fill = n)) + geom_tile()

Answer 78

RStudio shows your current working directory at the top of the console And you can print this out in R code by running getwd()

Answer 79

The most important difference is how you separate the components of the path. Mac and Linux uses slashes (e.g. plots/diamonds.pdf) and Windows uses backslashes (e.g. plots\diamonds.pdf). R can work with either type (no matter what platform you’re currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes. Absolute paths (i.e. paths that point to the same place regardless of your working directory) look different. In Windows they start with a drive letter (e.g. C:) or two backslashes (e.g. \\servername) and in Mac/Linux they start with a slash “/” (e.g. /users/hadley). You should never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you. The last minor difference is the place that ~ points to. ~ is a convenient shortcut to your home directory. Windows doesn’t really have the notion of a home directory, so it instead points to your documents directory.

Answer 80

Click File > New Project > New Directory > New Project

Answer 81

library(tidyverse) ggplot(diamonds, aes(carat, price)) + geom_hex() ggsave("diamonds.pdf") write_csv(diamonds, "diamonds.csv")

Answer 82

data frames, but they tweak some older behaviours to make life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It’s difficult to change base R without breaking existing code, so most innovation occurs in packages. provides opinionated data frames that make working in the tidyverse a little easier. In most places, I’ll use the term tibble and data frame interchangeably; when I want to draw particular attention to R’s built-in data frame, I’ll call them data.frames.

Answer 83

as_tibble(iris)

Answer 84

tibble( x = 1:5, y = 1, z = x ^ 2 + y ) #> # A tibble: 5 × 3 #> x y z #> #> 1 1 1 2 #> 2 2 1 5 #> 3 3 1 10 #> 4 4 1 17 #> 5 5 1 26

Answer 85

(e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.

Answer 86

tb <- tibble( `:)` = "smile", ` ` = "space", `2000` = "number" ) tb #> # A tibble: 1 × 3 #> `:)` ` ` `2000` #> #> 1 smile space number

Answer 87

tribble( ~x, ~y, ~z, #--|--|---- "a", 2, 3.6, "b", 1, 8.5 ) #> # A tibble: 2 × 3 #> x y z #> #> 1 a 2 3.6 #> 2 b 1 8.5

Answer 88

printing and subsetting.

Answer 89

tibble( a = lubridate::now() + runif(1e3) * 86400, b = lubridate::today() + runif(1e3) * 30, c = 1:1e3, d = runif(1e3), e = sample(letters, 1e3, replace = TRUE) ) #> # A tibble: 1,000 × 5 #> a b c d e #> #> 1 2022-11-18 23:15:18 2022-11-25 1 0.368 n #> 2 2022-11-19 17:20:28 2022-11-30 2 0.612 l #> 3 2022-11-19 11:44:07 2022-12-10 3 0.415 p #> 4 2022-11-19 01:05:24 2022-12-09 4 0.212 m #> 5 2022-11-18 21:29:41 2022-12-06 5 0.733 i #> 6 2022-11-19 08:30:38 2022-12-02 6 0.460 n #> # … with 994 more rows

Answer 90

You can also control the default print behaviour by setting options: options(tibble.print_max = n, tibble.print_min = m): if more than n rows, print only m rows. Use options(tibble.print_min = Inf) to always show all rows. Use options(tibble.width = Inf) to always print all columns, regardless of the width of the screen.

Answer 91

nycflights13::flights %>% View()

Answer 92

df <- tibble( x = runif(5), y = rnorm(5) ) Extract by name df$x #> [1] 0.73296674 0.23436542 0.66035540 0.03285612 0.46049161 df[["x"]] #> [1] 0.73296674 0.23436542 0.66035540 0.03285612 0.46049161 Extract by position df[[1]] #> [1] 0.73296674 0.23436542 0.66035540 0.03285612 0.46049161

Answer 93

they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.

Answer 94

class(as.data.frame(tb)) #> [1] "data.frame"

Answer 95

With base R data frames, [ sometimes returns a data frame, and sometimes returns a vector. With tibbles, [ always returns another tibble.

Answer 96

read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions(). read_table() reads a common variation of fixed width files where columns are separated by white space. read_log() reads Apache style log files. (But also check out webreadr which is built on top of read_log() and provides many more helpful tools.)

Answer 97

heights <- read_csv("data/heights.csv")

Answer 98

read_csv("a,b,c 1,2,3 4,5,6")

Answer 99

Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with (e.g.) #. read_csv("The first line of metadata The second line of metadata x,y,z 1,2,3", skip = 2) read_csv("# A comment I want to skip x,y,z 1,2,3", comment = "#") The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn: read_csv("1,2,3\n4,5,6", col_names = FALSE)

Answer 100

Alternatively you can pass col_names a character vector which will be used as the column names: read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))

Answer 101

read_csv("a,b,c\n1,2,.", na = ".")

Answer 102

hey are typically much faster (~10x) than their base equivalents. Long running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try data.table::fread(). It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster. They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions. They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.

Answer 103

These functions are useful in their own right, but are also an important building block for readr. Like all functions in the tidyverse, the parse_*() functions are uniform: the first argument is a character vector to parse, and the na argument specifies which strings should be treated as missing

Answer 104

problems(x)

Answer 105

parse_logical() and parse_integer() parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further. parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways. parse_character() seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings. parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values. parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.

Answer 106

People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,. Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”. Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.

Answer 107

readr has the notion of a “locale”, an object that specifies parsing options that differ from place to place. When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of . by creating a new locale and setting the decimal_mark argument: parse_double("1.23") #> [1] 1.23 parse_double("1,23", locale = locale(decimal_mark = ",")) #> [1] 1.23

Answer 108

parse_number() addresses the second problem: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text. parse_number("$100") #> [1] 100 parse_number("20%") #> [1] 20 parse_number("It cost $123.45") #> [1] 123.45

Answer 109

Used in America parse_number("$123,456,789") #> [1] 123456789 Used in many parts of Europe parse_number("123.456.789", locale = locale(grouping_mark = ".")) #> [1] 123456789 Used in Switzerland parse_number("123'456'789", locale = locale(grouping_mark = "'")) #> [1] 123456789

Answer 110

charToRaw("Hadley") #> [1] 48 61 64 6c 65 79 Each hexadecimal number represents a byte of information: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII. ASCII does a great job of representing English characters, because it’s the American Standard Code for Information Interchange.

Answer 111

guess_encoding(charToRaw(x1)) The first argument to guess_encoding() can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).

Answer 112

fruit <- c("apple", "banana") parse_factor(c("apple", "banana", "bananana"), levels = fruit) #> Warning: 1 parsing failure. #> row col expected actual #> 3 -- value in level set bananana #> [1] apple banana #> attr(,"problems") #> # A tibble: 1 × 4 #> row col expected actual #> #> 1 3 NA value in level set bananana #> Levels: apple banana

Answer 113

parse_datetime() expects an ISO8601 date-time. ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second. parse_date() expects a four digit year, a - or /, the month, a - or /, then the day parse_time() expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier

Answer 114

%Y (4 digits). %y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.

Answer 115

%m (2 digits). %b (abbreviated name, like “Jan”). %B (full name, “January”).

Answer 116

%d (2 digits). %e (optional leading space).

Answer 117

%H 0-23 hour. %I 0-12, must be used with %p. %p AM/PM indicator. %M minutes. %S integer seconds. %OS real seconds. %Z Time zone (as name, e.g. America/Chicago). Beware of abbreviations: if you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is not Eastern Standard Time! We’ll come back to this time zones. %z (as offset from UTC, e.g. +0800).

Answer 118

%. skips one non-digit character. %* skips any number of non-digits.

Answer 119

parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr")) #> [1] "2015-01-01"

Answer 120

You can emulate this process with a character vector using guess_parser(), which returns readr’s best guess, and parse_guess() which uses that guess to parse the column

Answer 121

logical: contains only “F”, “T”, “FALSE”, or “TRUE”. integer: contains only numeric characters (and -). double: contains only valid doubles (including numbers like 4.5e-5). number: contains valid doubles with the grouping mark inside. time: matches the default time_format. date: matches the default date_format. date-time: any ISO8601 date. If none of these rules apply, then the column will stay as a vector of strings.

Answer 122

readr contains a challenging CSV that illustrates both of these problems: challenge <- read_csv(readr_example("challenge.csv"))

Answer 123

problems(challenge) #> # A tibble: 0 × 5 #> # … with 5 variables: row , col , expected , actual , #> # file

Answer 124

tail(challenge) #> # A tibble: 6 × 2 #> x y #> #> 1 0.805 2019-11-21 #> 2 0.164 2018-03-29 #> 3 0.472 2014-08-04 #> 4 0.718 2015-08-16 #> 5 0.270 2020-02-04 #> 6 0.608 2019-01-06 That suggests we need to use a date parser instead. challenge <- read_csv( readr_example("challenge.csv"), col_types = cols( x = col_double(), y = col_logical() ) ) Then you can fix the type of the y column by specifying that y is a date column: challenge <- read_csv( readr_example("challenge.csv"), col_types = cols( x = col_double(), y = col_date() ) ) tail(challenge)

Answer 125

you use col_xyz() when you want to tell readr how to load the data.

Answer 126

Always encoding strings in UTF-8. Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.

Answer 127

The most important arguments are x (the data frame to save), and path (the location to save it). You can also specify how missing values are written with na, and if you want to append to an existing file. write_csv(challenge, "challenge.csv")

Answer 128

This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two alternatives: write_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS The feather package implements a fast binary file format that can be shared across programming languages. Feather tends to be faster than RDS and is usable outside of R. RDS supports list-columns (which you’ll learn about in many models); feather currently does not.

Answer 129

reads SPSS, Stata, and SAS files.

Answer 130

reads excel files (both .xls and .xlsx).

Answer 131

along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.

Answer 132

for json, and xml2 for XML.

Answer 133

Each variable must have its own column. Each observation must have its own row. Each value must have its own cell.

Answer 134

There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

Answer 135

One variable might be spread across multiple columns. One observation might be scattered across multiple rows. Typically a dataset will only suffer from one of these problems; it’ll only suffer from both if you’re really unlucky! To fix these problems, you’ll need the two most important functions in tidyr: pivot_longer() and pivot_wider().

Answer 136

pivot the offending columns into a new pair of variables. To describe that operation we need three parameter

Answer 137

You use it when an observation is scattered across multiple rows.

Answer 138

The column to take variable names from. Here, it’s type. The column to take values from. Here it’s count.

Answer 139

pivot_longer() makes wide tables narrower and longer; pivot_wider() makes long tables shorter and wider.

Answer 140

By default, separate() will split values wherever it sees a non-alphanumeric character (i.e. a character that isn’t a number or letter).

Answer 141

Explicitly, i.e. flagged with NA. Implicitly, i.e. simply not present in the data.

Answer 142

data.frame(..., row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = FALSE) default.stringsAsFactors() # << this is deprecated ! Arguments ... these arguments are of either the form value or tag = value. Component names are created based on the tag (if present) or the deparsed argument itself. row.names NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame. check.rows if TRUE then the rows are checked for consistency of length and names. check.names logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are. fix.empty.names logical indicating if arguments which are “unnamed” (in the sense of not being formally called as someName = arg) get an automatically constructed name or rather name "". Needs to be set to FALSE even when check.names is false if "" names should be kept. stringsAsFactors logical: should character vectors be converted to factors? The ‘factory-fresh’ default has been TRUE previously but has been changed to FALSE for R 4.0.0.

Answer 143

Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair. Sometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.

Answer 144

Mutating joins, which add new variables to one data frame from matching observations in another. Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table. Set operations, which treat observations as if they were set elements.

Answer 145

Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis.

Answer 146

In other cases, multiple variables may be needed. For example, to identify an observation in weather you need five variables: year, month, day, hour, and origin.

Answer 147

uniquely identifies an observation in its own table. For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.

Answer 148

uniquely identifies an observation in another table. For example, flights$tailnum is a foreign key because it appears in the flights table where it matches each flight to a unique plane.

Answer 149

Once you’ve identified the primary keys in your tables, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to count() the primary keys and look for entries where n is greater than one

Answer 150

If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data.

Answer 151

you’ll occasionally see a 1-to-1 relationship. You can think of this as a special case of 1-to-many. You can model many-to-many relations with a many-to-1 relation plus a 1-to-many relation.

Answer 152

allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other. Like mutate(), the join functions add variables to the right, so if you have a lot of variables already, the new variables won’t get printed out.

Answer 153

A left join keeps all observations in x. A right join keeps all observations in y. A full join keeps all observations in x and y. These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA.

Answer 154

The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

Answer 155

A character vector, by = "x". This is like a natural join, but uses only some of the common variables.

Answer 156

As this syntax suggests, SQL supports a wider range of join types than dplyr because you can connect the tables using constraints other than equality (sometimes called non-equijoins).

Answer 157

semi_join(x, y) keeps all observations in x that have a match in y. anti_join(x, y) drops all observations in x that have a match in y.

Answer 158

Start by identifying the variables that form the primary key in each table. You should usually do this based on your understanding of the data, not empirically by looking for a combination of variables that give a unique identifier. If you just look for variables without thinking about what they mean, you might get (un)lucky and find a combination that’s unique in your current data but the relationship might not be true in general. Check that none of the variables in the primary key are missing. If a value is missing then it can’t identify an observation! Check that your foreign keys match primary keys in another table. The best way to do this is with an anti_join(). It’s common for keys not to match because of data entry errors. Fixing these is often a lot of work. If you do have missing keys, you’ll need to be thoughtful about your use of inner vs. outer joins, carefully considering whether or not you want to drop rows that don’t have a match.

Answer 159

intersect(x, y): return only observations in both x and y. union(x, y): return unique observations in x and y. setdiff(x, y): return observations in x, but not in y.

Answer 160

Your pipes are longer than (say) ten steps. In that case, create intermediate objects with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent. You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You are starting to think about a directed graph with a complex dependency structure. Pipes are fundamentally linear and expressing complex relationships with them will typically yield confusing code.

Answer 161

When working with more complex pipes, it’s sometimes useful to call a function for its side-effects. Maybe you want to print out the current object, or plot it, or save it to disk. Many times, such functions don’t return anything, effectively terminating the pipe. To work around this problem, you can use the “tee” pipe. %T>% works like %>% except that it returns the left-hand side instead of the right-hand side. It’s called “tee” because it’s like a literal T-shaped pipe. rnorm(100) %>% matrix(ncol = 2) %>% plot() %>% str() #> NULL rnorm(100) %>% matrix(ncol = 2) %T>% plot() %>% str() #> num [1:50, 1:2] -0.387 -0.785 -1.057 -0.796 -1.756 ... If you’re working with functions that don’t have a data frame based API (i.e. you pass them individual vectors, not a data frame and expressions to be evaluated in the context of that data frame), you might find %$% useful. It “explodes” out the variables in a data frame so that you can refer to them explicitly. This is useful when working with many functions in base R: mtcars %$% cor(disp, mpg) #> [1] -0.8475514 For assignment magrittr provides the %<>% operator which allows you to replace code like: mtcars <- mtcars %>% transform(cyl = cyl * 2) with mtcars %<>% transform(cyl = cyl * 2)

Answer 162

Be careful when testing for equality. == is vectorised, which means that it’s easy to get more than one output. Either check the length is already 1, collapse with all() or any(), or use the non-vectorised identical(). identical() is very strict: it always returns either a single TRUE or a single FALSE, and doesn’t coerce types. This means that you need to be careful when comparing integers and doubles: identical(0L, 0) #> [1] FALSE

Answer 163

x <- sqrt(2) ^ 2 x #> [1] 2 x == 2 #> [1] FALSE x - 2 #> [1] 4.440892e-16 Instead use dplyr::near() for comparisons, as described in comparisons. And remember, x == NA doesn’t do anything useful!

Answer 164

In log(), the data is x, and the detail is the base of the logarithm. In mean(), the data is x, and the details are how much data to trim from the ends (trim) and how to handle missing values (na.rm). In t.test(), the data are x and y, and the details of the test are alternative, mu, paired, var.equal, and conf.level. In str_c() you can supply any number of strings to ..., and the details of the concatenation are controlled by sep and collapse.

Answer 165

You specify a default value in the same way you call a function with a named argument

Answer 166

x, y, z: vectors. w: a vector of weights. df: a data frame. i, j: numeric indices (typically rows and columns). n: length, or number of rows. p: number of columns. matching names of arguments in existing R functions. For example, use na.rm to determine if missing values should be removed.

Answer 167

Does returning early make your function easier to read? Can you make your function pipeable?

Answer 168

Side-effects functions should “invisibly” return the first argument, so that while they’re not printed they can still be used in a pipeline. For example, this simple function prints the number of missing values in a data frame show_missings <- function(df) { n <- sum(is.na(df)) cat("Missing values: ", n, "\n", sep = "") invisible(df) }

Answer 169

y <- 100 f(10) #> [1] 110 y <- 1000 f(10) #> [1] 1010

Answer 170

The chief difference between atomic vectors and lists is that atomic vectors are homogeneous, while lists can be heterogeneous.

Answer 171

typeof(letters) #> [1] "character" typeof(1:10) #> [1] "integer"

Answer 172

x <- list("a", "b", 1:10) length(x) #> [1] 3

Answer 173

Factors are built on top of integer vectors. Dates and date-times are built on top of numeric vectors. Data frames and tibbles are built on top of lists.

Answer 174

Explicit coercion happens when you call a function like as.logical(), as.integer(), as.double(), or as.character(). Whenever you find yourself using explicit coercion, you should always check whether you can make the fix upstream, so that the vector never had the wrong type in the first place. For example, you may need to tweak your readr col_types specification. Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector. For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.

Answer 175

Because there are no scalars, most built-in functions are vectorised, meaning that they will operate on a vector of numbers.

Answer 176

A numeric vector containing only integers. The integers must either be all positive, all negative, or zero. Subsetting with a logical vector keeps all values corresponding to a TRUE value. This is most often useful in conjunction with the comparison functions If you have a named vector, you can subset it with a character vector The simplest type of subsetting is nothing, x[], which returns the complete x. This is not useful for subsetting vectors, but it is useful when subsetting matrices (and other high dimensional structures) because it lets you select all the rows or all the columns, by leaving that index blank. For example, if x is 2d, x[1, ] selects the first row and all the columns, and x[, -1] selects all rows and all columns except the first.

Answer 177

Names are used to name the elements of a vector. Dimensions (dims, for short) make a vector behave like a matrix or array. Class is used to implement the S3 object oriented system.

Answer 178

factors dates date-times tibbles