R programming Flashcards
import
take data stored in a file, database, or web application programming interface (API), and load it into a data frame in R.
tidy
storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.
transform
first step, narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means).
wrangling
tidying and transforming
two main engines of knowledge generation:
visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times.
visualization
A good visualisation will show you things that you did not expect, or raise new questions about the data. A good visualisation might also hint that you’re asking the wrong question, or you need to collect different data. Visualisations can surprise you, but don’t scale particularly well because they require a human to interpret them.
models
complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.
communication
last step, communicate results
prompt and comment
This is a segment of prompt:
> text
>
This is a comment:
#comment here
data exploration
the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again. The goal of data exploration is to generate many promising leads that you can later explore in more depth.
R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile.
ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.
geom
geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.
facet
can make multi-panel plots and control how the scales of one panel relate to the scales of another.
stat
The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. The figure below describes how this process works with geom_bar().
geom_bar and the data set
- geom_bar() gets data set
- geom_bar transforms data with y-axis stat, returns a data set of x-values and y-values
- geom_bar() uses transformed data to build plot. x and y axes are then mapped
geom_bar() can be used interchangeably with another function, both will result in the same results:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
This works because every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you might need to use a stat explicitly:
You might want to override the default stat. In the code below, I change the stat of geom_bar() from count (the default) to identity. This lets me map the height of the bars to the raw values of a
y
variable. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.
demo <- tribble(
~cut, ~freq,
“Fair”, 1610,
“Good”, 4906,
“Very Good”, 12082,
“Premium”, 13791,
“Ideal”, 21551
)
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = “identity”)
You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))
#> Warning: stat(prop)
was deprecated in ggplot2 3.4.0.
#> ℹ Please use after_stat(prop)
instead.
You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarises the y values for each unique x value, to draw attention to the summary that you’re computing:
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)
color gradient on bar chart:
The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: “identity”, “dodge” or “fill”.
position = “identity” will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA.
overplotting
This arrangement makes it hard to see where the mass of the data is. Are the data points spread equally throughout the graph, or is there one special combination of hwy and displ that contains 109 values?
You can avoid this gridding by setting the position adjustment to “jitter”. position = “jitter” adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = “jitter”)
coord_flip()
switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.
coord_quickmap()
sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2
coord_polar()
uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.
assignment statements
object <- val
Object names are snake_case
i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_RENOUNCEconvention
calling functions
function_name(arg1 = val1, arg2 = val2, …)
(seq)uences
seq() function
makes regular sequences of numbers
seq(1, 10)
TAB button
keyboard shortcut to find possible completions of a function
grammar of graphics
coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.
int
dbl
chr
dttm
integer
doubles/ real nums
character vector, string
date + time
lgl
logical, vector of TRUE/FALSE
fctr
factors, represent categorical vars with fixed possible values
date
stands for datess
five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:
Pick observations by their values (filter()).
Reorder the rows (arrange()).
Pick variables by their names (select()).
Create new variables with functions of existing variables (mutate()).
Collapse many values down to a single summary (summarise()).
These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
All verbs work similarly:
The first argument is a data frame.
The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
The result is a new data frame.
Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work.
Comparisons
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal).
common problem you might encounter when using ==: floating point numbers. These results might surprise you!
sqrt(2) ^ 2 == 2
#> [1] FALSE
1 / 49 * 49 == 1
#> [1] FALSE
Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on ==, use near()
near(sqrt(2) ^ 2, 2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE
Sometimes you can simplify complicated subsetting by remembering De Morgan’s law:
As well as & and |, R also has && and ||
!(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y. For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
Missing values or NA’s
f you want to determine if a value is missing, use is.na():
is.na(x)
#> [1] TRUE
filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
arrange() works similarly to filter() except that instead of selecting rows, it changes their order.
It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
arrange(flights, year, month, day)
desc()
re-order by column in descending order
select()
rapidly zoom in on a useful subset using operations based on the names of the variables.
There are a number of helper functions you can use within select():
starts_with(“abc”): matches names that begin with “abc”.
ends_with(“xyz”): matches names that end with “xyz”.
contains(“ijk”): matches names that contain “ijk”.
matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
num_range(“x”, 1:3): matches x1, x2 and x3.
rename()
Variant of select() that keeps variables that aren’t explicitly mentioned to change the variable name
use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.
use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.
select(flights, time_hour, air_time, everything())
mutate()
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().
mutate() always adds new columns at the end of your dataset
View()
see all columns
transmute()
to only keep the new variables created for the data set
transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
There are many functions for creating new variables that you can use with mutate(). The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:
Arithmetic operators: +, -, *, /, ^. These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number: air_time / 60, hours * 60 + minute, etc.
Arithmetic operators are also useful in conjunction with the aggregate functions you’ll learn about later. For example, x / sum(x) calculates the proportion of a total, and y - mean(y) computes the difference from the mean.
Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with:
transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)
Logs
log2()is recommended because it’s easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive, a feature we’ll come back to in modelling.
Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)).
They are most useful in conjunction with group_by(), which you’ll learn about shortly.
(x <- 1:10)
#> [1] 1 2 3 4 5 6 7 8 9 10
lag(x)
#> [1] NA 1 2 3 4 5 6 7 8 9
lead(x)
#> [1] 2 3 4 5 6 7 8 9 10 NA
Cumulative and rolling aggregates
R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
x
#> [1] 1 2 3 4 5 6 7 8 9 10
cumsum(x)
#> [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
#> [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
Logical comparisons
<, <=, >, >=, !=, and ==, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.
Ranking
there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
#> [1] 1 2 2 NA 4 5
min_rank(desc(y))
#> [1] 5 3 3 NA 2 1
If min_rank() doesn’t do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()
row_number(y)
#> [1] 1 2 3 NA 4 5
dense_rank(y)
#> [1] 1 2 2 NA 3 4
percent_rank(y)
#> [1] 0.00 0.25 0.25 NA 0.75 1.00
cume_dist(y)
#> [1] 0.2 0.6 0.6 NA 0.8 1.0
summarize()
- collapses a data frame to a single row
- Not very useful
- used with group_by() to change unit of analysis from complete dataset to individual groups
- use with dplyr verbs on grouped data frame to have them applied by the group
Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))).
That way you can check that you’re not drawing conclusions based on very small amounts of data.
Measure of location-
mean(x) - sum / length
median(x) - 50% of x is above it, 50% below it
It’s sometimes useful to combine aggregation with logical subsetting.
Measures of spread:
sd(x)
IQR(x)
mad(x)
The root mean squared deviation, or standard deviation sd(x), is the standard measure of spread. The interquartile range IQR(x) and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
Measure of rank
min(x)
quantile(x, 0.25)
max(x)
Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)
Measure of position:
first(x)
nth(x, 2)
last(x)
These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
Counts
n(), no arguments, return size of current group
To count number of non-missing values, use sum(!is.na(x)).
To count the number of distinct (unique) values, use n_distinct(x)
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
dplyr provides a simple helper if all you want is a count:
not_cancelled %>%
count(dest)
You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:
not_cancelled %>%
count(tailnum, wt = distance)
Counts and proportions of logical values:
sum(x > 10)
mean(y == 0)
When used with numeric functions, TRUE is converted to 1 and FALSE to 0. This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.
How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500))
What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_prop = mean(arr_delay > 60))
Grouping by multiple variables
Be careful when progressively rolling up summaries: it’s OK for sums and counts, but you need to think about weighting means and variances, and it’s not possible to do it exactly for rank-based statistics like the median. In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset:
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
(per_month <- summarise(per_day, flights = sum(flights)))
(per_year <- summarise(per_month, flights = sum(flights)))
Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use ungroup().
daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n()) # all flights
grouped mutates and filters:
Grouping is most useful in conjunction with summarise(), but you can also do convenient operations with mutate() and filter()
Pipe ( %>% )
lhs %>% rhs
lhs - value or magrittr placeholder
rhs - function call using magrittr semantics
When functions require only one argument, x %>% f is equivalent to f(x) (not exactly equivalent; see technical note below.)
Placing lhs as the first argument in rhs call
The default behavior of %>% when multiple arguments are required in the rhs call, is to place lhs as the first argument, i.e. x %>% f(y) is equivalent to f(x, y).
Placing lhs elsewhere in rhs call
Often you will want lhs to the rhs call at another position than the first. For this purpose you can use the dot (.) as placeholder. For example, y %>% f(x, .) is equivalent to f(x, y) and z %>% f(x, y, arg = .) is equivalent to f(x, y, arg = z).
Using the dot for secondary purposes
Often, some attribute or property of lhs is desired in the rhs call in addition to the value of lhs itself, e.g. the number of rows or columns. It is perfectly valid to use the dot placeholder several times in the rhs call, but by design the behavior is slightly different when using it inside nested function calls. In particular, if the placeholder is only used in a nested function call, lhs will also be placed as the first argument! The reason for this is that in most use-cases this produces the most readable code. For example, iris %>% subset(1:nrow(.) %% 2 == 0) is equivalent to iris %>% subset(., 1:nrow(.) %% 2 == 0) but slightly more compact. It is possible to overrule this behavior by enclosing the rhs in braces. For example, 1:10 %>% {c(min(.), max(.))} is equivalent to c(min(1:10), max(1:10)).
Using %>% with call- or function-producing rhs
It is possible to force evaluation of rhs before the piping of lhs takes place. This is useful when rhs produces the relevant call or function. To evaluate rhs first, enclose it in parentheses, i.e. a %>% (function(x) x^2), and 1:10 %>% (call(“sum”)). Another example where this is relevant is for reference class methods which are accessed using the $ operator, where one would do x %>% (rc$f), and not x %>% rc$f.
Using lambda expressions with %>%
Each rhs is essentially a one-expression body of a unary function. Therefore defining lambdas in magrittr is very natural, and as the definitions of regular functions: if more than a single expression is needed one encloses the body in a pair of braces, { rhs }. However, note that within braces there are no “first-argument rule”: it will be exactly like writing a unary function where the argument name is “.” (the dot).
The magrittr pipe operators use non-standard evaluation. They capture their inputs and examines them to figure out how to proceed. First a function is produced from all of the individual right-hand side expressions, and then the result is obtained by applying this function to the left-hand side.
For most purposes, one can disregard the subtle aspects of magrittr’s evaluation, but some functions may capture their calling environment, and thus using the operators will not be exactly equivalent to the “standard call” without pipe-operators.
A grouped filter is a grouped mutate followed by an ungrouped filter. I generally avoid them except for quick and dirty manipulations: otherwise it’s hard to check that you’ve done the manipulation correctly.
Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries). You can learn more about useful window functions in the corresponding vignette: vignette(“window-functions”).
pipe operation sequence
df %>%
do_this_operation %>%
then_do_this_operation %>%
then_do_this_operation …
Pipe operator to summarize one variable
summarize mean mpg grouped by cyl
mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg))
Use Pipe Operator to Group & Summarize Multiple Variables
The following code shows how to use the pipe (%>%) operator to group by the cyl and am variables, and then summarize the mean of the mpg variable and the standard deviation of the hp variable:
mtcars %>%
group_by(cyl, am) %>%
summarise(mean_mpg = mean(mpg),
sd_hp = sd(hp))
Use Pipe Operator to Create New Variables
The following code shows how to use the pipe (%>%) operator along with the mutate function from the dplyr package to create two new variables in the mtcars data frame:
add two new variables in mtcars
new_mtcars <- mtcars %>%
mutate(mpg2 = mpg*2,
mpg_root = sqrt(mpg))
head(new_mtcars)
EDA (Exploratory Data Analysis) is an iterative cycle.
Generate questions about your data.
Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not.
To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery.
You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
variable
quantity, quality, or property that you can measure.
value
the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
observation
set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.
Tabular data
set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.
Variation
the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable’s values.