dplyr Flashcards
a new type of data frame that makes data more easy to work with
tbl
function to create a tbl
tbl_df(mydata)
utilize lookup tables to clean data (similiar to case when)
two
glimpse
shows some data from every column in a tbl
What are the 5 main data manupulation funcion in dyplr
select, mutate, arrange, filter, summarise
arrange()
that reorders the rows according to single or multiple variables
summarise()
which reduces each group to a single row by calculating aggregate measures.
select()
select(data, column1, column 2, etc.)
starts_with(“X”)
returns every column that starts with “X”
ends_with(“X”)
returns every column that ends with “X”
contains(“X”)
returns every column that contains “X”
matches(“X”)
returns every column that matches “X”
“Not equal to” operator
!=
equal to operator
==
%in%
used in the filter clause as a logical operator to test whether a variable is found in a vector.
ex. filter(grades, %in% c(‘a’, ‘b’)) will return only grades that were an a or b
you can combine logical operators in the filter clause
&, |, !=, ==
Dplyr Function. The number of rows in the data.frame or group of observations that summarise() describes.
n()
Dplyr Function. The number of unique values in vector x.
n_distinct()
how can you use dplyr to calculate the proportion of observations that pass a logical test
first use some logical test, this will return a vector of TRUE and FALSE, then use the mean() function on that returned vector. R will coerce the logical vector into 1 for TRUE and 0 for FALSE. Thus, taking the mean of this vector will return the proportion of TRUE observations.