Data Transformation and Workflow Script Flashcards
What is deployer package?
dplyr packag is use to transform data .
What is data frame?
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.
what is tibble?
Tibbles are data frames, but slightly tweaked to work better in the tidyverse.
is a new package for manipulating and printing data frames in R. Tibbles are a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. The name comes from dplyr: originally you created these objects with tbl_df(), which was most easily pronounced as “tibble diff”.
how to create tibble?
You can create a tibble from an existing object with as_data_frame():
as_data_frame(iris)
his works for data frames, lists, matrices, and tables.
You can also create a new tibble from individual vectors with data_frame():
data_frame(x = 1:5, y = 1, z = x ^ 2 + y)
in dibble , what are the heads abbreviation that shows type of variable?
int stands for integers.
dbl stands for doubles, or real numbers.
chr stands for character vectors, or strings.
dttm stands for date-times (a date + a time).
What are the five key dplyr functions?
Pick observations by their values (filter()).
Reorder the rows (arrange()).
Pick variables by their names (select()).
Create new variables with functions of existing
variables (mutate()).
Collapse many values down to a single summary (summarise()).
These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of data manipulation.
Does all verb for data manipulation works same ? and if yes , how do they work?
The first argument is a data frame.
The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
The result is a new data frame.
Filter rows with filter() how it works?
filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. e.g data_517
dplyr
dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R. dplyr is the next iteration of plyr, focussing on only data frames. dplyr is faster, has a more consistent API and should be easier to use. There are three key ideas that underlie dplyr:
dplyr
What “dplyr” stand for ? I know it is a package for manipulating datasets in R.
Tibble vs dataframe?
?Tibbles are data frames, but slightly tweaked to work better in the tidyverse.
R comparison operator
R provides the standard suite: >, >=, [1] FALSE
1 / 49 * 49 == 1
#> [1] FALSE
near(sqrt(2) ^ 2, 2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE
x %in% y
A useful short-hand. x %in%y will select every row where x is one of the values in y
nov_dec
using Demorgans law : Sometimes you can simplify complicated subsetting by remembering De Morgan’s law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y.
For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
is.na()?
Determine missing value. filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly: filter(df, is.na(x) | x > 1)