Data Transformation and Workflow Script Flashcards by Birwe Leon

What is deployer package?

dplyr packag is use to transform data .

How well did you know this?

Not at all

Perfectly

What is data frame?

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

How well did you know this?

Not at all

Perfectly

what is tibble?

Tibbles are data frames, but slightly tweaked to work better in the tidyverse.

is a new package for manipulating and printing data frames in R. Tibbles are a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. The name comes from dplyr: originally you created these objects with tbl_df(), which was most easily pronounced as “tibble diff”.

How well did you know this?

Not at all

Perfectly

how to create tibble?

You can create a tibble from an existing object with as_data_frame():

as_data_frame(iris)

his works for data frames, lists, matrices, and tables.

You can also create a new tibble from individual vectors with data_frame():

data_frame(x = 1:5, y = 1, z = x ^ 2 + y)

How well did you know this?

Not at all

Perfectly

in dibble , what are the heads abbreviation that shows type of variable?

int stands for integers.

dbl stands for doubles, or real numbers.

chr stands for character vectors, or strings.

dttm stands for date-times (a date + a time).

How well did you know this?

Not at all

Perfectly

What are the five key dplyr functions?

Pick observations by their values (filter()).

Reorder the rows (arrange()).

Pick variables by their names (select()).

Create new variables with functions of existing
variables (mutate()).

Collapse many values down to a single summary (summarise()).

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

These six functions provide the verbs for a language of data manipulation.

How well did you know this?

Not at all

Perfectly

Does all verb for data manipulation works same ? and if yes , how do they work?

The first argument is a data frame.

The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

The result is a new data frame.

How well did you know this?

Not at all

Perfectly

Filter rows with filter() how it works?

filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. e.g data_517

How well did you know this?

Not at all

Perfectly

dplyr

dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R. dplyr is the next iteration of plyr, focussing on only data frames. dplyr is faster, has a more consistent API and should be easier to use. There are three key ideas that underlie dplyr:

How well did you know this?

Not at all

Perfectly

dplyr

What “dplyr” stand for ? I know it is a package for manipulating datasets in R.

How well did you know this?

Not at all

Perfectly

Tibble vs dataframe?

?Tibbles are data frames, but slightly tweaked to work better in the tidyverse.

How well did you know this?

Not at all

Perfectly

R comparison operator

R provides the standard suite: >, >=, [1] FALSE
1 / 49 * 49 == 1
#> [1] FALSE

near(sqrt(2) ^ 2, 2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE

How well did you know this?

Not at all

Perfectly

x %in% y

A useful short-hand. x %in%y will select every row where x is one of the values in y

nov_dec

How well did you know this?

Not at all

Perfectly

using Demorgans law : Sometimes you can simplify complicated subsetting by remembering De Morgan’s law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y.

For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:

filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

How well did you know this?

Not at all

Perfectly

is.na()?

Determine missing value. filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly: filter(df, is.na(x) | x > 1)

How well did you know this?

Not at all

Perfectly

arrange? what is arrange function in dplyr?

Study These Flashcards

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by

arrange(flights, year, month, day)

what is the use of dec() in arrange()?

Study These Flashcards

Use desc() to re-order by a column in descending order:

select()

Study These Flashcards

Select columns with select()

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

select(flights, year, month, day)
select(flights, year:day) year zuwa day column

select(flights, -(year:day))..Select all columns except those from year to day (inclusive)

rename(data , z=y) , what it is use for?

Study These Flashcards

rename(flights, tail_num = tailnum)… to rename variables,

select(dat , everything()). What everything does here?

Study These Flashcards

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

what are many possible ways to select column?

Study These Flashcards

Specify columns names as unquoted variable names.

select(flights, dep_time, dep_delay, arr_time, arr_delay)

Specify column names as strings.

select(flights, “dep_time”, “dep_delay”, “arr_time”, “arr_delay”)

Specify the column numbers of the variables.

select(flights, 4, 6, 7, 9)

Specify the names of the variables with character vector and one_of().

select(flights, one_of(c(“dep_time”, “dep_delay”, “arr_time”, “arr_delay”)))

Selecting the variables by matching the start of their names using starts_with().

select(flights, starts_with(“dep_”), starts_with(“arr_”))

Selecting the variables using regular expressions with matches(). Regular expressions provide a flexible way to match string patterns and are discussed in the Strings chapter.

select(flights, matches(“^(dep|arr)_(time|delay)$”))

What happens if you include the name of a variable multiple times in a select() call?

Study These Flashcards

The select() call ignores the duplication. Any duplicated variables are only included once, in the first location they appear. The select() function does not raise an error or warning or print any message if there are duplicated variables.

This behavior is useful because it means that we can use select() with everything() in order to easily change the order of columns without having to specify the names of all the columns.

select(flights, arr_delay, everything())

What does the one_of() function do? Why might it be helpful in conjunction with this vector?

Study These Flashcards

The one_of() function selects variables with a character vector rather than unquoted variable name arguments. This function is useful because it is easier to programmatically generate character vectors with variable names than to generate unquoted variable names, which are easier to type.

vars

mutate() what it does?

Study These Flashcards

it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

Note that you can refer to columns that you’ve just created in the same expression

transmute

we use mutate() to create new variable and is added to the end of the dataframe. If you only want to keep the new variables, use transmute() rather than mutate() ``` transmute(flights, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` this will only output new columns

Grouped summaries with summarise(), how it works?

summarise(). It collapses a data frame to a single row: ``` summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) #> # A tibble: 1 x 1 #> delay #> #> 1 12.6 ``` summarise() is not terribly useful unless we pair it with group_by(). ``` by_day # A tibble: 365 x 4 #> # Groups: year, month [?] #> year month day delay #> #> 1 2013 1 1 11.5 #> 2 2013 1 2 13.9 ```

if you have code that runs in two line to run it how many return u press?

u need to press command return twice.

how can you Combining multiple operations ?

using pipe

%>% what using pipe?

This focuses on the transformations, not what’s being transformed, which makes the code easier to read. good way to pronounce %>% when reading code is “then”.

%>% what using pipe?

This focuses on the transformations, not what’s being transformed, which makes the code easier to read. good way to pronounce %>% when reading code is “then”. Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn’t quite ready for prime time yet.

na.rm() what is its use?

It removes missing values. if there’s any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation: flights %>% group_by(year, month, day) %>% summarise(mean = mean(dep_delay, na.rm = TRUE))

cmd+shift+N

open new script editor. The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor. RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open. Nevertheless, it’s a good idea to save your scripts regularly and to back them up.

cmd+Enter

The key to using the script editor effectively is to memorise one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. This executes the current R expression in the console.

how to run complete script using shortcut?

You can also execute the complete script in one step: Cmd/Ctrl + Shift + S.

never share code with install.packages() and stewed(). why?

ote, however, that you should never include install.packages() or setwd() in a script that you share. It’s very antisocial to change settings on someone else’s computer!

RStudio diagnostics

The script editor will also highlight syntax errors with a red squiggly line and a cross in the sidebar: Hover over the cross to see what the problem is: RStudio will also let you know about potential problems:

ZOOM IN on only one window at a time???

``` Use the @rstudio shortcuts layed out "clockwisey", Ctrl + Shift + 1 (Script) Ctrl + Shift + 2 (Console) Ctrl + Shift + 3 (File Viewer) Ctrl + Shift + 4 (Environment) ```

Data Transformation and Workflow Script Flashcards

(37 cards)