Data Transformation and Workflow Script Flashcards

1
Q

What is deployer package?

A

dplyr packag is use to transform data .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data frame?

A

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is tibble?

A

Tibbles are data frames, but slightly tweaked to work better in the tidyverse.

is a new package for manipulating and printing data frames in R. Tibbles are a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. The name comes from dplyr: originally you created these objects with tbl_df(), which was most easily pronounced as “tibble diff”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

how to create tibble?

A

You can create a tibble from an existing object with as_data_frame():

as_data_frame(iris)

his works for data frames, lists, matrices, and tables.

You can also create a new tibble from individual vectors with data_frame():

data_frame(x = 1:5, y = 1, z = x ^ 2 + y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

in dibble , what are the heads abbreviation that shows type of variable?

A

int stands for integers.

dbl stands for doubles, or real numbers.

chr stands for character vectors, or strings.

dttm stands for date-times (a date + a time).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the five key dplyr functions?

A

Pick observations by their values (filter()).

Reorder the rows (arrange()).

Pick variables by their names (select()).

Create new variables with functions of existing
variables (mutate()).

Collapse many values down to a single summary (summarise()).

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

These six functions provide the verbs for a language of data manipulation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Does all verb for data manipulation works same ? and if yes , how do they work?

A

The first argument is a data frame.

The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

The result is a new data frame.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Filter rows with filter() how it works?

A

filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. e.g data_517

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

dplyr

A

dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R. dplyr is the next iteration of plyr, focussing on only data frames. dplyr is faster, has a more consistent API and should be easier to use. There are three key ideas that underlie dplyr:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

dplyr

A

What “dplyr” stand for ? I know it is a package for manipulating datasets in R.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Tibble vs dataframe?

A

?Tibbles are data frames, but slightly tweaked to work better in the tidyverse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

R comparison operator

A

R provides the standard suite: >, >=, [1] FALSE
1 / 49 * 49 == 1
#> [1] FALSE

near(sqrt(2) ^ 2, 2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

x %in% y

A

A useful short-hand. x %in%y will select every row where x is one of the values in y

nov_dec

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

using Demorgans law : Sometimes you can simplify complicated subsetting by remembering De Morgan’s law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y.

A

For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:

filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

is.na()?

A

Determine missing value. filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly: filter(df, is.na(x) | x > 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

arrange? what is arrange function in dplyr?

A

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by

arrange(flights, year, month, day)

17
Q

what is the use of dec() in arrange()?

A

Use desc() to re-order by a column in descending order:

18
Q

select()

A

Select columns with select()

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

select(flights, year, month, day)
select(flights, year:day) year zuwa day column

select(flights, -(year:day))..Select all columns except those from year to day (inclusive)

19
Q

rename(data , z=y) , what it is use for?

A

rename(flights, tail_num = tailnum)… to rename variables,

20
Q

select(dat , everything()). What everything does here?

A

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

21
Q

what are many possible ways to select column?

A
  1. Specify columns names as unquoted variable names.

select(flights, dep_time, dep_delay, arr_time, arr_delay)

  1. Specify column names as strings.

select(flights, “dep_time”, “dep_delay”, “arr_time”, “arr_delay”)

  1. Specify the column numbers of the variables.

select(flights, 4, 6, 7, 9)

  1. Specify the names of the variables with character vector and one_of().

select(flights, one_of(c(“dep_time”, “dep_delay”, “arr_time”, “arr_delay”)))

  1. Selecting the variables by matching the start of their names using starts_with().

select(flights, starts_with(“dep_”), starts_with(“arr_”))

  1. Selecting the variables using regular expressions with matches(). Regular expressions provide a flexible way to match string patterns and are discussed in the Strings chapter.

select(flights, matches(“^(dep|arr)_(time|delay)$”))

7.

22
Q

What happens if you include the name of a variable multiple times in a select() call?

A

The select() call ignores the duplication. Any duplicated variables are only included once, in the first location they appear. The select() function does not raise an error or warning or print any message if there are duplicated variables.

This behavior is useful because it means that we can use select() with everything() in order to easily change the order of columns without having to specify the names of all the columns.

select(flights, arr_delay, everything())

23
Q

What does the one_of() function do? Why might it be helpful in conjunction with this vector?

A

The one_of() function selects variables with a character vector rather than unquoted variable name arguments. This function is useful because it is easier to programmatically generate character vectors with variable names than to generate unquoted variable names, which are easier to type.

vars

24
Q

mutate() what it does?

A

it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

Note that you can refer to columns that you’ve just created in the same expression

25
Q

transmute

A

we use mutate() to create new variable and is added to the end of the dataframe. If you only want to keep the new variables, use transmute() rather than mutate()

transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

this will only output new columns

26
Q

Grouped summaries with summarise(), how it works?

A

summarise(). It collapses a data frame to a single row:

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 1 x 1
#>   delay
#>   
#> 1  12.6

summarise() is not terribly useful unless we pair it with group_by().

by_day  # A tibble: 365 x 4
#> # Groups:   year, month [?]
#>    year month   day delay
#>      
#> 1  2013     1     1 11.5 
#> 2  2013     1     2 13.9
27
Q

if you have code that runs in two line to run it how many return u press?

A

u need to press command return twice.

28
Q

how can you Combining multiple operations ?

A

using pipe

29
Q

%>% what using pipe?

A

This focuses on the transformations, not what’s being transformed, which makes the code easier to read.

good way to pronounce %>% when reading code is “then”.

30
Q

%>% what using pipe?

A

This focuses on the transformations, not what’s being transformed, which makes the code easier to read.

good way to pronounce %>% when reading code is “then”.

Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn’t quite ready for prime time yet.

31
Q

na.rm() what is its use?

A

It removes missing values. if there’s any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation:

flights %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay, na.rm = TRUE))

32
Q

cmd+shift+N

A

open new script editor. The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor. RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open. Nevertheless, it’s a good idea to save your scripts regularly and to back them up.

33
Q

cmd+Enter

A

The key to using the script editor effectively is to memorise one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. This executes the current R expression in the console.

34
Q

how to run complete script using shortcut?

A

You can also execute the complete script in one step: Cmd/Ctrl + Shift + S.

35
Q

never share code with install.packages() and stewed(). why?

A

ote, however, that you should never include install.packages() or setwd() in a script that you share. It’s very antisocial to change settings on someone else’s computer!

36
Q

RStudio diagnostics

A

The script editor will also highlight syntax errors with a red squiggly line and a cross in the sidebar:

Hover over the cross to see what the problem is:

RStudio will also let you know about potential problems:

37
Q

ZOOM IN on only one window at a time???

A
Use the @rstudio shortcuts layed out "clockwisey",
Ctrl + Shift + 1 (Script) 
Ctrl + Shift + 2 (Console)
Ctrl + Shift + 3 (File Viewer)
Ctrl + Shift + 4 (Environment)