SWIRL for data collection & cleaning Flashcards

Question 1

Q

INSTALLING dplyr

Answer

A

Install : library(dplyr)

Check package version: packageversion(“dplyr”)

Read file into variable (e.g. mydf) using read.csv / read.table, etc.

Load the data thus read into a dplyr format using: tbl_df(mydf)

Five “verbs” are supplied with dplyr:
select()
filter()
arrange()
mutate()
summarize()

Question 2

Q

dplyr - SELECT()

Answer

A

form: _select(cran, ip_id , package , country)_

Column names are not referred to using $

Display will show columns in the order stated in select()

Can use the (:) as in Cfor all columns
between the columns stated

Can use above in reverse order _select(cran, country : r_arch)_

Exclude columns by using (-) : select(cran , -package ) or
select(cran, -(country : r_arch))

Question 3

Q

dplyr - FILTER()

Answer

A

Subsetting using columns is covered by select().

Subsetting using rows involves filter(): filter(cran, package == “swirl”)
the “package == swirl” equivalence returns a T / F vector that is used as
an index to filter “cran” and display only the rows correponding to T

Several filter arguments can be combined:
filter(cran, r_version == “3.1.1”, country == “US”)
filter(cran, country == “US” | country == “IN”)
_filter(cran, size > 100500, r_os == “linux-gnu”)_
filter(cran, !is.na(r_version))

Question 4

Q

dplyr - ARRANGE()

Answer

A

forms: _arrange(cran2, ip_id)_ default ascending order
_arrange(cran2, desc(ip_id))_
_arrange(cran2, package, ip_id)_ if two rows have same package ip_id in ascending order
arrange(cran2, country, desc(r_version), ip_id)

Question 5

Q

dplyr - MUTATE()

Answer

A

Used to to create a new variable based on the value of one or more variables already in a dataset:

We want to add a column called size_mb that contains the download size in megabytes: _mutate(cran3, size_mb = size / 2^20)_
you can use the value computed for your second column (size_mb) to create a third column:
_mutate(cran3, size_mb = size / 2^20, size_gb = size_mb / 2^10)_

Question 6

Q

dplyr - SUMMARIZE()

Answer

A

summarize() can give you the requested value FOR EACH group in your dataset

Question 7

Q

e.g. USING INDEX FOR FILTERING

Answer

A

>ZipVector <- zipVector[zipVector==”21231”]

> ZipVector

[1] “21231” “21231” “21231” “21231” …

SWIRL for data collection & cleaning Flashcards

(7 cards)