SWIRL for data collection & cleaning Flashcards

1
Q

INSTALLING dplyr

A

Install : library(dplyr)

Check package version: packageversion(“dplyr”)

Read file into variable (e.g. mydf) using read.csv / read.table, etc.

Load the data thus read into a dplyr format using: tbl_df(mydf)

Five “verbs” are supplied with dplyr:
select()
filter()
arrange()
mutate()
summarize()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

dplyr - SELECT()

A

form: _select(cran, ip_id , package , country)_

Column names are not referred to using $

Display will show columns in the order stated in select()

Can use the (:) as in Cfor all columns
between the columns stated

Can use above in reverse order _select(cran, country : r_arch)_

Exclude columns by using (-) : select(cran , -package ) or
select(cran, -(country : r_arch))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

dplyr - FILTER()

A

Subsetting using columns is covered by select().

Subsetting using rows involves filter(): filter(cran, package == “swirl”)
the “package == swirl” equivalence returns a T / F vector that is used as
an index to filter “cran” and display only the rows correponding to T

Several filter arguments can be combined:
filter(cran, r_version == “3.1.1”, country == “US”)
filter(cran, country == “US” | country == “IN”)
_filter(cran, size > 100500, r_os == “linux-gnu”)_
filter(cran, !is.na(r_version))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

dplyr - ARRANGE()

A

forms: _arrange(cran2, ip_id)_ default ascending order
_arrange(cran2, desc(ip_id))_
_arrange(cran2, package, ip_id)_ if two rows have same package ip_id in ascending order
arrange(cran2, country, desc(r_version), ip_id)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

dplyr - MUTATE()

A

Used to to create a new variable based on the value of one or more variables already in a dataset:

  • We want to add a column called size_mb that contains the download size in megabytes: _mutate(cran3, size_mb = size / 2^20)_
  • you can use the value computed for your second column (size_mb) to create a third column:
    _mutate(cran3, size_mb = size / 2^20, size_gb = size_mb / 2^10)_
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

dplyr - SUMMARIZE()

A

summarize() can give you the requested value FOR EACH group in your dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

e.g. USING INDEX FOR FILTERING

A

>ZipVector <- zipVector[zipVector==”21231”]

> ZipVector

[1] “21231” “21231” “21231” “21231” …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly