SWIRL for data collection & cleaning Flashcards
INSTALLING dplyr
Install : library(dplyr)
Check package version: packageversion(“dplyr”)
Read file into variable (e.g. mydf) using read.csv / read.table, etc.
Load the data thus read into a dplyr format using: tbl_df(mydf)
Five “verbs” are supplied with dplyr:
select()
filter()
arrange()
mutate()
summarize()
dplyr - SELECT()
form: _select(cran, ip_id , package , country)_
Column names are not referred to using $
Display will show columns in the order stated in select()
Can use the (:) as in Cfor all columns
between the columns stated
Can use above in reverse order _select(cran, country : r_arch)_
Exclude columns by using (-) : select(cran , -package ) or
select(cran, -(country : r_arch))
dplyr - FILTER()
Subsetting using columns is covered by select().
Subsetting using rows involves filter(): filter(cran, package == “swirl”)
the “package == swirl” equivalence returns a T / F vector that is used as
an index to filter “cran” and display only the rows correponding to T
Several filter arguments can be combined:
filter(cran, r_version == “3.1.1”, country == “US”)
filter(cran, country == “US” | country == “IN”)
_filter(cran, size > 100500, r_os == “linux-gnu”)_
filter(cran, !is.na(r_version))
dplyr - ARRANGE()
forms: _arrange(cran2, ip_id)_ default ascending order
_arrange(cran2, desc(ip_id))_
_arrange(cran2, package, ip_id)_ if two rows have same package ip_id in ascending order
arrange(cran2, country, desc(r_version), ip_id)
dplyr - MUTATE()
Used to to create a new variable based on the value of one or more variables already in a dataset:
- We want to add a column called size_mb that contains the download size in megabytes: _mutate(cran3, size_mb = size / 2^20)_
- you can use the value computed for your second column (size_mb) to create a third column:
_mutate(cran3, size_mb = size / 2^20, size_gb = size_mb / 2^10)_
dplyr - SUMMARIZE()
summarize() can give you the requested value FOR EACH group in your dataset
e.g. USING INDEX FOR FILTERING
>ZipVector <- zipVector[zipVector==”21231”]
> ZipVector
[1] “21231” “21231” “21231” “21231” …