Data Wrangling Flashcards
General Form :
DT[i, j, by]
Take DT, subset rows using i, then calculate j grouped by by
DT[3:5,] or DT[3:5]
Subsetting rows : Selects third to 5th row
DT[ V2 == “A”]
DT[ V2 %in% c(“A”, “C”)]
Selects all rows that have value A in column V2.
Select all rows that have the value A or C in column V2.
DT[,V2]
Select 1 column in j. Column V2 is returned as a vector.
[1] “A” “B” “C” “A” “B” “C” …
DT[,.(V2,V3)]
Select several columns in j.Columns V2 and V3 are returned as a data.table.
V2 V3
1: A -1.1727
2: B -0.3825
3: C -1.0604
.() or list()
.() is an alias to list(). If .() is used, the returned value is a data.table. If .() is not used, the result is a vector.
DT[,sum(V1)]
Call functions in j.Returns the sum of all elements of column V1 in a vector.
[1] 18
DT[,.(sum(V1),sd(V3))]
Computing on several columns. Returns the sum of all elements of column V1 and the standard deviation of V3 in a data.table.
V1 V2
1 : 18 0.7634655
DT[,.(Aggregate = sum(V1), Sd.V3 = sd(V3))]
Assigning column names to computed columns.
The same with previous card, but with new names.
Aggregate Sd.V3
1 : 18 0.7634655
DT[,.(V1, Sd.V3 = sd(V3))]
Columns gets recycled if different length. Selects column V1, and compute std. dev. of V3, which returns a single value and gets recycled.
V1 Sd.V3 1: 1 0.7634655 2: 2 0.7634655 ... 11: 1 0.7634655 12: 2 0.7634655
DT[,{print(V2)
plot(V3)
NULL}]
Multiple expressions can be wrapped in curly braces. Print column V2 and plot V3.
[1] "A" "B" "C" "A" "B" "C" ... #And a plot
DT[,.(V4.Sum = sum(V4)),by=V1]
Doing j by group. Calculates the sum of V4, for every group in V1.
V1 V4.Sum
1:1 36
DT[,.(V4.Sum = sum(V4)),by=.(V1,V2)]
Doing j by several groups using .(). The same with previous cards, but for every group in V1 and V2.
V1 V2 V4.Sum
1: 1 A 8
2: 2 B 10
3: 1 C 12
4: 2 A 14
5: 1 B 16
6: 2 C 18
DT[,.(V4.Sum = sum(V4)),by=sign(V1-1)]
Call functions in by. Calculates the sum of V4, for every group in
sign(V1-1).
sign V4.Sum
1: 0 36
2: 1 42
DT[,.(V4.Sum = sum(V4)),
by=.(V1.01 = sign(V1-1))]
Assigning new column names in by. Same as previous card, but with a new name for the variable we are grouping by.
V1.01 V4.Sum
1: 0 36
2: 1 42
DT[1:5, .(V4.Sum = sum(V4)), by=V1]
Grouping only on a subset by specifying i.Calculates the sum of V4, for every group in V1, after subsetting on the first five rows.
V1 V4.Sum
1: 1 2
2: 9 6
DT[,.N,by=V1]
Using .N to get the total number of observations of each
group. Count the number of rows for every group in V1.
V1 N
1: 1 6
2: 2 6
DT[, V1 := round(exp(V1),2)]
Adding/updating a column by reference using := in one line. Watch out: extra assignment (DT
DT[, c(“V1”,”V2”) := list (round(exp(V1),2), LETTERS [4:6])]
Adding/updating several columns by reference using :=. Column V1 and V2 are updated by what is after :=.
Returns the result invisibly. Column V1 changed as above. Column V2 went from: [1] “A” “B” “C” “A” “B” “C” …to:[1] “D” “E” “F” “D” “E” “F” …
DT[, ‘:=’ (V1 = round(exp(V1),2),
V2 = LETTERS[4:6])] [ ]
Using functional :=. Another way to write the same line as previous card, but easier to write comments side-by-side. Also, when [ ] is added the result is printed to the screen.
Same changes as line above this one, but the result is printed to the screen because of the [ ] at the end of the statement.
DT[, V1 := NULL]
Remove a column instantly using :=. Removes column V1.
Returns the result invisibly. Column V1 became NULL.
DT[, c(“V1”,”V2”) := NULL]
Remove several columns instantly using :=.
Removes columns V1 and V2.
Returns the result invisibly. Column V1 and V2 became NULL.
2 DT[, Cols.chosen := NULL]
Cols.chosen = c(“A”,”B”)
2 Watch out: this deletes the column with column name Cols.chosen.
Wrap the name of a variable which contains column names in parenthesis to pass the contents of that variable to be deleted.
setkey(DT,V2)
Use setkey() to set a key on a DT. The data is sorted on the column we specified by reference. A key is set on column V2.
DT[“A”]
Use keys like supercharged rownames to select rows. Returns all the rows where the key column (set to column V2 in the line above) has the value A.
V1 V2 V3 V4
1: 1 A -1.1727 1
2: 2 A 0.6651 4
3: 1 A -1.0604 7
4: 2 A -0.3825 10
DT[c(“A”,”C”)]
Returns all the rows where the key column (V2) has the value A or C.
V1 V2 V3 V4 1: 1 A -1.1727 1 2: 2 A 0.6651 4 ... 7: 1 C -1.1727 9 8: 2 C 0.6651 12
DT[“A”, mult =”first”]
DT[“A”, mult = “last”]
The mult argument is used to control which row that i matches to is returned, default is all.
Returns first row of all rows that match the value A in the key column (V2).
Returns last row of all rows that match the value A in the key column (V2).
V1 V2 V3 V4 1: 1 A -1.1727 1 V1 V2 V3 V4 1: 2 A -0.3825 10