Data Wrangling Flashcards
General Form :
DT[i, j, by]
Take DT, subset rows using i, then calculate j grouped by by
DT[3:5,] or DT[3:5]
Subsetting rows : Selects third to 5th row
DT[ V2 == “A”]
DT[ V2 %in% c(“A”, “C”)]
Selects all rows that have value A in column V2.
Select all rows that have the value A or C in column V2.
DT[,V2]
Select 1 column in j. Column V2 is returned as a vector.
[1] “A” “B” “C” “A” “B” “C” …
DT[,.(V2,V3)]
Select several columns in j.Columns V2 and V3 are returned as a data.table.
V2 V3
1: A -1.1727
2: B -0.3825
3: C -1.0604
.() or list()
.() is an alias to list(). If .() is used, the returned value is a data.table. If .() is not used, the result is a vector.
DT[,sum(V1)]
Call functions in j.Returns the sum of all elements of column V1 in a vector.
[1] 18
DT[,.(sum(V1),sd(V3))]
Computing on several columns. Returns the sum of all elements of column V1 and the standard deviation of V3 in a data.table.
V1 V2
1 : 18 0.7634655
DT[,.(Aggregate = sum(V1), Sd.V3 = sd(V3))]
Assigning column names to computed columns.
The same with previous card, but with new names.
Aggregate Sd.V3
1 : 18 0.7634655
DT[,.(V1, Sd.V3 = sd(V3))]
Columns gets recycled if different length. Selects column V1, and compute std. dev. of V3, which returns a single value and gets recycled.
V1 Sd.V3 1: 1 0.7634655 2: 2 0.7634655 ... 11: 1 0.7634655 12: 2 0.7634655
DT[,{print(V2)
plot(V3)
NULL}]
Multiple expressions can be wrapped in curly braces. Print column V2 and plot V3.
[1] "A" "B" "C" "A" "B" "C" ... #And a plot
DT[,.(V4.Sum = sum(V4)),by=V1]
Doing j by group. Calculates the sum of V4, for every group in V1.
V1 V4.Sum
1:1 36
DT[,.(V4.Sum = sum(V4)),by=.(V1,V2)]
Doing j by several groups using .(). The same with previous cards, but for every group in V1 and V2.
V1 V2 V4.Sum
1: 1 A 8
2: 2 B 10
3: 1 C 12
4: 2 A 14
5: 1 B 16
6: 2 C 18
DT[,.(V4.Sum = sum(V4)),by=sign(V1-1)]
Call functions in by. Calculates the sum of V4, for every group in
sign(V1-1).
sign V4.Sum
1: 0 36
2: 1 42
DT[,.(V4.Sum = sum(V4)),
by=.(V1.01 = sign(V1-1))]
Assigning new column names in by. Same as previous card, but with a new name for the variable we are grouping by.
V1.01 V4.Sum
1: 0 36
2: 1 42