Data Wrangling Flashcards by Birwe Leon

General Form :

DT[i, j, by]

Take DT, subset rows using i, then calculate j grouped by by

How well did you know this?

Not at all

Perfectly

DT[3:5,] or DT[3:5]

Subsetting rows : Selects third to 5th row

How well did you know this?

Not at all

Perfectly

DT[ V2 == “A”]

DT[ V2 %in% c(“A”, “C”)]

Selects all rows that have value A in column V2.

Select all rows that have the value A or C in column V2.

How well did you know this?

Not at all

Perfectly

DT[,V2]

Select 1 column in j. Column V2 is returned as a vector.

[1] “A” “B” “C” “A” “B” “C” …

How well did you know this?

Not at all

Perfectly

DT[,.(V2,V3)]

Select several columns in j.Columns V2 and V3 are returned as a data.table.

V2 V3

1: A -1.1727
2: B -0.3825
3: C -1.0604

How well did you know this?

Not at all

Perfectly

.() or list()

.() is an alias to list(). If .() is used, the returned value is a data.table. If .() is not used, the result is a vector.

How well did you know this?

Not at all

Perfectly

DT[,sum(V1)]

Call functions in j.Returns the sum of all elements of column V1 in a vector.

[1] 18

How well did you know this?

Not at all

Perfectly

DT[,.(sum(V1),sd(V3))]

Computing on several columns. Returns the sum of all elements of column V1 and the standard deviation of V3 in a data.table.

V1 V2
1 : 18 0.7634655

How well did you know this?

Not at all

Perfectly

DT[,.(Aggregate = sum(V1), Sd.V3 = sd(V3))]

Assigning column names to computed columns.
The same with previous card, but with new names.

Aggregate Sd.V3
1 : 18 0.7634655

How well did you know this?

Not at all

Perfectly

DT[,.(V1, Sd.V3 = sd(V3))]

Columns gets recycled if different length. Selects column V1, and compute std. dev. of V3, which returns a single value and gets recycled.

V1              Sd.V3
1: 1         0.7634655 
2: 2       0.7634655 
... 
11: 1        0.7634655 
12: 2      0.7634655

How well did you know this?

Not at all

Perfectly

DT[,{print(V2)
plot(V3)
NULL}]

Multiple expressions can be wrapped in curly braces. Print column V2 and plot V3.

[1] "A" "B" "C" "A" "B" "C" ...
#And a plot

How well did you know this?

Not at all

Perfectly

DT[,.(V4.Sum = sum(V4)),by=V1]

Doing j by group. Calculates the sum of V4, for every group in V1.

V1 V4.Sum
1:1 36

How well did you know this?

Not at all

Perfectly

DT[,.(V4.Sum = sum(V4)),by=.(V1,V2)]

Doing j by several groups using .(). The same with previous cards, but for every group in V1 and V2.

V1   V2    V4.Sum

1: 1 A 8
2: 2 B 10
3: 1 C 12
4: 2 A 14
5: 1 B 16
6: 2 C 18

How well did you know this?

Not at all

Perfectly

DT[,.(V4.Sum = sum(V4)),by=sign(V1-1)]

Call functions in by. Calculates the sum of V4, for every group in
sign(V1-1).

sign     V4.Sum

1: 0 36
2: 1 42

How well did you know this?

Not at all

Perfectly

DT[,.(V4.Sum = sum(V4)),

by=.(V1.01 = sign(V1-1))]

Assigning new column names in by. Same as previous card, but with a new name for the variable we are grouping by.

V1.01 V4.Sum

1: 0 36
2: 1 42

How well did you know this?

Not at all

Perfectly

DT[1:5, .(V4.Sum = sum(V4)), by=V1]

Study These Flashcards

Grouping only on a subset by specifying i.Calculates the sum of V4, for every group in V1, after subsetting on the first five rows.
V1 V4.Sum
1: 1 2
2: 9 6

DT[,.N,by=V1]

Study These Flashcards

Using .N to get the total number of observations of each
group. Count the number of rows for every group in V1.

V1     N

1: 1 6
2: 2 6

DT[, V1 := round(exp(V1),2)]

Study These Flashcards

Adding/updating a column by reference using := in one line. Watch out: extra assignment (DT

DT[, c(“V1”,”V2”) := list (round(exp(V1),2), LETTERS [4:6])]

Study These Flashcards

Adding/updating several columns by reference using :=. Column V1 and V2 are updated by what is after :=.

Returns the result invisibly. Column V1 changed as above. Column V2 went from: [1] “A” “B” “C” “A” “B” “C” …to:[1] “D” “E” “F” “D” “E” “F” …

DT[, ‘:=’ (V1 = round(exp(V1),2),

V2 = LETTERS[4:6])] [ ]

Study These Flashcards

Using functional :=. Another way to write the same line as previous card, but easier to write comments side-by-side. Also, when [ ] is added the result is printed to the screen.
Same changes as line above this one, but the result is printed to the screen because of the [ ] at the end of the statement.

DT[, V1 := NULL]

Study These Flashcards

Remove a column instantly using :=. Removes column V1.

Returns the result invisibly. Column V1 became NULL.

DT[, c(“V1”,”V2”) := NULL]

Study These Flashcards

Remove several columns instantly using :=.
Removes columns V1 and V2.

Returns the result invisibly. Column V1 and V2 became NULL.

2 DT[, Cols.chosen := NULL]

Cols.chosen = c(“A”,”B”)

Study These Flashcards

2 Watch out: this deletes the column with column name Cols.chosen.

Wrap the name of a variable which contains column names in parenthesis to pass the contents of that variable to be deleted.

setkey(DT,V2)

Study These Flashcards

Use setkey() to set a key on a DT. The data is sorted on the column we specified by reference.
A key is set on column V2.

DT["A"]

Use keys like supercharged rownames to select rows. Returns all the rows where the key column (set to column V2 in the line above) has the value A. V1 V2 V3 V4 1: 1 A -1.1727 1 2: 2 A 0.6651 4 3: 1 A -1.0604 7 4: 2 A -0.3825 10

DT[c("A","C")]

Returns all the rows where the key column (V2) has the value A or C. ``` V1 V2 V3 V4 1: 1 A -1.1727 1 2: 2 A 0.6651 4 ... 7: 1 C -1.1727 9 8: 2 C 0.6651 12 ```

DT["A", mult ="first"] DT["A", mult = "last"]

The mult argument is used to control which row that i matches to is returned, default is all. Returns first row of all rows that match the value A in the key column (V2). Returns last row of all rows that match the value A in the key column (V2). V1 V2 V3 V4 1: 1 A -1.1727 1 V1 V2 V3 V4 1: 2 A -0.3825 10

Data Wrangling Flashcards

(27 cards)