Lecture 2 Flashcards
Data Wrangling with Data.table in R
What is the syntax for a data.table operation in R?
A: DT[i, j, by] where:
i: Row conditions
j: Column operations
by: Grouping
How do you create a data.table in R?
data.table(x = c(), y = c())
What function converts a data.frame to a data.table?
as.data.table()
Which function is used to load large data files efficiently into a data.table?
fread()
How can you access the 2nd row of a data.table?
DT[2]
How can you subset rows using multiple conditions?
Use & for AND and | for OR, e.g., DT[AIRLINE == “AA” & DEPARTURE_TIME > 600]
How can you select rows where a column value is in a set of values?
Use %in%, e.g., DT[DESTINATION_AIRPORT %in% c(“JFK”, “LGA”)]
How can you ensure changes in a data.table do not affect the original data?
Use copy(), e.g., new_DT <- copy(DT)
How do you access a specific column by name?
DT[, COLUMN_NAME]
How do you access multiple columns as a data.table?
DT[, .(col1, col2)]
How do you add a new column in a data.table?
Use :=, e.g., DT[, NEW_COLUMN := OLD_COLUMN * 2]
How do you remove a column in a data.table?
DT[, COLUMN_NAME := NULL]
What does .N represent in data.table?
.N is a built-in variable that counts the number of rows in the table or group.
What is the purpose of the by argument in data.table?
It is used for grouped operations, e.g., DT[, .(mean_col = mean(COLUMN)), by = GROUP_COLUMN]
How do you calculate the mean of a column in a data.table?
DT[, mean(COLUMN_NAME, na.rm = TRUE)]
What is the output of rep(6:9, 2) ?
a. 6 7 8 9 6 7 8 9
b. 6 6 7 7 8 8 9 9
c. “6”, “7”, “8”, “9”, “6”, “7”, “8”, “9”
d. “6”, “6”, “7”, “7”, “8”, “8”, “9”, “9”
The correct answer is A
rep(6:9, 2)
[1] 6 7 8 9 6 7 8 9
answer B is wrong because rep(6:9, 2) repeats the sequence “6 7 8 9” twice, not each number of the sequence individually.
answer C and D are wrong because their elements are characters (and not integers)
What is the output of the following code: c(3 != sqrt(9), TRUE == (3 > 8))?
a. FALSE TRUE
b. TRUE
c. FALSE FALSE
d. FALSE
The correct answer is C
c(3 != sqrt(9), TRUE == (3 > 8))
[1] FALSE FALSE
NOTE: in order to understand its logic better, the expression could be simplified to:
c(3 != 3, TRUE == FALSE)
Let x <- c(1, 6, 3, 2). What is the output of sort(x)?
a. 1 2 3 6
b. 6 3 2 1
c. 1 4 3 2
d. 2 3 4 1