Lecture 3 Flashcards
Tidy Data
What are the three criteria for a tidy dataset?
- Each variable has its own column. 2. Each observation has its own row. 3. Each value has its own cell.
What are common signs of untidy data?
Column headers are values, not variable names.
Multiple variables stored in one column.
Variables stored across both rows and columns.
A single observational unit stored across multiple tables.
What is the function used to convert wide data to long data?
melt()
What is the function used to convert long data to wide data?
dcast()
How do you split a column into multiple columns?
Use separate(), e.g., separate(data, col = “proportion”, into = c(“votes”, “total_votes”)).
How do you combine multiple columns into one?
Use unite(), e.g., unite(data, col = “candidate”, name, surname, sep = “ “).
What function is used to concatenate multiple tables?
rbindlist()
What are the four types of merges in data.table?
Inner Merge: Only matching rows from both tables.
Outer Merge: All rows from both tables with NAs for missing values.
Left Merge: All rows from the first table, with NAs for non-matching rows in the second.
Right Merge: All rows from the second table, with NAs for non-matching rows in the first.
How do you perform an inner merge in data.table?
merge(table1, table2, by = “column”, all = FALSE)
How do you merge two tables by multiple columns?
merge(table1, table2, by = c(“col1”, “col2”))
Why is there no single tidy representation of a dataset?
The tidy representation depends on the observation and the goal of the analysis.
What is the difference between back-end and front-end data needs?
Back-end: Data is normalized to avoid redundancy.
Front-end: Data may be combined for easier analysis, even with some redundancy.