3 | R: Data Flashcards
(POLL)
Return or print? To use the value of a function outside of that function what would you use at the end of your function?
- print
- return
- none-of-both
return
(POLL)
The dim for a data frame returns?
- number of columns
- number of rows
- both, first columns then rows
- both, first rows then columns
both, first rows then columns
(POLL)
To display the data from a data frame df
for the column col
in a sorted manner, what is the right statement to do so?
- df[order(df$col)]
- df[order(df$col),]
- df[sort(df$col)]
- df[sort(df$col),]
- sort(df)
df[order(df$col),]
(POLL)
To summarise a column of a vector by one or more categories we use?
- aggregate
- apply
- print
- summary
aggregate
(POLL)
What is the command to combine two data frames by a column which have the same set of values?
- attach
- cbind
- join
- merge
- rbind
merge
(POLL)
What is the command you would use to get the sum of all row values for a matrix?
- aggregate
- apply
- sum
- summary
apply
(POLL)
To display tables with more than 2 dimensions we use:
- cat
- ftable
- Summary
- Table
ftable
(Summary: might give unwanted info
Table: maybe also)
(POLL)
To write a single data frame to the file system in a compressed compact file we use …
- save
- Save.image
- saveRDS
- writehistory()
- Write.table
save, saveRDS
(Save.image: saves whole workspace!
writehistory()
Write.table: uncompressed!)
R:
Four data frames we worked with?
● survey, nym.2002 - data frames with different column types
● authors, books - two data frames to be merged
● protein-consumption - matrix of percentages for eating
● Titanic - contingency table for people on the ship belonging to certain categories
R:
How to return multiple objects in a function?
return(list(a, b, c, etc))
R:
How to create a date frame from survey in a .tab file?
> survey = read.table("../../../data/survey‐2019‐11.tab", > header=TRUE, stringsAsFactors=TRUE)
R:
How to check dimensions of a dataframe?
dim(dataframe)
R:
Two ways to check number of rows in a data frame?
> dim(dataframe)[1] > nrow(dataframe)
R:
What is ordering? Code to order a dataframe?
- gives the indices of elements in some order
- does not change the data frame
eg:
~~~
head(somedf[order(somedf$someCol),])
~~~
R:
What is the difference between sorting and ordering?
sort ‐ gives back values and makes changes
order - gives back indices and does not make changes
R:
Basic aggregate() Usage
What is the general syntax of the aggregate() function in R?
aggregate(numeric_vector, by = list(categorical_vector), FUN, ...)
R:
How do you calculate the mean age by gender from the nym dataset?
aggregate(nym$age, by = list(nym$gender), mean)
R:
How do you calculate the mean for place, age, and time, grouped by gender with trimmed mean (10%)?
aggregate(nym[, c('place', 'age', 'time')], by = list(nym$gender), mean, trim = 0.1)
R:
How can you use with() to avoid $ notation in aggregate()?
with(survey, aggregate(cm, by = list(gender), mean))
R:
How do you replace country codes in nym$home with USA for two-letter codes and World otherwise?
usa = as.character(nym$home) usa[grep("^[A-Z][A-Z]$", nym$home)] = "USA" usa[-grep("^[A-Z][A-Z]$", nym$home)] = "World"
R:
How do you calculate the mean place, age, and time, grouped by gender and home (USA/World)?
aggregate(nym[, c('place', 'age', 'time')], by = list(nym$gender, as.factor(usa)), mean)
R:
nym dataset:
How do you count the number of observations for each gender-home combination?
aggregate(nym[, c('age')], by = list(nym$gender, as.factor(usa)), length)
R:
Which function can you use to add columns to a dataframe?
give an example
and rows?
using cbind()
> gender=c("male","female", "female","male") > ages=c(12,23,22,11) > df=data.frame(age=ages, gender=gender) > colors=c("yellow","orange","yellow","green") > df=cbind(df,color=colors)
rows analogously with rbind()
R:
What can you do with cbind() and rbind()? When does this not work?
to add rows or columns to a data frame.
Don’t work if dimensions are not the same –> but smartbind() from ‘gtools’ package does this
R:
How can you make one dataframe out of two? Give an example
> load('../../../data/authors.RData') > authors surname nationality deceased 1 Tukey US yes 2 Venables Australia no 3 Tierney US no 4 Ripley UK no 5 McNeil Australia no > head(books,4) name title other.author 1 Tukey Exploratory Data Analysis <NA> 2 Venables Modern Applied Statistics ... Ripley 3 Tierney LISP‐STAT <NA> 4 Ripley Spatial Statistics <NA > merge(authors,books,by.x="surname",by.y="name") surname nationality deceased title other.author 1 McNeil Australia no Interactive Data Analysis <NA> 2 Ripley UK no Spatial Statistics <NA> 3 Ripley UK no Stochastic Simulation <NA> 4 Tierney US no LISP‐STAT <NA> 5 Tukey US yes Exploratory Data Analysis <NA> 6 Venables Australia no Modern Applied Statistics ...
R:
Compare dataframes and matrices
matrices
- always 2 dimensional
- only 1 type (usually numeric)
R:
How can you convert a data frame to a matrix?
> mt=as.matrix(mt)
R:
Can you use $ operator on matrices?
No
atomic error vector - matrices are internally saved as vectors (very efficient) so $ operator doesn’t work
must use brackets and col/row names or indices [ ]
R:
Would you use summary or aggregate with matrices?
- Using aggregate makes no sense here because we only have 1 type
- There is no column with categories
- remember all columns in a matrix must have the same type
R:
How could you get sums of all columns or rows in a matrix?
> head(apply(mt,1,sum),8)
(first 8 values)
> head(apply(mt,2,sum),5)
R:
What are some useful variants of apply? Examples of usage?
lapply
listapply = for every list element
advantage: don’t need to loop over elements –> faster computation
> childs=list(Fritz=c("Max","Moritz"), \+ Klaus=c("Otto","Emi","Karl","Lotta")) > lapply(childs,length) $Fritz [1] 2 $Klaus [1] 4 > lapply(childs,length)$Klaus [1] 4
rapply
~~~
> nc.childs=list(Fritz=list(Gerda=c(“Max”,”Moritz”),Frieda=
+ c(“Else”)),Klaus=list(Marlene=c(“Otto”,”Emi”,”Karl”,”Lotta”)))
> rapply(nc.childs,length)
Fritz.Gerda Fritz.Frieda Klaus.Marlene
2 1 4
~~~
R:
How do you do matrix multiplication ? What issue can arise?
> D %*% M
issue:
~~~
> N = D %*% M
> identical(M, N) # FALSE arrrghhh! Floating point issue!
[1] FALSE
> N == M
~~~
rounding issues !
representation of floats is not 100% exact
solution:
~~~
> all.equal(M, N) # internal small rounding
~~~
R:
Different ways to create a table in R?
table
ftable
apply
matrix (and then table)
R:
What is in a table?
- contingency tables for counts
- each combination of factor levels is counted
R:
table vs data frame?
tables are not a dataframe, but rather contingency table which contains counted items for different categories
R:
What is str()?
str: str displays structures of R objects.
mostly used for displaying the contents of a list.
str () is an alternative function to display the summary of the output produced, especially when the data set is huge, eg more than two dimensions
eg:
~~~
> str(Titanic)
‘table’ num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 …
- attr(*, “dimnames”)=List of 4
..$ Class : chr [1:4] “1st” “2nd” “3rd” “Crew”
..$ Sex : chr [1:2] “Male” “Female”
..$ Age : chr [1:2] “Child” “Adult”
..$ Survived: chr [1:2] “No” “Yes”
~~~
R:
What does ftable() do?
It ‘flattens’ data so that it will have 2 dimensions
eg:
~~~
> Titanic[c(“1st”,”2nd”),”Male”,,]
, , Survived = No
Age
Class Child Adult
1st 0 118
2nd 0 154
, , Survived = Yes
Age
Class Child Adult
1st 5 57
2nd 11 14
> ftable(Titanic[c(“1st”,”2nd”),”Male”,,])
Survived No Yes
Class Age
1st Child 0 5
Adult 118 57
2nd Child 0 11
Adult 154 14
~~~
R:
How to get a proportion table?
prop.table(table)
R:
How to transpose a table?
t(table)
R:
What does with do ?
change the scope of variables:
with imports within the current evaluation parenthesis the
inner variables to global scope
R:
How can you change the scope of variables?
with, attach, detach
R:
What do attach and detach do?
attach imports permanently the inner variables into global scope
detach forgets the imported variables
Hint: don’t use attach and detach
R:
What set operations are there? Name 4
intersect()
union()
setdiff()
setequal()
R:
&& vs & and || vs | ?
- confusing feature of R
- && works on first vector element only
- returns FALSE here the condition here is not TRUE (52)
- speed up to not go through long vectors
- but in R 4.3 now an error
- good - I mostly did this by accident‼
in many languages we use && but don’t use it in R
R:
Reading and Saving data:
write.table and save - what’s the difference?
Give an example with nym
write.table → data frame in a file with tabstop as a separator
save → saved as binary - can’t be inspected from terminal. 1/3 of the size of tabular file
> head(nym[order(nym$age),],n=2) place gender age home time 116 23373 Male 18 MEX 408.3333 182 8823 Female 20 MEX 244.8833 > nym2=nym[order(nym$age),] > write.table(nym2,file="nym2.tab",sep="\t",quote=FALSE) > save(nym2,file="nym2.RData")
R:
read.table, write.table - datatypes?
read.table always produces a data frame
- need to convert
````
> mt[1:2,1:4]
RedMeat WhiteMeat Eggs Milk
Albania 10.1 1.4 0.5 8.9
Austria 8.9 14.0 4.3 19.9
> write.table(mt,file=”protein‐consumption2.tab”,
+ sep=”\t”,quote=FALSE)
> mt2=read.table(“protein‐consumption2.tab”, header=TRUE,
+ stringsAsFactors=TRUE)
> mt2[1:2,1:4]
RedMeat WhiteMeat Eggs Milk
Albania 10.1 1.4 0.5 8.9
Austria 8.9 14.0 4.3 19.9
> class(mt)
[1] “matrix” “array”
> class(mt2)
[1] “data.frame”
> mt2=as.matrix(mt2)
> mt2[1:2,1:4]
RedMeat WhiteMeat Eggs Milk
Albania 10.1 1.4 0.5 8.9
Austria 8.9 14.0 4.3 19.9
> class(mt2)
[1] “matrix” “array”
> sum(mt2‐mt)
[1] 0
> identical(mt2,mt) # or use all.equal to be sure
[1] TRUE
~~~
R:
read.ftable / write.ftable - when to use them? Benefit?
use for tables with more than 2 dimensions - flattens them - better way to present data
eg
~~~
> ftable(Titanic[“1st”,,,])
> write.ftable(ftable(Titanic[“1st”,,,]),
+ file=”ftable.ftab”)
> sam=read.ftable(“ftable.ftab”)
> sam
R:
dot, comma problems with read.table?
decimal separators instead of the usual dot (.).
When read.table() is used without specifying the decimal separator, R assumes commas are column delimiters, making the entire dataset characters instead of numbers.
The issue is fixed by explicitly setting dec=’,’ in read.table(), telling R that commas indicate decimal points.
R:
what is RDS
when / how to use?
“Serialization Interface for Single Objects” - R’s own data file format
saveRDS(object, file=”filename.RDS”) saves a single object to a file.
readRDS(“filename.RDS”) loads the object into a variable (does not change any existing variables).
Use saveRDS() and readRDS() for single objects when you want explicit assignment and to avoid accidental overwrites.
Yes, RDS files save only single objects - but you can create
lists!
R:
what to be careful of with load()?
it will overwrite variables if they exist
R:
How to load excel files?
many different packages for this
eg:
~~~
> install.packages(‘openxlsx’)
> library(openxlsx)
> sample=read.xlsx(“../../../data/sample.xlsx”)
~~~
(Quiz 1)
From which data sources R can directly import data without additional packages? Several answers are possible.
a. Tab files
b. RData files
c. RDS files
d. SQL Databases
e. Excel files
a. Tab files – you can use the read.table command
b. RData files - You can use the load command to import data in RData files
c. RDS files - readRDS is an Inbuild function
The commands load, read.table, readRDS can be used to Import RData, Tab and RDS files into R without installing additional packages.
(Quiz 1)
What does this command do?
readRDS
loads a single object into a given variable name
(Quiz 1)
What does this command do?
load
loads a single object without variable assignment
(Quiz 1)
What does this command do?
read.table
loads data from a flat text file
(Quiz 1)
What does this command do?
source
loads and execute R code from a flat file
(Quiz 1)
What does this command do?
loadhistory
loads old session R commands into the current sesssion
(Quiz 2)
Within rectangular braces for sorting / ordering data of a data frame, which is probably the better choice?
* order
* by
* sort
order – indices will be returned (sort → values)
(Quiz 2)
To get the average value of all columns of a numerical matrix we usually use the ______ function together with the mean function, whereas for calculating group based means of a date frame for a numerical vector in this data frame against a factor vector in the data frame we use the ______ function. To add new rows for both data structures we use the ______ function whereas for new columns we use the ______ function. How many rows and columns are in both structures we can find out using the _____ function.
To get the average value of all columns of a numerical matrix we usually use the apply function together with the mean function, whereas for calculating group based means of a data frame for a numerical vector in this data frame against a factor vector in the data frame we use the aggregate function. To add new rows for both data structures we use the rbind function whereas for new columns we use the cbind function. How many rows and columns are in both structures we can find out using the dim function.
(Quiz 2)
Complete the code.
To change the name of the last column of a data frame df to a name ‘last’ we use the following construct:
________________
To change the name of the last column of a data frame df to a name ‘last’ we use the following construct:
~~~
colnames(df)[length(df)]=”last”
~~~
(Quiz 2)
To remove the column with the name ‘last’ from the df data frame we use the following code:
________________
To remove the column with the name ‘last’ from the df data frame we use the following code:
~~~
Df$last==NULL
~~~
(Quiz 2)
To combine two data frames based on a column with values which can be matched, we use the ______ command.
merge()
(Quiz 2)
To get the elements of two vectors that are in both vectors, not in only one vector, we use the ______ command.
intersect()
(Quiz 2)
To get the elements of two vectors that are in one or both vectors we use the ______ command
union()
(Quiz 2)
To get the elements of vector 1 which are not in vector 2 we can use the ______ command
To get the elements of vector 1 which are not in vector 2 we can use the setdiff command
setdiff()
(Quiz 2)
To create a contingency table out of two variables we use the ______function
table()
(Quiz 2)
To display tables with more than two dimensions we use the ______ command
To display tables with more than two dimensions we use the ftable command
ftable()
(Quiz 2)
To extract to variables for a multidimensional table we can use the ______ command together with the ______ command
apply(), sum()
(Quiz 2)
To create a new contingency table out of four given numbers we can use the _____command.
matrix()
(EXAM - VL3)
1- save
2- saveRDS
3- dev.copy2pdf
4- write.ftable
5- write.table
6- save.image
7- savehistory
?
(2024-2)
save – Saves multiple R objects to file in binary format (.RData).
saveRDS – Saves single R object to file in binary format (.rds), allowing selective loading.
dev.copy2pdf – Copies current graphics device output to PDF file.
write.ftable – Writes flat contingency table (ftable) to a text file.
write.table – Exports df or matrix to a text file (CSV-like).
save.image – Saves entire current R workspace (all objects) to .RData file.
savehistory – Saves command history to a file (.Rhistory)