R Flashcards
Modulo operation
The modulo returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or 5 %% 3 is 2.
Check the data type of a variable
class()
assign value to variable
var <- value
how to create a vector
with the combine function c()
assign names to vector
names()
[2:5] –> which values does this include?
includes the second and fifth value of a vector
Define a new variable based on a selection from a vector
new_var <- some_vector[c(…,3, 4, …)] or [:]
Construct a matrix with 3 rows that contain the numbers 1 up to 9
matrix(1:9, byrow = TRUE, nrow = 3)
Name the columns and rows of a matrix
rownames()
colnames()
calculate sums of rows of matrix or of columns
rowSums() or colSums
Merge matrices and/or vectors together by column (right) or below
cbind() or for below: rbind()
in console, check out contents of workspace
ls()
data on the rows 1, 2, 3 and columns 2, 3, 4.
my_matrix[1:3,2:4]
encode the vector as a factor –> and, optional, also give them an order
factor() –> factor(temperature_vector, order = TRUE, levels = c(“Low”, “Medium”, “High”))
change factor levels of a factor vector to …
levels(factor_vector) <- c(“”, “”) –> the order with which you assign the levels is important. If you don’t specify the levels of the factor when creating the vector, R will automatically assign them alphabetically.
quick overview of the contents of a variable
summary()
see first or last rows of a built-in dataframe
head() or tail()
select a subset based on a certain condition from your dataset
subset(dataframe, condition)
order a vector
order()
Call order() on a dataframe by ordering it based on certain column
dataframe$column
see structure of a dataframe
str()
create a dataframe
data.frame()
add components to a list, then assign names to the components
my_list <- list(your_comp1, your_comp2)
names(my_list) <- c(“name1”, “name2”)
or
my_list <- list(name1 = your_comp1,
name2 = your_comp2)
filter dataframe
filter()
sort the rows of a df based on a positions vector
planets_df[positions, ]
sort values in a dataset
arrange(column_to_use) or arrange(desc(column_to_use)) for descending
pipe
%>%
change values in a dataframe
mutate(what_is_replaced = what_is_calculated)
package for data visualization
ggplot2
create visualization
ggplot(dataset, aes(aesthetic mapping of variables) + type of graph)
function for creating scatterplot with ggplot
geom_point()
to ggplot, add color and size of dots
ggplot(dataset, aes(aesthetic mapping of variables, color = …, size = …) + type of graph)
divide one plot into multiple smaller plots
faceting: facet_wrap(~…)
turn groups into one row each before summarize()
group_by()
after specifying type of graph, also specify log scale
+ scale_x_log10()
turn many rows into one with pipe
… %>% summarize(… = mean(…))
make a line plot
geom_line()
make a bar plot
geom_col()
make a histogram
geom_histogram(binwidth = … or bins = …)
make a boxplot
geom_boxplot()
how are the lines going up and down from the boxplot called?
“whiskers”
add title to ggplot
+ggtitle(“…”)
Data visualization points of consideration
Add a smooth geom to the plot
geom_smooth()
geom_point() has an alpha argument - what does it mean?
controls the opacity of the points. A value of 1 (the default) means that the points are totally opaque; a value of 0 means the points are totally transparent (and therefore invisible). Values in between specify transparency.
when only is it necessary to map onto the color aesthetic in geom?
When all layers should NOT inherit the same aesthetics or when mixing different data sources
how is fill distinct from color?
fill differs that color usually, but not always, refers to the outline of a shape
change line pattern
linetype
text on a plot or axes
geom_text() layer and aes(label = …)
attributes vs aesthetics
attributes are always called in the geom layer, for example, it’s color attribute is set by the color argument, its size by the size argument
get rownames of dataframe
rownames()
default position for dataplot
identity (put position exactly where it originally should be)
add random noise to points to counteract overplotting
position = “jitter” or before everything set posn_j <- position_jitter(…) and alter position = posn_j
all aesthetics are a scale, so how can be access that scale?
with scale_…*()
most common scale arguments
limits, breaks, expand, labels
set the x- and y-axis labels
You’ll also make use of some functions for improving the appearance of the plot.
labs() to set the x- and y-axis labels. It takes strings for each argument.
define properties of the color scale (i.e. axis). The first argument sets the legend title. values is a named vector of colors to use.
scale_fill_manual()
set y limits of axis
+ ylim()
to which axes are the independent and dependent variables mapped?
Typically, the dependent variable is mapped onto the the y-axis and the independent variable is mapped onto the x-axis.
Markdown: Bold, italics, both, headers, inline link, reference link, images, block quote, lists, soft break
(1-6 # possible)
Bold: …
Italics: …
Both: …
some_text_displayed
instead of (link) directly, use […], and at the end, […]: link
images: same as links, just enter ! before []
block quote: simply enter the symbol > before
Lists: either just * or 1., 2., etc.
Soft break: with 2 spaces ( )
picture of different geometries
how to offset bars in a histogram? How to “use the complete space top to bottom”?
position = “dodge”, position = “fill”
for plotting, count number of cases at each x position
geom_bar()
plot errorbars e.g. for a mean
geom_errorbar(aes(ylim = avg - stdev, ymax = avg + stdev))
in aes, set different line types
linetype = …
modify visual elements not part of the data (text, line, rectangle)
element_…()
aesthetics for categorical variables
in a plot, how to change e.g. the axis title colour
+ theme(axis.title = element_text(color = …))
do we need to modify each e.g. text item individually to e.g. change the colour?
no, they inherent from each other in a hierarchy. All text elements inherit from text, so if we changed that argument, all downstream arguments would be affected. The same goes for line and rectangle.
remove legend
theme(legend.position = “none”) - also: “top”, “bottom”, “left”, or “right’”: place it at that side of the plot.
remove an element in a plot
eg line: line = element_blank()
look inside data how each column looks
glimpse()
Basic data types
Character, Numeric (Double/Integer), Logical
vector vs matrix vs array vs dataframe
row in excel, excel sheet, stacked excel sheets, 2-D array which can hold different data types down each column
Tibble
2-D array with less functionality than a dataframe to limit user mistakes - tibble is the unifying feature of tidyverse (data is expected to be in tibble)
Feature matrix X c R^(NXD)
feature matrix X which contains N observations and D features (and R = real numbers)
Dimensionality of data - what counts?
When we talk about ‘dimensionality’ we typically mean ‘how many independent variables do I have for analysis’?
A single observations forms a row of data - correct?
Yes
is a number (and the opposite)
is.na (is not a number) and !is.na (opposite)
how do tidy up names of dataframe
clean_names()
you need to convert characters to numbers - how to do that
by creating a factor - function called factor()
each variable forms a column, each observation forms a row, each cell is a single value - correct?
yes
which format in tidyverse is considered tidy?
long format
function to change date format - and package in tidyverse
dmy() - Lubridate package
create tidy format in tidyverse - function
pivot_longer()
select columns of interest
select() –> no need for “” to select column, just write column_name
select rows of interest
filter()
select all columns in a dataframe except one
select(-column_name)
instead of using mutate: drop all non-transfomred variables - function
transmute()
mutate vs transmutate
mutate() keeps all variables in the original dataframe (unless otherwise specified in the .keep argument.) transmute() returns a dataframe with only the newly computed or modified variables.
filter in a df from specific date to specific date
filter(date >= as.Date(“2020-01-01”)) %>%
filter(date < as.Date(“2021-01-01”)) %>%
in timeseries, use e.g. previous value, what function?
lag()
mutate datetime into year
mutate(year = year(date))
join two dataframes - function
inner_join(df1, df2, by = …)
for pivot_longer, use all available columns
pivot_longer(cols = everything(), …, …)
how to in ggplot access different infos/settings for plot
+ theme(…)
draw a straight line in a ggplot
+ geom_smooth(method = “lm”, color = …) - other line would e.g. be “loess” -> works basically like a local regression
define some sort of baseline in a plot
geom_hline(yintercept = 0, color = …)
filter filters rows, select selects columns - correct?
yes
get probability of z-value - function
xpnorm(probability, mean =…, sd = …)
given a particular probability of 𝑍 < 𝑧, what is the corresponding value 𝑧?
qnorm(value z, mean = …, sd = …)
Function for correlation
cor(X,Y)
Get summary statistics about variables in tibbles
skim() of the skimr package - skim() works with (grouped) tibbles
get some statistics on some data (e.g. tibbles)
favstats(column, data)
USe the dplyr package to select to columns from e.g. a tibble/data set and plot correlations with one (simple/nice) function
… %>%
dplyr::select(…, …) %>%
ggpairs()
add a trendline over an existing plot, and use a linear model, and control whether the standard error (confidence interval) of the fitted line should be displayed - do not display it
geom_smooth(method=lm, se=FALSE, color=…)
Which library to plot 3 essential plots for residuals? Code for that?
- ggfortify
- model1 %>% autoplot(which = 1:3) + theme_bw()
What does :: do?
:: is operator which helps to access a specific function from a specific package
Function/plots that allow to check whether assumptions of a Linear Regression model have been satisfied (i.e. examination of the behavior of the residuals for model inadequacies)
- autoplot()
- from ggfortify library to plot 3 essential plots for residuals
Build a multiple regression model
model <- lm(y_variable ~ 1st_var + 2nd_var + 3rd_var)
use a library to compare valid models (regression) to chose the best one
Test in RStudio for Multicollinearity