Chapter 2 Code Flashcards
How can we load a dataframe if it is from a package in R?
data <- nhanes
How we can we load a data frame from a file?
data <- read.csv(“nhanes.csv”)
How do we write data to a file?
write.csv(data, “file_name”)
write.csv(data, “file_name”, row.names=TRUE)
How do we look at the structure of the data?
str(data)
How do you ensure that there is only complete data in the dataset?
use na.omit()
data <- na.omit(data)
How do you print a statement in R?
paste0()
What function was used to get the mode of a vector?
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
What does tabulate() do?
Takes an integer-valued vector bin and counts the number of times each integer occurs in it
What does match(x, y) do?
Matches each element of x to its corresponding position in y.
What function is used to determines the location (index) of the (first) minimum or maximum of a numeric (or logical) vector?
which.max()
What function can you use to return the statistical mode?
R does not have a built-in function to compute the statistical mode. This is why you need to use a custom function like getmode to:
Identify the unique values.
Count how often each unique value occurs.
Return the value with the highest frequency.
What are the measures of location?
mean(data)
median(data)
and mode - need to create a function
How is R indexed?
Indexed from 1
How do you convert a numeric variable to a categorical variable?
data$var <- as.factor(data$var)
How can you quickly see the quartiles of data?
summary(data)
What are the measures of variation?
Range
- Can do max(data) - min(data)
- Can do range(data)[2] - range(data)[1]
IQR
- IQR(data)
- summary(data)[5] - summary(data)[2]
Variance
- var(data)
Use summary(data) to get the quartiles
How do we calculate frequencies to probabilities?
prob <- freq / sum(freq)
What are the measures of heterogeneity?
Gini index of heterogeneity
Entropy
How do we calculate the Gini Index of Heterogeneity?
G <- 1 - sum(prob^2)
Why would you multiply the Gini index by k / k-1 ?
What code is used to do this?
K/k-1 is a correction factor that adjusts the Gini index to account for the number of categories in the dataset.
This normalises the index to range from 0 to 1.
G prime -
k <- length(example)
G_norm <- G * (k/(k-1))
How do you calculate entropy?
E <- -sum(prob * log10(prob))
How do you normalise the entropy value?
Divide by log(k) where k = length(data)
E_norm <- E / log10(k)
What is the Gini() function from DescTools used for?
The Gini concentration index - used for measuring concentration
How do you calculate the Gini concentration index?
Gini(data)
Manual process:
n <- length(data)
n_list <- c(1:n)
example_1 <- sort(example_1)
Fi <- n_list / n
xj <- cumsum(example_1)
N <- sum(example_1)
Qi <- xj / N
FiQi <- Fi-Qi
FiQi <- (head(sumFiQi, -1))
sumFiQi <- sum(FiQi)
R <- sumFiQi / sumFi