Chapter 2 Code Flashcards by Louise Rodgers

How can we load a dataframe if it is from a package in R?

data <- nhanes

How well did you know this?

Not at all

Perfectly

How we can we load a data frame from a file?

data <- read.csv(“nhanes.csv”)

How well did you know this?

Not at all

Perfectly

How do we write data to a file?

write.csv(data, “file_name”)

write.csv(data, “file_name”, row.names=TRUE)

How well did you know this?

Not at all

Perfectly

How do we look at the structure of the data?

str(data)

How well did you know this?

Not at all

Perfectly

How do you ensure that there is only complete data in the dataset?

use na.omit()

data <- na.omit(data)

How well did you know this?

Not at all

Perfectly

How do you print a statement in R?

paste0()

How well did you know this?

Not at all

Perfectly

What function was used to get the mode of a vector?

getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

How well did you know this?

Not at all

Perfectly

What does tabulate() do?

Takes an integer-valued vector bin and counts the number of times each integer occurs in it

How well did you know this?

Not at all

Perfectly

What does match(x, y) do?

Matches each element of x to its corresponding position in y.

How well did you know this?

Not at all

Perfectly

What function is used to determines the location (index) of the (first) minimum or maximum of a numeric (or logical) vector?

which.max()

How well did you know this?

Not at all

Perfectly

What function can you use to return the statistical mode?

R does not have a built-in function to compute the statistical mode. This is why you need to use a custom function like getmode to:

Identify the unique values.
Count how often each unique value occurs.
Return the value with the highest frequency.

How well did you know this?

Not at all

Perfectly

What are the measures of location?

mean(data)
median(data)

and mode - need to create a function

How well did you know this?

Not at all

Perfectly

How is R indexed?

Indexed from 1

How well did you know this?

Not at all

Perfectly

How do you convert a numeric variable to a categorical variable?

data$var <- as.factor(data$var)

How well did you know this?

Not at all

Perfectly

How can you quickly see the quartiles of data?

summary(data)

How well did you know this?

Not at all

Perfectly

What are the measures of variation?

Range
- Can do max(data) - min(data)
- Can do range(data)[2] - range(data)[1]

IQR
- IQR(data)
- summary(data)[5] - summary(data)[2]

Variance
- var(data)

Use summary(data) to get the quartiles

How do we calculate frequencies to probabilities?

prob <- freq / sum(freq)

What are the measures of heterogeneity?

Gini index of heterogeneity
Entropy

How do we calculate the Gini Index of Heterogeneity?

G <- 1 - sum(prob^2)

Why would you multiply the Gini index by k / k-1 ?
What code is used to do this?

K/k-1 is a correction factor that adjusts the Gini index to account for the number of categories in the dataset.

This normalises the index to range from 0 to 1.

G prime -
k <- length(example)
G_norm <- G * (k/(k-1))

How do you calculate entropy?

E <- -sum(prob * log10(prob))

How do you normalise the entropy value?

Divide by log(k) where k = length(data)

E_norm <- E / log10(k)

What is the Gini() function from DescTools used for?

The Gini concentration index - used for measuring concentration

How do you calculate the Gini concentration index?

Gini(data)

Manual process:
n <- length(data)
n_list <- c(1:n)

example_1 <- sort(example_1)

Fi <- n_list / n

xj <- cumsum(example_1)
N <- sum(example_1)
Qi <- xj / N

FiQi <- Fi-Qi
FiQi <- (head(sumFiQi, -1))
sumFiQi <- sum(FiQi)

R <- sumFiQi / sumFi

If we want to work on models and then check the results, what should we do?

Split the data into a training and test set

How do we split the data into a training and test set?

createDataPartition() in_train <- createDataPartition(data$column, p = 0.8, list = FALSE)

How do you add labels to a graph?

+ labs(title="yx")