Chapter 2 Code Flashcards

1
Q

How can we load a dataframe if it is from a package in R?

A

data <- nhanes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How we can we load a data frame from a file?

A

data <- read.csv(“nhanes.csv”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we write data to a file?

A

write.csv(data, “file_name”)

write.csv(data, “file_name”, row.names=TRUE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do we look at the structure of the data?

A

str(data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you ensure that there is only complete data in the dataset?

A

use na.omit()

data <- na.omit(data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you print a statement in R?

A

paste0()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What function was used to get the mode of a vector?

A

getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does tabulate() do?

A

Takes an integer-valued vector bin and counts the number of times each integer occurs in it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does match(x, y) do?

A

Matches each element of x to its corresponding position in y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What function is used to determines the location (index) of the (first) minimum or maximum of a numeric (or logical) vector?

A

which.max()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What function can you use to return the statistical mode?

A

R does not have a built-in function to compute the statistical mode. This is why you need to use a custom function like getmode to:

Identify the unique values.
Count how often each unique value occurs.
Return the value with the highest frequency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the measures of location?

A

mean(data)
median(data)

and mode - need to create a function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is R indexed?

A

Indexed from 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you convert a numeric variable to a categorical variable?

A

data$var <- as.factor(data$var)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can you quickly see the quartiles of data?

A

summary(data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the measures of variation?

A

Range
- Can do max(data) - min(data)
- Can do range(data)[2] - range(data)[1]

IQR
- IQR(data)
- summary(data)[5] - summary(data)[2]

Variance
- var(data)

Use summary(data) to get the quartiles

17
Q

How do we calculate frequencies to probabilities?

A

prob <- freq / sum(freq)

18
Q

What are the measures of heterogeneity?

A

Gini index of heterogeneity
Entropy

19
Q

How do we calculate the Gini Index of Heterogeneity?

A

G <- 1 - sum(prob^2)

20
Q

Why would you multiply the Gini index by k / k-1 ?
What code is used to do this?

A

K/k-1 is a correction factor that adjusts the Gini index to account for the number of categories in the dataset.

This normalises the index to range from 0 to 1.

G prime -
k <- length(example)
G_norm <- G * (k/(k-1))

21
Q

How do you calculate entropy?

A

E <- -sum(prob * log10(prob))

22
Q

How do you normalise the entropy value?

A

Divide by log(k) where k = length(data)

E_norm <- E / log10(k)

23
Q

What is the Gini() function from DescTools used for?

A

The Gini concentration index - used for measuring concentration

24
Q

How do you calculate the Gini concentration index?

A

Gini(data)

Manual process:
n <- length(data)
n_list <- c(1:n)

example_1 <- sort(example_1)

Fi <- n_list / n

xj <- cumsum(example_1)
N <- sum(example_1)
Qi <- xj / N

FiQi <- Fi-Qi
FiQi <- (head(sumFiQi, -1))
sumFiQi <- sum(FiQi)

R <- sumFiQi / sumFi

25
Q

If we want to work on models and then check the results, what should we do?

A

Split the data into a training and test set

26
Q

How do we split the data into a training and test set?

A

createDataPartition()

in_train <- createDataPartition(data$column, p = 0.8, list = FALSE)

27
Q

How do you add labels to a graph?

A

+ labs(title=”yx”)