Data Analytics R Flashcards

1
Q

Add 5 and 49

A

5+49

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Subtract 5 from 49

A

49-5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Display a sequence of integers from 1 to 20

A

1:20

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Multiply 3 by 5

A

3*5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Divide 12 by 4

A

12 / 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Two to the power of three

A

2^3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Square root of 2

A

sqrt(2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sine of pi

A

sin(pi)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Exponential of 1

A

exp(1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Log 10 to the base of e

A

log(10)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Log 10 to the base of 10

A

log(10, base = 10)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Add a comment in R

A

#

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Add a mass comment in R

A

Control, shift, C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Remove a variable

A

rm(var) or remove(var)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you investigate a dataset or function in R?

A

?function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the first thing you should do when presented with a dataset?

A

Investigate the variables - use head(dataset) or ?dataset in the console.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you refer to variables created in the R markdown?

A

Backticks r var_name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you print to the markdown?

A
  • Type the variable name that you want to print directly
  • Paste()
  • Paste0()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you assign to an object? What is the benefit of this?

A

Use <-
We can store it in the R workspace and save it for future use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you calculate the mean?

A

mean(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How do you calculate the variance?

A

var(x)

Longhand: sum((x-mean(x))^2)/(n-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Find the size / the number of objects in a vector or list.

A

length(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Find the maximum, minimum and range of a vector of objects.

A
  • max(x)
  • min(x)
  • range(x) - this will paste both
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What function is used to collect things together into a vector?

A

x <- c(0, 7, 8)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is a vector in R?

A

A vector is a sequence of data elements of the same basic type. Members of a vector are called Components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Join the vector x and y together

A

c(x, y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Extract the 4th, 6th and 8th elements of a vector

A

x[ c(4, 6, 8) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Extract everything from the 3rd to 9th element

A

x[ 3:9 ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Extract the second element of a vector

A

x[ 2 ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Remove everything from the 3rd to 9th element

A

x[ - (3:9) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Can arithmetic operators be performed element-by-element on vectors?

A

Yes, the operation is performed on each element.
Eg y^x is y1 ^ x1, y2 ^ x2, y3 ^ x3 etc.
- If you sum two vectors in R it takes the element-wise sum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What functions can be used to obtain patterned vectors?

A
  • rep()
  • seq()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How do you generate a sequence?

A
  • seq()
    seq(from, to, by, length.out)
    from: Starting element of the sequence
    to: Ending element of the sequence
    by: Difference between the elements
    length.out: Maximum length of the vector
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How do you replicate elements in a vector?

A

rep(x, times = 1, length.out = NA, each = 1)
x: The object to replicate
times: The number of times to replicate object
length.out: Repeated x as many times as necessary to create vector of this length
each: Number of times to replicate individual elements in object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Create a character vectors containing three colours

A

colours <- c(“red”, “yellow”, “green”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Extract or replace substrings in a character vector.

A

substr(x, start, stop)
x: the current character vector
start: position of digit to start at
stop: position of digit to end at

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are different ways to format pastes?

A

sep = “:”
Leveraging ‘collapse’ to combine a vector into a single string - paste(c(‘Apple’, ‘Banana’, ‘Cherry’), collapse=’, ‘)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What four things should be at the top of an R notebook?

A

install.packages(“tidyverse”)
# install.packages(“plotly”)
library(tidyverse)
library(plotly)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What does paste() do for a vector?

A

Prints individually for each element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How do you print vectors as one whole (ie not like paste which prints each element individually)?

A

cat()
Need to have “\n” to break between prints.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How do you find the remainder of division?

A

Modulo - %%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What does a variable allow you to do?

A

Store a value or function in R

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What types of data are used in R?

A
  • Numerics
  • Integers (which are also numerics)
  • Logical
  • Characters (text or string)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

How can you check the data type of a variable?

A

class() function
- class(varName)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is a vector?

A

A one-dimensional array that can hold numeric data, character data or logical data. It is a simple tool to store data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

How do you create a vector?

A

The combine function c().
Place the vector elements separated by a comma.
Can use the created vector to do calculations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

How do you name a vector? Why might you want to do this?

A

names(vector) function
- names(some_vector) <- c( “Name”, “Profession”)
- Naming aids understanding of the data you are using, and what each element refers to

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

How can you calculate the sum of all elements in a vector?

A

sum(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

How do you compare values?
What are these called?

A

< for less than
> for greater than
<= for less than or equal to
>= for greater than or equal to
== equal to each other
!- not equal to each other

Relational operators

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is the comparison operator for equal to?

A

==

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

How do you select elements of a vector (or matrix or dataframe)?

A

Use of square brackets - indicate which element you want to select eg vector[3] or vector[c(2,3,4)] or vector[2:4].
Or could use the names of the vector elements (assigned with names(vector)) eg vector[“Position”]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is the index of the first element of a vector in R?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What is returned when you use comparison operators on a vector in R?

A

The command tests every element of the vector to see if the condition stated by the comparison operator is true or false. Get a vector of logicals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Instead of selecting a subset of days to investigate yourself, you can get R to return only the days with eg a positive return. How do you do this?

A

selection_vector <- vector > 0
new_vector <- vector[selection_vector]
- R knows how to handle it when you pass in a logical vector into the square brackets. It will only select elements that correspond to TRUE in the selection vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What is a matrix in R?

A

A collection of elements of the same data type arranged into a fixed number of rows and columns (2D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

How do you construct a matrix in R?

A

The matrix() function
matrix(1:9, byrow = TRUE, nrow = 3)
or matrix(c(1,2,3,4,5,6,7,8,9), byrow = TRUE, nrow = 3)
or matrix(1:6, nrow = 2, ncol = 3)
- byrow indicates that the matrix is filled by the rows (FALSE if the matrix is filled by the columns)

Can also add in argument dimnames = list(colnames, rownames) where colnames and rownames are vectors. Therefore there is no “ “

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

How do you name a matrix in R? Why would you want to do this?

A
  • You can add names for the rows and columns:
    rownames(matrix)
    colnames(matrix)
  • Naming a matrix helps us read the data and is useful for selecting certain elements from the matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

How do you calculate the sum of each row in a matrix?

A

rowSums(matrix)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What function do you use to add a new column to a matrix?

A

The cbind() function - this merges matrices and/or vectors together by a column
- cbind(matrix1, matrix2, vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What function do you use to add a new row to a matrix?

A

The rbind() function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

How do you investigate the contents of the workplace?

A

The ls() function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

How do you select elements from a matrix?

A

Square brackets
- matrix[1,2] selects the element at the first row and second column
- matrix[,1] selects all elements of the first column
- matrix[1,] selects all elements of the first row

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

How do standard operators like + / - * work on matrices in R?

A

Standard operators work in an element-wise way on matrices in R.
NB: The matrix1 * matrix2 creates a matrix where each element is the product of the corresponding elements. This is not standard matrix multiplication (achieved by %*%)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What is a factor?

A

A factor is a statistical data type used to store categorical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

How do you create factors in R?

A

The factor() function.
factor_vector <- factor(vector)
Firstly, you need to create a vector that contains all observations that belong to a limited number of categories.
By default the function factor() transforms a vector into an unordered factor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

How do you create an ordered vector of factors?

A

This is possible for ordinal categorical variables, those with a natural ordering.
factor_temp_vector <- factor (temp_vector, order = TRUE, levels = c(“low”, “medium”, “high”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

How would you change the names of factors?

A

Change the names of the factor levels for clarity or other reasons using the function levels()
levels(x) <- c(“name1”, “name2”, …)
Check the order which you assign levels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What do you need to check before renaming factor levels?

A

The order of the current labels - check the output. R will automatically assign alphabetically if the order is not assigned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

How can you give a quick overview of the contents of a variable?

A

summary()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Male and female are what kind of factor levels?

A

Unordered or normal - using comparator operators is meaningless, R returns NA. R attaches an equal value to the levels for such factors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

How do you create an ordered factor?

A

Use the factor() function with two additional arguments.
factor(x, ordered = TRUE, levels = c(“lev1”, “lev2”…))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

It may be more efficient to internally code the levels of the factor as integers. How do you do this?

A

as.integer(x), where x is the vector containing categorical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

How do you output the levels of a vector?

A

levels(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

What is an array in R?

A

An array is a more general way to store data. Array objects can hold two or more than two-dimensional data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

How do you create an array?

A

The array() function
- eg a <- array(1:24, c(3,4,2))
- array(numbers, dimensions)
- creates a 3x4x2 array

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

What is a data frame in R?

A

R provides a data structure, called a data frame, for collecting vectors into one object, which we can imagine as a table. More specifically, a data frame is an ordered collection of vectors, where the vectors must all be the same length but can be different types.
They are like matrices but the columns have their own names. Columns can be different types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

How do you investigate the column names of a data frame?

A

names(df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

How do you access the columns in the data frame?

A

Use the $ symbol
df$colName - this alone will print the data in the column and any associated levels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

What function applies a function to each value in a vector?

A

sapply(x, function)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

How do you check if two vectors are completely identical (same length and same elements in the same position)?

A

all(x == y): This checks if all corresponding elements in x and y are equal. It returns TRUE if all comparisons are TRUE, and FALSE otherwise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

What does the for() statement do?

A

Allows one to specify that a certain operation should be repeated a fixed number of times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

How do you create a vector of length 12, which can hold numeric values?

A

Fib <- numeric(12)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

How do you assign the value 1 to the first two elements of a vector?

A

vector[1:2] <- 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

What is the format of a for loop in R?

A

eg for (i in 3:12)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

What does an if statement allow you to do?

A

Allows you to control the statements that are executed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

What are functions?

A

Self-contained units of code. They generally take inputs, do calculations and produce outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

What is the format of a function?

A

fun <- function(y) {
x <- 3*y
}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

What does attach() do?

A

Allows you to use column names directly - used when using data from an imported package. Otherwise would need to use df$colName

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

What happens when you try to add these two vectors? p <- c(3,5,6,8)
q <- c(3,3,3)

A

You get a warning message about object length.

R uses the recycle rule when vectors have different lengths, i.e. it re-uses elements from the shorter vector (starting at the beginning of the vector).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

I want to select only the rows from a data frame where the column “Gender” has “M” in it. How do I do this?

A

output <- df[df$Gender == “M”, ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

How do you count the number of NA elements in a vector?

A

num_NA <- sum(is.na(x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

Does the length() function consider NA values?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

What are two ways to remove NAs from a vector?

A

na.omit(x) or x[!is.na(x)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

How do you remove all rows with NA values?

A

df[complete.cases(df), ]

complete.cases(df): This function returns a logical vector indicating which rows have no NA values. Rows that have no missing values are marked as TRUE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

When creating a function, how do you create a default argument? eg raising one number to the power of another

A

powerFunc <- function(base, exponent = 2) {
result <- base^exponent
print(result)
}

if a second parameter isn’t specified, it uses 2 as default

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

What is a data frame in R?

A

A data frame collects vectors into one object, which we can imagine as a table. More specifically, a data frame is an ordered collection of vectors, where the vectors must all be the same length but can be different types.

A dataframe has the observations as rows and the variables as columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

What functions can you use to get an oversight of very large data frames?

A

head() shows the first observations and tail() shows the last observations. Both print a top line called “header” which contains the different variables in the dataset.

Another method used to get rapid oversight of the data is the function str() - this shows the structure of the dataset. Structure: no obs, no vars, list of var names, data type, first obs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

How do you construct a data frame?

A

Using data.frame(), you can pass in all the vectors of equal length.

Use str() to confirm your understanding of the data frame.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

How do you select elements from a data frame?

A

Square brackets

df[1,2] selects the value at the first row and second column
df[1,] selects all elements of the first row
Can use variable names for columns as well as numerics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

How do you select an entire column of a data frame?

A

df[, colName]
df$colName

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

How do you select elements from a data frame with a certain condition in a certain column?

A

The subset() function

subset(df, subset = condition)
eg subset(df, subset = rings)
eg subset(df, subset = (diameter < 1))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

How do you sort the data according to a certain variable in the dataset?

A

order(x) - gives you the ranked position of each element when applied to a variable

x[order(x)] - using the output of order(x) to rank the vector, producing the vector rearranged in ascending order.

For a df:
positions <- order(df$column)
df[positions, ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

What is a list in R?

A

A list in R programming is a generic object consisting of an ordered collection of objects.

It allows you to gather a variety of objects under one names.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

How do you construct a list?

A

The list() function
The arguments are the list components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

How do you construct a named list?

A

Naming the components is useful
list <- list(name1 = var1, name2 = var2)

If you want to name list components after it has been created - use the names() function
names(list) <- c(“name1”, “name2”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

How do you name the components of a list?

A

Use the names() function
names(list) <- c(“name1”, “name2”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

How do you select elements from a list?

A

A list is built with numerous elements and components, so getting a single element is not always straightforward,

One way to select a component is using the numbered position (note double brackets)
list[[1]]

You can also refer to the names of the components
list[[“reviews”]]
list$reviews

Can also select specific elements from a component in the list
list[[1]][2] - select from the first component the second element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

How would you change all values that are 0 in the column X in data to 2.

A

data$X[data$X == 0] <- 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

How can you output the even rows of a data frame?

A

rows <- nrow(data)
even_rows <- seq_len(rows) %% 2
data[even_rows == 0, ]

even_rows <- df[seq(1, nrow(df)) %% 2 == 0, ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

How do you find quantiles of a dataset vector?

A

Using the quantile function
quantile(data, probs = c(0, 0.25, 0.5, 0.75, 1)

110
Q

How do you find the IQR of a vector?

A

IQR(data)

111
Q

What is a simple way to plot a histogram - eg of a vector?

A

hist(data) - showing the frequency of each group
hist(data,prob=T) - showing the proportion of each group

112
Q

What is a simple way to plot a box plot?

A

boxplot(data) - includes outliers
boxplot(data, outline = F) - without outliers

Plotting the outliers: boxplot(data, plot=F)$out

113
Q

How do you use the boxplot function to find the outliers of a dataset?

A

boxplot(data, plot=F)$out

114
Q

What do conditional statements, loops and functions do?

A

Make the code more efficient

115
Q

For string comparison, how does R determine the greater than relationship?

A

Based on alphabetical order

116
Q

What is the not operator?

A

! - negates the logical value it is used on

!TRUE
!=4
!is.numeric

117
Q

What are the logical operators?

A

AND (&) OR (|) NOT (!)

118
Q

What logical operator do you use to compare to vectors if you want only the first element of each vector to be compared?

A

&& - double sign version

&& only examines the first element of each vector

119
Q

What are conditional statements?

A

Conditional statements in programming allow the execution of different pieces of code based on whether certain conditions are true or false.

120
Q

What does an if statement do?

A

The if statement takes in a condition, if the condition evaluates to true, the R code associated with the if statement is executed.

if (condition) {
}

121
Q

What does the else statement do, when paired with the if?

A

The if statement takes in a condition, if it evaluates to false (ie the if is not satisfied), the R code associated with the else statement will be executed.

The else statement does not require a condition.

122
Q

What does the else if allow for?

A

If we want to get another piece of code executed for another condition.

This allows for further customisation of the control structure.

123
Q

In an if/else statement, what happens as soon as R gets to the condition that evaluates as true?

A

It executes this code block and leaves the structure.

This becomes important if the conditions listed are not mutually exclusive.

124
Q

What does the while loop do?

A

Executes the code inside if the condition is true. It will continue to loop through as long as the condition is true.

while(condition) {
}

Need to remember to CHANGE something within the while loop, to prevent an infinite loop - infinity while loop. It will run indefinitely because the condition isn’t changing from TRUE.

125
Q

How do you exit a while loop?

A

When the condition becomes FALSE.

Or through use of the break statement.
eg if (x %% 5 == 0) {
break
}

126
Q

What does the for loop do?
What is its syntax?

A

For each variable in a sequence, it executes the expression.

Can use on vectors and lists.

for (variable in vectorName) {
}

for (i in 1:length(vectorName){
}

127
Q

What are the control statements that can be used in for loops?

A

Break and next.

Break - stops the code and abandons the for loop altogether.

Next - skips the remainder of the code for that iteration and proceeds to the next iteration.

128
Q

How do you get the number of characters in a string?

A

nchar(string)

129
Q

How do you loop through a vector?

A

for (i in 1:length(vector)) {
}

for (variable in vector) {
}

130
Q

To access an element in a list, what do you need to remember?

A

You need double square brackets to select elements from a list
list[[i]]

131
Q

How does a nested for loop loop over a matrix?

A

The other loop iterates over the rows. The inner loop iterates over the column.

for(i in 1:nrow(matrix)) {
for (j in 1:ncol(matrix)) {
}
}

132
Q

How do you create a list?

A

list()

133
Q

How do you print to the console?

A

print() or paste0() or cat()

134
Q

How do you calculate the standard deviation of a vector?

A

sd()

There is an na.rm argument which tells us whether or not missing values should be removed. By default it is set to FALSE.

sd(values, TRUE)
or sd(values, na.rm = TRUE)

135
Q

How do you learn about the arguments of a function without having to read through the documentation?

A

args(function)

136
Q

What does the args() function do?

A

args(x) lets you learn about the argument of a function without having to read through the entire documentation.

137
Q

How do you write your own function in R?

A

name <- function(x, y) {
}

If you don’t explicitly say, the last line becomes the returned value.

We can also explicitly return with the return statements. But not always needed it.

name <- function(x, y) {
z <- x + y
return(z)
}

138
Q

How do you create a function with an optional argument?

A

name <- function(x, y = 1) {
}

Otherwise calling without y would create an error.

139
Q

If your function contains division and there is a possibility you could divide by 0 (resulting in infinity) how can you guard your function?

A

name <- function(x, y = 1) {
if (b == 0) {
return(0)
}
}

140
Q

How do you generate a random dice throw in R?

A

dice_throw <- sample(1:6, size = 1)

1:6 represents the numbers on a die (1 to 6).
size = 1 means you are drawing one random number (one dice roll).

141
Q

If a variable is defined inside a function, cannot be accessed outside of the function?

A

No, due to function scope.

142
Q

What is a package you can install for data visualisations?

A

GGVIS

install.packages(“ggvis”)

143
Q

Before you can use a package what do you have to do?

A

Install it

Install is part of the UTILs package.
The function goes to CRAN, it downloads and installs to your package.

We have to actually load the package to use it - library(“package”)

144
Q

How can you find a list of packages and environments that R looks through to find a variable or function?

A

search()

145
Q

How do you load a package in R?

A

library(“package”)

require() also loads packages into the R session - avoid errors when attaching packages dynamically inside functions.

Can search to see if it is there - search(). Loading a package attaches it tot the search list.

146
Q

What does lapply() do?

A

lapply() is a function used to apply a specified function to each element of a list (or other objects that can be coerced into a list, such as a vector), and it returns the result as a list.

lapply(vector, nchar)

147
Q

Instead of using a for loop, what other function can you use which is more concise?

A

lapply()
lapply(vector, function)

The function is applied to each element of the vector - get a list returned.

Output is a list of the same length as the vector.

To convert the list to a vector, you can wrap L-apply inside the unlist function
unlist(lapply(vector, function))

If the function also takes an additional argument
lapply(vector, function, argument = X).

Names on the list are maintained - useful

148
Q

How do you covert a list into a vector?

A

unlist()
unlist(lapply(vector, function))

149
Q

What function do you use in R to split up a string?

A

strsplit() splits the string by using a delimiter.
strsplit(string, split = “:”)

150
Q

What is an anonymous function in R?

A

If you choose not to give the function a name, you get an anonymous function. You use an anonymous function when it’s not worth the effort to give it a name.

151
Q

What is a list capable of storing?

A

Heterogeneous content

152
Q

What does the apply() function do?

A

The apply() function lets us apply a function to the rows or columns of a matrix or data frame. This function takes matrix or data frame as an argument along with function and whether it has to be applied by row or column and returns the result in the form of a vector or array or list of values obtained.

apply(x, margin, function)

margin: If the margin is 1 function is applied across row, if the margin is 2 it is applied across the column.

153
Q

What does the sapply() function do?

A

The sapply() function applies a function on a list, vector, or data frame and returns an array or matrix object of the same length.

This is easier than using unlist on lapply if the function always returns the same type of object.

sapply(vector, function)

You can choose to not name the output - USE.NAMES = FALSE
sapply(vector, function, USE.NAMES = FALSE)

154
Q

What is the difference between lapply() and sapply()?

A

lapply() always returns a list, while sapply() attempts to return a vector or matrix if possible. If it can’t, it will fall back to returning a list. Usage: Use lapply() when you need to maintain the structure of your output as a list.

S - stands for simplify
Can’t simplify when there are vectors of varying lengths etc.

Lapply() returns a list
Sapply() usually returns a vector that is a simplified version of this list

155
Q

What does the tapply() function do?

A

The tapply() helps us to compute statistical measures (mean, median, min, max, etc..) or a self-written function operation for each factor variable in a vector. It helps us to create a subset of a vector and then apply some functions to each of the subsets.

tapply( x, index, fun )
x: determines the input vector or an object.
index: determines the factor vector that helps us distinguish the data.
fun: determines the function that is to be applied to input data.

tapply(diamonds$price, diamonds$cut, mean)

156
Q

What does the gsub() function do?

A

The gsub() function can be used to replace all occurrences of certain text within a string in R.

gsub(pattern, replacement, x)

pattern: The pattern to look for
replacement: The replacement for the pattern
x: The string to search

157
Q

What function can you use to replace occurrences of a certain text within a string?

A

gsub()

158
Q

What does the identical() function do in R?

A

The identical() function in R can be used to test whether or not two objects in R are exactly equal.

identical(x,y)

159
Q

What function can be used to test whether two objects are exactly equal?

A

identical()
identical(x,y)

160
Q

How can you generate random values from a uniform distribution in R?

A

runif()

runif(n, min=0, max=1)
n: The number of random values to generate
min: The minimum value of the distribution (default is 0)
max: The maximum value of the distribution (default is 1)

runif - random uniform

161
Q

What does the runif() function do?

A

Generates random values from a uniform distribution in R.

162
Q

Summarise lapply(), sapply() and vapply()

A

lapply() - apply function over list or vector. Output = list

sapply() - apply function over list or vector. Tries to simplify the list to an array

vapply() - apply function over list or vector. Explicitly specify output format

163
Q

What is the difference between v-apply and s-apply?

A

V-apply uses l-apply under the hood and then tries to simplify the result, we have to explicitly give the type of return. In s-apply, this is not possible.

eg vapply(vector, function, numeric(1))

numeric(1) - specifying type and length of vector returned

Specifying makes v-apply a safer alternative to s-apply

164
Q

What function can be used to sort a vector?

A

The sort() function returns a sorted version of the input vector.

sort(x, decreasing, na.last)

decreasing: Boolean value to sort in descending order
na.last: Boolean value to put NA at the end

sort(x) - puts in ascending order

164
Q

How do you simplify a long decimal place answer?

A

round(data, digits = 1). Rounds to 0 dp by default.
signif(data, digits = 3)
ceiling(data) - round values UP to nearest integer
floor(data) - round values DOWN to nearest integer
trunc(data) - truncate decimal places OFF

165
Q

How do you get the absolute value / modulus of a number or vector?

A

abs()

166
Q

What functions can you use to check the type of your data structure?

A

is.()
as.
() - these can be used to convert eg vectors to lists

Both return logicals

167
Q

What does the append() function do?

A

Allows you to add elements to a vector or a list in a very readable way.

168
Q

What function allows you to add elements to a vector or a list in a readable way?

A

append()

169
Q

What does the rev() function do?

A

Reverses elements in a dataset

170
Q

What function reverses elements in a dataset?

A

rev()

171
Q

What is a trimmed mean and how is it determined?

A

A trimmed mean is the mean of the given data that is calculated after removing a specific percentage of the smallest and largest number from the given data.

calculate trimmed mean with trim of 10%
print(mean(data,trim=0.10))

172
Q

What are regular expressions in R?

A

Regular expressions (regex) are powerful tools used for pattern matching within text data.

grep()
grepl()
sub()
gsub()

Handy for cleaning data - make the data ready for analysis

173
Q

What does the grepl() function do?

A

grepl(pattern, x) searches for matches of a pattern within a vector x and returns a logical vector indicating whether a match was found in each element.

eg grepl(“Hello”, text)
eg grepl(pattern = “a”, x = animals)

l of grepl - logical

174
Q

What function do you use to find a pattern within a vector?

A

grepl(pattern, x) searches for matches of a pattern within a vector x and returns a logical vector indicating whether a match was found in each element.

175
Q

How do you search a vector for strings that begin with an A?

A

Use the carrot symbol ^ in grepl - matches the empty string at the start of the string, so it will only match with those beginning with A.

grepl(pattern = “^a”, x)

176
Q

How do you search a vector for strings that end with an A?

A

Use the $ symbol in grepl - matches the empty string at the end.

grepl(pattern = “a$”, x)

177
Q

How do you match the empty string at the start vs the end of a string?

A

^ - start
$ - end

178
Q

What command can you run to further investigate regular expressions in R?

A

?regex

179
Q

What does grep() do?

A

grep(pattern, x) searches for matches of a pattern within a vector x and returns a vector containing the indices of the matching elements

Can use grep to subset the original vector, returning only elements with the pattern.
eg assign output to hits
vector[hits]

180
Q

How can you get the same output of grep() using grepl()?

A

which(grepl(pattern = “a”, x = animals”))

181
Q

What does the sub() function do?

A

sub(pattern, replacement, x) replaces the first occurrence of a pattern in each element of vector x with the replacement.

pattern can be or eg “a|i|e”

182
Q

What does the gsub() function do?

A

gsub(pattern, replacement, x) replaces all occurrences of a pattern in each element of vector x with the replacement.

pattern can be or eg “a|i|e”

183
Q

What do the sub() and gsub() functions do? What are their differences?

A

Both replace patterns in strings.
sub replaces the first occurrence, gsub replaces all occurrences.

pattern can be or eg “a|i|e”

184
Q

What is the syntax to match valid .edu emails?

A

pattern = “@.*\.edu$”

.* matches any character (.) zero or more times (*) - the dot and asterisk are metacharacters. They are used to match any characters between the @ and .edu

\.edu$ matches .edu at the end part of the email. The \ ESCAPES the dot - it tells R that you want to use it as an actual character.

185
Q

What is the difference between grep/grepl and sub/gsub?

A

Grep and grepl are used to simply check whether a regular expression could be matched with a character vector.

Sub and gsub allow you to specify a replacement argument.

186
Q

What are different common symbols used for pattern matching in regex expressions?

A

.* - any character that is matched 0 or more times

\s - matches a space (escaping s makes it a metacharacter)

[0-9]+ - match the numbers 0-9, at least once (+)

([0-9]+) - parentheses are used to make parts of the matching string available to define the replacement

187
Q

How do you get the current date and time?

A

Sys.Date()
Sys.time()

These variables are not simple strings - check with class()
POSIXct class makes sure that the dates and times in R are compatible across different OS according to posting standards.

188
Q

How can you create a date?

A

as.Date(“1971-05-15”)

YYYY-MM-DD

For a date and time
as.POSIXct(“1971-05-15 11:25:15”)

189
Q

What is the default date format in R?

How do you change this?

A

YYYY-MM-DD

as.Date(“1971-15-05”, format = %Y-%d-%m)

190
Q

How do Date and POSIXct objects behave differently?

A

Unit of date is a day, unit of POSIXct is a second eg in arithmetic

191
Q

How are dates stored n R?

A

The number of days since January 1st 1970 (convention)

192
Q

What are dedicated R packages to deal with times in a more advanced fashion?

A

lubridate, zoo, xts

193
Q

What are the different symbols for formatting dates in R?

A

%Y - YYYY
%y - YY
%m - MM
%d - DD
%A - weekday eg Wednesday
%a - abbreviated weekday (Wed)
%B - month (January)
%b - abbreviated month (Jan)

Small letters - abbreviation

eg as.Date(“13 January, 1982”, format = %d %B, %Y)

Default - “%Y-%m-%d” or “%Y/%m/%d”

194
Q

How does to strings?

A

format(Sys.Date(), format = “%d %B, %Y”

195
Q

What are the different symbols for formatting times in R?

A

You can use as.POSIXct() to convert from a character string to a POSIXct object

%H - hours as a decimal number (00-23)
%I - hours as a decimal number (01-12)
%M - minutes as a decimal number
%S - seconds as a decimal number
%T - typical format %H:%M:%S
%p - AM/PM indicator

You can use format() to convert from a POSIXct object to a character string

196
Q

How can you view a full list of conversion symbols?

A

?strptime documentation

197
Q

In R markdown

```{r echo=FALSE}

head(cars)

~~~

what is the difference between echo = FALSE and include = FALSE

A

Echo hides the R code in the final output but shows the results (i.e., the output of the code).

Include hides both the code and the results (output) in the final document.

198
Q

How do you use base R graphics to produce a boxplot for a particular column from a dataset?

A

boxplot(dataset$column)

199
Q

How do you use ggplot to produce a boxplot for a particular column from a dataset?

A

cars %>%
ggplot(aes(y=speed)) +
geom_boxplot()

200
Q

How do you turn a static ggplot object into an interactive one?

A

After creating the boxplot (no indentation) ggplotly()

201
Q

What is base R code for a scatterplot?

A

plot(data)

202
Q

What is the code for a ggplot scatterplot?

A

cars %>%
ggplot(aes(x=speed, y=dist)) +
geom_point() +
theme_bw()

ggplotly()

Can also assign this to a variable

203
Q

How do you find the Pearson’s Correlation Coefficient in R?

A

cor(data$var1, data$var2)

round(cor(data$var1, data$var2),2)

204
Q

What is the tidy verse?

A

A collection of data science tools within R for transforming and visualising data.

205
Q

What is dplyr()?

A

dplyr is a package for making data manipulation easier.
library(dplyr)

206
Q

What is a tibble?

A

A modern data frame.

207
Q

When do you use filter()?

A

When you only want to look at a subset of observations, based on a particular condition.

208
Q

What do you need to do when applying a verb using dplyr?

A

Pipe %>%

This says, take whatever is before it and feed it into the next step

209
Q

How do we filter a dataset?

A

dataset %>%
filter(condition1, condition2 …)

Filter returns a new dataset.
To specify multiple conditions in the filter, separate them by a comma.

210
Q

What does arrange() do?

A

The arrange verb sorts the observations in the dataset.

dataset %>%
arrange(column)

Default is ascending order.

dataset %>%
arrange(desc(column))

This produces a new sorted dataset.

211
Q

How do you combine multiple dplyr verbs to apply to a dataset?

A

dataset %>%
filter(condition1) %>%
arrange(desc(column))

Pipe the dataset into the first verb, then use another pipe to take the result of the first verb and pass it to the second verb.

Piping together multiple simple operations can create a rich and informative data analysis.

212
Q

What does the dplyr mutate verb do?

A

Change one of the variables in your dataset (based on another) or add a new variable.

dataset %>%
mutate(pop = pop/10)

Left = what is being replaced / created
Right = what is being calculated

Returns a new data frame

213
Q

Does applying verbs to a dataset by piping them in change the data frame?

A

No it creates a new data frame

214
Q

What are the built-in functions R provides for calculations related to the standard normal distribution (Z-distribution)?

A

dnorm() / pnorm() / qnorm() / rnorm()

dnorm(x, mean = 0, sd = 1)
Gives the probability density function (PDF) of a normal distribution at a given point x. Eg dnorm(0) gives the height of the PDF at Z = 0, which is the mean of the standard normal distribution.

pnorm(q, mean = 0, sd = 1)
This gives the cumulative distribution function (CDF), i.e., the probability that a standard normal random variable will be less than or equal to q.
Example: pnorm(1.96) returns the probability that a Z-score is less than or equal to 1.96.

qnorm(p, mean = 0, sd = 1):
This gives the quantile function, or the Z-score corresponding to a cumulative probability p.
Example: qnorm(0.975) gives the Z-score associated with the 97.5th percentile of the standard normal distribution.

rnorm(n, mean = 0, sd = 1):
This generates n random numbers from a normal distribution with specified mean and standard deviation.
Example: rnorm(5) generates 5 random values from a standard normal distribution.

215
Q

What do scatterplots do?

A

Compare two variables on an x and y axis

216
Q

When working with a subset, it is useful to save the filtered data as a new data frame. How do you do this?

A

Variable assignment <-

This can then be used to create a visualisation.

217
Q

How do you create a scatterplot using ggplot?

A

library(ggplot2)

ggplot(dataset, aes(x = var1, y = var2)) +
geom_point()

Aes - aesthetic: mapping of variables in the dataset to aesthetics in the graphs. An aesthetic is a visual dimension of a graph that can be used to communicate information.

218
Q

What is an aesthetic of a graph?

A

An aesthetic is a visual dimension of a graph that can be used to communicate information.

219
Q

What are the three parts of creating a graph with ggplot

A

ggplot(data, aesthetic mapping) + layer

220
Q

If a lot of data points are crammed to the left of a scatterplot, what transformation can you do to help improve the visualisation?

A

Log transformation

This happens when a variable spans several orders of magnitude

221
Q

What is the code to add a logarithmic transformation to the x variable of a ggplot2 scatterplot?

A

ggplot(data, aes(x = var1, y =var2)) +
geom_point() +
scale_x_log()

222
Q

What are two other aesthetics, in addition to the x and y axis, that you can use to represents variables on a scatter plot?

A

Size (numeric) and colour (categorical)

ggplot(data, aes(x = var1, y =var2, color = catVar, size = numVar)) +
geom_point()

223
Q

How can you explore categorical variables with a scatterplot?

A

Using colour (color = ) to distinguish the different categories.

Using faceting to divide the plots into subplots, getting a separate graph for each category.

Facet a plot by adding + facet_wrap(~variable)

ggplot(data, aes(x = var1, y =var2)) +
geom_point() +
facet_wrap(~var)

The tilde symbol typically means “by” - we are splitting the plot by the variable

224
Q

What does the summarize verb do?

A

summarize collapses the entire table down to one row

summarize() creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.

dataset %>%
summarize(newVar = mean(column))

Can create multiple summaries at once

dataset %>%
summarize(newVar = mean(column), newVar2 = sum(column))

225
Q

How do we use the summarize verb to answer specific questions?

A

Combine it with filter - filter for a certain category then summarise the result

dataset %>%
filter(condition) %>%
summarize(newVar = mean(column))

226
Q

What functions can you use for summarising the dataset?

A

Mean, median, sum, min, max

227
Q

Using the Z-score, how do you find the probability that a randomly selected observation has at least X variable?

A

pnorm(x, mean = mu, sd = sd, lower.tail = FALSE)

Need to set lower tail to false (ie it will look at the upper tail)

or can do 1 - the LHS value which is pnorm(x, mean = mu, sd = sd)

228
Q

In R, how do you obtain the Z score?

How do you use this Z score to find the percentile it falls in?

A

Zscore <- function(x, mu, sd) {
Z <- (x - mu) / sd
}

pnorm(Zscore)

OR percentile <- pnorm(noOfInterest, mean = mu, sd = sigma)

229
Q

When given a percentile, how do you work out the value of the variable this corresponds to?

A

quantile <- percentile / 100

A –>
ZScoreCalculated <- qnorm(quantile)

xFromZScore <- function(Z, mu, sd) {
x <- (Z * sd) + mu
}

B –>
qnorm(quantile, mean = mu, sd = sd)

230
Q

What are two ways you can you round your answer?

A

round(var,2)
var%>% round(2)

231
Q

If you want to summarise by category, what verb should you apply before?

A

Rerunning filter each time is very tedious

group_by()

dataset %>%
group_by(categoricalColumn) %>%
summarize(newVar = mean(column))

Summarize turns

Can use filter and group_by if necessary

232
Q

How can you turn summaries into visualisations?

A

Assign the summary to a variable, so it can be visualised using ggplot2.

233
Q

How do you specify that you want an axis to start at 0 using ggplot?

A

ggplot(data, aes(x = var1, y =var2)) +
geom_point() +
expand_limits(y=0)

234
Q

What are the different types of graphs we have explored?

A

Scatterplots - useful for comparing two variables
Line plots - useful for showing changes over time
Bar charts - good at comparing statistics for each of several categories
Histograms - describe the distribution of a 1D numerical variable
Box plots - compare the distribution of a numeric variable amongst several categories

235
Q

What are line plots useful for?

A

Visualising a change over time. You can get a sense of trends from a scatter plot but it is easier to visualise as. a line plot.

geom_line()

236
Q

How do you specify a scatterplot?

A

geom_point()

237
Q

How do you specify a line graph?

A

geom_line()

eg.
ggplot(data, aes(x = var1, y =var2), color = var3) +
geom_line() +
expand_limits(y=0)

238
Q

What are bar charts useful for?

A

Useful for comparing values across discrete categories.

239
Q

How do you create a bar plot to visualise a value across different categories?

A

Any necessary filters.
Group by (variable).
Summarise (get the variable you want to plot)

Then plot this dataset.

geom_col()

Two aesthetics - x is the categorical variable, y is the variable that determines the height

They always start at 0

ggplot(data, aes(x = var1, y =var2)) +
geom_col()

240
Q

Where does the y-axis of a bar chart start?

A

0

241
Q

What are histograms useful for?

A

Investigating one variable.
Every bar represents a bin of variables.
The height represents how many observations fall into that bin.

Only one aesthetic - the x-axis, the variable whose distribution we are examining.

Width of each bin is chosen automatically - can customise with binwidth option

ggplot(data, aes(x = var1)) +
geom_histogram()

ggplot(data, aes(x = var1)) +
geom_histogram(binwidth=5)
or
ggplot(data, aes(x = var1)) +
geom_histogram(bins=50)

242
Q

What do box plots show?

A

The distribution of a variable across different categorical variables.

Allows you to compare them

geom_boxplot()

Two aesthetics - x = category, y = variable under investigation

ggplot(data, aes(x = var1, y = var2)) +
geom_boxplot()

243
Q

What does the box plot show?

A

5 figure summary - min, Q1, median, Q3, max

The box represents the IQR (top and bottom of each box represents 75th and 25th percentile). The IQR has half of the distribution in it.

Whiskers and outliers

A boxplot helps give more context to the shape of the histogram.

244
Q

How do you add a title to your ggplot2 graph?

A

+ ggtitle(“Insert title”)

245
Q

What is a short-cut for the test of proportions using the Z-statistic?

A

TEST OF TWO PROPORTIONS

test_results <- prop.test(x = c(x1, x2), n = c(n1, n2), correct = F, alternative = “two.sided”)

Setting correct = F disables the continuity correction - not necessary in large samples

To access the value: test_results$p.value

If you wanted to test whether one proportion is greater than the other, you could use “greater” or “less” as alternatives.

TEST OF SINGLE PROPORTION
prop.test(x, n, p = null_hypothesized_proportion, correct = F, alternative = “two.sided”)
eg prop.test(threeYears, n, p = p, correct = F, alternative = “two.sided”, conf.level = 0.95)

246
Q

What are the functions to import data?

A

UTILS:
read.table() - the main function
read.csv() - wrapper for CSV
read.delim() - wrapper for tab-delimited files

read.csv2() and read.delim2() exist for locale differences - eg Europeans using commas to separate numbers

247
Q

How do you determine what proportion of the t-distribution with 18 degrees of freedom falls below -2.10?

A

pt(q = -2.1, df = 18)

248
Q

What is the wording to report a t-value?

A

Under H0, t = x is an observation from t(n-1)

249
Q

What is a quick way to do a t-test in R?

A

t.test(vector, mu = 80, alternative = “less”)
- Didn’t seem to work for confidence intervals

or t.test(vector, mu = 80, alternative = “two.sided”)
- Did seem to work for confidence intervals

250
Q

What is a quick way to do a paired t-test in R?

A

t.test(a, b, paired = T, alternative = “two.sided”)

251
Q

Is there a quick way to do a t.test function for a comparison of two means (independent samples) where you only have the means and not the raw values?

A

I don’t think so?

t.test(method1, method2, var.equal = T, alternative = “two.sided”)

252
Q

What are the relevant functions for the chi squared distribution?

A

chisq_test <- chisq.test(observed, p = probability)

OR

phi_squared <- sum((observed - expected)^2 / expected)

phi_star <- qchisq(conf, dof, lower.tail = FALSE)

prob <- pchisq(phi_squared, dof, lower.tail = FALSE)

253
Q

How would you count the number of missing values in a column?

A

missing_count <- sum(is.na(df$col))

254
Q

What is the difference between summary and summarize?

A

summary() is a base R function used to give descriptive statistics about an entire dataset or vector. It provides a general overview such as the minimum, maximum, quartiles, mean, and number of NA values, depending on the data type.

summarize() is a function from the dplyr package used to apply summary statistics to grouped data or to a datase

summaryIris2 <- iris %>%
group_by(Species) %>%
summarize(meanSW = mean(Sepal.Width), sdSW = sd(Sepal.Width))

255
Q

How do you select relevant columns from a data frame (when creating a new data frame)?

A

select()

irisSW <- iris %>%
group_by(Species) %>%
select(Species, Sepal.Width)

256
Q

How do you add colour to a histogram?
How do you change the transparency?
How do you add a border?

A

Fill
Alpha
color

ggplot(irisSW, aes(x = Sepal.Width)) +
geom_histogram(bins = 20, fill = “blue”, alpha = 0.5, color = “blue”)

Different colour for each
ggplot(irisSW, aes(x = Sepal.Width, fill = Species)) +
geom_histogram(bins = 20, alpha = 0.5, color = “black”) +
facet_wrap(~Species)

257
Q

How do you carry out a one-way ANOVA in R?

A

aov

irisSW_aov <- aov(numVar ~ catVar, data = df)

aov(Sepal.Width ~ Species, data = iris_copy)

258
Q

How do you extract the p-value from an ANOVA summary table?

A

p_value <- summaryIrisAOV[[1]][“Pr(>F)”][1]

summaryIrisAOV[[1]][[1,’Pr(>F)’]]

259
Q

If a sample has missing data, how do you exclude these observations from calculations like the mean?

A

na.rm=T

260
Q

If a categorical variable is not listed as such in a data frame, how would we do this before conducting ANOVA?

A

iris_copy$Species <- factor(iris_copy$Species)

261
Q

How do you perform a t-test following an ANOVA?

A

Pairwise t-test using Boneferroni correction

pairwise.t.test(data$score, data$technique, p.adjust.method=”bonferroni”)

262
Q

What is the first condition to be met for an ANOVA test and how do you check for it in R?

A

Observations should be independent within and between groups.

We assume independence of the data.

263
Q

What is the second condition to be met for an ANOVA test and how do you check for it in R?

A

The observations within each group should be nearly normal.

We need to mean-center each sepal width by it’s respective group mean. These group-wise, mean-centered values are also known as residuals, and by using them we can assess the normality of all observations as a whole.

Graphical method for assessing normality of residuals - QQ Plot
plot(anova_test, 2)

Alternatively
anova_test_residuals <- residuals(anova_test)

shapiro.test(anova_test_residuals)

263
Q

What is the third condition to be met for an ANOVA test and how do you check for it in R?

A

The variability across the groups should be about equal.

plot(anova_test, 1)

leveneTest(Sepal.Width ~ Species, data = iris_copy)

264
Q

What function do you use to fit linear regression models in R?

A

lm()

model <- lm( fitting_formula, dataframe )
summary(model)

eg
iris_model <- lm(Petal.Width ~ Petal.Length, data = irisPetal)
where we are predicting widths based on lengths

265
Q

How do you plot the line of best fit on a scatterplot?

A

stat_smooth

ggplot(irisPetal, aes(x = Petal.Length, y = Petal.Width)) +
geom_point() +
stat_smooth(method = “lm”, formula = y ~ x, geom = “smooth” , se = FALSE)

se = FALSE - you don’t want to display the confidence levels associated with it

OR
plot(irisPetal$Petal.Length, irisPetal$Petal.Width)
abline(iris_model, col=”red”)

266
Q

How do you use your linear regression model to predict values?

A

new_data <- data.frame(
Petal.Length = c(1.5, 1.6, 1.7) # predictor column name needs to be the same as the one in the original dataset
)

predict(iris_model, new_data)

267
Q

How do you see the categories in a column of a dataset?

A

levels(datset$column)

268
Q

How do you create a contingency table?

A

Contingency table shows categorical variables
table(datset$column1, datset$column2)

269
Q

What are the three elements to specify in a ggplot graphic?

A

Dataset
Aesthetics - the variables you are interested in
Layers - describe how you want the variables to be plotted

270
Q

How do you protect a bar chart to show the relationship between two categorical variables?

A

Stacked bar chart

ggplot(data, aes(x = var1, fill = var2)) +
geom_bar

271
Q

How do you plot all box plots on one axis? How do you improve the layout

A

ggplot(data, aes(x = cut, y = price, colour = cut)) +
geom_boxplot() +
labs(title = “Boxplot of Diamond Prices by Cut”, y = “Price (USD)”, x = “Cut”) +
theme_minimal()