RStudio functions: Flashcards
A + B + Enter button
Answer for A + B
*
Multiply
/
Divide
To the power of
Brackets
To separate different functions used in one
A ← 3
Store the value 3 in variable A
What is a scalar variable?
A variable storing a single value
Variable (eg. A) + Enter
Displays value stored in variable
C ← “apples”
Stores the word apples in variable C
What are vectors?
Variables that can hold more than one value
D ← c(3, 7, 1)
Stores vector 3, 7, 1 in variable D
C stands for combine
D[2]
Allows access to the second value stored in the vector variable D
What are data frames?
A store of large amounts of data
Putting name of data frame
Displays the full set of data in data frame
Cars$Mpg
Codes for only mpg column in the cars data frame
DateFrameName[RowNumber, ColumnNumber]
Helps distinguish between rows and columns in a table
Cars[ , 1]
When row number is left blank, it will return all the rows.
In this example, the first column of rows of car will be returned
What is a function?
Anything that performs a particular operation on our data
What is the format that all functions follow on R Studio?
Function_name(argument)
What is an argument?
The input for the function
What would the function “ mean(e) “ do?
Give the mean/average of values in vector e
What code would you use to calculate the mean mpg of cars?
Mean(cars$mpg)
How do you find the column names in a data frame?
By printing the whole data frame
Names(name of data frame )
Only shows the column names of a data frame
Head(data frame name )
Only shows column names and top few rows of data frame
Why might the function “ head(name of data frame)” function cause the full data frame to be shown?
The function shows the first 10 rows, so if data frame has 10 or less rows then all of it will be shown
How are arguments separated in all functions?
Using commas
How do you define how many rows you want from a data frame?
head(name of date frame, n= )
Hist()
Creates a histogram
Draw histogram of horsepower (hp) from cars:
hist(cars$hp)
What is the standard layout of data?
- Each column represents a different variable
- Each row represents a different subject or replicate
How is the standard layout of data different from the layout used by people when making spreadsheets?
Each condition is put into separate columns
What is an advantage of the standard layout of data?
It is much easier to record additional variables in a data frame
What is the advantage of using histograms over box plots?
Histograms show more information
What is the advantage of using boxplots over histograms?
It is easier to compare data presented as a box plot than as a histogram
Boxplot()
Function that plots a boxplot
How do you code for a boxplot of only one variable (column) of the data?
Boxplot(data-frame$ColumnName)
What values are shown on a boxplot?
- Median
- Upper (1st) quartile
- Lower (3rd) quartile
- Lowest value in data
- Highest value in data
How much of the data falls within each interval of a boxplot (generally speaking)?
A quarter of the data
How much of the data falls within the interquartile range?
Half (50%) of the data
When is the boxplot() function able to produce multiple plots?
If given a variable to group the data by
Name the code needed to make multiple plots using the boxplot() function:
boxplot(NameOfDataFrame$a ~ NameOfDataFrame$b)
- Where a is the variable that is plotted
- Where b is the variable for grouping the data
In the function “boxplot(NameOfDataFrame$a ~NameOfDataFrame$b)”, what lies to the left and right of the “ ~ ” symbol?
- The response variables lie to the left
- The explanatory variables lie to the right
Summary()
This function neatly and easily provides several summary statistics
List the function that would be used to make summary statistics for weight from the mice data frame:
summary(mice$weight)
List the function that would be used to make summary statistics from the whole mice data frame:
summary(mice)
When sex is a variable in the data frame, what problem can be caused when trying to make summary statistics? What are possible solutions?
- Sex isn’t a numerical variable
- Any numbers used would have no intrinsic meaning so wouldn’t be very useful
- Better summaries involve knowing the number of males or females
- To do this, we would need to make R a categorical variable
What are categorical variables called in R?
Factors
What function is used to convert numerical values to factors?
factor()
List the function that would be used to convert the sex column from the data frame “mice” to a factor:
Factor(mice$sex)
When a variable has been converted in to a factor, the outcome will not be stored unless told to be. How do you store the date in this format?
- By assigning the converted data to a column under the name of the converted variable
- Here, it will overdue the current column contents
class()
Function helps you check what kind of variable the variable is being saved as
dbinom( x, size= , prob= )
Function for discrete binomial distributions
Where x, size and probability are the arguments
List the function for finding the probability of getting 3 heads after tossing a coin 4 times:
dbinom(3, 4, 0.5)
pbinom()
Function calculates the probability of observing up to a certain number of successes or events
What are the arguments for pbinom?
pbinom (x, size = , prob = )
- X= observed number of outcomes
- Size= sample size
- Prob= Probability of success
What is the probability of getting up to 3 heads after tossing 6 different coins?
pbinom( 3, 6, 0.5)
15 people are admitted to hospital with a heart attack. 4 in 100 people die of a fatal heart attack. What is the probability that more than 4 people die?
1- pbinom(4, 15, 0.04)
Pnorm()
Calculates cumulative probability of a normal distribution
What is a similarity and difference between the pnorm() and pbinom() functions?
- They both calculate cumulative probability
- pnorm() is used for normal distribution and pbinom() is used for binomial distributions
What are the arguments for pnorm()?
pnorm(x, mean, sd)
- Where x is the observation made
- Where mean is the mean of the population
- Where sd is the standard deviation
The probability that two people share the same birthday in a cohort of 150 is being calculated. Explain why the function below is incorrect.
dbinom(2, 150, 0.00273973)
- As two people have the same birthday, size has to be the number of trials and we are comparing 1 birthday to 149 other birthdays
- This means the number of trials is 149
- As x is the number of successes, you get a success when 1 other person has the same birthday, so x = 1
The probability of sharing a birthday with at least two people in a cohort of 150 is being calculated. Why is the following function incorrect?
Pbinom(2, 149, 0.00273973)
- Mutual exclusivity must be taken into account
- pbinom(1, 149, 1/365) is the probability of sharing a birthday with none or one other person
- Taking the value away from 1 finds the probability of sharing a birthday with more than 1 person
qbinom()
Function calculates critical value of a given distribution at a specific alpha
What are the arguments in the function qbinom()?
qbinom( alpha, size= , prob=)
- Alpha is the probability of success
- size is the number of trials
- prob is the probability of success
Out of 100 tosses of a coin, 59 were heads. With a value of alpha being 0.05, calculate the critical values given a two-tailed test:
The Lower critical value:
qbinom(0.025, 100, 0.5)
The upper critical value:
qbinom(0.975, 100, 0.5)
Remember during a two tailed test, the significance level is 0.025 as it is shared between both tails
4 in every 100 people suffering with heart attacks die. 15 people are admitted to a hospital and 3 of them die. A doctor is cared this is abnormally high, so finds 4% of 15 (which turns out to be 0.6) but is not sure what this means.
Null hypothesis - hospital does not suffer more fatalities from heart attacks than expected.
Alternative hypothesis - hospital suffers more fatalities from heart attacks then expected.
- Calculate the critical value for heart attacks at hospital.
Now as a two tailed test:
- Calculate the lower critical value
- Calculate the upper critical value
- qbinom(0.95, 15, 0.04)
- qbinom(0.025, 15, 0.04)
- qbinom(0.975, 15, 0.04)
Why would we use a one-tailed test over a two-tailed test when trying to figure out if the fatality rate from heart attack in a hospital is higher than expected?
We are not interested in whether the hospital has a lower than expected fatality rate
Binom.test()
Function
What are the arguments in the binom.test() function?
Binom.test( x, size = , prob = )
- X is the observed number of a particular outcome
- size is the number of trials
- prob is the probability of success
What is used to compare the mean of one sample to a particular value?
One sample t test
What is the function for a t-test?
t.test()
What are the arguments for a one sample t-test?
t. test(DataFrameName, mu= )
- Mu is the population mean
- DataFrameName is the name of the data frame
What test compares the means of two samples?
The two sample t-test
The weight of 10 mice for each cohort is stored in data1. The data is arranged with the weights stored in the weight column and a treatment column containing either control or drug.
- Code a boxplot of weight grouped by treatment:
- Perform a two sample t-test to compare the weights grouped by treatment option:
- Boxplot(weight ~ treatment, data = data1)
- t-test(weight~treatment, data =data1)
As we are doing t-test on the same thing as we made boxplot on, we can use the same arguments in both functions
What are the arguments for the function t.test() for a two sample t-test?
t. test(OutputVariable ~ VariableThisIsGroupedBy, data= )
- Where data is the name of the data frame
Plot()
Function plots a scatter graph of data
What happens if the grouping variable used in the plot() function is a factor? How do you resolve this?
- If the grouping variable is a factor, R will produce a boxplot
- Use the function “as.numeric(NameOfGroupingVariable)” to plot a scatter graph
Plot a scatter graph of output grouped by day from data2:
Plot( output~as.numeric(day), data = data2)
How do you colour in points on a graph showing data?
By using the argument “col=“ to colour the data points according to the variable you want
How do we change the t-test so it becomes a paired t-test?
By adding the argument paired=TRUE
When is a paired t-test useful?
- With two sample t-tests compare the mean of two groups
- The variation between replicates overrides the effects of the independent variable (the one we are looking to see has a significant effect)
- Paired t-test analyses the data in pairs to stop this happening
What is required in R for the paired t-test to work?
The data must be arranged in the same order in each group (sorted/grouped by the same variable)
When is the paired t-test used?
- When observations in one group can be paired with observations in the other group
- There needs to be a reason why an observation in one group is more closely related to one particular observation than the other observations in the second group
When can observations in one group be paired with observations in the other group?
- The observations were performed on the same subject
- The observations were performed at the same time
qqnorm()
Functions plots data into a graph with sample quantiles against theoretical quantiles (Q-Q plot)
qqline()
Adds a line to Q-Q plot to see if data is normally distributed
R automatically uses the Welch’s t-test, which does not assume equal variance of the two populations from which the samples have been taken. How do you specify in R to change this when variance is equal? What effect does this have?
By adding the argument “var.equal =TRUE”
- It increases the power of the test, but there is little advantage in most situations