R-Code for Exam, Modules 1-8 Flashcards
sd(DATASET_NAME$VARIABLE_NAME)
gives the standard deviation for the observations in the variable
favstats(DATASET_NAME$VARIABLE_NAME)
provides summaries of the observations in the variable
hist(DATASET_NAME$VARIABLE_NAME)
produces a histogram of the variable from the dataset
boxplot(DATASET_NAME$VARIABLE_NAME)
produces a boxplot of the variable from the dataset
boxplot(Y~X, data = DATASET_NAME)
produces a box-plot for variable “Y” given variable “X” from the dataset
summary(DATASET_NAME)
gives numerical summaries of all of the variables in the data set
(minimum, maximum, median, mean, 1st quartile, 3rd quartile)
t.test(Y~X, data = DATASET_NAME, conf.level = 0.95)
provides a two-sample t-test statistic, degrees of freedom, and p-value for a 95% confidence interval
runif(# OBS, X, Y)
produces a list of random numbers in the range (X, Y), with the number of observations specified by “# OBS”
setwd(“C:/Users/Joseph Paoli/Downloads/Lessons in R for Stats”)
sets the working directory for “Lessons in R for Stats” in the Downloads folder
rm(list=ls())
sets a clean working environment in RStudio
“Flies” → Desired Folder (Click) → Blue Gear (Click) → “Set as Working Directory”
how to set the working directory if the RStudio code doesn’t work
“Plots” → “Export” → “Save as Image…”
how to export a plotted graph in the viewing area to a .jpeg or .png file
capture.output(summary(DATASET_NAME), file = “EXCEL_NAME.xls”)
saves the summary statistics for a dataset as an Excel file of a specified name
data[(DATASET_NAME > X)]
modifies the dataset to include only values greater than “X”
log(DATASET_NAME)
takes the common log (base-10) of the dataset
sqrt(DATASET_NAME)
takes the square root of the dataset
var(DATASET_NAME)
finds the variance in the set of values in the dataset
install.packages(“PACKAGE”)
installs a package called “PACKAGE” into RStudio
library(“PACKAGES”)
loads the package “PACKAGE” for our use in RStudio
?rstudio.command
this code would provide information on the command with the name “rstudio.command”
help.search(“data.input”)
command which would locate a code for imputing data into RStudio unknown to the user
find(“rstudio.command”)
there’s a command called “rstudio.command” which you know the name of and want to use, but you don’t know the package it’s located under
example(rstudio.command)
we want to run an example of the command “rstudio.command” to become better acquainted with it
demo(graphics)
generates a series of plots and shows the code to make them in the “Console” window in the lower-left of RStudio
colnames(DATASET_NAME)
provides the names of all of the columns in a data set
dim(DATASET_NAME)
provides the number of columns and the number of rows in the data set
str(DATASET_NAME)
provides the internal structure of the data set
range(DATASET_NAME$VARIABLE_NAME)
provides the range in values of a certain variable from the dataset
quantile(DATASET_NAME$VARIABLE_NAME, X%)
provides the X% quantile of a certain variable from the dataset, which is to say that X% of the other observations are below it and (100-X)% of the observations are above it (“X%” is expressed in decimal form, not as a percentage)
quantile(DATASET_NAME$VARIABLE_NAME, X%, Y%, Z%)
provides the X%, Y%, and Z% quantiles for a certain variable from the dataset, with X% < Y% < Z%
unique(DATASET_NAME$VARIABLE_NAME)
provides all of the observed values or name for a variable from the data set
table(DATASET_NAME$VARIABLE_NAME)
for all of the unique entries of a given variable, this command tabulates the number times they appears and displays it in the “Console” area
indexes = X:Y
would produce a list of all integer numbers between the lower integer “X” and upper integer “Y”
DATASET_FEW_COLUMNS = DATASET_NAME[, indexes]
code we can write to produce a new data set with fewer columns, which includes only the columns earmarked by a list of integers between “X” and “Y” in the object called “indexes”
DATASET_FEW_ROWS = DATASET_NAME[indexes ,]
code we can write to produce a new data set with fewer rows, which includes only the rows earmarked by a list of integers between “X” and “Y” in the object called “indexes”
DATASET_FEW_BOTH = DATASET_NAME[indexes , indexes]
code we can write to produce a new data set with fewer columns and rows, which includes only the rows and columns earmarked by a list of integers between “X” and “Y” in the object called “indexes”
main = “HISTOGRAM_TITLE”
provides a title for the histogram when typed into the “hist(DATASET_NAME)” command after a parenthesis placed after the text “DATASET_NAME”
xlab = “X-AXIS_LABEL”
provides an x-axis label for the histogram when typed into the “hist(DATASET_NAME)” command after a parenthesis placed after the text “DATASET_NAME”
ylab = “Y-AXIS_LABEL”
provides a y-axis label for the histogram when typed into the “hist(DATASET_NAME)” command after a parenthesis placed after the text “DATASET_NAME”
aggregate(VARIABE_Y~VARIABLE_X, data = DATASET_NAME, sd/mean)
would provide the standard deviation or mean for two variables in the data set, with “VARIABLE_Y” being the y-variable and “VARIABLE_X” being the x-variable
query_VARIABLE = is.na(DATASET_NAME$VARIABLE_NAME)
index_VARIABLE = which(query_VARIABLE)
NEW_DATASET = DATASET_NAME[-index_VARIABLE, ]
we have a data set which has rows with values “NA” under certain variables, and we want to exlude these from “NEW_DATASET”
plot(x = NEW_DATASET$EXPLANATORY, y = NEW_DATASET$RESPONSE)
would create a scatter-plot relating an explanatory variable to a response variable for a new dataset, “NEW_DATASET”.
pch = actual integer number (1, 2, 17, etc.)
line of code which, if typed inside of the parentheses in the “plot()” command, will change the open circles denoting coordinates from the open circles to something else
xlim = c(LOW INTEGER, HIGH INTEGER)
ylim = c(LOW INTEGER, HIGH INTEGER)
two commands which, if typed inside of the parentheses in the “plot()” command, will denote the range in the x/y-axes for the viewing window
text(x = X-COORDINATE, y = Y-COORDINATE, labels = “NAME”)
line of code written on the inside of the parentheses in the command “plot()” to place a desired name at a particular set of coordinates
set.seed(1)
code written which generates the same set of random numbers and allows the use of those same integers later on
n = number of observations (100, 104, etc.)
mu = average of observation values
sigma = standard deviation (1, 2, etc.)
norm_dist = rnorm(n, mean = mu, sd = sigma)
we want to establish a normal distribution curve (called “norm_dist”), which has an explicit number of observations (n), mean value of observations (mu), and an explicit standard deviation (sigma) – what are the four lines of code necessary to establish “norm_dist”?
brk_points = seq(from = LOW, to = HIGH, by = SIZE)
to establish a list of numbers (called “brk_points”) which is bounded between two values (LOW, HIGH) and is sub-divided between each number in the set by ‘SIZE’
hist(norm_dist, xlim=c(LOW, HIGH), breaks=brk.points)
command which would create a histogram for the normal distribution “norm_dist”, whose x-axis is bound between (LOW, HIGH) and which has a width of the bins defined by the number sequence from the previous question
mu1 = mu2 = … = mu(n-1) = mun = x; sigma1 = y1, sigma2 = y2, etc. (different shapes)
if all of the averages (mu) for various normal distributions is the same value (x), but the standard deviations being different will cause some curves to be thinner (reduced variation) and those with larger variation will be wider and flatter (greater variation)
mu1 = x1, mu2 = x2, etc.; sigma1 = sigma2 = … = sigma(n-1) = sigman = y (different places)
if averages (mu) for various normal distributions are different, but the standard deviations are all the same value (y), the curves will have the same overall shape but they will be centered at different parts of the x-axis
b0 ; b1 ; sigma ; n
intercept ; slope ; measure of the spread of frequency ; sample size
query_VARIABLE = DATASET_NAME$VARIABLE_X == included group
index_VARIABLE = which(query_VARIABLE)
RELEVANT_DATASET = DATASET_NAME$VARIABLE_Y[index_VARIABLE]
situation in which we have a dataset with an explanatory variable with multiple unique traits (producer 1, producer 2, etc.), and we want to establish a dataset which includes only the responses associated with one of those unique representatives (i.e., all of the y-outputs associated with producer 1, or all of the y-outputs associated with producer 2)
qqnorm(RELEVANT_DATASET)
qqline(RELEVANT_DATASET, col = ‘red’/’blue’/’green’ [etc.])
two commands which establish a normal Q-Q plot for the relevant data we want to inspect for the normal distribution
log_RELEVANT = log(RELEVANT_DATASET)
command would perform a common log transformation on the list of values in “RELEVANT_DATASET”, then place those values in the object “log_RELEVANT”
hist(RELEVANT_DATASET) < hist(log_RELEVANT) [possible for normality]
expresses the possibility that a log transformation of the original sample values may adhere better to the normal distribution than the original values of the samples
t.test(DATASET_NAME, mu = log(TRUE_MEAN), alternative=‘less’/‘greater’/’two-sided’)
code to run to perform a t-test on a dataset, with “mu” standing in for the true value of the means and “alternative” specifying if the alternative hypothesis is that the sample mean is “less”, “greater”, or has a (default) “two-sided” difference from the true mean in the population
numerator_1sided = mean(SAMPLE_MEAN) - mean(TRUE_MEAN)
makes an object called “numerator_1sided” which is the mean of all of the values in the sample, minus the mean in the population (or in a claim made by a vendor)
n = length(DATASET_NAME)
establishes an object which is as many units long as there are samples in the study with data
denominator_1sided = sd(DATASET_NAME)/sqrt(n)
makes an object called “denominator_1sided” which is the standard deviation of the sampled values, divided by the square root of the number of samples present
T_statistic_1sided = numerator_1sided/denominator_1sided
creates the object “T_statistic_1sided” for the manual calculation of the T-statistic associated with a manually-performed T-test – “T_statistic_1sided” makes use of two previously-generated values, “numerator_1sided” and “denominator_1sided”
df_1sided = length(DATASET_NAME) - 1
(df_1sided = n - 1)
creates the object “df_1sided”, which represents the degrees of freedom present in a manually-generated t-test, based on the length of the dataset (HINT; the object “df_1sided” and the object “n” differ in regard to only one thing)
pt(T_statistic_1sided, df = df_1sided)
allows one to calculate the P-value for a one-sided t-statistic, based on the pre-established objects “T_statistic_1sided” and “df_1sided”
query_1/A = DATASET_NAME$CATEGORICAL_VARIABLE==1/“A”
query_2/B = DATASET_NAME$CATEGORICAL_VARIABLE==2/“B”
index_1/A = which(query_1/A)
index_2/B = which(query_2/B)
DATA_CAT_1/A = DATASET_NAME[index_1/A, ‘NAME OF QUANTITATIVE VALUES’]
DATA_CAT_2/B = DATASET_NAME[index_2/B, ‘NAME OF QUANTITATIVE VALUES’]
set of six commands we could input using the query~index~new dataset method to split a larger dataset with a category with two representatives (Producer 1 and Producer 2; Producer A and Producer B) into two new datasets with just their values
t.test(x = DATA_CAT_1/A, y = DATA_CAT_2/B, var.equal = TRUE, alternative = ‘two-sided’)
command to run a two-sided t-test which relates the data from the values in one categorical explanatory variable (DATA_CAT_1/A) to the values in the other categorical response variable (DATA_CAT_2/B)
numerator_2sided = mean(DATA_CAT_1/A) - mean(DATA_CAT_2/B)
creates an object (“numerator_2sided”) which can be used to calculate the numerator in order to derive the t-statistic by hand for a two-sided test
n_1/A = length(DATA_CAT_1/A)
n_2/B = length(DATA_CAT_2/B)
creates two objects which have the number of samples in the two unique categories used for our two-sided t-test (i.e., Farm 1 and Farm 2; Producer A and Producer B)
df_2sided = length(DATA_CAT_1/A) + length(DATA_CAT_2/B) - 2
(df_2sided = n_1/A + n_2/B - 2)
creates the object “df_2sided”, which represents the degrees of freedom present in a manually-generated t-test, based on the length of the dataset (HINT; the object “df_2sided” is related to the objects “n_1/A” and “n_2/B”
samp_sig2 = ((n_1/A - 1)var(DATA_CAT_1/A) + (n_2/B - 1)var(DATA_CAT_2/B))/df_2sided
samp_sig = sqrt(samp_sig2)
list of commands used to find the standard deviation of the samples (“samp_sig”), which is the square root of the calculated pooled variance for DATA_CAT_1/A and DATA_CAT_2/B (“samp_sig2”)
denominator_2sided = samp_sig*sqrt((1/n_1/A) + (1/n_2/B))
the denominator in a manually calculated t-statistic for a two-sided t-test (“denominator_2sided”) equals the product of our derived standard deviation of the samples (“samp_sig”), times the square root of the inverse values of the number of samples between Category 1/A and Category 2/B
T_statistic_2sided = numerator_2sided/denominator_2sided
the t-statistic in a two-sided t-test is equal to the 2-sided numerator and the 2-sided denominator, which were previously worked out
2*(1-pt(T_statistic_2sided, df=df_2sided))
code used to calculate the P-value associated with a two-sided t-test
t.test(x = DATASET_NAME$TREAT_1, y = DATASET_NAME$TREAT_2, paired = TRUE)
command to run a paired two-sided t-test (as opposed to the default unpaired two-sided t-test) which relates the values of one variable (TREAT_1) to the values of another variable (TREAT_2)