Functions: Descriptive Statistics Flashcards
mean()
mean(pirates$age)
mean “age” of pirates in this dataset
max()
max(pirates$height)
max “height” of pirates in this dataset
table()
table(pirates$sex)
generate a frequency table of the sex of pirates
## female male other
## 464 490 46
aggregate()
# Calculate the mean age, separately for each sex
aggregate(x = age ~ sex,
data = pirates,
FUN = mean)
sex age ## 1 female 30 ## 2 male 25 ## 3 other 27
median()
quantile()
length()
The length() function takes a vector as an argument, and returns a scalar representing the number of elements in the vector
a <- 1:10
length(a) # How many elements are in a?
##[1] 10
additional numeric functions
unique()
vec <- c(1, 1, 1, 5, 1, 1, 10, 10, 10)
gender <- c(“M”, “M”, “F”, “F”, “F”, “M”, “F”, “M”, “F”)
unique(vec)
##[1] 1 5 10
unique(gender)
##[1] “M” “F”
this function doesn’t tell you how often each of these values occurs
table()
The function table() does the same thing as unique(), but goes a step further in telling you how often each of the unique values occurs:
table(vec)
##vec
##1 5 10
##5 1 3
table(gender)
##gender
##F M
##5 4
na.rm = TRUE function
a <- c(1, 5, NA, 2, 10)
mean(a, na.rm = TRUE)
##[1] 4.5
this syntax in the code is needed to tell R to process these values even though there is an NA in them
z score formula
a <- c(5, 3, 7, 5, 5, 3, 4)
a.z <- (a - mean(a)) / sd(a) which is the formula for making z scores and standardizing scores
a.z
##[1] 0.31 -1.12 1.74 0.31 0.31 -1.12 -0.41
calculating the mean of z scores should result in 0
summary()
Pring descriptive statistics of the piercing data
summary(american.bp)
##Min. 1st Qu. Median Mean 3rd Qu. Max.
##1.0 3.0 4.0 3.7 4.8 6.0
summary(european.bp)
##Min. 1st Qu. Median Mean 3rd Qu. Max.
##3.0 4.2 5.5 5.3 6.0 7.0
independent samples t-test code
p value definition
Assuming that there the null hypothesis is true (i.e.; that there is no difference between the groups), what is the probability that we would have gotten a test statistic as far away from 0 as the one we actually got?
It’s a bullshit detector aimed at the null hypothsis. If the p value gets too small, the bullshit detector goes off
Does the p-value tell us the probability that the null hypothesis is true?
No!!! The p-value does not tell you the probability that the null hypothesis is true. In other words, if you calculate a p-value of .04, this does not mean that the probability that the null hypothesis is true is 4%. Rather, it means that if the null hypothesis was true, the probability of obtaining the result you got is 4%. Now, this does indeed set off our bullshit detector, but again, it does not mean that the probability that the null hypothesis is true is 4%.
htest
R stores hypothesis tests in special object classes called htest. htest objects contain all the major results from a hypothesis test, from the test statistic (e.g.; a t-statistic for a t-test, or a correlation coefficient for a correlation test), to the p-value, to a confidence interval.
different h tests necessitate data to be loaded into the function in different formats (vectors/dfs or tables)
names()
returns all of the elements in the h.test object
one sample t-test
you can pull data from a df or from separate vectors, it doesn’t have to come from a table() function
t tests compared to each other in bar chart form
you can pull data from a df or from separate vectors, it doesn’t have to come from a table() function
Using subset to select levels of an IV
use the %in% argument to specify which levels of an IV you want to test
cor.test()
two ways to run a correlation test
To run a correlation test between two variables x and y, use the cor.test() function. You can do this in one of two ways, if x and y are columns in a dataframe, use the formula notation (formula = ~ x + y). If x and y are separate vectors (not in a dataframe), use the vector notation (x, y):
you can pull data from a df or from separate vectors, it doesn’t have to come from a table() function
example correlation test
using subset() in the cor.test() function
Just like the t.test() function, we can use the subset argument in the cor.test() function to conduct a test on a subset of the entire dataframe. For example, to run the same correlation test between a pirate’s age and the number of parrot’s she’s owned, but only for female pirates, I can add the subset = sex == “female” argument:
chisq.test()
used to determine whether there is a significant association between two categorical variables
you must create a table of data to feed a chisq.test function
this example has one nomial variable, and we are testing to see if the likelihood is equal that a pirate would attend either school.
2 sample chisq.test()
If you want to see if the frequency of one nominal variable depends on a second nominal variable, you’d conduct a 2-sample chi-square test.
apa-style conclusions using the apa() function
you can have R take raw h.test results and extract only the relevant data in APA style for you using this function