Data Analytics Flashcards
What is the operator to assign objects to variables
<-
“assign operator”
helps assign objects to variables
c() function - what is it?
concatenate - Define a vector
tidy or untidy dataframe?
- each variable should have one column; not more
What is the pipe operator?
%>% - pipe operators allows to concatenate code so it is more readable; instead of using nested functions; takes whatever comes before the concatenator and sends it to the first argument after; allows to pass code of one function directly into the next function as an argument
What function to use for making a dataframe tidy?
pivot_longer()
–> is used to transform data from a wide format (where ecah level of variable has its own column) into a long format (where all values of a variable are placed into a single column)
What does group_by() do?
The group_by() function in dplyr is used to group a data frame by one or more variables (columns), so that subsequent operations like summarizing, mutating, or filtering are applied to each group independently
What does summarise() do?
the summarise() function in dplyr is used to reduce or summarize a data frame by calculating a single value for each group of data - it is typically used after group_by() to perform aggregate calculations like finding the mean, sum, count, etc.
What does mutate() do?
A function which “mutates”, i.e. adds information - compute transformations on columns: create new columns using existing columns - To drop all non-transformed variables use transmute()
Confidence interval formula
- x bar +/- z * s/(n^1/2)
- (n = sample size, s = SD of sample)
What is the interpretation of the Confidence Interval?
For 95% of the time I am calculating xyz, it is going to fall into that specific interval - 95% of the time the x bar is going to fall into this interval
Parameter of the t-distribution
n-1
Hypothesis testing - z-test - formula
z-test = | (x bar - mu) / (sigma / n^1/2) |
The two outcomes for a hypothesis test
z-test is < (?) than z-critical
–> Fail to Reject the Null Hypothesis / Reject the Null Hypothesis
–> Failing to reject H0 is not the same as accepting H0 - it simply means there isn’t strong enough evidence against H0, not that H0 is true
hypothesis testing - comparing to samples - t-test (technically can also be a t-test) - to test whether two means are different or the same - i.e. H0: mu1 - mu2 = 0; H0: mu1 - mu2 is not 0 - set-up of formula for SE:
- SE = square root (s1^2/n1 + s2^2/n2)
- Use gaussian distribution if n1 and n2 > 30
hypothesis testing - comparing to samples - t-test for comparing to sample means - formula
t-test = ((x1 bar - x2 bar) - 0) / SE = …
0 because H0 is mu1 - mu2 = 0
reorder() function
The reorder() function reorders categories (e.g. region_names) based on the numerical data provided (e.g. total_crime)
aes() function
In R, the aes() function is often used within other graphing elements to specify the desired aesthetics. The aes() function can be used in a global manner (applying to all of the graph’s elements) by nesting within ggplot()
function qt() - explain
qt() provides the t-critical value (students may also talk about the probability that it expects, and the degrees of freedom)
returns the value on the x-axis below which contains the specified probability, or area under the t-distribution below that point. The pt() returns the probability that x is less than a specified value.