Data Analytics Flashcards
What is the operator to assign objects to variables
<-
“assign operator”
The operator assigns the value on its right-hand side to the variable or object on its left-hand side.
c() function - what is it?
concatenate - Define a vector
tidy or untidy dataframe?
- each variable should have one column; not more
What is the pipe operator?
%>% - pipe operators allows to concatenate code so it is more readable; instead of using nested functions; takes whatever comes before the concatenator and sends it to the first argument after; allows to pass code of one function directly into the next function as an argument
What function to use for making a dataframe tidy?
pivot_longer()
–> is used to transform data from a wide format (where ecah level of variable has its own column) into a long format (where all values of a variable are placed into a single column)
What does group_by() do?
The group_by() function in dplyr is used to group a data frame by one or more variables (columns), so that subsequent operations like summarizing, mutating, or filtering are applied to each group independently
What does summarise() do?
the summarise() function in dplyr is used to reduce or summarize a data frame by calculating a single value for each group of data - it is typically used after group_by() to perform aggregate calculations like finding the mean, sum, count, etc.
What does mutate() do?
A function which “mutates”, i.e. adds information - compute transformations on columns: create new columns using existing columns - To drop all non-transformed variables use transmute()
Confidence interval formula
- x bar +/- z * s/(n^1/2)
- (n = sample size, s = SD of sample)
What is the interpretation of the Confidence Interval?
For 95% of the time I am calculating xyz, it is going to fall into that specific interval - 95% of the time the x bar is going to fall into this interval
Parameter of the t-distribution
n-1
Hypothesis testing - z-test - formula
z-test = | (x bar - mu) / (sigma / n^1/2) |
The two outcomes for a hypothesis test
z-test is < (?) than z-critical
–> Fail to Reject the Null Hypothesis / Reject the Null Hypothesis
–> Failing to reject H0 is not the same as accepting H0 - it simply means there isn’t strong enough evidence against H0, not that H0 is true
hypothesis testing - comparing to samples - t-test (technically can also be a t-test) - to test whether two means are different or the same - i.e. H0: mu1 - mu2 = 0; H0: mu1 - mu2 is not 0 - set-up of formula for SE:
- SE = square root (s1^2/n1 + s2^2/n2)
- Use gaussian distribution if n1 and n2 > 30
hypothesis testing - comparing to samples - t-test for comparing to sample means - formula
t-test = ((x1 bar - x2 bar) - 0) / SE = …
0 because H0 is mu1 - mu2 = 0
reorder() function
The reorder() function reorders categories (e.g. region_names) based on the numerical data provided (e.g. total_crime)
aes() function
In R, the aes() function is often used within other graphing elements to specify the desired aesthetics. The aes() function can be used in a global manner (applying to all of the graph’s elements) by nesting within ggplot()
function qt() - explain
qt() provides the t-critical value (students may also talk about the probability that it expects, and the degrees of freedom)
returns the value on the x-axis below which contains the specified probability, or area under the t-distribution below that point. The pt() returns the probability that x is less than a specified value.
Explain what the :: operator does
:: operator says package::function i.e. from the stated package, use the stated function.
Explain what the functions pnorm()/xpnorm() and qnorm() do?
pnorm()/xpnorm() provides probability to the left and to the right of a specific z value in a Standard Normal distribution, i.e. 𝑃(𝑍 < 𝑧) as well as 𝑃(𝑍 > 𝑧) for a normally distributed standardized random variable z - pnorm(1.96) = 0.975
qnorm() for a given probability what is the corresponding z value of 𝑃(𝑍 < 𝑧) - qnorm(.975, mean=0, sd=1) = 1.96
xxx_join and full_join(A,B)
full_join(A,B) - It joins tibble A and tibble B by a common column called colName automatically and results in all rows from both tibbles where matching rows are combined. In the case where there are no matching values the result is NA for the missing ones.
Discuss the plots that allow you to check whether assumptions of a Linear Regression model have been satisfied - and which R function is that?
The function name is autoplot(). There are 3 plots for testing the residual behavior and they are:
- Residuals vs Fitted: shows if residuals are randomly scattered (have an average of zero). This is shown with the blue line which should be horizontal at 0.
- Normal Q-Q plot (points should be located on the diagonal – this would mean residuals are normally distributed.
- Scale location – should have horizontal blue line which would mean new have a case of constant variance with time.