final exam Flashcards
what do we use when we want to test if two variables are different from each other
two-way chi-square
formula for chi-square
TOTAL OF (Observed - Expected)2 / Expected
when you run the formula for X2 (chi-square) and you get the final number, what is that called
test statistic
degrees of freedom for two-way chi-squared
(number of rows minus 1) * (number of columns – 1)
what do you plug into R-Studio to check p-value for two way chi-square
pchisq(X2 , df = x , lower.tail = F)
then check if the p-value is greater or less than 0.05. If it’s lower, you reject the null hypothesis.
What does “power” mean?
it’s the likelihood of finding a particular difference as statistically significant
what variable are you trying to find in power analysis?
what is n (sample size) you need to reject the null hypothesis and the findings to be significant
Why do you want to aim for ~ 80% for power?
you want a 80% probability of rejecting the null hypothesis with a 95% confidence
when we want an area to be 80%, the z value that corresponds with that is
qnorm(0.8) = 0.84
formula for SE
σ/n(squarerooted)
what are we trying to minimize in power analysis?
Type II error (β): This error occurs when the test fails to reject the null hypothesis when it is actually false. To minimize this, we aim to decrease the overlap between the two distributions (the H0 and Ha curves)
what are we trying to maximize through power analysis?
Distance between the null hypothesis and alternative hypothesis distributions
Maximizing the area of the alternative distribution that falls in the rejection region of the test (where we reject the null hypothesis) increases the chances of detecting a true effect.
what does ANOVA stand for?
ANalysis Of VAriance
What is ANOVA used for?
ANOVA is used to determine if a numeric variable differs across different groups
What are the three assumptions for ANOVA?
The observations are independent within and across groups
The data within each group are nearly normal
The variability across the groups is about equal
Hypothesis testing for two-way chi-squared and ANOVA – what null are they testing?
chi-square Ho = there is no meaningful difference between observation and expected value (from random chance)
ANOVA Ho = all group means are equal
𝜇_x=𝜇_y=𝜇_c
when we reject the null hypothesis in ANOVA it doesn’t mean every mean is different from one another it just means
that there is at least one difference.
Mean Squared Between Groups (MSG) in ANOVA
conceptually represents the amount of variation between groups (how much group means deviate from the overall mean).
If it’s high, there’s more of a difference. If it’s low, the group means are similar to the overall mean - the groups aren’t different.
Mean Square Error (MSE) in ANOVA
measures how much the data points within each group deviate from their respective group mean. This is an estimate of the variance within the groups.
how do you calculate test statistic (F-Value) in Anova?
F = MSG/MSE
F distributions (ANOVA) take on two degrees of freedom, which are
DF1 = k – 1
DF2 = n - 1
(ANOVA) on R studio, how do you calculate the mean and standard deviation for particular groups within the dataset?
dataset %>%
group_by(pos_group) %>%
(ANOVA) on R studio, how do you create multiple histograms for the particular groups of the dataset?
facet_wrap(~pos_group)
how do you do an ANOVA analysis on R-studio
anova <- aov( variable1 ~ pos_group, data = dataset)
summary(anova)
how do you examine the structure of a dataset in R
str(dataset)
how do you look at the number of observation in dataset
nrow(dataset)
how do you look at the number of variables in dataset
ncol(dataset)
how do you look at the mean of a specific variable but you want to eliminate the missing values?
mean(variable1, na.rm = T)
Pearson’s R/Correlation
one number to describe a relationship in linear regression.
Ranges from -1 to a 1
coefficient of determination
when you square the correlation coefficient (R). This represents how well the linear model
predicts the data, ranging from 0 to 1. (how much the variation in a variable is explained by another variable)
what minimizes the RMSE (root mean square error) and what is that line called?
the mean of the different datapoints minimizes RMSE, and the line is called “best fit line”
Why is it called OLS regression (ordinary least squares)?
because it finds the line with the least root mean squared error
What are the four assumptions of linear regression?
Linearity – the data have a linear trend
Nearly normal residuals
Constant variability
Independent observations
RMSE
a measure of how well a predictive model’s predictions match the actual data. It quantifies the average magnitude of the residuals
how to create a scatter plot with a regression in R
ggplot(dataset, aes(x = independent, y = dependent)) + geom_point + geom_smooth (method = “lm”)
how to construct an OLS regression model in R
lm(dependent ~ independent, data = dataset)
when do you use chi-square, anova, and OLS regression
categorical - categorical = chi-square
independent variable is categorical and the dependent is numeric = ANOVA
numeric - numeric = OLS Regression
High leverage
in regression analysis, a point with a very high or low value on the independent variable (X) is said to have “high leverage”
Influential point
a data point that has an extreme value on both the independent (X) and dependent (Y) variables such that its inclusion meaningfully changes the regression line
How do you deal with outliers? (3)
- Check to make sure they are real and not data entry errors
- Don’t automatically exclude them – sometimes outlies can hold important insights
- be cautious about interpreting a relationship if it’s solely the product of an outlier
How do Calculate RMSE
- find the SE by summing up all the squared distances between each point
- find the mean of the SE by sum of SE squared / # of points
- find the root of this number by square rooting the mean
Binary variables can take on exactly…
two values, 1 or 0
In R, TRUE or FALSE
With binary variables, what’s the difference between when x = 0 and when x = 1
x = 0 : y = b or y = Bo
x = 1: y = mx + b or y = Bo + Bx
what are the three criteria for causality again?
- plausibility
- time order
- non-spuriousness
Multiple regression
is a way of mathematically controlling for variables to eliminate spuriousness
how would you make this regression equation to a multiple regression equation?
crime = Bo + B1 * icecream eating + E
crime = Bo + B1 * icecream eating + B2 * summer + E