final exam Flashcards by Kim Larkin

what do we use when we want to test if two variables are different from each other

two-way chi-square

How well did you know this?

Not at all

Perfectly

formula for chi-square

TOTAL OF (Observed - Expected)2 / Expected

How well did you know this?

Not at all

Perfectly

when you run the formula for X2 (chi-square) and you get the final number, what is that called

test statistic

How well did you know this?

Not at all

Perfectly

degrees of freedom for two-way chi-squared

(number of rows minus 1) * (number of columns – 1)

How well did you know this?

Not at all

Perfectly

what do you plug into R-Studio to check p-value for two way chi-square

pchisq(X2 , df = x , lower.tail = F)

then check if the p-value is greater or less than 0.05. If it’s lower, you reject the null hypothesis.

How well did you know this?

Not at all

Perfectly

What does “power” mean?

it’s the likelihood of finding a particular difference as statistically significant

How well did you know this?

Not at all

Perfectly

what variable are you trying to find in power analysis?

what is n (sample size) you need to reject the null hypothesis and the findings to be significant

How well did you know this?

Not at all

Perfectly

Why do you want to aim for ~ 80% for power?

you want a 80% probability of rejecting the null hypothesis with a 95% confidence

How well did you know this?

Not at all

Perfectly

when we want an area to be 80%, the z value that corresponds with that is

qnorm(0.8) = 0.84

How well did you know this?

Not at all

Perfectly

formula for SE

σ/n(squarerooted)

How well did you know this?

Not at all

Perfectly

what are we trying to minimize in power analysis?

Type II error (β): This error occurs when the test fails to reject the null hypothesis when it is actually false. To minimize this, we aim to decrease the overlap between the two distributions (the H0 and Ha curves)

How well did you know this?

Not at all

Perfectly

what are we trying to maximize through power analysis?

Distance between the null hypothesis and alternative hypothesis distributions

Maximizing the area of the alternative distribution that falls in the rejection region of the test (where we reject the null hypothesis) increases the chances of detecting a true effect.

How well did you know this?

Not at all

Perfectly

what does ANOVA stand for?

ANalysis Of VAriance

How well did you know this?

Not at all

Perfectly

What is ANOVA used for?

ANOVA is used to determine if a numeric variable differs across different groups

How well did you know this?

Not at all

Perfectly

What are the three assumptions for ANOVA?

The observations are independent within and across groups

The data within each group are nearly normal

The variability across the groups is about equal

How well did you know this?

Not at all

Perfectly

Hypothesis testing for two-way chi-squared and ANOVA – what null are they testing?

chi-square Ho = there is no meaningful difference between observation and expected value (from random chance)

ANOVA Ho = all group means are equal
𝜇_x=𝜇_y=𝜇_c

How well did you know this?

Not at all

Perfectly

when we reject the null hypothesis in ANOVA it doesn’t mean every mean is different from one another it just means

that there is at least one difference.

How well did you know this?

Not at all

Perfectly

Mean Squared Between Groups (MSG) in ANOVA

conceptually represents the amount of variation between groups (how much group means deviate from the overall mean).

If it’s high, there’s more of a difference. If it’s low, the group means are similar to the overall mean - the groups aren’t different.

How well did you know this?

Not at all

Perfectly

Mean Square Error (MSE) in ANOVA

Study These Flashcards

measures how much the data points within each group deviate from their respective group mean. This is an estimate of the variance within the groups.

how do you calculate test statistic (F-Value) in Anova?

Study These Flashcards

F = MSG/MSE

F distributions (ANOVA) take on two degrees of freedom, which are

Study These Flashcards

DF1 = k – 1
DF2 = n - 1

(ANOVA) on R studio, how do you calculate the mean and standard deviation for particular groups within the dataset?

Study These Flashcards

dataset %>%
group_by(pos_group) %>%

(ANOVA) on R studio, how do you create multiple histograms for the particular groups of the dataset?

Study These Flashcards

facet_wrap(~pos_group)

how do you do an ANOVA analysis on R-studio

Study These Flashcards

anova <- aov( variable1 ~ pos_group, data = dataset)
summary(anova)

how do you examine the structure of a dataset in R

str(dataset)

how do you look at the number of observation in dataset

nrow(dataset)

how do you look at the number of variables in dataset

ncol(dataset)

how do you look at the mean of a specific variable but you want to eliminate the missing values?

mean(variable1, na.rm = T)

Pearson's R/Correlation

one number to describe a relationship in linear regression. Ranges from -1 to a 1

coefficient of determination

when you square the correlation coefficient (R). This represents how well the linear model predicts the data, ranging from 0 to 1. (how much the variation in a variable is explained by another variable)

what minimizes the RMSE (root mean square error) and what is that line called?

the mean of the different datapoints minimizes RMSE, and the line is called "best fit line"

Why is it called OLS regression (ordinary least squares)?

because it finds the line with the least root mean squared error

What are the four assumptions of linear regression?

Linearity – the data have a linear trend Nearly normal residuals Constant variability Independent observations

RMSE

a measure of how well a predictive model's predictions match the actual data. It quantifies the average magnitude of the residuals

how to create a scatter plot with a regression in R

ggplot(dataset, aes(x = independent, y = dependent)) + geom_point + geom_smooth (method = "lm")

how to construct an OLS regression model in R

lm(dependent ~ independent, data = dataset)

when do you use chi-square, anova, and OLS regression

categorical - categorical = chi-square independent variable is categorical and the dependent is numeric = ANOVA numeric - numeric = OLS Regression

High leverage

in regression analysis, a point with a very high or low value on the independent variable (X) is said to have “high leverage”

Influential point

a data point that has an extreme value on both the independent (X) and dependent (Y) variables such that its inclusion meaningfully changes the regression line

How do you deal with outliers? (3)

1. Check to make sure they are real and not data entry errors 2. Don’t automatically exclude them – sometimes outlies can hold important insights 3. be cautious about interpreting a relationship if it’s solely the product of an outlier

How do Calculate RMSE

1. find the SE by summing up all the squared distances between each point 2. find the mean of the SE by sum of SE squared / # of points 3. find the root of this number by square rooting the mean

Binary variables can take on exactly...

two values, 1 or 0 In R, TRUE or FALSE

With binary variables, what's the difference between when x = 0 and when x = 1

x = 0 : y = b or y = Bo x = 1: y = mx + b or y = Bo + Bx

what are the three criteria for causality again?

1. plausibility 2. time order 3. non-spuriousness

Multiple regression

is a way of mathematically controlling for variables to eliminate spuriousness

how would you make this regression equation to a multiple regression equation? crime = Bo + B1 * icecream eating + E

crime = Bo + B1 * icecream eating + B2 * summer + E

final exam Flashcards

(46 cards)