final exam Flashcards

1
Q

what do we use when we want to test if two variables are different from each other

A

two-way chi-square

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

formula for chi-square

A

TOTAL OF (Observed - Expected)2 / Expected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

when you run the formula for X2 (chi-square) and you get the final number, what is that called

A

test statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

degrees of freedom for two-way chi-squared

A

(number of rows minus 1) * (number of columns – 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what do you plug into R-Studio to check p-value for two way chi-square

A

pchisq(X2 , df = x , lower.tail = F)

then check if the p-value is greater or less than 0.05. If it’s lower, you reject the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does “power” mean?

A

it’s the likelihood of finding a particular difference as statistically significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what variable are you trying to find in power analysis?

A

what is n (sample size) you need to reject the null hypothesis and the findings to be significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why do you want to aim for ~ 80% for power?

A

you want a 80% probability of rejecting the null hypothesis with a 95% confidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

when we want an area to be 80%, the z value that corresponds with that is

A

qnorm(0.8) = 0.84

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

formula for SE

A

σ/n(squarerooted)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are we trying to minimize in power analysis?

A

Type II error (β): This error occurs when the test fails to reject the null hypothesis when it is actually false. To minimize this, we aim to decrease the overlap between the two distributions (the H0 and Ha curves)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what are we trying to maximize through power analysis?

A

Distance between the null hypothesis and alternative hypothesis distributions

Maximizing the area of the alternative distribution that falls in the rejection region of the test (where we reject the null hypothesis) increases the chances of detecting a true effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what does ANOVA stand for?

A

ANalysis Of VAriance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is ANOVA used for?

A

ANOVA is used to determine if a numeric variable differs across different groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the three assumptions for ANOVA?

A

The observations are independent within and across groups

The data within each group are nearly normal

The variability across the groups is about equal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Hypothesis testing for two-way chi-squared and ANOVA – what null are they testing?

A

chi-square Ho = there is no meaningful difference between observation and expected value (from random chance)

ANOVA Ho = all group means are equal
𝜇_x=𝜇_y=𝜇_c

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

when we reject the null hypothesis in ANOVA it doesn’t mean every mean is different from one another it just means

A

that there is at least one difference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Mean Squared Between Groups (MSG) in ANOVA

A

conceptually represents the amount of variation between groups (how much group means deviate from the overall mean).

If it’s high, there’s more of a difference. If it’s low, the group means are similar to the overall mean - the groups aren’t different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Mean Square Error (MSE) in ANOVA

A

measures how much the data points within each group deviate from their respective group mean. This is an estimate of the variance within the groups.

20
Q

how do you calculate test statistic (F-Value) in Anova?

A

F = MSG/MSE

21
Q

F distributions (ANOVA) take on two degrees of freedom, which are

A

DF1 = k – 1
DF2 = n - 1

22
Q

(ANOVA) on R studio, how do you calculate the mean and standard deviation for particular groups within the dataset?

A

dataset %>%
group_by(pos_group) %>%

23
Q

(ANOVA) on R studio, how do you create multiple histograms for the particular groups of the dataset?

A

facet_wrap(~pos_group)

24
Q

how do you do an ANOVA analysis on R-studio

A

anova <- aov( variable1 ~ pos_group, data = dataset)
summary(anova)

25
Q

how do you examine the structure of a dataset in R

A

str(dataset)

26
Q

how do you look at the number of observation in dataset

A

nrow(dataset)

27
Q

how do you look at the number of variables in dataset

A

ncol(dataset)

28
Q

how do you look at the mean of a specific variable but you want to eliminate the missing values?

A

mean(variable1, na.rm = T)

29
Q

Pearson’s R/Correlation

A

one number to describe a relationship in linear regression.

Ranges from -1 to a 1

30
Q

coefficient of determination

A

when you square the correlation coefficient (R). This represents how well the linear model
predicts the data, ranging from 0 to 1. (how much the variation in a variable is explained by another variable)

31
Q

what minimizes the RMSE (root mean square error) and what is that line called?

A

the mean of the different datapoints minimizes RMSE, and the line is called “best fit line”

32
Q

Why is it called OLS regression (ordinary least squares)?

A

because it finds the line with the least root mean squared error

33
Q

What are the four assumptions of linear regression?

A

Linearity – the data have a linear trend
Nearly normal residuals
Constant variability
Independent observations

34
Q

RMSE

A

a measure of how well a predictive model’s predictions match the actual data. It quantifies the average magnitude of the residuals

35
Q

how to create a scatter plot with a regression in R

A

ggplot(dataset, aes(x = independent, y = dependent)) + geom_point + geom_smooth (method = “lm”)

36
Q

how to construct an OLS regression model in R

A

lm(dependent ~ independent, data = dataset)

37
Q

when do you use chi-square, anova, and OLS regression

A

categorical - categorical = chi-square
independent variable is categorical and the dependent is numeric = ANOVA
numeric - numeric = OLS Regression

38
Q

High leverage

A

in regression analysis, a point with a very high or low value on the independent variable (X) is said to have “high leverage”

39
Q

Influential point

A

a data point that has an extreme value on both the independent (X) and dependent (Y) variables such that its inclusion meaningfully changes the regression line

40
Q

How do you deal with outliers? (3)

A
  1. Check to make sure they are real and not data entry errors
  2. Don’t automatically exclude them – sometimes outlies can hold important insights
  3. be cautious about interpreting a relationship if it’s solely the product of an outlier
41
Q

How do Calculate RMSE

A
  1. find the SE by summing up all the squared distances between each point
  2. find the mean of the SE by sum of SE squared / # of points
  3. find the root of this number by square rooting the mean
42
Q

Binary variables can take on exactly…

A

two values, 1 or 0
In R, TRUE or FALSE

43
Q

With binary variables, what’s the difference between when x = 0 and when x = 1

A

x = 0 : y = b or y = Bo
x = 1: y = mx + b or y = Bo + Bx

44
Q

what are the three criteria for causality again?

A
  1. plausibility
  2. time order
  3. non-spuriousness
45
Q

Multiple regression

A

is a way of mathematically controlling for variables to eliminate spuriousness

46
Q

how would you make this regression equation to a multiple regression equation?

crime = Bo + B1 * icecream eating + E

A

crime = Bo + B1 * icecream eating + B2 * summer + E