midterm 1 Flashcards

1
Q

Three Criteria for Causality

A
  1. Plausibility
  2. Time Order
  3. Non-Spuriousness
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two type of variables?

A

Numeric and Categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

mean, median, mode

A
  1. mean: every number added up divided by how many numbers
  2. median: middle
  3. mode: most common
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Range

A

maximum – minimum value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

IQR

A

Q3 – Q1 (50% of the data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

standard deviation is what square rooted?

A

variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cluster Sample vs Stratified Sample

A

Cluster: randomly samples existing clusters, then samples within those clusters

Stratified: creates subgroups based on variables and samples from within those subgroups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

68 – 95 – 99.7 rule for normal distribution

A

68% of data will fall within 1 sd of mean
95% of data will fall within 2 sd of mean
99.7% of data will fall within 3 sd of mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how to calculate ANOVA
Test Statistic:

A

𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝐺𝑟𝑜𝑢𝑝𝑠 (𝑀𝑆𝐺)/ 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑟𝑟𝑜𝑟 (𝑀𝑆𝐸)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does MSG conceptually mean?

A

the amount of variation between groups. In other words, how much of the variation you see in the sample is because there are multiple groups.

If this number is high, it means that the groups are different from each other. If this number is low, it means that the group means are all very similar to the overall mean – the groups are NOT different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is MSE conceptually?

A

the amount of variation within groups. If this number is high, it means that there is a lot of variation within the groups. If this number is low, it means that all of the observations are pretty close to average for their group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Factorial ANOVA

A

A technique for studying the effect of two or more categorical independent variables on a numeric dependent variable accounting for interaction effects among the independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Main Effect

A

The overall relationship between an independent and a dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Interaction Effect

A

when the relationship between two variables is different depending on the value of a third variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Tukey HSD (factorial anova)

A

This tests every pair to see if they are statistically significantly different

In R: TukeyHSD() function on the aov model object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

multivariate

A

studies relationships of independent variables with multiple dependent variables

17
Q

conditions for regression

A
  1. linearity
  2. nearly normal residuals
  3. constant variability
  4. independent observations
18
Q

coefficient vs y-intercept

A

coefficient: Bx or mx
y-intercept: Bo

19
Q

RMSE root mean square error (three steps)

A
  1. find squared error
  2. calculate mean of the se
  3. take the square root
20
Q

what minimizes the RMSE?

A

the mean, and the line that minimizes RMSE is the best fit line - OLS (ordinary least squares)

21
Q

R for regression

A

summary(lm(depedent variable ~ indepepdnet variable, data = dataset))

22
Q

high leverage vs influencial point

A

high leverage: very high or low value on the independent variable (x)

influencial point: extreme in both independent (x) and dependent (y) variables

23
Q

Simpson’s Paradox

A

where the observed relationship b/w two variables changes when the population is divided into different groups

24
Q

Adjusted R squared in multiple regression

A

applies a penalty based on the number of parameters in the model

Bo + B1 = two parameters
Bo + B1 + B2 = three parameters

25
Q

How to Build a Multiple Regression Model

A
  1. Simultaneous Inference or Full Entry: put all variables into model
  2. Hierarchical: adding variables in a pre-specified and theoretically justified order
  3. Forward Selection: keep on adding variables
  4. Backward Elimination: start with full and eliminate
26
Q

How is ANOVA a special case of regression?

A

Regression: how much of the variation in Y is explained by variation in the numerical variable X

ANOVA: how much of the variation in Y is explained by a categorical variable

27
Q

nominal vs ordinal (categorical variables)

A

nominal: Data is organized into categories that have no intrinsic order

ordinal: Data is organized into categories that have a clear order

28
Q

variable vs case

A

Variable: A feature that varies across cases/observational units/units of analysis

Case: an individual person/object whose features you are measuring and evaluating.

29
Q

4 types of sampling frames

A
  1. Simple random sample: Randomly select a sample from a list
  2. Systematic sample: Organize the sampling frame in a list and select every kth individual
  3. Cluster Sample: Randomly sample existing clusters (e.g. cities, schools, neighborhoods) first, then sample within those clusters
  4. Stratified sample: Create subgroups (called strata) of the sample based on important variables (often race, gender, or other demographic features) and then sample from within those subgroups
30
Q

The Central Limit Theorem (for proportion):

A

When observations are independent, the sampling distribution of sample proportions for a sample of size n is a normal distribution with a mean of P and a standard deviation of √((𝑝(1−𝑝))/𝑛)

31
Q

parameters vs statistics

A

parameter (population)
P – the population proportion
μ (mu) – the population mean
σ (sigma) – the population standard deviation

statistic (sample)
𝑝̂ (p-hat) – the sample proportion
𝑥̅ (x-bar) – the sample mean
s – the sample standard deviation

32
Q

what’s the null hypothesis in linear regression

A

that the coefficient (B1x) is zero