1. mean: every number added up divided by how many numbers 2. median: middle 3. mode: most common

midterm 1 Flashcards by Kim Larkin

Three Criteria for Causality

Plausibility
Time Order
Non-Spuriousness

How well did you know this?

Not at all

Perfectly

What are the two type of variables?

Numeric and Categorical

How well did you know this?

Not at all

Perfectly

mean, median, mode

mean: every number added up divided by how many numbers
median: middle
mode: most common

How well did you know this?

Not at all

Perfectly

Range

maximum – minimum value

How well did you know this?

Not at all

Perfectly

IQR

Q3 – Q1 (50% of the data)

How well did you know this?

Not at all

Perfectly

standard deviation is what square rooted?

variance

How well did you know this?

Not at all

Perfectly

Cluster Sample vs Stratified Sample

Cluster: randomly samples existing clusters, then samples within those clusters

Stratified: creates subgroups based on variables and samples from within those subgroups

How well did you know this?

Not at all

Perfectly

68 – 95 – 99.7 rule for normal distribution

68% of data will fall within 1 sd of mean
95% of data will fall within 2 sd of mean
99.7% of data will fall within 3 sd of mean

How well did you know this?

Not at all

Perfectly

how to calculate ANOVA
Test Statistic:

𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝐺𝑟𝑜𝑢𝑝𝑠 (𝑀𝑆𝐺)/ 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑟𝑟𝑜𝑟 (𝑀𝑆𝐸)

How well did you know this?

Not at all

Perfectly

What does MSG conceptually mean?

the amount of variation between groups. In other words, how much of the variation you see in the sample is because there are multiple groups.

If this number is high, it means that the groups are different from each other. If this number is low, it means that the group means are all very similar to the overall mean – the groups are NOT different.

How well did you know this?

Not at all

Perfectly

What is MSE conceptually?

the amount of variation within groups. If this number is high, it means that there is a lot of variation within the groups. If this number is low, it means that all of the observations are pretty close to average for their group.

How well did you know this?

Not at all

Perfectly

Factorial ANOVA

A technique for studying the effect of two or more categorical independent variables on a numeric dependent variable accounting for interaction effects among the independent variables

How well did you know this?

Not at all

Perfectly

Main Effect

The overall relationship between an independent and a dependent variable

How well did you know this?

Not at all

Perfectly

Interaction Effect

when the relationship between two variables is different depending on the value of a third variable

How well did you know this?

Not at all

Perfectly

Tukey HSD (factorial anova)

This tests every pair to see if they are statistically significantly different

In R: TukeyHSD() function on the aov model object

How well did you know this?

Not at all

Perfectly

multivariate

Study These Flashcards

studies relationships of independent variables with multiple dependent variables

conditions for regression

Study These Flashcards

linearity
nearly normal residuals
constant variability
independent observations

coefficient vs y-intercept

Study These Flashcards

coefficient: Bx or mx
y-intercept: Bo

RMSE root mean square error (three steps)

Study These Flashcards

find squared error
calculate mean of the se
take the square root

what minimizes the RMSE?

Study These Flashcards

the mean, and the line that minimizes RMSE is the best fit line - OLS (ordinary least squares)

R for regression

Study These Flashcards

summary(lm(depedent variable ~ indepepdnet variable, data = dataset))

high leverage vs influencial point

Study These Flashcards

high leverage: very high or low value on the independent variable (x)

influencial point: extreme in both independent (x) and dependent (y) variables

Simpson’s Paradox

Study These Flashcards

where the observed relationship b/w two variables changes when the population is divided into different groups

Adjusted R squared in multiple regression

Study These Flashcards

applies a penalty based on the number of parameters in the model

Bo + B1 = two parameters
Bo + B1 + B2 = three parameters

How to Build a Multiple Regression Model

1. Simultaneous Inference or Full Entry: put all variables into model 2. Hierarchical: adding variables in a pre-specified and theoretically justified order 3. Forward Selection: keep on adding variables 4. Backward Elimination: start with full and eliminate

How is ANOVA a special case of regression?

Regression: how much of the variation in Y is explained by variation in the numerical variable X ANOVA: how much of the variation in Y is explained by a categorical variable

nominal vs ordinal (categorical variables)

nominal: Data is organized into categories that have no intrinsic order ordinal: Data is organized into categories that have a clear order

variable vs case

Variable: A feature that varies across cases/observational units/units of analysis Case: an individual person/object whose features you are measuring and evaluating.

4 types of sampling frames

1. Simple random sample: Randomly select a sample from a list 2. Systematic sample: Organize the sampling frame in a list and select every kth individual 3. Cluster Sample: Randomly sample existing clusters (e.g. cities, schools, neighborhoods) first, then sample within those clusters 4. Stratified sample: Create subgroups (called strata) of the sample based on important variables (often race, gender, or other demographic features) and then sample from within those subgroups

The Central Limit Theorem (for proportion):

When observations are independent, the sampling distribution of sample proportions for a sample of size n is a normal distribution with a mean of P and a standard deviation of √((𝑝(1−𝑝))/𝑛)

parameters vs statistics

parameter (population) P – the population proportion μ (mu) – the population mean σ (sigma) – the population standard deviation statistic (sample) 𝑝 ̂ (p-hat) – the sample proportion 𝑥 ̅ (x-bar) – the sample mean s – the sample standard deviation

what's the null hypothesis in linear regression

that the coefficient (B1x) is zero

midterm 1 Flashcards

(32 cards)