439 midterm Flashcards

Question

why could two variables be correlated

Answer 1

- by chance - two variables could be mutually affecting each other (price and demand) - relationship could be driven by an underlying cause (a confounder)

Answer 2

- potential causes for a relationship that aren't measured

Answer 3

relationship between 2 variables after removing the influence of another variable

Answer 4

- H0: p = 0, H1: p =/= 0 - use a t-test (one parameter in the null) when two variables are jointly normal (bivariate normality test Shapiro-Wilk) - t = (sample stat - pop parameter)/sample SD (s/sqrtN)

Answer 5

- used if we have a directional hypothesis (how X affects Y), can show an asymmetric linear relationship between predictor and outcome variables - the effect of X on Y is beta (regression coefficient/slope) - there will also be error that accounts for some variation in Y (not just X)

Answer 6

- only one X (how DV changes when IV changes) - assumption of linearity - used as a mathematical model summarizing the relationship by fitting a straight line/regression line (Ŷ) to the data that predicts values of Y based on X (Ŷ = alpha + X(beta))

Answer 7

- alpha: intercept (average value of Y when X = 0) - beta: slope (amount by which Y changes on average when X changes by one unit)

Answer 8

- finding the best fitting regression line - minimizing the vertical distance between a data point and the line (minimizing the residuals Yi - Ŷi) - compute all the residuals for all data points, square them, sum the squares - we square the residuals to avoid the negative and positive residuals cancelling out to zero - this method is problematic when you have outliers because squaring large values makes them larger - B = cov(x,Y)/var(X) - alpha = mean of Y - estimate of B(mean of X)

Answer 9

- H0: B=0, H1: B=/=0 - use a t-test - assumptions of normality and independence - df = N-2 (two variables) - t = (stat-parameter)/(SE(B)) - if t>CV or p<.05, the slope is significantly different from 0 which suggests a significant effect of X on Y

Answer 10

- SS(Total) = SS(Regression) + SS(Error) - SS(Regression): The variation in Y explained by the regression line. - SS(Error): The variation in Y unexplained by the regression line (residuals). - df(Reg) # of IVs in the model (=1) - df(E) = N-2 - divide SS by corresponding df = MS - divide MS(Reg)/MS(E) = F - compare F ratio to F distribution using df(reg) and df(E)

Answer 11

- they're testing the same thing ONLY in simple linear reg - t^2 = F

Answer 12

- coefficient of determination (R2) - proportion of variation in Y accounted for by the model (R2 = SS(Reg)/SS(T) - ranges between 0 and 1 - in simple linear reg, r = sqrt(R2)

Answer 13

- model summary gives R2 - ANOVA is the significance of regression coefficients - coefficients: unstandardized are the estimates (divide them by SE = t) - in simple linear regression, both ANOVA table and t values in coefficients are testing the same thing so you get the same conclusion about whether to reject the null

Answer 14

- one DV with multiple IVs (which IV matters most for the DV?) - Describes how the DV (Y) changes as multiple IVs (XJ) (J ≥ 2) change - results in 3D data points on a graph (along a regression plane) - equation of regression plane: Ŷ = alpha + B1X1 + B2X2 + ... + BjXj - IVs can be continuous or discrete (will change the interpretation) - alpha: Average value of Y when Xj = 0 - beta: Amount by which Y changes on average when Xj changes by one unit, holding the other IVs constant (partialled out/controlled). - we can still use least squares method to estimate the intercept and slope

Answer 15

- effect of a standardized IV on the standardized DV (z-scores) - change in the standard deviation of the DV that results from a change of one standard deviation in Xj, holding the other IVs constant - Used to compare the effects of IVs on the DV, when the IVs are measured in different units of measurement

Answer 16

- SS(T) = SS(Reg) + SS(E) - df(Reg) = J (number of IVs) - df(E) = N-J-1 - MS(Reg)/MS(E) = F

Answer 17

- H0: B1 = B2 = ... = Bj, H1: not all Bs are equal to 0 - if we reject the null, we test individual coefficients with individual t-tests to which is different from zero (H0: Bj = 0, H1: Bj =/= 0)

Answer 18

- r2: proportion of the total variation in Y accounted for by the regression model (=SS(Reg)/SS(T)) - also called the squared multiple correlation (SMC) - ranges from 0 (no explanation) to 1 (perfect) - the F statistic can also be calculated based on R2 - by adding another IV to the model, SS(Error) will always decrease (even if it's negligeable decrease), so SS(Reg) will become larger and R2 will increase - it is impossible for R2 to decrease when you add an IV

Answer 19

- If no substantial increase in R2 is obtained by adding a new IV, adjusted R2 tends to decrease (it accounts for the # of IVs in the model) - can be greater than 1 and smaller than 0, so difficult to interpret - this index prefers a simpler model when different models provide similar explanatory value for Y (which is also easier to interpret in practice) - only use it when comparing different model; the best one (statistically) will have the largest R2

Answer 20

- 10-15 cases per IV - N ≥ 104 + J for individual IVs

Answer 21

1. The relationship between X and Y is linear (linearity). 2. The error term ‘e’ follows a normal distribution with mean zero and constant variance (σ2) (normality & homoscedasticity) 3. The error terms of observations are not related to each other (independence) 4. There are no outliers 5. There are no high correlations among IVs (no multicollinearity)

Answer 22

- descriptive statistics (sample skewness and kurtosis ~ 0, if >0 = positive skewness or kurtosis, if <0 = negative) - skewness stat / SE(skewness) = t(skewness) - kurtosis stat / SE(kurtosis) = t(kurtosis) - if t(skew or kurtosis) > 3.2 = violation of normality - Shapiro-Wilk test for normality assess skewness and kurtosis together (if significant = suggests that the sample may not come from a normal distribution, but very easy to give significant results with a large sample size) - normal quantile plot: sort observations from smallest to largest, find their z scores, plot observations against corresponding z scores (if the data are normal, the points lie close to a straight line)

Answer 23

- homoscedasticity: (“equal scatter”) is true if all the error terms have the same variance. - Heteroscedasticity (“unequal scatter”) indicates the violation of this assumption. - If homoscedasticity is violated, the variances (or standard deviations) of regression coefficient estimates tend to be underestimated (t ratios tend to be inflated)

Answer 24

- Residual plot: plot residuals vs. predicted Y values (often uses standardized residuals) - if plot shows an arc shape = nonlinear - if plot shows a coin shape = heteroscedasticity - if data is randomly distributed = good

Answer 25

- Autocorrelation: correlation between error terms ordered in time (time-series data) or space (cross-sectional data) - time-series correlation: repeated measures of the same variable are correlated (IQ scores over time) - spatial correlation: similar observations from the same area (sample from the same class = dependency) - if independence is violated, variances of regression coefficients tend to be underestimated

Answer 26

- Durbin-Watson test of autocorrelation - a value around 2 (1.53 are cause for concern - d = 4 (perfect negative correlation) - d = 0 (perfect positive correlation)

Answer 27

- first check data are coded properly - A data point represents an outlier if it is disconnected from the rest of the distribution (z > 3.3) - if there is a concern, run LR with and without the data point to see if it has any influence on regression analysis - Cook's distance measures the influence of a data point on the regression equation - Cook's D > 1 is a cause for concern - Cook's D > 4 is a serious outlier

Answer 28

- a situation in which two or more IVs are highly correlated (e.g., r ≥ |.9|) - consequences: unstable regression coefficient estimates (large variance of estimates, lower t ratios), large SE = small t stat = large p value = more Type II errors - high R2 (or significant F) but no significant t ratios - unexpected signs (positive or negative) of regression coefficients - matrix inversion problem

Answer 29

- check correlations between IVs - tolerance: 1 - R2 (R2 for the regression of each IV on the other IVs, ignoring the DV. The higher the intercorrelation of the IVs, the closer the tolerance is to zero) - tolerance < 0.1 is a problem - variance inflation factor (VIF) = 1/tolerance (VIF > 10 is a problem) - condition index: measure of dependency of one variable on the others - condition indices are computed as the square roots of the ratios of the largest eigenvalue to each successive eigenvalue - with 3 IVs and 1 DV = 4 condition indices, where the first = 1 and they ascend until the last largest condition number - any condition index > 30 is a problem - condition number < 100 is fine - condition number > 1000 is serious multicollinearity

Answer 30

- data transformation: sqrt(Y) for weak skew, log(y) for mild skew, 1/y for strong correction - only done on Y - prioritize the transformation that addresses kurtosis - resampling methods: Jackknife or bootstrap

Answer 31

- data transformation - add another IV (x1^2) - nonlinear methods

Answer 32

- data transformation - other estimation methods (weighted least squares) - other regression methods

Answer 33

- data transformation - generalized least squares - other regressions

Answer 34

- remove them if Cook's d >1 - robust regression (sum of residuals in absolute value instead of least squares)

Answer 35

- drop a variable; look at standardized coefficients and remove the one that is least related to Y - create a composite variable: sum the variables, find their mean, or their difference (the difference may be more interpretable) - other regression models (machine learning)

Answer 36

- polynomial regression - Y = alpha + B1X1 + B2X^2 + e (second degree polynomial, adds one curve to the regression line for a quadratic shape) - Y = alpha + B1X1 + B2X^2 + B3X^3 (third degree polynomial, adds two curves to the regression line) - adding a polynomial will cause R2 to increase because it fits the model better (use adjusted R2 to compare and see which is best)

Answer 37

- when normality is violated - instead of using the sampling distribution of statistics to find the standard error (all possible samples of size N), the bootstrap method uses your actual data as a population (so requires no assumption of normality) - JASP takes 5000 samples of N observations from your sample (resampling with replacement) to find all the possible estimates of B, then takes SD of all B estimates which is SE

Answer 38

- IV = factor (in ANOVA, always a nominal variable) with multiple response categories (levels = k) - DV is continuous - single factor designs (1 IV = one-way ANOVA) - factorial designs (more than one IV = two-way, three-way ANOVA) - between subject: subjects only belong to one group (independent-group) - within-subject: subjects belong to all groups (repeated-measures) - ANOVA used to test effects of IV on DV (same objective as LR) by comparing means - H0: u1 = u2 = ... = uk - H1: not all uk are the same

Answer 39

- populations follow a normal distribution - homoscedasticity (homogeneity of variance in the populations) - independence of observations

Answer 40

- normal distributions are defined by a mean and standard deviation - assumption of equal variances means that the standard deviation is fixed, so the only way the populations can differ is by the mean (so if we reject the null, the groups are different) - they must be representative samples (independence of observations) so that they're representative of their populations

Answer 41

- if k=2, H0: u1 = u2, we can do a t-test (t=(X1-X2) - (u1 - u2) / SE(X1-X2)), if X1-X2 increases, t increases, p decreases (reject the null) - if k=3, H0: u1 = u2 = u3, we compute the variances of sample means as a measure of distance - (k=3) smaller variance from the grand mean = group means are closer together = fail to reject the mean, larger variance (further away from the grand mean) = sample means are very different from each other = reject mean - (k=3), compute variance between-groups and variance within-groups (residual) to get an F ratio and compare to the F dsitribution

Answer 42

1. divides the variance observed in data into different parts resulting from different sources; 2. assesses the relative magnitude of the different parts of variance (F ratio) 3. examines whether a particular part of the variance is greater than expectation under the null hypothesis - two sources of variance: between-groups (due to different treatments/levels of a factor across groups) and within-groups (random fluctuations of subjects within each group)

Answer 43

- varies in shape according to df(B), df(W) - df(B) for the numerator is always given first - right-skewed - find the critical value according to df(B) and df(W) and the alpha level - if F > CV we reject the null

Answer 44

- cohen's d (standardized mean difference) - eta squared (n2): ratio of variance explained in the DV by one or more IVs, which is equivalent for R2 in LR (a biased effect size that estimates the amount of variance explained based on the sample, and not based on the population) - omega squared (w2) which is an unbiased version of n2 - 0.01 is small, 0.06 is medium, 0.14 is large (w2, n2)

Answer 45

- F-test gives a global effect of an IV on a DV (omnibus or overall test) - to see which pairs of means are different, we do post hoc (a posteriori/unplanned) comparisons IF three or more means were compared in F-test - Tukey's HSD is the simplest and most accurate (especially when sample sizes are equal and homogeneity of variances is met)

Answer 46

- they have the same objective - ANOVA can be viewed as a special case of LR with nominal IVs with multiple levels - if a nominal variable has more than 2 levels, it cannot be used as it is in linear regression (needs to be transformed into dummy variables/dummy-coded) - LR with dummy variables is possible, but it's usually better to use ANOVA because it has fewer assumptions (but with complex designs like four-way ANOVA or nominal+continuous IVs, it's better to run LR)

Answer 47

- assignment of binary values (0 or 1) to represent membership in each level of a nominal variable - expresses group membership of observations using zeroes and ones - number of dummy variables is always k-1 (if you have three levels of a nominal variable, you have 2 dummy variables)

Answer 48

Step 1: Create K – 1 new variables as dummy variables, where K = number of levels/groups. Step 2: Choose one of K groups as a baseline (a group against which all other groups will be compared). Usually, this is a control group. Step 3: Assign the baseline group values of 0 for all dummy variables. Step 4: For the kth dummy variable (k = 1, ..., K-1), assign the value 1 to the kth group. Assign all other groups 0 for this variable. - observations in group 1 will have the value 1 for X1, the value 0 for X2 - observations in group 2 will have the value 0 for X1, the value 1 for X2 - observations in group 3 will have the value 0 for both X1 and X2

Answer 49

- equation: Ŷi =α+β1X1i +β2X2i - intercept: mean of Y for group 3 (baseline group), X1 = X2 = 0 - B1: difference in Y between the means of G1 and G3 - B2: difference in Y between the means of G2 and G3 - H0: B1 = B2 = 0 (and B1 = u1-u3 = 0, B2 = u2-u3 = 0, so H0: u1 = u2 = u3) - both LR and ANOVA are testing the same thing

Answer 50

- random assignment avoids systematic bias, but leaves individual differences uncontrolled (subjects may not be well-matched) - these extraneous variables (age, sex, etc.) can still affect the DV - extraneous variables in ANCOVA are covariates or concomitant variablees - ANCOVA allows you to compare group means AND control covariates at once

Answer 51

- differences found in ANCOVA while controlling covariates (mean after eliminating effect of covariate)

Answer 52

- Y= α + β1X1+ β2X2+ β3Z + e, where X1 and X2 are dummy variables, and Z is the covariate - alpha: mean level of Y only for the baseline group (X1=X2=0) and controlling for covariate (z=0) - B1: adjusted mean difference between group 1 and baseline group (difference in means controlling for covariate) - B2: adjusted mean difference between group 2 and baseline - B3: effect of covariate on DV controlling for other IVs (X1 and X2) - B1 and B2 only control for z, not other IVs (because X1 and X2 are not correlated and cannot overlap, so B1 doesn't need to control for X2 and B2 doesn't need to control for X1)

Answer 53

- B1 and B2 are adjusted mean differences - adjusted means are individual linear regression models computed based on the same value of covariate (so they can be compared) - adjusted means are regression lines with different intercepts but the same slope

Answer 54

- linear regression analysis which includes both nominal (dummy-coded) and continuous variables as IVs - SS(Reg): The variation in Y explained by the regression model - df(Reg) = J - SS(E): The variation in Y unexplained by the regression model - df(E) = N-J-1 - SS(Reg) comes from two sources: SS(Group) and SS(Covariate) - SS(G): the effects of groups/treatments, group mean differences (H0: u1A = u2A = ... = ukA) - df(G) = J-P(#covariates) - SS(CV): effects of covariates (Bcv = 0 or Bcv1 = Bcv2 = ... = Bcvk) - df(CV) = P(# covariates) - BUT SS(G) + SS(CV) =/= SS(Reg) due to potential correlations between group and covariates

Answer 55

- same assumptions as for ANOVA and Linear Regression - also the assumption of homogeneity of regression slopes across difference groups (H0: Bz1 = Bz2 = ... = Bzk), the effect of Z on Y for each group must be equal across groups (the intercepts can still be different) - equal slopes assumption means there is no interaction between IV and covariate (JASP table with interaction effect must be nonsignificant) - ANCOVA only looks at main effects of IVs, not interaction effects (it requires a lack of interaction)

Answer 56

- The extent to which the effect of one factor depends on the level of the other factor. - An interaction is present when the effect of one factor on the DV changes at the different levels of the other factor. - The presence of an interaction indicates that the main effects alone do not fully describe the outcome of a factorial experiment. - can be graphed in a cell means plot for two-way ANOVA (if lines do not have the same slope, there is an interaction, but if lines are parallel there is no interaction) - same interpretation in LR: If one predictor’s effect on the dependent variable depends on another predictor, we say that the two predictors interact, or that there is an interaction effect between them on the dependent variable (instead of assuming that a regression coefficient is independent of another regression coefficient)

Answer 57

- The effect of a predictor (X1) on Y depends on the value of another predictor (X2), or the effect is moderated by X2 - X1 is the focal predictor and X2 is the moderator - 3 types: interaction between 2 continuous predictors, interaction between 1 nominal and 1 continuous predictor (the nominal predictor can be binary or multi-categorical), or interaction between 2 nominal variables (ANOVA)

Answer 58

- Y = alpha + B1X1 + B2X2 + B3X1X2 - product term to represent the interaction effect - rephrased as Y = alpha + B2X2 + X1(B1 + B3X2) - = B0 + B1*X1 - B1* = B1 + B3X2, which is a simple linear regression equation (B1 is the intercept and B3 is the slope) to denote how the slope of X1 (the effect of X1 on Y; B1*) changes based on a 1-unit increase in X2 (the moderator) - if B3 is 0, the effect of the focal variable X1 doesn't change across values of X2, so no interaction - if B3 =/= 0, the effect of X1 on Y depends on the value of X2 (moderator)

Answer 59

- Y = alpha + B1X1 + B2X2 + B3X1X2 - alpha: average value of Y when all IVs = 0 (so the product term is also 0) - B1: average change in Y when X1 increases by one unit, controlling for X2 (X2=0) *the product term also includes X1, so B1 cannot be interpreted as a main effect - B2: effect of X2 on Y when X1=0 (average change in Y when X2 increases by one unit when X1 = 0) - B3: average change in the slope of X1 when you increase one unit of X2 (the moderator), this is the difference between the slopes - the value of B1* (B1 + B3X2) at each value of X2 is called a simple/conditional effect/slope of X1 (the effect of X1 on Y for a particular value of X2)

Answer 60

- H0: B3 = 0 (no interaction effect, the slopes are equal) - H1: B3 =/= 0 (interaction effect present) - use a t-test (one parameter) - rejection of H0 suggests that the effect of X1 depends on X2

Answer 61

- Y = alpha + B1X1 + B2X2 + B3X1X2 + B4X3 - alpha, B1, B2, B3 all have the same interpretation, with the added phrase "controlling for X3" (holding the covariates constant) - B4: effect of X3 on Y controlling for all the other IVs (because X3 doesn't interact with other variables) - you can add more covariates to expand the regression equation with B5X4, B6X5, etc.

Answer 62

- effect of X1 on Y may differ between two response categories in X2 - same regression equation: Y = alpha + B1X1 + B2X2 + B3X1X2

Answer 63

- Y = alpha + B1X1 + B2X2 + B3X1X2 - in this example, X2 is the binary moderator - alpha: average value of Y when X1 = X2 = 0 (so product term is also 0) ONLY for the baseline group (coded with 0) - B1: effect of X1 on Y when X2 = 0 only for the baseline group (average change in Y when you increase one unit of X1 when X2 is 0 for the 0 group) - B2: effect of X2 (binary) on Y when X1 = 0 (the mean difference in y between the two groups when X1 is 0) - the difference in between intercepts on a graph - B3: difference in the slopes for X1 between the two groups (change in the effect of X1 on average when X2 increases by one unit) - the difference in slopes on a graph (if it's 0, no interaction; if it's not 0, there is an interaction)

Answer 64

- if X2 is a three-category predictor with two dummy-coded variables D1 and D2 (group 3 is the baseline) - Y = alpha + B1X1 + B2D1 + B3D2 + B4X1D1 + B5X1D2

Answer 65

- Y = alpha + B1X1 + B2D1 + B3D2 + B4X1D1 + B5X1D2 (where D1 and D2 are dummy-coded, so now can be considered binary, X1 is the focal IV, B4 and B5 are interaction effect product terms) - alpha: average of Y for group 3 (baseline; D1=D2=0) when X1 = 0 - B1: effect of X1 on Y for group 3 (baseline) when D1=D2=0 (average change in Y when you increase one unit of X1 only for the baseline group) - B2: mean difference in Y between group 1 and baseline group when X1=0 (effect of D1 on Y when X1=0) - difference in intercepts between groups 1 & 3 - B3: mean difference in Y between group 2 and baseline when X1=0 (effect of D2 on Y when X1=0) - difference in intercepts between groups 2 & 3 - B4: difference in slopes for X1 between group 1 and baseline - B5: difference in slopes for X1 between group 2 and baseline - if either B4 OR B5 =/= 0, there is an interaction effect - if BOTH B4 AND B5 = 0, there is no interaction

Answer 66

- whenever you have an interaction term, you also need to include the main effects (even if they're not statistically significant) - X1X2 is typically correlated with X1 and X2, and excluding X1 and/or X2 tends to alter the meaning of the interaction

Answer 67

- widely believed that we should never test the interaction between X1 and X2 by including their product term without first mean-centering both predictors - instead of B1X1, it would be B1(X1-mean of X1), same for other coefficients - believed that since X1X2 is highly correlated with X1 and/or X2 (multicollinearity problem), mean-centering will alleviate this problem (increasing the tolerance of the product and lowering the standard error of the regression coefficient of the product term), but IT CANNOT - mean-centering will increase the tolerance, but will also change the variance of the product term (these two cancel each other out, so SE will still be large)

Answer 68

- it doesn't work for addressing multicollinearity - we only consider doing it when variables are continuous - it can be helpful to interpret regression coefficients (When we mean-center X1 and X2, the means of the centered X1 and X2 are zero. Then, β1 indicates the effect of X1 on Y among those average on X2. β2 indicates the effect of X2 on Y among those average on X1)

Answer 69

- used to quantify pathways of influence or the process by which an independent variable can influence a dependent variable

Answer 70

- direct effects: influence of one variable on another that is not mediated by any other variables. Regression coefficient estimates indicate direct effects - indirect effects: Influence of a variable mediated by at least one intervening variable. They are estimated as the product of direct effects - total effects: Direct effect + Indirect effect

Answer 71

- X1 affects Y: direct effect (B2) - X2 affects Y: direct effect (B3) - X1 affects X2: direct effect (B4) - X1 affects Y through X2: indirect effect (B3B4, multiply these coefficients) - total effect of X1 on Y = B2 + B3B4 - if you add covariates X3 and X4 (which also affect Y), the interpretations remain the same with the added phrase "holding X3 and X4 constant"

Answer 72

- we may test the statistical significance of unstandardized indirect effects with a single mediator based on the Sobel test (in large samples only) - Sobel test: method for estimating standard error of B3B4, then the ratio B3B4/SE(B3B4) is approximated as a z statistic (only for large samples) - Sobel test requires that B ~ Normal (B3~Normal and B4~Normal, but their product term isn't normal), so this test is based on a false assumption - Sobel test tends to be lower in power

Answer 73

- bootstrap and Monte Carlo are better ways to test statistical significance using confidence intervals because they don't make assumptions about the distribution of the indirect effect

Answer 74

- predictor: X1 - mediator: X2 - outcome: Y - background confounder: covariates - in all tables, make sure the CI is based on bootstrap method, then check whether 0 is included in the 95% CI - if 0 is included, the effect is not statistically significant (if 0 is excluded, the effect is significant) - z and p in the tables will be based on the Sobel test

439 midterm Flashcards

(98 cards)