Sabinas lecture 1 to 7 Flashcards
Variance & SD
Variance & SD
sigma2 or V = the degree to which a variable ‘varies’ around its
mean V = sum (X-X)2/N-1 = SS/df
SD = squ root sigma 2 or V (in the same units, easier to interpret)
› Covariance
CoV = the degree to which two variables ‘vary’ simultaneously or co-vary
Note: the variance of a variable is… its covariance with itself.
Correlation
degree of linear
relationship between two variables and, essentially, it is a
standardised covariance
Continuous versus discrete variables
continuous and discrete (categorical)
Regression sum of sqares
Regression sum of sqares is about something we can predict. (1-R2) is what we can not predict.
how to work out t
b/ SEb = t
df Residual
df residual is proportion of the variable we cannot predict. N-K-1 predictors.
Confidence Intervals (CI)
b is an estimate of the population parameter. Ultimately, we want to know the true value of the regression coefficient. Having the CI helps to illustrate this idea (i.e., if we conducted this research 100 times, there is XX% chance that the true (yet unknown) slope is within the specified range of values) .
how to use CIs
If the range includes 0, then we can conclude that the findings are NOT n statistically significant, and vice versa.
› We can also use the CI to test whether the slope is different from a particular value (e.g., whether this slope is different from the one found in previous studies).
› SPSS does not calculate CI automatically
CI is sort of our parameter line. If I perform the experiment 100 times, this is the range I expect the B to be in.
Converting from b to β [in italics!]
ß = b x (SDx/SDy) b = ß x √(Vx/Vy)
If B is equal to zero
there is no equation. It is not important. It will still be featured in the regression equation (DO NOT TAKE IT OUT).
The most common null hyp is that b = zero. Slope is not different to zero, nothing systematic is happening.
It doesn’t have to be zero, the slope is stagnant. Is the new slope different to 1.5 or not? If the CI includes this, it is fine as a null hyp.
MR advantages
Can use both categorical and continuous independent
variables
› Can easily incorporate multiple independent variables
› Is appropriate for the analysis of experimental or
nonexperimental research
Factors Affecting the Results
of the Regression Equation
Sample size (N) The amount of scatter of points around the regression line [indexed by (Y-Y’)2 or SSresidual] = Other things being equal, the smaller SSresidual, the larger SSregression, and hence larger the F-ratio
›The range of values in the X variable,
indicated by (X-X)2
Assumptions Underlying MR (only a
glimpse now)
Dependent variable is a linear function of the IVs
- can be overlooked if one selects extreme cases of X… selection of only extreme cases can ‘force’ the regression to appear linear, even if it might be curvilinear for the X values. Bad practice…
› Each observation is drawn independently
› Errors are normally distributed
› The mean of errors is = 0
› Errors are not correlated with each other, nor with the IV
› Homoscedasticity of variance
- Variance of errors is not a function of IVs
- The variance of errors at all values of X is constant, meaning that it is the same at all levels of IV
reg df
number of IVS
do you report the non significant parts in regression conclusion?
YES
decimal places for b
three decimal places. .003 etc
what happens when you shorten the effect sample line graph?
B is same, ß changes. distribution is different so SDs change
why is ß the same as ry2 when the two IVs don’t correlate
because the overlap is not in the ven diagram. ß = ry2 when r12= 0
assumptions of error
we assume that they are normally distributed, independent,
and have constant variance.
regression line
that the IVs are differentially weighted
so that the prediction is optimised and the sum of the errors2 of prediction is minimised.
That is, the sum of squared values for each residual term is smaller than for any other
possible straight line, thus the term least squares
ß way of writing conc
standard scores or stand deviations. not standard units
what is a different metric
includes different scale of same dimension like cm is DIFF to hours. cm is DIFF to meters. must be exactly the same or use beta
when is something not a common cause
a, b and c paths equivalent to ß’s, where DV is VarY, and it is regressed on Variables X1 and X2
› If VarX1 has no effect on Y (b=0), but it has an effect on X2, then:
- it is not a common cause
- ßYX2 = r YX2 = c
- c does not change with the inclusion or exclusion of X1
- OR
- If VarX1 has no effect on X2 (a=0),
but it has an effect on Y, then:
- it is not a common cause
- ßYX2 = r YX2 = c
- c does not change with the inclusion or exclusion of X1
importance of r2
For explanation, a high R2 less important than proper
variable selection
R2 should be within expected range
- Explaining 25% of the variance may be surprisingly high for some questions, low for others
› A high (?) R2 is important for prediction
› “Human freedom may then rest in the error term
Indirect Effects
The regression weight for Parent Education changed
because a mediating variable (Previous Achievement) was included in the model.
› A portion of the direct effect from the first regression is now indirect (e.g., paths d and a)
› Mediating variables do not have to be included to interpret regression coefficients as effects
› However, this type of regression only focuses on
direct effect.
mean when you standardise something
every time you standardise something the mean will be very close to zero like z score distribution
what do you look at when you have intercorrelations output?
Correlations that our IVs have with our DV
- Correlations that our IVs have with each other
including a common cause
prevents inflating the other variables bs and ßs
df for change statistics in sequential
always 1 because only adding 1 vriable at a time
order of entry
the variable entered first has the most opportunity to capture the higherst proportion of variance. the one entered last has a tiny ∆R2
total effects
the direct effect plus e x d (the indirect effects lines timsed together)
importnce measure btter than ∆R2
√∆R2
Unique Variance
Some researchers add each variable last in a sequential regression to determine its “unique” effect/variance
› Can get the same information in simultaneous regression, requesting semipartial (part) correlations
› Square the part correlations to determine unique variance
what to do with stepwise
large N and cross validation necessary
interactions and curevs
Test for interactions by sequentially adding a cross-product
term to the regression
› Test for curves in the regression plane by sequentially adding
powers of variables (e.g., variable2)
purpose sequential
Is a variable (or block of variables) important for an outcome?
- Does a variable explain/predict variance beyond that
explained by other influences?
- Does a variable explain/predict unique variance in an
outcome?
- Test for statistical significance of interactions and curves
- Does a variable aid in predicting some criterion
What to Interpret in sequential
magnitude = importance √∆R2,
stat significancec ∆R2
when to use sequential
Useful for explanation when guided by theory
› Useful for testing interactions & curves
› Estimates total effects in implied model
› More ‘similar’ to ANOVA method (?)
be careful with order
alternatives to Stepwise
Simultaneous regression
› Sequential regression (final equation)
› Study correlations between IVs. If some are highly
intercorrelated, consider combining them in a composite.
› SEM (…?…)