week 2 Correlation, Bivariate regression & Preliminary data analyses Flashcards
Interpret correlations
r=Pearson’s correlation coefficient(or Pearson’s product-moment correlation coefficient). r ranges from -1 to 1. r=0=absence of a linear relationship. r=1=perfect positive linear relationship. r=-1=perfect negative linear relationship.
r2=coefficient of determination=strength of relationship between 2 variables.
r2tells us the amount of variability in one variable, that is shared by the other, and can be expressed as a % of the total variability.
eg. if r=0.50, then r2=0.25 and therefore 25% of the variance in one variable is associated with the variance in the other.The 2 variables share 25% of the same information.
Cohen’s effect size guides;
r=+/- 0.10-0.30 small effect
r=+/- 0.30-0.50 moderate effect
r=+/- 0.50-1 large effect size. BUT whether effect size should be viewed more leniently or stringently is quite variable with the specific world situation applied.
BUT for hypothesis testing r still needs to be assessed for statistical significance. A t-test is used to assess the significance of r but by APA convention, this t-test is not reported. To report for Pearson’s correlation coefficient, need size and direction of r, and the p value.
general linear model(GLM)
This model is central to many different analyses, including correlation, regression, t-tests, one-way& two-way Anova. Y=a +bX +error
outcome=model+ error The GLM assumes: -all continuous variables are normally distributed -each possible pair of variables are linearly related
bivariate (Pearson’s) correlation (r)
Used to examine the relationship between 2 variables.Assesses how well an association between 2 variables can be described via a linear relationship.Relationship described by direction(+/-) and strength(magnitude). May be used when both variables are interval or ratio(continuous variables) or when 1 variable is continuous and other is dichotomous(only 2 forms possible eg male/female). Cannot validly evaluate non-linear relationships.
Spearman rank order correlation coefficient
Used if ; -assumptions required for Pearson’s cannot be met or -if one variable is measured with an ordinal scale (labels) with more than 2 options
Parametric and non parametric data
With parametric data, certain assumptions (eg normal distribution) are made about the data, whereas with non parametric data no assumptions are made.
Understand and interpret bivariate regression
Bivariate linear regression is an extension of bivariate linear correlation. Regression uses score of independent variable (X)(predictor) to predict score of dependent variable (criterion)(Y).
If correlation is high, prediction is reliable, but if have low correlation, have poor accuracy in prediction.
Relationship between independent variable and dependent variable is summarised in a linear equation known as the regression line, or line of best fit.
yi=(b0 + b1xi) + ei
yi=outcome to be predicted, b1=slope of line, bo=the intercept,
ei=error.
Often leave error out of equation, leaving;
y^=bx+a
b=slope, a=y intercept (where x=0)
NOTE making the prediction is still NOT suggesting a causal relationship (although it may be possible) (demonstration of causal relationship is dependent upon study design not analysis type).
In regression we use R and R2 (and in correlation it is r and r2)but they can be interpreted the same, as showing how much variance is shared between the 2 variables.
If R2=0.25, this means that 25% of the variability inthe DV, can be predicted from the association with X. The remaining unpredicted portion (1-R2) is defined as the error variability.
The slope of the regression line (b)
shows amount of change in the independent variable, which corresponds to a change in the dependent variable.
gradient= rise/run=(change in y)/(change in x)
The slope is also referred to as an unstandardised regression coefficient (or beta ß).
If slope =0.50, then an increase of 1 unit of x, is associated with an average increase in y of 0.50 units.
A t-test is used to judge whether b is significantly different from 0, ie whether change in IV associated with change in DV is reliable, or due to chance or error.
NOTE here we are talking about a change in a variable as being a change in score across participants, we are NOT talking about a change in score occurring over a period of time in one individual.
the y intercept of the regression line
when x=o
also called the constant
Point at which line crosses the vertical axis.
also evaluated via a t-test to see if sifnificantly different to 0.
residual
is the difference between what the linear regression line predicts, and the outcome.
The smaller the residual, the more accurate the regression equation is at predicting.
Standardised residual is a residual converted to a z-score. We then know 95% of z scores should be between -1.96 &1.96, 99% between -2.58 and 2.58 and 99.9% between -3.29 and 3.29. Thus a standardised residual > absolute 3.29 is extremely unlikely. If more than 1% of sample cases have standardised residuals of an absolute >2.58 evidence that level of error within model may be unacceptable. And if >5% of cases have standardised residuals of an absolute>1.96, the model may be a very poor respresentation of the data.
residual variance and standard error of estimate of the residuals
Also used to compute the errors involved in making the prediction.
error variability
factors affecting correlations
includes:
range restrictions,
use of heterogenous samples
distribution of scores on dichotomous scales
presence of outliers
range restriction
A range restriction of either X or Y (or both) is usually a conscious choice by the researcher for valid (hopefully) reasons. This means that only part of a variable’s range is explored in the regression relationship. this may be misleading. Usually results in a poor correlation, but occasionally inflates the correlation, and rarely, gives a correlation in the opposite direction to what would be found if the entire range were considered.
heterogenous samples
eg have 2 distinct groups, but examine them together, may lead to different correlations than if examined seperately. If suspect might be the case, split the data into the appropriate groups and correlate seperately and see if this changes things.
unequal split of responses on a dichotomous measure
eg if have 90% male and 10% female. The correlation between that measure and others will be deflated. Best solution is to delete the measure from further analysis.