week 2 Correlation, Bivariate regression & Preliminary data analyses Flashcards by Nina de Goede

Interpret correlations

r=Pearson’s correlation coefficient(or Pearson’s product-moment correlation coefficient). r ranges from -1 to 1. r=0=absence of a linear relationship. r=1=perfect positive linear relationship. r=-1=perfect negative linear relationship.

r²=coefficient of determination=strength of relationship between 2 variables.

r²tells us the amount of variability in one variable, that is shared by the other, and can be expressed as a % of the total variability.

eg. if r=0.50, then r²=0.25 and therefore 25% of the variance in one variable is associated with the variance in the other.The 2 variables share 25% of the same information.

Cohen’s effect size guides;

r=+/- 0.10-0.30 small effect

r=+/- 0.30-0.50 moderate effect

r=+/- 0.50-1 large effect size. BUT whether effect size should be viewed more leniently or stringently is quite variable with the specific world situation applied.

BUT for hypothesis testing r still needs to be assessed for statistical significance. A t-test is used to assess the significance of r but by APA convention, this t-test is not reported. To report for Pearson’s correlation coefficient, need size and direction of r, and the p value.

How well did you know this?

Not at all

Perfectly

general linear model(GLM)

This model is central to many different analyses, including correlation, regression, t-tests, one-way& two-way Anova. Y=a +bX +error

outcome=model+ error The GLM assumes: -all continuous variables are normally distributed -each possible pair of variables are linearly related

How well did you know this?

Not at all

Perfectly

bivariate (Pearson’s) correlation (r)

Used to examine the relationship between 2 variables.Assesses how well an association between 2 variables can be described via a linear relationship.Relationship described by direction(+/-) and strength(magnitude). May be used when both variables are interval or ratio(continuous variables) or when 1 variable is continuous and other is dichotomous(only 2 forms possible eg male/female). Cannot validly evaluate non-linear relationships.

How well did you know this?

Not at all

Perfectly

Spearman rank order correlation coefficient

Used if ; -assumptions required for Pearson’s cannot be met or -if one variable is measured with an ordinal scale (labels) with more than 2 options

How well did you know this?

Not at all

Perfectly

Parametric and non parametric data

With parametric data, certain assumptions (eg normal distribution) are made about the data, whereas with non parametric data no assumptions are made.

How well did you know this?

Not at all

Perfectly

Understand and interpret bivariate regression

Bivariate linear regression is an extension of bivariate linear correlation. Regression uses score of independent variable (X)(predictor) to predict score of dependent variable (criterion)(Y).

If correlation is high, prediction is reliable, but if have low correlation, have poor accuracy in prediction.

Relationship between independent variable and dependent variable is summarised in a linear equation known as the regression line, or line of best fit.

y_i=(b₀ + b₁x_i) + e_i

y_i=outcome to be predicted, b₁=slope of line, b_o=the intercept,

e_i=error.

Often leave error out of equation, leaving;

y^{^}=bx+a

b=slope, a=y intercept (where x=0)

NOTE making the prediction is still NOT suggesting a causal relationship (although it may be possible) (demonstration of causal relationship is dependent upon study design not analysis type).

In regression we use R and R² (and in correlation it is r and r²)but they can be interpreted the same, as showing how much variance is shared between the 2 variables.

If R²=0.25, this means that 25% of the variability inthe DV, can be predicted from the association with X. The remaining unpredicted portion (1-R²) is defined as the error variability.

How well did you know this?

Not at all

Perfectly

The slope of the regression line (b)

shows amount of change in the independent variable, which corresponds to a change in the dependent variable.

gradient= rise/run=(change in y)/(change in x)

The slope is also referred to as an unstandardised regression coefficient (or beta ß).

If slope =0.50, then an increase of 1 unit of x, is associated with an average increase in y of 0.50 units.

A t-test is used to judge whether b is significantly different from 0, ie whether change in IV associated with change in DV is reliable, or due to chance or error.

NOTE here we are talking about a change in a variable as being a change in score across participants, we are NOT talking about a change in score occurring over a period of time in one individual.

How well did you know this?

Not at all

Perfectly

the y intercept of the regression line

when x=o

also called the constant

Point at which line crosses the vertical axis.

also evaluated via a t-test to see if sifnificantly different to 0.

How well did you know this?

Not at all

Perfectly

residual

is the difference between what the linear regression line predicts, and the outcome.

The smaller the residual, the more accurate the regression equation is at predicting.

Standardised residual is a residual converted to a z-score. We then know 95% of z scores should be between -1.96 &1.96, 99% between -2.58 and 2.58 and 99.9% between -3.29 and 3.29. Thus a standardised residual > absolute 3.29 is extremely unlikely. If more than 1% of sample cases have standardised residuals of an absolute >2.58 evidence that level of error within model may be unacceptable. And if >5% of cases have standardised residuals of an absolute>1.96, the model may be a very poor respresentation of the data.

How well did you know this?

Not at all

Perfectly

residual variance and standard error of estimate of the residuals

Also used to compute the errors involved in making the prediction.

How well did you know this?

Not at all

Perfectly

error variability

How well did you know this?

Not at all

Perfectly

factors affecting correlations

includes:

range restrictions,

use of heterogenous samples

distribution of scores on dichotomous scales

presence of outliers

How well did you know this?

Not at all

Perfectly

range restriction

A range restriction of either X or Y (or both) is usually a conscious choice by the researcher for valid (hopefully) reasons. This means that only part of a variable’s range is explored in the regression relationship. this may be misleading. Usually results in a poor correlation, but occasionally inflates the correlation, and rarely, gives a correlation in the opposite direction to what would be found if the entire range were considered.

How well did you know this?

Not at all

Perfectly

heterogenous samples

eg have 2 distinct groups, but examine them together, may lead to different correlations than if examined seperately. If suspect might be the case, split the data into the appropriate groups and correlate seperately and see if this changes things.

How well did you know this?

Not at all

Perfectly

unequal split of responses on a dichotomous measure

eg if have 90% male and 10% female. The correlation between that measure and others will be deflated. Best solution is to delete the measure from further analysis.

How well did you know this?

Not at all

Perfectly

outliers

Study These Flashcards

Presence of any outliers, will affect correlations. Outliers are scores with extreme values (> 3 standard deviations from the mean). Check that outliers have actually been entered correctly and not a typo. Whilst they are “normal” they do not contribute to out understanding eg of 95% of the population, and might be bettter to delete it from further analysis(must report this though), or might use a significant adjustment figure to modify it’s value.

Study These Flashcards

normality

Study These Flashcards

Data from any variable may deviate from a normal distribution due to skewness, kurtosis or both.

It is vital that all data is screened to ensure variables are normally distributed prior to running analyses that are relying on this character. Screening may be done via:

Kolmogorov-Smirnov test
evaulating components of skewness or kurtosis with the aid of t-tests

or 3. visual inspection of histograms.

Kolmogorov-Smirnov test

Study These Flashcards

this does a significance test, comparing experiment data against a normal distribution.

Z-score to assess Skewness

Study These Flashcards

Done to determine if skew is sufficiently bad, that something needs to be done.

z= (skew statistic)/(standard error of skewness)

if absolute z (/z/) <3 ( or <4 for large samples with N>100), no adjustment is necessary.

Z-score to asses kurtosis

Study These Flashcards

z= kurtosis value/standard error of kurtosis.

if z> 3 or 4, then adjustment is required. Usually adjust via transformation variables etc.

Handling a strong non linearity after transformation

Study These Flashcards

May be addressed by converting one measure into a dichotomous variable. To achieve, this, split the data via a median split, and allocate values to one of 2 categories (eg high and low).

BUT if data is genuinely non linear, likely better to use an analysis which can handle the shape of the relationship.

Homogeneity of variance in arrays

Study These Flashcards

The assumption of The homogeneity of variance in arrays (also known as assumption of homoscedasticity) states that as go through values of X, the variance of Y stays the same.

It assumes that groups come from populations with the same variance.

Common Data transformations

Study These Flashcards

Many scales are arbitrary and unaffected by transformation, but some data had original meaning in the units, which post transformation, becomes hard to interpret.

After any transformation, double check that it has had the desired effect.

A standard conversion for every value of a measure, leaves the underlying relationship between the variables intact.

For positive skewness the transformation reduces larger scores more than smaller scores and pulls the right hand tail towards the median.

Commonly performed transformations for this includesquare root, log, inverse ir reciprocal (1/score).

For positive moderate skewness (z>3 or 4) use a square root transformation, for z=6-10 use logarithm, and z>10 use reciprocal.

For negative skewness, must first reflect the data (converts negative skew to positive skew), then do as above. Afterwards, it is wise to consider values in their original direction to avoid confusion when interpreting the variable’s relation ship with another variable.

If transformations result in as bad or worse skew, return to the original data and either:

run intended analysis but flag reader that results may be adversely affected due to violation of normality assumptions or
find a non-parametric alternative which does not require the nromalcy assumptions.

Covariance

Cov_xy=Σⁿ_i=1(x-x^-)(y-y^-)/N-1

Standardisation of the Correlation coefficient

Because scales and measurements of variables vary, this makes interpreting their corelation difficult, thus covariance is typically standardised, and usually standardise by converting any variable to a unit of standard deviation. distance from mean÷standard deviation=distance in standard deviation units. Standardising the covariance, gives us a correlation coefficient r=cov_xy/s_xs_y s=standard deviation r=Σⁿ_i=1(x_i-x^-)(y_i-y^-)/(N-1)s_xs_y

week 2 Correlation, Bivariate regression & Preliminary data analyses Flashcards

(27 cards)