Week 3 - Parametric test assumptions Flashcards
What are the features of a parametric test?
- assess group means
- data must have normal distribution (+CLT)
- unequal variances allowed
- more powerful
What are the features of a non-parametric test?
e.g. correlation tests
- assess group MEDIANS
- data doesn’t need to be normally distributed
- can handle small sample size
Questions to ask yourself when deciding to use a parametric test or not
- sample size
- best way to measure central distribution (e.g. median or mean?)
What are the parametric test assumptions? (4)
- Additivity and linearity
- Normality (Gaussian distribution/Bell curve)
- Homogeneity of variances
- Independence of observations
Describe the assumption of Additivity and linearity
Involves a standard linear model/ equation (describing a straight line)
What is the Standard linear model equation
Yi - b0 + b1X1+Ei
Yi= the ith person’s score on the outcome variable
B0= Y-intercept. value of Y when X = 0. point at which the regression line crosses the y-axis
B1 = regression coefficient for the first predictor (B2 for the second predictor).
- Gradient (slope/ rise over run) of the regression
- Direction/ strength of relationship
Ei= the difference between the actual and predicted value of Y for the ith person
- residual/ error
What does it mean for data to be linear and additive?
- X1 and X2 predict Y.
- The outcome is a linear function of the predictors (X1 + X2)
- predictors are added together & do not depend on values of other variables in as in a multiplicative model
The outcome Y is an additive combination of the effects of X1 and X2. e.g. as both X1 and X 2 increase, Y increases also
True or false:
The outcome Y is an additive combination of the effects of X1 and X2. e.g. as both X1 and X 2 increase, Y increases also
true
How can we assess linearity?
- plot observed vs predicted values (symmetrically distributed around diagonal line)
- plot residuals vs predicted values (symmetrically distributed around diagonal line)
How to fix non-linear equations?
- nonlinear transformation to variables
- another regressor that is nonlinear - function - polynomial curve
- examine moderators
Describe the assumption of Normality
relevant to:
- parameters (sampling distribution)
- residuals/ error terms
- -> confidence intervals around parameter
- -> Null hypothesis significance testing
What is Central Limit Theorem (CLT)?
As the sample size increases toward infinity (gets larger), the sampling distribution approaches normal.
–> sample means will be normally distributed thus you don’t need to worry too much about the distribution that the samples came from.
–> distribution of means from many samples and re-samples
–>sample size must be AT LEAST 30
For CLT to apply, what size must the sample size be?
At least 30
True or false
According to CLT -
Even if the data is not normal, the sampling distribution of the data will be normal
True
True or false
Positively skewed data gathers on the left side and scores bunch at the low values with tails pointing to high values
true
True or false
Negatively skewed data gathers on the left side and scores bunch at the low values
false - it gathers on the left (e.g. as you grow conditions get “worse” in life)
they bunch at the high values with tails pointing to low values
What is kurtosis?
The amount which data clusters in either the tails (ends) or the peak (tallest part) of the distribution
- heaviness of tails
Draw the following:
Negative Kurtosis
Positive Kurtosis
Normal distribution
Leptokurtic (heavy tails)
Mesokurtic
Platykurtic (light tails)
draw on paper
What are properties of frequency distributions?
- Skewness
- Kurtosis
Checking the distribution to determine if the assumption of normality is met is important. Which graphical displays are used to test for normality?
Q-Q plots (dots on straight line = normal)
Histograms
What is the name for the software (e.g. JASP) based method for testing for normality?
Shapiro Wilkes Test
Describe the Shapiro Wilkes Test and what a p value of <0.05 means
- tests if data is different from normal distribution
- p < 0.05 = data varies significantly from normal distribution thus normality is violated
In Shapiro Wilkes Test, what does a p value >0.05 mean?
Data des not vary significantly from a normal distribution thus the normality assumption is not violated
Describe the assumption of homogeneity of variance
Assumes all groups or data points have the same or equal variances = the assumption of equal variances
What does homoscedasticity mean?
All groups have equal/ similar variances
What does hetroscedasticity mean?
All data points/ groups do NOT have equal variances. = unequal variances
Define the “error”
The variance from the residual line
Error from what we predicted the y would be based on its X value and what we actually observed from the true data
Describe the assumption of independence of observation
Assumes that you do not have repeated measures of data.
- residuals (errors) are unrelated
- assume based on study design
According to the assumption of independence of observations, what happens when observations are non-independent?
results in downwardly biased standard errors. (too small) thus incorrect statistica inferences (p values < 0.05 when they should be > 0.05)
–> false significant p values
—> this is why it is important to know study design
—> important for mean values of the outcome to come from a different person or other unit (e.g. family, school)
What is is an univariate outlier?
outlier when considering only the distribution of the variable it belongs to
What is a bivariate outlier?
outlier when considering the joint distribution of two variables
- breaking away from the pattern of the association between two variables
What is a multivariate outlier?
outliers when simultaneously considering multiple variables.
What type of outlier is difficult to asses using numbers or graphs?
multivariate outliers
What types of outliers bias the mean and inflate the standard deviation?
Univariate outliers
What types of outliers bias the RELATIONSHIP between two variables e.g. change the strength
bivariate outliers
What are the three ways to deal with outliers?
REMOVE the case or trim the data
TRANSFORM the data
CHANGE the score (winsorizing) pulling the data in e.g. biological data (must be transparent about it when reporting results)
What are some reasons for transforming data?
- ease of interpretation - standardisation e.g. z -scores allow for simpler comparisons
2, reducing skewness - closer to normality
- equalising spread/ improving homogeneity of variances
- linearising relationships between variables - to fit non-linear relationships into linear models
- making relationships additive therefore fulfilling assumptions for certain tets
Do linear transformations change the shape of the distribution ?
What do they change?
No
Changes the value of the mean/ SD but shape remains unchanged
How do linear transformations work?
- adding constant to each number, x + 1
- converting raw scores to z-scores (x-m)/SD
- mean centring, x- m
What type of transformation changes the shape of the distribution?
non-linear transformations
- Log, log(X) or ln(x)
- Square root of x
- Reciprocal, 1/x
Can you use a log transformation [log(x)] on data with positive values and if you want to reduce positive skew and stabilise variance?
yes
When would you use a square root transformation?
- reduce positive skew
- stabilise variance
- defined for zero/ positive values
When would you use a reciprocal transformation?( 1/x)
- reduce impact of large scores
- stabilise variance
- it reverses the scores so this can be avoided by reversing the scores before transforming 1/ (Xhighest - X lowest)
What are the negatives of transforming data?
- non-linear transformations (used to normalise distribution e.g. log, square root, reciprocal ) CHANGE the data & results –> 1 unit increase on the natural log scale might be different
- Transformation can hider if wrong transformation applied
- Makes interpretation difficult (dealing with raw sores and transformed)