1.8 Preparing for analysis Flashcards
Before conducting a statistical analysis you need to check your data for eight things:
- Accuracy of data entry,
- Missing data,
- Outliers,
- Normality,
- Linearity, homoscedasticity, and homogeneity of variance,
- Independence,
- Multicollinearity and singularity (MANOVA and multiple regression).
- Other assumptions
Missing data may be addressed through a range of approaches such as
list-wise deletion, mean substitution, expectation-maximization, multiple imputation
As defined by Tabachnick and Fidell (2019, p. 63), an outlier is
“a case with such an extreme value on one variable (a univariate outlier) or such a strange combination of scores on two or more variables (multivariate outlier) that it distorts statistics”
If not identified and processed, outliers can lead to
Both Type I and Type II errors.
There are several ways that outliers can be addressed that include
- ignoring (non-influential) data points (univariate, multivariate),
- deleting individual data points, if sample size can accommodate for this (univariate, multivariate),
- running the analysis with and without the outlier/s to justify keeping the outlier/s (univariate, multivariate),
- modification to reduce the bias of the data through winsorizing or trimming data (univariate), and
- transforming data for large data sets (univariate, can be extremely complex for multivariate).
Occasionally, new multivariate outliers may have been identified following deletions or original outliers. This happens because once you remove a single outlier, the data set becomes more consistent and new data points will become
extreme points
Distributional information, such as skewness and kurtosis values, can provide indicators of
symmetry and peakedness of a variable’s distribution
Skewness relates to the
symmetry of the distribution
Positive skew is depicted when most scores are clustered at the
lower end of the distribution,
Kurtosis refers to the
peakedness of the distribution
A positive skew is described as ________ and a negative skew is described as:
leptokurtic; platykurtic
Screening the residuals for normality is common practice when conducting data analyses for
ungrouped data
Linearity (straight-line relationships between variables) can be observed graphically through
bivariate scatterplots
For ungrouped data, the assumption of homoscedasticity refers to
assumption in regression analysis that the residuals on the continuum of scores for the predictor variable are fairly consistent and as such have similar variances
For grouped data, Homogeneity of variance is
the assumption that the variance of one variable is stable (i.e. relatively similar) at all levels of another variable
There are often two types of assumptions of independence often referred to in statistics which are
Independence of Observations and Independence of Residuals/Errors
Independence of Observations requires each participant to
participate only once in the research and as such only contribute one set of data
the assumption of Independence of Residuals/Errors is the assumption that
errors in your model are not related to each other
the Durbin-Watson test statistic is used to
assess for serial correlations (autocorrelation) of errors
Multicollinearity and singularity are
problems with a correlation matrix that occurs when variables are too highly correlated. With multicollinearity, the variables are very highly correlated (say, above .80); with singularity, the variables are redundant; one of the variables is a combination of two or more of the other variables
Investigation of Tolerance and Variance Inflation Factors can help determine
whether multicollinearity is a problem within your sample
the assumption of sphericity relates to
repeated measures ANOVA and mixed model ANOVA designs
Sphericity assumes that
variances of the differences between data taken from the same participant are equal
Field (2018, p. 283) suggests that nonparametric statistics based on ranks are not affected by
small sample sizes, extreme scores, and outliers, and they do not require a normally distributed sample
Allen, Bennett & Heritage (2018) suggest that non-parametric tests should be used with
ordinal data, and/or where the sample is not normally distributed