Page 8 - Preparing for Analysis Flashcards
Before conducting a statistical analysis you need to check your data for..
- Accurate of data entry
- Missing date
- Outliers
- Normality
- Linearity, homoscedasticity, homogenity of variance
- Independence
- Multicollineratity and singularity (MANOVA and multiple regression)
- Other assumptions
What can you use to check Data entry?
SPSS Procedures
or, Frequencies
You must consider the … and … of missing data
Amount - pattern
Amount of data missing
If more than 5% check patterns
What kind of patterns of missing data?
MCAR (missing completely at random)
MAR (Missing at random)
MNAR (Missing not at random)
MAR
A pattern of missingness predictable from other variables in the data set
MNAR
A pattern of missingness related to the variable itself
Litte’s MCAR test - when is data MCAR?
If p value is above 0.05 (non-significant difference from MCAR test/mean)
Missed data can be checked by?
List-wise deletion
Mean substitution
Expectation maximisation
Multiple imputation
List-wise used when
Few cases missing
Variables not critival to your analysis
Data are missing at random
Missing data on a different variable
Mean substitution
Replacing value with the mease of cases across items
Not highly recommended, can skew mean
Expectation maximisation
Estimated the shape of the distribution and infering the liklihood the value falling with that distribution
Most simple and reasonable with random missing data
Multiple imputation
Used regregression to predict values based on other variables in your dataset
Most respectable, can be used a MNAR MCAR
More difficult
An outlier is…
A case with such an extreme value on one variable (univariate) or such a strange combination of score on two or more variables (multivariate) that is distorts statistics
Can lead to type 1 (false positive) and type 2 (false negative) results
When can an outlier occur?
Participant interpreted question incorrectly
Experiementer eorror
Participants answer comes from different population
Population of participants has extreme values and is not normally distributed
Checking univariate outliers
Frequency distribution in histogram
Box-plots
Normal probability plots
Calculating standariised scrores (Z-scores +- 3.29
Before checking univariate outliers, determine…
Ungrouped (Correlations, regression and factor analysis) or,
Split by group (t-tests, ANOVAs, ANCOVAs, MANOVAs, logistic regression and discriminant analysis
Checking multivariant outliers
Mahalanobis distance (chi squared distribution used x2)
Leverage
Discrepancy
Influence (cooks distance)
Five methods of addressing Outliers
Ignoring data points
Deleting individual Data points
Running analysis with and without outlier/s
Modification to reduce the bias through winsorizing or tremming (only univariate)
Transforming data for large data sets (univariate, normally too complex for multivariate)
Kurtosis
Peakedness of distribution
Positive (Leptokurtic) - High
Negative (Platykurtic) - Flat (looks like flat platypus)
https://img.tfd.com/mk/K/X2604-K-11.png
What tests used for normality?
Kolmogorov-Smitov (out-dated)
Shapiro-Wilk
Weaskness of normality tests
As sample size, and statistical power increases, tests can appears statistically significant (suggesting skewness) despite data points being normally distributed
What does Field recommend when testing for normality?
You assess the extent of non-normality in your data using ‘converging evident of multiple techniques
Box-plots, histograms, and/or normality tests
Non-normal data is more likely to result in Type 1, or Type 2 errors?
Type 1 (False Positive, incorrect rejection of the H0)
Is there a normality test for multivarite normality in SPSS
No, however, is Shapiro Wilk is non-significant, can assume multivariate normality
What is normality of residuals
Used for ungrouped data
Difference between observed and expected valiues on a variable should be normally distributed
What is linearity
Straight line relationship betweem variables
*Seen by bivariate scatterplot, or residuals plots (multiple regression/linearity of residuals)
Field (2018) states that linearity is one of the most important assumptions to meet for you analyses, as it underpins the process that you want to model. True or False?
True
What is Homoscedasticity
Used for ungrouped data
Assumption in regression analysis that the residulas on the coninuum of scores for the predictor variable are fairly conistent and have similiar variances
(
What is Homogeneity of Variance
Same as Homoscedasicity, but for grouped data
‘the assumption that the variance of one cariable is stable at all levels of another variable. Linear relationship, not exponential etc.
What is independance of Observations
Each participant only participates once in the research + no influence of participants on other participants
What is independance of residuals/errors
Errors in your model are not related to each other
Durbin-Watson test statistic used to check 1-3 preferred
What are multicollinearity and singularity
‘Problems with a correlation matric that occurs when variables are too highliy correlated
Multicollinearity
Variables are very highly correlated (>0.8)
Singularity
Variables are redundant, one variable is a combination of two or more other variables
What is Additivity?
The combined effect of individual predictors on an outcome variable is best represented by adding these individual effects together
What test is used to check for homegeneity of variance?
Levene’s test
Non-significant = homogeneity of variance
Signnificant = Heterogeneity of variance
Trimming Data
Deletion of cases with extreme values
- Percentage based
- Standard deviation based
Winsorising
Extreme scores are replace with value that is not as extreme
- Next highest or lowest score
- Reaplce with next highest/lowest score that is not an outlier
- Replace with score that is +- 3.29 Standard deviations from the mean
Non-parametric statistics
Do not rely on normally distributed data - Spearmann correlation
Robust methods
Trimmed mean/M-Estimator
Bootstrapping (estimates parameters of the same distri ution based on the sample data