Week1_Topic 1: Screening and cleaning data Flashcards
1) Why is Data Screening Important?
- An important (often time consuming!) precursor to any serious data analysis.
- Crucial to check relevant statistical assumptions for any subsequent analyses.
- Data screening can often provide an important first insight into the key variables of your study.
2) Data Screening:
What should be your first step?
• The first step in data screening should involve checking for the accuracy of the data entry:
– Out of range values
– Plausible values
– Check accuracy of coding in SPSS
• The Frequencies window in SPSS –
Descriptive Statistics is often useful for the above checks.
3) Missing Data:
It is problematic because it may reduce the
represeativeness of your data
It is a problem in many areas of psychological
research - give 4 examples:
- Participant attrition (sustained pressure)
- Items/tasks not completed
- Completed data misplaced
- Equipment malfunctions
4)
Why is meticulous data collection is important?
– Ensure participants complete all tasks/ questionnaires
– Remind them to check that they’ve completed everything
– Program your task to minimize missing data
5)
Describe - missing completely at random (MCAR):
- Cause of missing data is independent of other variables in the study
- Non-missing data is representative of total data
- Best possible situation, but rare
- Should not cause any problems if relatively small loss (<5%) in moderate and large datasets.
6) Describe - missing at random (MAR):
- The pattern of missingness is predictable from other variables in the dataset.
- For instance, patients might be less likely to complete a certain questionnaire.
- If <5% most missing value procedures yield similar results.
7)
Describe - missing not at random (MNAR):
- Non-random missingness.
- Value of variable is related to reason why it’s missing.
- Patients are less likely to complete a questionnaire because of their scores on the questionnaire.
- Can seriously bias results if left as is.
8) SPSS Missing Value Analysis:
What is a logical appraoch?
- Separate missing and non-missing into two groups and compare on other variables e.g. ttest with IV (miss vs. non-miss) on all other variables.
- Little’s MCAR test: if non-significant then assume MCAR; if significant, but missingness is predictable from other variables (other than DV), then assume MAR. If t-test is significant for missingness across DVs, then assume MNAR.
9)
Dealing With Missing Data:
Omission (but still report it)
– Usually ok if low frequency and MCAR/MAR (<5%), and dataset is moderate to large (default option in SPSS).
– Can be problematic in small datasets and in experimental designs e.g. unequal group sample sizes, loss of power etc
– If MNAR, can distort results.
10)
Dealing With Missing Data
Estimate (‘impute’) missing data - Knowledge of area/previous research is good..
Another option is mean substitution - what are teh concequences?
– Mean substitution
- Easy to implement
- Conservative option
- Reduces variance in variable
11) Dealing With Missing Data
Estimate (‘impute’) missing data using Regression
- Use complete cases to derive a regression equation with a series of IVs predicting relevant DV.
- Use equation to predict scores for missing cases.
- Tends to reduce variance and inflate relationships with other variables.
- Relies on relationships between DV and potential IVs.
12) Dealing With Missing Data
Estimate (‘impute’) missing data using Expectation maximisation (EM)
- Calculates missing values for DVs when missing data is random.
- Uses a maximum likelihood approach to iteratively generate values (usually) using a normal distribution.
- Produces biased standard errors for hypothesis testing.
13) Dealing With Missing Data
Estimate (‘impute’) missing data using Multiple imputation
- Makes no assumptions about randomness of missing data.
- Complex to undertake in SPSS.
14) Contrasting different methods
List 3 recomendations….
- It is worth repeating your analysis with and without missing data when using any type of imputation strategy.
- Discrepant results will be a cause for concern.
- The method you should select should not be based on the analysis outcome.
15)
Define Outliers
Outliers are extreme values on one variable (univariate outlier) or a combination of variables (multivariate outlier), that distort or obscure the results of analyses.
16) Outliers can be the result of:
(list 4 answers)
- Data entry errors
- Invalid missing data coding
- Case sampled is not from the intended population
- Case sampled is from the intended population, but simply represents an extreme value within that population, ie a genuine outlier.
17)
How should you check for univariate outliers?
– Standardise variable and look for absolute values of z > 3.29 (.1% of sample)
– Use graphical methods to inspect for outliers e.g. histograms, boxplots, etc.
– The selection of an outlier detection method should be independent of the results.
– Address univariate outliers as a starting point, as this will often limit the number of multivariate outliers.
18)
How can we check for multivariate outliers?
- Best assessed using formal statistical procedures rather than graphical methods of detection.
- Mahalanobis Distance is the distance of a case from the centroid of the remaining cases (centroid is the intersection of the variable means).
- The MD is tested using a χ2 distribution, with a conservative value of alpha (usually p < .001).
19)
How can the Mahalanobis Distance be assessed in SPSS?
Using regression:
– Use any DV, with relevant variables as predictor IVs.
– Use the Save dialog to request MD values.
– Evaluate MD values using a χ2 distribution at p < .001, with df equal to the number of IVs in the model.
20)
When are Leverage, Discrepancy and Influence used?
– Most often used in the context of multiple regression.
21) Detecting Outliers
What is leverage similar to?
– Leverage (identifying influence values) is similar to Mahalanobis Distance (MD), but measured on a different scale.
22) Detecting Outliers
What does discrepancy measure?
– Discrepancy measures the extent to which a case deviates from others (usually deviation from a straight line).