Week 3 Data Cleaning Flashcards

to provide information from Slides 9 - onwards (separate slides for decision trees)

1
Q

what are some of the issues with data collection?

A
  • Data Entry Errors
  • Missing Data
  • Outliers
  • Non-normal distribution
  • Linearity
  • Homoscedasticity
  • Multicollinearity and Singularity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is it necessary to be fastidious in cleaning your data?

A
  • Conclusions drawn from inferential statistics are dependent upon not having violated the assumptions underpinning the technique used.
  • Applying principles that generalize the findings.
  • Data cleaning helps you understand your data to draw conclusions.
  • Using small samples makes it even more important to ensure the data is representative and generalisable.
  • We infer from our sample data what the population may do or think – we are generalising to the broader community when using parametric statistics.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Remembering that my data is precious, it tells a story, what is the best method for checking the accuracy of my data?

A
  • physically comparing the original data with the onscreen entry
  • However, this is only practical with a small data set
  • Stats packages offer alternatives to help screen my data for accuracy
  • Descriptive statistics and graphical representations are a good start
    e. g. unusual data entries can be identified visually or sometimes by assessing outliers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why should I look for out of range values?

A
  • out of range values may distort my results & are not representative.
  • data entry may be incorrect: e.g. a score of 12 on a scale of 1-7 is inaccurate.
  • thus the sequence may also be inaccurate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is missing data, and why does it matter?

A

*Missing Data is information not available for a subject (or case) about whom other information is available.
*Usually occurs when the participant fails to answer one or more questions in a survey.
I need to consider is it Systematic or Random?
- I need to look for patterns and relationships underlying the missing data to maintain as close as possible values in the original distribution when a remedy is applied.
*Impact can reduce sample size available for analysis and can also distort results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the principles of missing data screening?

A
  • Work out if it’s spread randomly across the dataset or is systematically missing – check patterns of missingness.
  • Number (amount) of missing data points is secondary .
  • How you deal with your missing data needs to be directed by stringent criteria
  • Regardless of my decision, any transformations, deletions or adjustments need to be referenced succinctly in my results section.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

I need to deal with missing data prior to Cleaning & Analysis - but how do I do this?

A

I can can test for levels of missingness by creating a dummy-coded variable (0/1) and then using a t-test to assess whether there is a mean difference between the 2 groups on the DV of interest.
*If this is not significantly different it is not as critical in the way I deal with the data although degree of missingness is important to look at next (more than 5% is not great).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

There are 3 kinds of missing data, what are they?

A

The data may be:
*Missing Completely at Random – unpredictable (MCAR)
*Missing At Random but ignorable response (MAR)
*Missing not at random or non-ignorable (MNAR)
MNAR is the worst kind :-(

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 3 alternatives that used to be employed to handle missing data? and why are they not the most appropriate?

A

Historically, missing data has been handled using either listwise deletion of the entire case or, pairwise deletion (when the item missing is involved in any calculation) or replacement of the mean value of the item in question.
Primarily, using listwise deletion would eliminate other very relevant information and although pairwise deletion would lessen this effect, it would still eliminate substantial, valuable data. Instead data transformation appears to be the optimal way in dealing with the issue of missing values, rather than eliminating the pattern of responses from the individual being assessed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

SPSS will use list-wise or pairwise deletion to deal with data during most analyses as the default. Why is this not a great idea?

A

A better method to avoid loss of valuable data or inappropriate imputation, is achieved by either Regression or EM (expectation maximisation – Model-Based Methods) replacement.
Systematic replacement of missing values may be used in your research using SPSS, however, remember to reference any changes made to the actual raw data set in your results section.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is Regression Replacement (RP) a better option for handling missing data?

A

The regression option provides an assigned mean value which takes into account the pattern of responses from individual cases on all other variables and supplies an adjusted mean for each participant through a regression analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What other things do I need to remember when considering missing data?

A

*Missing data under 5% for an individual case or observation can generally be ignored, except when the missing data occurs in a specific nonrandom fashion (e.g., concentration in a specific set of questions, attrition at the end of the questionnaire, etc.). Particularly relevant in a large dataset.

The number of cases with no missing data must be sufficient for the selected analysis technique to be used if replacement values are not being substituted (imputed) for the missing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

So what if I have more than 5% data missing?

A

Under 10% – Any of the data replacement methods can be applied when missing data is this low, although the complete case method has been shown to be the least preferred.

10 to 20% – The increased presence of missing data makes the all available, regression methods most preferred for MCAR data & model-based methods are necessary with MAR missing data processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is Expectation Maximisation (EM) so highly regarded when dealing with missing data?

A

Expectation Maximisation (EM):

  • Uses data to generate the shape of the distribution
  • Uses the observed values and the estimate of the parameters to generate substitutions for the missing values
  • Based on the likelihood of those assigned values occurring under that distribution
  • Unlike regression, expectation maximisation produces realistic estimates of the variance but it is much more difficult to undertake.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

So, what does the Regression Method of Data replacement have to offer?

A
  • Other variables are used as IVs and the variable with missing data as the DV. The variables with complete cases generate the regression equation and this is used to predict the assigned missing values
  • It requires IVs to be good predictors of the variable with the missing data. It reduces variance and is more consistent with other scores than a “real” score (remember – they are used to predict it)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between EM and Regression Replacement?

A

EM will give you the estimated maximum likelihood replacement, where the regression model will give you take the pattern of each case into account for the replacement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does Little’s MCAR test: χ2 statistic tell us?

A

*This statistic tests whether the missing data are characterized as MCAR, MAR or NMAR. MAR (missing at random)

  • MCAR may be inferred when Little’s MCAR test is NOT significant
  • MAR can be inferred when Little’s MCAR test is significant but missingness is predicted from variables (other than the DV) as indicated by the Separate Variance t Tests
  • MNAR is inferred if the t-test shows that missingness is related to the DV
18
Q

What does a Little’s MCAR test of >.05 indicate?

A

when Little’s MCAR test is NOT significant, as in this case it “indicates that the probability that the pattern of missing values diverges from randomness >.05, so that MCAR may be inferred (p.63). This suggests that generated values may be used to replace the missingness.

19
Q

Problems that can arise and suggest EM is not best method of data replacement & what to do:

A
  • If the t-test showed that 1 variable (such as sex) significantly deviated in relation to the other variables of interest. This presents a problem because uniformly replacing these values through EM, would not be advisable given males & females differed on the predicted values.
  • I need to compare the means for (the model / present) & (missing) to see whether replacement would be inadvisable even though Little’s MCAR t-test EM replacement for the patterns of the overall dataset was probable.
  • Both the t-test statistic & the chi sq statistic are used in conjunction to make this decision. An alternative is to use pairwise deletion when both these aspects energe.
  • Alternatively, tick for replacement according to sex & values replaced will differ for males & females.
20
Q

What are univariate outliers?

A

Univariate Outliers are extreme values on a variable identified as ±3.29 SD.

21
Q

How does one manage univariate outliers?

A
  • Standardization of raw scores for univariate outliers identifies whether the suspected or possible outliers are in effect true outliers. Z-scores > ± 3.29 for univariate outliers is an acceptable criteria.
  • Deletion is not really an option with univariate outliers – I usually check the multivariate as well, you don’t want to delete any data if possible, it is so valuable. Adjustment may be preferable but report any changes.
  • Univariate outliers may have to be dealt with by adding one to the most extreme score nearest to the outlier and replace the extreme value in the dataset. The value is still the most extreme value however that individual score will not impact on the error variance to a greater extent than applicable within the given dataset.
22
Q

What are the 2 kinds of multivariate outliers? How does one manage multivariate outliers?

A

Multivariate Outliers:

  1. Casewise Diagnostics: outliers that have an unusual relationship between the IVs and DVs
    - SPSS identifies these in by listing cases with a large standardised residual; the default is 3+ most favour ±3.29, which corresponds to a p = .001
  2. Mahalanobis Distance are an unusual pattern of scores among the IVs
23
Q

Remind me from Andy Field, what are the key values for Mahalanobis Distance?

A

Mahalanobis Distance measures the influence of a case by examining the distance of the cases from the mean

  • in large samples (500+) with 5 variables, values above 25 are problematic
  • In smaller samples (100+) with 3 predictors, values above 15 are problematic
  • In small samples (30+) with 2 predictors, values above 11 are problematic
24
Q

With regard Data transformation: What do T & F recommend to do prior to dealing with outliers?

A

*If the data is skewed or kurtosis is causing a significant problem then transformations may be undertaken:
*square root transformation for moderately + / - skewed
*substantial + skewness use log transformations
NB: Transformations often reduce the impact of outliers. Transformations are best done on ungrouped data.

25
Q

What do we use to identify skew or kurtosis (i.e. non-normal distributions) in data?

A

We use note violations with Kolmogorov Smirnov or Shapiro Wilks to identify skewed or kurtosis data to identify if it is causing a significant problem

26
Q

How do we remember Skewness &

Kurtosis?

A
Skewness = positively or negatively skewed.
Kurtosis = too peaked (leptokurtic) (remember Leap)
Kurtosis = too flat (Platykurtic) (remember Plateau)
27
Q

How do we know if our data is normally distributed?

A
  • does it fit into a normal distribution curve (NDC).
  • This assumption underpins all parametric statistics.

Several ways of assessing this:

  • Stem & leaf plots, box plots, histograms
  • skewness and kurtosis.
28
Q

How do we know if our data fits the linearity assumptions?

A

Linearity: is there a straight-linear relationship between the two variables?
Non-linearity is diagnosed either from residuals plots in analyses involving a predicted variable or from bivariate scatterplots between pairs of variables.

29
Q

How do we know if our data fits the homoscedasticity assumptions?

A

Ungrouped data:
*the variability in scores for one continuous variable is roughly the same at all values of another continuous variable.

For grouped data:

  • the same as the assumption of homogeneity of variance when one of the variables is discrete (the grouping variable), the other is continuous (the DV);
  • the variability in the DV is expected to be about the same at all levels of the grouping variable
30
Q

What are Multicollinearity and Singularity & when do they occur?

A

Multicollinearity and Singularity are problems with a correlation matrix that occur when variables are too highly correlated.
With multicollinearity, the variables are very highly correlated - say .90 and above.

With singularity, the variables are redundant; one of the variables is a combination of two or more other variables.
P.88 T&F.
They cause both logical and statistical problems.
See another card!

31
Q

How do we know if our data fits the Multicollinearity and Singularity assumptions?

A
  • multicollinearity, the variables are highly correlated (say, .90 and above);
  • singularity, the variables are redundant; one of the variables is a combination of two or more of the other variables

NB: Hills suggests that intercorrelations of .8 or even .7 perhaps should also be avoided when using regression to avoid problems in interpretation.

32
Q

How do we know if our data fits the assumptions of homogeneity of variance?

A

*Parametric analyses, (e.g. t-tests, ANOVA,) assume that the sample tested is representative of the population and equal in variance across groups.
*Levene’s test this assumption by evaluating whether the Homogeneity of Variance in your sample is violated.
Greater is good remember >.05 and therefore not significant.
8This then indicates that the variability across the groups is similar and the emergent differences, not what is causing the difference between groups

33
Q

Parametric tests are based on the assumption that the dataset is large enough (20-30+ per group),homogeneity of variance criteria is met, normality assumptions are met and generally the cells are of equal size

How is the required sample size determined?

A
  • The sample size required is dictated by the specific type of analysis you undertake and whether parametric or non-parametric statistics are used.
  • If parametric statistics are used your sample size may also affect whether you are able to undertake the use of univariate or multivariate analyses.
34
Q

What are the pro’s & cons of Parametric versus non-parametric tests?

A

*Parametric statistics – more stringent set of assumptions.
*Non-parametric statistical analysis - less stringent criteria.
non-parametric tests should be used when:
-Ordinal level data is analysed
-Sample size is small <10 per cell
-Sample size is small and unequal

  • Nonparametric tests can be used as an alternative strategy for dealing with outliers in parametric tests due to the ordinal nature of the analysis of the data.
  • Nonparametric statistics are based on ranks that are not affected by extreme scores, and they do not require normality
35
Q

What is EM and what does it involve?

A

EM stands for Expectation Maximization.
EM methods are available for randomly missing data.
Em forms a missing data correlation (or covariance) matrix by assuming the shape of a distribution (such as normal) for the partially missing data and basing inferences about missing values on the likelihood under that distribution.
IT IS AN ITERATIVE PROCEDURE WITH TWO STEPS:
Expectation &
Maximisation.
P.68 T&F

36
Q

What is the procedure of Expectation Maximisation (EM)?

A

EM is a repetitious process with two steps:
1. Expectation- the E step finds the conditional expectation of the of the “missing data”’ given the observed values and current estimate of the parameters, such as correlations. These expectations are then substituted for the missing data.

  1. M step- performs maximum likelihood estimation as though the missing data had been filled in.
    Fina lly, after coverage is achieved, the EM variance-covariance matrix is provided…the filled in data saved in the data set.
37
Q

Why is the analysis of EM biased?

A

Because error is not added to the transformed data set.

38
Q

What is EM and what does it involve?

A

EM stands for Expectation Maximization.
EM methods are available for randomly missing data.
EM forms a missing data correlation (or covariance) matrix by assuming the shape of a distribution ( such as normal) for the partially missing data and basing inferences about missing values on the likelihood under that distribution.
IT IS AN ITERATIVE PROCEDURE WITH TWO STEPS:
Expectation & Maximisation.

39
Q

What is the procedure of Expectation Maximisation (EM)?

A

EM is a repetitious process with two steps:
1. Expectation- the E step finds the conditional expectation of the of the “missing data”’ given the observed values and current estimate of the parameters, such as correlations. These expectations are then substituted for the missing data.

  1. M step- performs maximum likelihood estimation as though the missing data had been filled in.
    Finally, after coverage is achieved, the EM variance-covariance matrix is provided…the filled in data saved in the data set.
40
Q

Why is the analysis of EM biased?

A

Because error is not added to the data set.

41
Q

What are univariate outliers?

A

Univariate outliers are cases with an outlandish value on one variable.
Z score in excess of +/- 3.29.
Easier to spot than multivariate outliers…
Among dichotomous variables, the cases on the “wrong” side of a very uneven split are likely to be univariate outliers.
Among continuous variables, univariate outliers are cases with very large standardized scores, z-scores on one or more variables…disconnected from the other z-scores in excess of 3.29.

42
Q

What are multivariate outliers?

A

Multivariate outliers are cases with an unusual combination of scores on two or more variables.
E.g. 15 year old normal bounds. Someone earning $45,000.00 /yr is normal… BUT a 15 year-old who earns $45,000.00 a year is not normal - multivariate outlier.