Week1_Topic 1: Screening and cleaning data Flashcards

1
Q

1) Why is Data Screening Important?

A
  • An important (often time consuming!) precursor to any serious data analysis.
  • Crucial to check relevant statistical assumptions for any subsequent analyses.
  • Data screening can often provide an important first insight into the key variables of your study.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2) Data Screening:

What should be your first step?

A

• The first step in data screening should involve checking for the accuracy of the data entry:

– Out of range values

– Plausible values

– Check accuracy of coding in SPSS

• The Frequencies window in SPSS –

Descriptive Statistics is often useful for the above checks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3) Missing Data:

It is problematic because it may reduce the

represeativeness of your data

It is a problem in many areas of psychological

research - give 4 examples:

A
  1. Participant attrition (sustained pressure)
  2. Items/tasks not completed
  3. Completed data misplaced
  4. Equipment malfunctions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

4)

Why is meticulous data collection is important?

A

– Ensure participants complete all tasks/ questionnaires

– Remind them to check that they’ve completed everything

– Program your task to minimize missing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

5)

Describe - missing completely at random (MCAR):

A
  • Cause of missing data is independent of other variables in the study
  • Non-missing data is representative of total data
  • Best possible situation, but rare
  • Should not cause any problems if relatively small loss (<5%) in moderate and large datasets.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

6) Describe - missing at random (MAR):

A
  • The pattern of missingness is predictable from other variables in the dataset.
  • For instance, patients might be less likely to complete a certain questionnaire.
  • If <5% most missing value procedures yield similar results.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

7)

Describe - missing not at random (MNAR):

A
  • Non-random missingness.
  • Value of variable is related to reason why it’s missing.
  • Patients are less likely to complete a questionnaire because of their scores on the questionnaire.
  • Can seriously bias results if left as is.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

8) SPSS Missing Value Analysis:

What is a logical appraoch?

A
  • Separate missing and non-missing into two groups and compare on other variables e.g. ttest with IV (miss vs. non-miss) on all other variables.
  • Little’s MCAR test: if non-significant then assume MCAR; if significant, but missingness is predictable from other variables (other than DV), then assume MAR. If t-test is significant for missingness across DVs, then assume MNAR.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

9)

Dealing With Missing Data:

Omission (but still report it)

A

– Usually ok if low frequency and MCAR/MAR (<5%), and dataset is moderate to large (default option in SPSS).

– Can be problematic in small datasets and in experimental designs e.g. unequal group sample sizes, loss of power etc

– If MNAR, can distort results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

10)

Dealing With Missing Data

Estimate (‘impute’) missing data - Knowledge of area/previous research is good..

Another option is mean substitution - what are teh concequences?

A

– Mean substitution

  • Easy to implement
  • Conservative option
  • Reduces variance in variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

11) Dealing With Missing Data

Estimate (‘impute’) missing data using Regression

A
  • Use complete cases to derive a regression equation with a series of IVs predicting relevant DV.
  • Use equation to predict scores for missing cases.
  • Tends to reduce variance and inflate relationships with other variables.
  • Relies on relationships between DV and potential IVs.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

12) Dealing With Missing Data

Estimate (‘impute’) missing data using Expectation maximisation (EM)

A
  • Calculates missing values for DVs when missing data is random.
  • Uses a maximum likelihood approach to iteratively generate values (usually) using a normal distribution.
  • Produces biased standard errors for hypothesis testing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

13) Dealing With Missing Data

Estimate (‘impute’) missing data using Multiple imputation

A
  • Makes no assumptions about randomness of missing data.
  • Complex to undertake in SPSS.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

14) Contrasting different methods

List 3 recomendations….

A
  1. It is worth repeating your analysis with and without missing data when using any type of imputation strategy.
  2. Discrepant results will be a cause for concern.
  3. The method you should select should not be based on the analysis outcome.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

15)

Define Outliers

A

Outliers are extreme values on one variable (univariate outlier) or a combination of variables (multivariate outlier), that distort or obscure the results of analyses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

16) Outliers can be the result of:

(list 4 answers)

A
  • Data entry errors
  • Invalid missing data coding
  • Case sampled is not from the intended population
  • Case sampled is from the intended population, but simply represents an extreme value within that population, ie a genuine outlier.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

17)

How should you check for univariate outliers?

A

– Standardise variable and look for absolute values of z > 3.29 (.1% of sample)

– Use graphical methods to inspect for outliers e.g. histograms, boxplots, etc.

– The selection of an outlier detection method should be independent of the results.

– Address univariate outliers as a starting point, as this will often limit the number of multivariate outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

18)

How can we check for multivariate outliers?

A
  • Best assessed using formal statistical procedures rather than graphical methods of detection.
  • Mahalanobis Distance is the distance of a case from the centroid of the remaining cases (centroid is the intersection of the variable means).
  • The MD is tested using a χ2 distribution, with a conservative value of alpha (usually p < .001).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

19)

How can the Mahalanobis Distance be assessed in SPSS?

A

Using regression:

– Use any DV, with relevant variables as predictor IVs.

– Use the Save dialog to request MD values.

– Evaluate MD values using a χ2 distribution at p < .001, with df equal to the number of IVs in the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

20)

When are Leverage, Discrepancy and Influence used?

A

– Most often used in the context of multiple regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

21) Detecting Outliers

What is leverage similar to?

A

Leverage (identifying influence values) is similar to Mahalanobis Distance (MD), but measured on a different scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

22) Detecting Outliers

What does discrepancy measure?

A

Discrepancy measures the extent to which a case deviates from others (usually deviation from a straight line).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

23) Detecting Outliers

What is influence?

A
  • Influence is the product of both leverage and discrepancy. It is the extent to which regression coefficients change when a case is deleted.
  • Influence can be assessed using Cook’s Distance, obtained in the Save dialog of SPSS Regression.
24
Q

24) Detecting Outliers

Describe the plot for leverage, discrepancy and influence in terms of high, moderate and low

A

Answer:

  • High leverage
  • Low discrepancy
  • moderate influence
25
Q

25) Detecting Outliers

Describe the plot for leverage, discrepancy and influence in terms of high, moderate and low

A

Answer:

High leverage

High discrepancy

High influence

26
Q

26) Detecting Outliers

Describe the plot for leverage, discrepancy and influence in terms of high, moderate and low

A

Answer:

Low leverage

High discrepancy

Moderate influence

27
Q

27) Detecting Outliers

When you have detected multivariate outliers, you should seek to determine why the cases are extreme.

How do you do this?

A

Can create a dummy variable with the outliers as one value and the rest of the cases with another value.

Use the dummy variable as a DV in logistic regression, to determine the IVs that best predict group membership.

Means, SDs etc can then be checked for the outliers on the variables identified in this way.

28
Q

28) Dealing with Outliers

List some solutions..

A
  1. Check for data entry error, coding mistakes etc.
  2. Omit cases if not part of the population (with caution).
  3. Otherwise,
  • Transform variable
  • Reduce/increase score to next unit above/below the next highest/ lowest score (e.g. Winsorize / nearest-neighbor).
  • Trim an a priori percentage of cases from the upper and lower tails of the distribution.
29
Q

29) Dealing with Outliers

list 4 more points to consider..

A
  1. If you still have multivariate outliers after dealing with univariate outliers and N is large, may be simplest to omit cases.
  2. Run analyses with and without outliers to check on effect of omissions.
  3. Make sure all outlier remedies are reported in the results section.
  4. Non-parametric statistics are less sensitive to the influence of outliers, but not wholly insensitive.
30
Q

30) Normality:

A
  • Assumption of multivariate normality is the assumption that each variable, and all linear combinations of the variables, are normally distributed.
  • This assumption underlies the use of many statistical tests and tests become less robust as distributions depart from normality.
31
Q

31) Normality

Describe normality for grouped data (e.g. ANOVA)

A
  • normality applies to sampling distribution of the variables means. With large enough N (ie > 20), Central Limit Theorem shows sampling distribution will be normally distributed regardless of the distribution of the variables
32
Q

32) Normality

Describe normality for ungrouped data (e.g. Regression)

A

– Testing for multivariate normality impractical. The assumption can be largely checked by examining the normality, linearity, and homoscedasticity of individual variables and any analysis residuals.

33
Q

33) Define Homoscedasticity.

A

Homoscedasticity. This assumption means that the variance around the regression line is the same for all values of the predictor variable (X). The plot shows a violation of this assumption. For the lower values on the X-axis, the points are all very near the regression line

34
Q

34) Normality

Normality can be assessed by statistical and graphical means.

What is the test and distribution to look for with a visual inspection?

A
  • Kolmogorov-Smirnoff test
  • Skewness is the degree of symmetry in the distribution.
  • Kurtosis is the peakedness or flatness of the distribution.
35
Q

35) Normality

A normal distribution has skewness and kurtosis values of ______

A

A normal distribution has skewness and kurtosis values of 0.

Skewness and kurtosis values and their standard errors are provided by SPSS Frequencies.

Can test a hypothesis using a z score that skewness and kurtosis differ significantly from 0.

z = skewness/standard error (SE), z = kurtosis/SE

Small samples (n<50): Z>1.96: data is non- normal

36
Q

36) Normality

For small to moderate N, use ________ ________

values e.g. p < ____.

For large N, likely to make a __________ as SEs become _____. Better to examine the _________________________ or look at absolute values.

A

For small to moderate N, use conservative alpha

values e.g. p < .001.

For large N, likely to make a Type 1 error as SEs become small. Better to examine the shape of the distribution or look at absolute values.

37
Q

37) Normality

____________ suggest normality can be assumed if skewness values are not greater than _______ and kurtosis values are not greater than ________.

Note that the consequences of departures from ________are generally more serious when variables are _______ in different directions.

A

Curran et al. (1996) suggest normality can be assumed if skewness values are not greater than abs val 2 and kurtosis values are not greater than abs val 7.

Note that the consequences of departures from normality are generally more serious when variables are skewed in different directions.

38
Q

38) Normality

Graphical methods - list 3

A
  1. Frequency histogram (with normal curve overlay in SPSS)
  2. Normal probability plot
  • For normality, points should fall along the diagonal line.
  • Plots observed values against expected values based on a normal distribution.
  1. Detrended normal probability plot

• Similar to above, but plots the deviations from the diagonal.

39
Q

39) Normalise

To address normality…. list 6 ways..

A
  1. Transform variable (eg log transform)
  2. Winsorize/Trim
  3. Check modified variable to ascertain normality
  4. Examine how any changes impact analyses
  5. Choose a method based on what is effective in yielding a normal distribution
  6. If normality violations can’t be corrected, consider using non-parametric statistics
40
Q

40)

Define Linearity

A
  • The assumption that variables have linear (straight line) relationships with each other.
  • Underlies many statistical tests e.g. Pearson correlation, regression, etc
  • Can be assessed using bivariate scatterplots.
  • Some variables may inherently have a non-linear
  • relationship.
  • See Tabachnick and Fidell for alternate analyses if nonlinearity present.
41
Q

41) Linearity

Waht is the reliationship?

A

Quadratic relationship

42
Q

42) Linearity

What is the relationship?

A

Cubic relationship

43
Q

43) Homoscedasticity

Homoscedaticity is the __________ that ________ in scores for one ______________ is roughly the same at all levels of _______________________.

A

Homoscedaticity is the assumption that variance in scores for one continuous variable is roughly the same at all levels of another continuous variable.

44
Q

44) Homoscedasticity

For grouped data what is the homogeneity of variance assumption:

A

variability in DV is expected to be similar across all levels of the discrete IV.

45
Q

45) Homoscedasticity

How do you assess it?

A

Can be assessed with Levene’s test.

46
Q

46) Homoscedasticity

What should you do for ungrouped data?

A

For ungrouped data, inspect the bivariate scatterplots. Heteroscedasticity is caused by nonnormality in one or both of the variables.

Tests usually robust to some heteroscedasticity.

47
Q

47) Homoscedasticity

Describe the plot below:

A

Homoscedasticity with both variables normally distributed.

48
Q

48) Homoscedasticity

Describe the plot below:

A

Heteroscedasticity with skewness on one variable

49
Q

49) Data Transforms

List 5 key points.

A
  1. Generally recommended if there are significant departures from normality and/or problems with outliers etc.
  2. An exception may be if the variable is measured on an inherently meaningful scale, as interpretation becomes more difficult after transformation.
  3. It is worth trying several types of transformations to try to produce the best distribution.
  4. Always check a transformed variable for normality etc.
  5. If no transformation seems to work, try dichotomizing the variable or using NP statistics.
50
Q

50) Data Trasforms

If data is skewed what is a good option?

A

Log transformations are often helpful with skewed data.

51
Q

51) Multicolllinearity

Multicollinearity is a problem that occurs in a ________________when variables are too ________________ to each other (e.g. > .90).

A

Multicollinearity is a problem that occurs in a correlation matrix when variables are too highly related to each other (e.g. > .90).

52
Q

52) Multicollinearity

List 3 Problems with multicollinearity

A
  1. Conceptually, it indicates redundancy in the variables.
  2. It can be an important issue in regression.
  3. It doesn’t necessarily impact the model as a role (eg in regression), but rather individual predictors
53
Q

53) Multicollinearity

How to detect multicollinearity?

A
  • Basic principle is to examine overlap between variables.
  • Examine correlation matrix for preliminary check.
  • Specific test depends to some degree on type of analysis e.g. tolerance/VIF in multiple regression (more on this in a few weeks).
54
Q

54) Multicollinearity

How do we deal with multicollinearity?

A

Remove or combine relevant variables

55
Q

55)

Data Screeing

• Putting it altogether:

A

– See Tabachnick and Fidell pg. 91 for a checklist of data screening prior to analysis.

– Also see end of chapter 4 for fully worked examples of data screening using SPSS, for ungrouped and grouped data.

56
Q
A