the research process Flashcards
different type of missing values
- Missing completely at random
- Missing at random
- Missing not at random
MCAR
no pattern in the missing data and completely random. Can be ignored or removed
MNAR
- Data missing not at random (MNAR) are missing for reasons related to the values themselves.
- mostly high or low scores are missing.
- some participants with low incomes avoid reporting their holiday spending amounts because they are low
- lack data from key subgroups within the sample and not representative of your population
MAR
missing values can be explained by other (observed) variables. Missingness predictable from other variables in the dataset
how to identify HOW MUCH missing data is missing
1) Frequencies functions > statistics table
2) Explore function > case processing table
3) Missing value analysis > univariate statistics table
how to detect patterns
- little MCAR test
- T-test or dummy variables
little MCAR test
a multivariate test, that evaluates the subgroups of the data that share the same missing data pattern. evaluates differences between the observed and estimated means in each missing data pattern.
It is not a definitive test
* provides EM means table with the MCAR test provided
* If p-value is above .05 (non-significant), the data is MCAR
* If p-value is below .05 (significant), the data is not MCAR
t-tests for missing values
- T-test: evaluates if missingness is related to any of other variables with alpha= .05
- For the t-test procedure, SPSS first separates cases with complete and missing values by creating an indicator variable of variables that contain missing values
- confirmed by partitioning the data into two parts: one set containing the missing values, and the other containing the non-missing values.
- After partitioning the data, use the t-test of mean difference to check whether there exists any difference in the sample between the two datasets.
- MAR can be inferred if the MCAR test is statistically significant but missingness is predicted from variables (other than the DV) as indicated by the separate variance t-tests
- MNAR is inferred if the t-test shows that missingness is related to the DV
using dummy variables for missing values
- Construct a dummy variable with two groups, cases with missing and non missing values on income, and perform a test of mean differences in attitude between the groups
- 1 = missing
0 = observed - run t-tests and chi-square tests between this variable and other variables in the data set to see if the missingness on this variable is related to the values of other variables.
- If there are no differences, decisions about how to handle data are not so critical
- For example, if women really are less likely to tell you their weight than men, a chi-square test will tell you that the percentage of missing data on the weight variable is higher for women than men.
listwise deletion
deleting data from all cases (participants) who have data missing for any variable in your dataset. You will have a dataset that is complete for all participants included in it
Few cases are missing
You may end up with a smaller/biased sample to work with
pairwise deletion
Lets you keep more of your data by only removing the data points that are missing from any analyses. It conserved more of your data because all available data from cases are included
estimating data
- mean substitution
- common point replacement
- regression
- expectation maximisation
- multiple imputation
mean subsitution
- replacing the value with the mean of cases across values
- However the variance of the variable is reduced because the mean is closer to itself than the missing value it replaces
- Is it best to avoid mean substitution unless the proportion of missing values is very small
Common point replacement
replacing the value with the midpoint of the scale
imputation
regression
imputation
- other variables are used as IV’s to write a regression equation for the variable with missing data serving as DV
- Cases with complete data generate the regression equation; the equation is then used to predict the missing values for incomplete cases
- however scores fit together better than they should
- reduced variance because the estimate is probably too close to the mean
- IV’s must be good
expectation maximisation
- creates a missing data correlation (or covariance) matrix
- assumes the shape of the distribution as normal for the partially missing data and infers the likelihood of the missing value falling within that distribution
- can be used with (MCAR or MAR)
- associated with bias and inapp SE
1. First the E step, finds the conditional expectation of the missing data, given the observed values and current estimate of the parameters such as the correlations. These expectations are then subsituted for the missing data
2. Second the M step performs maximum likelihood estimation as the missing data has been filled in»_space; after convergence is achieved the EM variance-covariance matrix may be provided and/or the filled in data saved in the dataset
multiple imputation
- using logistic regression to predict values based on other variables in your dataset:
- differentiate cases with and without missing data»_space;» uses other variables in your data set to estimate missing values»_space; random samples taken from distribution from variable to create new datasets
- Allows you to create 5 new data sets
- Most respectable method and can be used when data is MNAR or MAR
- Used for regression, ANOVA , Logistic regression and longitudinal data
- Difficult to implement
- Analyse >multiple imputation> impute missing data variables>insert key variables> set imputations to 5> create new data set> click on constraints to show which ones are predictors or values > impute and use as a predictor to the variable that has the missing data
univariate outliers
large standardised scores, z scores, on a single variable.
how to detect univariate outliers
- By converting our data to z-scores, we can use benchmarks that we can apply to the dataset to search for outliers
- Analyse»_space;> descriptive statistics»_space;» descriptives»> select the variable to convert and tick save standardised values as variables
- Cases with standardised scores (z scores more than 3.29 from the mean (p <.001, two tailed test) are potential outliers (preferred method)
- histograms»_space;> using frequencies
- Boxplots/ IQR range (Q3-Q1) method Q3 + (1.5IQR) & Q1-(1.5IQR)
- P-P plots/ detrended P-P plots
multivariate outliers
a case with a strange combination of scores on two or more variables
Calculating mahalanobis’s distance
a measure of the distance of each case from a centroid of cases determined from a combination of scores or variables detected along chi square distribution.
Mahal distance cut off for 2 predictor variables
13.816
leverage
influential individual points
unusual predictor value
discrepancy
the extent to which a case is in line with others (unusual y value given its x value)
influence
combination of leverage and discrepancy
assesses a change in regression coefficients when a case is deleted.
high leverage low discrepancy
moderate influence
high leverage
high discrepancy
high influence
low leverage
high discrepancy
moderate influence
cook’s distance
- used in regression analysis to find influential outliers in a set of predictor variables
- Cases with influence scores larger than 1.00 are suspected of being outliers
- Cooks distance is below 1 = which means no cases
correcting outliers
- trimming data
- winsorizing
- robust estimation method
- transform the data