Lesson 2 Flashcards
What happens during data cleaning?
- Ensuring an analyzable format (dummy coding).
- Ensuring the data takes legal values.
- Outliers are located and treated.
- Missing data is located and treated.
What missing data patterns are there?
- Univariate (only on one variable/column)
- Monotone (can be arranged into a staircase pattern)
- Arbitrary (randomly scattered)
What 5 rates of nonresponse are there?
- Percent/proportion missing (screening, apply on each variable)
- Attrition rate (data that is removed)
- Percent/proportion of complete cases (percent without missing data)
- Covariance coverage (pairwise proportion)
- Fraction of missing information (proportion of all the missing data points)
What is Missing Completely at Random (MCAR)?
Missing data is completely randomly spread
P(R|Y mis, Y obs) = P (R)
Best possibility
What is Missing at Random (MAR)?
Missing data has a reason, but it is not directly related to the thing that is being researched.
Frogs can’t be caught because they are too smart, but the research is about length.
What is Missing Not at Random (MNAR)?
When data is missing because a certain group doesn’t answer for a reason that is directly related to the research.
Rich people don’t talk about their salary
What are 6 Bad methods to handle missing data?
- Listwise deletion (only using complete rows)
- Pairwise Deletion (only use complete pairs of observations)
- Mean Substitution (use mean for every missing value)
- Deterministic Regression Imputation (using regression data to replace missing values)
- Averaging Available Items (data must be MCAR)
- Last Observation Carried Forward (LOCF)
What are 2 Good methods to handle missing data?
- Stochastic Regression Imputation (fill missing data with some random noise)
- Multiple Imputation (STI with extra steps ((com expensive))
What are univariate outliers?
Extreme values with respect to the distribution of a variable’s other observations
What causes an outlier?
- Data entry error
- legal but extreme values
It is important to use outliers that are correct and generalize to the findings
What are the 4 most popular methods to diagnose outliers?
- Internally studentized residuals (AKA Z-score method)
- Externally studentized residuals
- Median absolute deviation method
- Tukey’s boxplot method
How can you calculate Internally Studentized Residuals?
It is calculated using the mean and the SD of the sample. Then a cut-off is used to determine an outlier
It is important that the sample is normally distributed
How can you calculate Externally Studentized Residuals?
It uses the same calculation as the Internally Studentized Residuals, but does not include the identified outliers.
Uses deletion mean and deletion SD
How can you calculate Median Absolute Deviation Method?
It uses the median instead of the mean, and it uses the absolute SD. This way it is more resistant to outliers.
What is the Boxplot Method?
It is a visualization method that calculates the inner fence and outer fence.
values outside inner fence = possible outliers, outside outer fence = probable otliers.
What is a Breakdown Point?
The minimum proportion of cases that must be replaced by infinity to cause the value of the statistic to go to infinity
What is a multivariate outlier?
A combination of values that is very unlikely
Someone that is in the longest, but the least in weight
What is the Mahalanobis Distance?
This is a method for detecting multivariate outliers, that is based on the internally studentized residual.
What is the difference beween Classical and Robust Mahlanobis Distance?
Robust is less influenced by outliers
What is Winsorization?
Replacing a value (outlier or missing) with the nearest non-outlying value.
Why do we want outlier detection methods to be Robust?
Robustness makes it less sensitive to outliers themselves
Sort these detection methods by breakdown point:
- Boxplot method
- externaly studentized residual method
- internally studentized residual method
- median absolute deviation method
1/N for mean, 2/N for deletion mean, 50% for median, and 25% for the Boxplot method.