Data Preparation Flashcards
Phases of data preparation
- Identification of variables
- Univariate analysis
- Bivariate analysis
- Missing data
- Outlier treatment
- Variable transformation
- Creation of new variables
Types/causes of missing data
During data extraction/collection:
a) Completely random missing: probability of missing is identical for each observation.
b) Random missing: missing in a variable is random but has a relationship with other input variable (e.g.. missing data on age is more common for women than men).
c) Missing that depend on predictors not observed: missing data not random, but depend on variable that has not been recorded.
d) Missing that depend on the same value of the variable. probability of missing depends on value itself (e.g. low income tends to not declare it).
How to handle missing values:
1) Delete them.
* *Listwise** deletion - entire observation that contains the missing data is deleted. Risk: little data.
Pairwise deletion - statistical analysis is calculated with all available values. Risk: different sample sizes for each variable. When missing data is completely random, pairwise deletion is preferred.
2) Impute with mean/mode/median. Mean and median for numeric data, the mode for categorical.
- Generalised replacement: one size fits all.
- Similarity replacement: calculate separately values for different categories and impute. Average of height for men is imputed different than height of women.
3) Building predictive models. Build a predictive model to estimate the values that will replace the missing data.
Risk: estimated values more regular than true values.
4) Recoding with K-NN. Advantage: It can be used interchangeably to qualitative and quantitative variables.
Risk: computational difficulty and sensitivity of model parametrisation.
Types of outliers and origins
Univariate
Multivariate
- Natural Outlier.
- Non-Natural / Due to errors:
- Intentional outlier: related to sensitive data. Interview some young people on alcohol consumption. Only some of them will report actual value.
- Measurement errors: when measurement tool used is faulty. Example: weighing machines.
- Experimental errors: abnormal event that has affected the outcome of the experiment.
- Sampling error: for example, we have to measure the height of few athletes. For error, we include a pair of basketball players in the sample.
- Data entry errors: human errors, such as errors caused during data collection.
- Data processing error: extract data from more sources. It is possible that some errors of manipulation or extraction bring outliers in the set of final data.
Consequences of outliers
1) Increase the error variance, reduce the power of statistical tests.
2) If outliers are not distributed randomly, compromise normality of distribution.
3) Influence results of estimated tests.
4) They may have an impact on the basic assumptions of regression, ANOVA and other statistical methods.
How to identify outliers
Univariate Outliers
- Visual inspection
- Beyond interquartile range limit/outside 5th and 95th percentile / more than 3 std from the mean.
Multivariate Outliers
- Mahalanobis distance (use Chi2 as cutoff value).
- Cook’s D (calculated by removing the ith data point from the model and recalculating the regression. It summarizes how much all the values in the regression model change when the ith observation is removed)
How to manage outliers
- Elimination of observations.
- Data transformations: log, reduce their impact by weighting.
- Treating them as separate group.
- Replacing values.
Variable transformation strategies
- Categorical to dummies
- Right skewed distribution: Log, cubic, sqrt
- Left skewed distribution: sqrt, cubic, exponential