L7; Data pre-processing & Dealing with missing data Flashcards
data pre-processing
real world datasets often contain noisy, missing and inconsistent data. This is usually caused by the data being formed from multiple, sources or poor data collection techniques. It is generally regarded that data pre-processing takes up about 80 % of an analysis time.
includes; data transformation and reshaping, calculating variables that are functions of existing variables, aggregation, dealing with missing data
dplyr package (5)
filter, select, arrange, mutate, summarise
dplyr; filter
filter(data, how to filter)
and
&
or
|
equal
==
then
%> %
missing problem (3)
missing data can usually be classified into;
1. Missing Completely at Random (MCAR);
if missingness doesn’t depend on the values of the dataset.
- Missing at random (MAR)
if missingness does not depend on the unobserved value of the data set but does depend on the observed. - Not missing at random (NMAR)
if missingness depends on the unobserved values of the data set.
the procedure for dealing with missing data
- Identify the missing data.
- Identify the cause of the missing data.
- A; remove the rows containing the missing data
(naive approach) and missing data should not be biased.
B; replace missing values with alternative values.
impute the missing values, there are number of approaches.
deletion (2)
listwise deletion is analyse the data rows where there is complete data for every column. the advantage of it is simple and easily compare across analyses. the limitations are could be biased and lower n and reduces statistical power.
Pairwise deletion is analysing the data rows where the variables of interest have data present. advantage is using all possible information but limitation is separate analyses cannot be compared as the data/ sample will be different.
replacing missing data
- Simple Imputation
missing values are replaced with the mean, median or mode value. no stochastic and very simple. the limitations are could be biased, underestimating standard errors, could distort (ゆがめる) correlations among variables. - Multiple Imputation
estimates missing data through repeated simulations. stochastic and variability is more accurate. limitations are algorithms are more complex and normally would require complex coding.