lesson 7 Flashcards
how to Verify Data Quality? 3
DOCUMENTATION
– CONTENT:
- Entity (what is about?)
- Property
- Measurement (units) type of variable
- Time
– TECHNICAL
- Abbreviations, codes
- Program code for data set creation, conversion
TRUTHFULNESS
- verify from other sources
-plausibility
COMPLETENESS
– TALL:
- missing observations?
– WIDE:
- encoding for missing documentation (NaN, 0..)
how to preserve data quality?
CONVERSION:
– DATA TRANSFORMATION
- convert unit
-aggregation
MERGING:
- ‘key’ is key
– DATA CLEANING
- limit the scope of the analysis (focus on the scope!)
- check the realistic/possible range of value
- check the origin of the data
- eliminate outliners
- eliminate border observation
how can we treat the missing data?
ELIMINATE: horizontally or vertically but cannot be a statement
IMPUTE: make a statement based on other variables. (estimation)
INTERPOLATION: male a statement based on the same variable. OK CROSS-SECTION, NOT OK Interpolation in time
IMPROVE ECONOMETRICS: other methods - different form interpolation
what is meant by Audit trail?
document the entire process from original data to the final result
6 dimension of data quality
- Accuracy -> Data is accurate when it reflects reality. es. height of a person is recorded as 15 cm
- Completeness->Data is complete when all required data for a particular use is present.
- Uniqueness->Data is unique if each entry appears only once within a dataset, without duplicates
- Consistency-> Consistency is achieved when data values do not conflict within a record or across
- Timeliness-> Data is timely if it is available when expected and needed.
- Validity->Data is valid if it conforms to the expected format, type, and range. es. @ nell’email.