Part 3: Data Preparation Flashcards
What is the purpose of data preparation?
To convert acquired ‘raw’ data into valid, consistent data, using structures and representations that will make analysis straightforward
Initial steps of data preparation
1)Explore the content, values, and the overall shape of the data
2)Determine the purpose in which the data will be used
3)Determine the type and aims of the analysis to be applied to it
Possible discovered problems with real data
1)Data is incorrectly packaged
2)Some values may not make sense
3)Some values may be missing
4)The format doesn’t seem right
5)The data doesn’t have the right structure for the tools and packages to be used with it
Data preparation activities:
1)Data cleansing: remove or repair obvious errors and inconsistencies in the dataset
2)Data integration: combining data
3)Data transformation: shaping datasets
In data warehousing, the data preparation activities are known as:
ETL (Extract, Transform, Load) is used for the process of taking data from operational systems and loading them into warehouse
Data harmonization and data enhancement are also used
Classification of error types:
1)Validity
2)Accuracy
3)Completeness
4)Consistency
5)Uniformity
Validity
Checking whether the data values match any specified constraints, value limits, and formats for the column in which they appear
Accuracy
Checking correctness requires some external ‘gold standard’ to check them against
Completeness
Checking if all the values are present and if there’s any missing values
Consistency
If two values should be the same but are not, then there is an inconsistency.
Uniformity
It is necessary to choose a base or
canonical representation and translate all values to that
form
Data harmonization
A data cleansing activity for creating a common (aka canonical) form for non-uniform data. Mixed forms often occur when two or more data sources use different representations (and then combined)
Approaches to handling dirty data
- fix it – replace incorrect or missing values with the correct values
- remove it – remove the value, or a group of values (or rows of data or data elements) from the dataset
- replace it – substitute a default marker for the incorrect value, so that later processing can recognize it is dealing with inappropriate values
- leave it – simply note that it was identified and leave it, hoping that its impact on subsequent processing is minimal.
Documenting data cleansing
- it is necessary to:
- document how the dirty data was identified and handled, and for what reason
- and maintain the data in both raw and ‘cleaned’ form
- If the data originally came from operational systems, it might be necessary to feed the findings back to the managers of these systems
What are the benefits of documenting data cleansing?
- Allows others to consider the changes made and ensure they were both valid and sensible.
- Helps to build a core of approaches and methods for the kinds of datasets that are frequently used.
- Allows managers of operations systems where the data came from to adjust and improve their validation processes.
- Allows you, in time, to develop effective cleansing regimes for specialized data assets.