Introduction to ETL and Data Transformation Flashcards
Where is process data hidden? And what needs to happen to it?
in ERP systems that captures data in tables, it needs to be first extracted then transformed into a specific format before being analysed in a mining tool like process intelligence
Explain the concept of ETL
preparing transactional data for mining
What does ETL stand for?
[Data] Extraction
[Data] Transformation
[Data] Load
In terms of business processes what is data extractions/
retrieval of all business-related data within the system used for process mining
What questions surrounding ETL should we ask?
what data is required and where is it stored?
What process is it?
Which IT-Systems are used?
What is the timeframe?
Does all recorded activity have a timestamp?
What are the minimum key requirements for ETL?
a valid case with a case ID, event name identifier and a timestamp for each event.
What would the time frame be?
ideally all records but as this is a lot of data, its usually a smaller parameter such as 1 year.
What is the issue with a smaller time frame?
not all cases have been executed so you need to consider whether they should be removed from extraction
3 steps of data extraction
- define the process scope [where it starts and end, what events are important]
- Identify the relevant business objects [state transitions of these objects that allow you to track the progress]
- Identify the required systems and tables [for each event idetify id, event name, timestamp]
When you’re extracting data from multiple systems what is the recommendation?
start small by by extracting the data from one system to get your first results then you can expand the process with more data in the next iteration
What can you do if the data is difficult to extract (external systems) or there is no unique identifier?
you can combine 2 values eg. order value and order time or reduce the timeframe if ID can’t be created