Data Processing Lifecycle Flashcards
Section 1
What are the five stages of the data processing lifecycle?
- Data ingestion and integration
- Data Processing
- Data Storage
- Data Analysis
- Reporting
What does the first stage of the data processing lifecycle involve?
- Collecting data from a variety of sources, transforming the data (if needed to match the target storage) and then storing in the target storage
- When data is integrated from multiple sources we have to be aware of data heterogeneity
What types of data make up heterogeneous data?
- XML files
- JSON Files
- Weblogs and Cookies
- SQL Queries
- Flat files
Why is heterogeneous data increasing?
Rise in technology increasing data produced
What is data heterogeneity?
- Data made up of different types, sources, structures or formats.
- May be in a structured, unstructured or semi-structured format
What is structured data?
- Conforms to a well-defined schema - schema on write
- Often tabular (rows are datapoints and columns are attributes)
Where is structured data stored?
- Relational Databases
- Data Warehouses
- Legacy Data Systems
What is semi-structured data?
- Scheme is not completely defined by a data model - no schema on write
- The size, order and contents of the elements can be different
- Often used with IoT devices
What formats might semi-structured data be in and where might they be used?
- HTML files
- XML files
- JSON files
- Often used in IoT devices
What is unstructured data?
- No formal description of schema - no schema on write
- Human readable, requires pre-processing for a computer to extract information.
What types of file formats make up unstructured data?
- Text files
- Image Files
*Video Files - Audio Files
What are data ingestion and integration frameworks?
- Often carried out as a single technical solution
- Been used in data warehouses for a long time
- Uses the ETL process
What is the ETL process?
- Extract: collects raw data, often in a structured format
- Transform: processes the data into a format matching the end destination
- Load: stores the transformed data into its new storage location
What might be carried out during the transformation process of ETL?
- Data cleaning
*Data enrichment - Feature engineering
Where is ETL traditionally used?
Batch processing
What other approaches can be used instead of ETL?
- IoT Hubs
- Digital Twins
- Data Pipeline Orchestrators
- Bulk import tools
- Data Streaming Platforms
What are the requirements for a data integration tool?
- Different protocols to support data collection from different sources
- Support for integration processes across different hardware and operating systems
- Scalability and adaptability
- Integrated capabilities for transformation operations (including fundamental and complex transformations)
- Security mechanisms to protect the data in the pipeline
- Visualisation of data flow (not necessary, but offered by many tools)
What is data variety?
The diversity of data types
What is data veracity?
The level of trust in the collected data
What is meant by data velocity?
Tthe speed of data generation and its movement
What are challenges with data integration and ingestion?
- The increasing variety and veracity of data
- Processing large amounts of data with high velocity
- New requirements due to increased interest in new technologies
- Cybersecurity - data needs to be secure, trusted and accountable
- Encryption to protect data during transport
How do data processing frameworks work?
- Distribute storage and processing over several nodes
- Transform data in several steps
- Can efficiently store, access and process large amounts of data
How is the transformation process generally modelled in a data processing framework?
As a Directed Acyclic Graph
How are Directed Acyclic Graphs used in data processing?
- Each stage has an input and an output
- Inputs can be used for multiple tasks so dependencies are clearly defined and there are no feedback loops
What are the components of a Directed Acyclic Graph used in data processing?
- Orchestrator: supervises task execution
- Scheduler: coordinates tasks
- Executor: runs the tasks
- Metadata component: monitors the state of the pipeline
What is batch processing?
- Raw data are collected at given time intervals (hourly, daily, weekly)
- Data is transformed and loaded into centralised storage
- Process can be automated to run without human intervention
What is stream processing?
- Real-time processing
- Shortens processing intervals into smaller windows (typically sub seconds)
- May use an event driven approach where data are processed as soon as they are available
What is the event driven approach in stream processing?
- Often implemented as a publish-subscribe architecture
- Data producing sources notifies the stream processor of new data/events
What is the difference between batch processing and stream processing?
- Batch data are bound (we know the size and number of data entries before processing)
- Stream data are unbound (size and number of data is unknown).
What data batch size does batch processing handle?
Large data batches
What data batch size does stream processing handle?
Individual records or microbatches
What is the latency of batch and stream processing?
- Batch processing is high latency
- Stream processing is low latency
What type of tasks is batch processing used for?
Complex data processing
What types of tasks is stream processing used for?
Time sensitive data or quick insights
What is a Lambda architecture?
- Data Processing design patterns that supports both batch and stream processing to handle massive amounts of data
- Made up of a batch, speed and serving layer
How does a Lambda architecture work?
- Abstracts the underlying implementation.
- Enables complex analysis with batch processing and quick insights through stream processing
What is Kappa Architecture and how does it work?
- No serving layer
- Batch and speed layers are integrated into a single processing layer which is capable of both batch and stream processing
What are the benefits of using Kappa architecture?
- Easier to implement than lambda architecture
- Simplified architecture
- Easier to set up and maintain
- More secure as there is less attack surface.
Is it more efficient to set up your own processing architecture or use a managed service?
- More resource efficient to use a managed service that setting up, maintaining and administering your own service.
- AWS, Azure and Google Cloud Platform have the biggest market share.
- Apache also offers solutions through Hadoop
What services does Azure offer for batch processing?
- Azure Synapse - advanced analytics
- Azure Data Lake Analytics - analytical services for Azure Data Lake
- HDInsight - for Hadoop
- Databricks - large scale analytics based on Spark
What are examples of use cases for stream processing?
- Fraud detection
- Social media analysis
- Log monitoring
- Customer analysis
What services does Azure offer for stream processing?
- HDInsight with Spark or Stream
- Databricks
- Azure Stream Analytics
- Azure Functions Azure App Service Webjobs
- IoT and Event Hubs
What do you need to consider when choosing a stream processing solution?
- Which programming languages are supported
- Programming paradigm (declarative or imperative)
- Pricing model
- Available connectors.
What are the different pricing models for stream processing?
- Per Streaming Unit
- Per Active Cluster Hour
- Per Executed Function
What are data sources?
Systems that generate data
What are data sinks?
- Storage systems designed to receive data from different sources
- Used for processing, analysis or retrieval
What components are available for lambda architecture on AWS?
- Kinesis Firehose for data ingestion
- S3 (Amazon Simple Storage Service) - cloud based distributed file system
- AWS Glue for batch processing
- Athena: data warehouse
- Amazon QuickSight: reporting tool
What is primary data storage?
Holding data in memory or the CPU during program execution
What is secondary data storage?
Storing data in hardware devices so that the date is available for future use.
What is physical secondary storage?
Hard Disk Drives, USB Flash Drives, SD Cards, Solid State Drives or other physical media used as direct attached storage (a storage drive that is directly connected to a computer).
What is cloud storage?
- Storage solutions deployed over a network
- An infrastructure of interconnected servers
Why is cloud storage suited for data intensive applications?
- Can store large amounts of data
- Distribute data across the physical storage of individual machines
What are the two ways stored files can be encoded?
Text files or Binary files. The file content of both is a sequence of bits.
What are text files?
- Human readable files
- The bits represent characters
- Can be stored as plain text (txt) or rich text (rtf)
- May also be a CSV table
What are binary files?
- Computer readable files
- A sequence of bytes, optimised for computer access
- Better at data compression - faster access and smaller files
- Images, audio, video files (or any file that needs translation by a computer program to read the content)
What is data serialisation?
The process of translating text data into binary data
What is data de-serialisation?
The process of translating binary data into text data
What are the five fundamental storage categories?
- Files systems
- Data Lakes
- Data Warehouses
- Relational Databases
- NoSQL databases
What are local file systems?
- Generally store unstructured data objects
- Hierarchy of drives, directories, subdirectories and files
- Generally focus on sharing and access aspects of small datasets rather than large distributed datasets
What are cloud-based file systems?
A collaborative platform to store and share files remotely
What features do cloud-based file systems offer?
- Automated backups
- Synchronisation of data
- User-specific file or folder access
- Versioning
- Data security services
- User friendly interface
What are examples of cloud based file systems?
- Sync
- pCloud
- iceDrive
- Google Drive
- Sharepoint
- OneDrive
- Dropbox