Introduction to Data Engineering Flashcards
The role of a data engineer
- the primary role responsible for integrating, transforming, and consolidating data from various structured and unstructured data systems into structures that are suitable for building analytics solutions.
- helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints.
Types of Data
- Structured
- Unstructured
- Semi-structured
Structured data
- primarily comes from table-based source systems
- the rows and columns are aligned consistently throughout the file.
- CSV, RDB
Semi-structured Data
- may require flattening prior to loading into your source system.
- doesn’t have to fit neatly into a table structure.
Unstructured data
data stored as key-value pairs that don’t adhere to standard relational models and Other types of unstructured data that are commonly used include portable data format (PDF), word processor documents, and images.
Main data operations
- Data integration
- Data transformation
- Data consolidation
Data integration
establishing links between operational and analytical services and data sources to enable secure, reliable access to data across multiple systems.
Data transformation
extract, transform, and load (ETL) process
the data is prepared to support downstream analytical needs.
Data consolidation
- the process of combining data that has been extracted from multiple data sources into a consistent structure - usually to support analytics and reporting.
- data from operational systems is extracted, transformed, and loaded into analytical stores such as a data lake or data warehouse.
Operational data
transactional data that is generated and stored by applications, often in a relational or non-relational database.
Analytical data
data that has been optimized for analysis and reporting, often in a data warehouse.
Streaming data
perpetual sources of data that generate data values in real-time, often relating to specific events.
Data pipelines
- are used to orchestrate activities that transfer and transform data.
- Pipelines are the primary way in which data engineers implement repeatable extract, transform, and load (ETL) solutions that can be triggered based on a schedule or in response to events.
Data lakes
- a storage repository that holds large amounts of data in native, raw formats.
- optimized for scaling to massive volumes (terabytes or petabytes) of data.
- data typically comes from multiple heterogeneous sources, and may be structured, semi-structured, or unstructured.
GOAL: store everything in its original, untransformed state.
Data warehouse
- a centralized repository of integrated data from one or more disparate sources.
- store current and historical data in relational tables that are organized into a schema that optimizes performance for analytical queries.
Apache Spark
a parallel processing framework that takes advantage of in-memory processing and a distributed file storage.
Core Azure technologies used to implement data engineering workloads include:
- Azure Synapse Analytics
- Azure Data Lake Storage Gen2
- Azure Stream Analytics
- Azure Data Factory
- Azure Databricks
- Azure Event Hubs