Chapter 4 - Ingesting and Transforming Data Flashcards
Incremental loading
Process of loading smaller increments of data into a storage solution. i.e. daily data is loaded into a data lake or hourly data flowing into a ETL pipeline.
What does Azure Data Factory do?
- Pipelines: collection of activities linked together to control flow or data transformation.
- Activities: Steps in the pipeline, such as copying data or running a Spark job.
- Datasets: data pipelines or activities operate on.
- Linked Services: connections that are utilized by ADF to link a variety of data stores and resources in Azure.
What are the ADF incremental loading techniques?
Watermarks (Relational DB): Track the last updated record (e.g., via a “LastModified” column) to load only new or changed rows.
File Timestamp (Blob Storage): Use the file’s last modified date/time to identify and load only the updated files at source.
Partition Data (Time-Based Source): Load data incrementally by processing only the time-specific partitions that have changed.
Folder Structure (Time-Based Source): Organize files into time-based folders so you can pull only new or updated data during each load.
Three APIs supported by Apache Spark?
- Resilient Distributed Datasets
- Dataframes
- Datasets
What are Resilient Distributed Datasets?
RDDs are the backbone of Spark, it is a powerful data processing framework that allows to perform in-memory computations across large clusters in a fault-tolerant way.
Think of them as a team of data warriors. It supports JSON, CSVs and Parquet file types.
What are DataFrames?
Similar to tables in relational databases. They are immutable, redundant, and distributed but represent a higher form of data abstraction.
They contain schemas, columns, and rows.
What is T-SQL and what is it used for?
Variation of SQL that is primarily utilized with Microsoft SQL Server which focuses on enterprise features for SQL Server.
How are Azure Data Factory and Azure Synapse Pipelines used to transform data?
Azure Data Factory (stand-alone data transformation service) and Azure Synapse Pipelines (analytics focused with other Synapse features) offer low-code alternatives to T-SQL and Spark (Python) environments, however, code can be added later to expand available features.
What are Synapse Pipelines?
Synapse pipelines are simply ADF pipelines whose functionalities are ported into the Synapse workspace and can be considered, for all intents and purposes, functionally equivalent to ADF pipelines.
What kind of transformations does Azure Data Factory provide?
Schema Transformations - changing table schema i.e. Aggregate, Source (add a new source), Sink (add destination option to export)
Row Transformations - Alter Row (insert, delete, update, sort, or upsert) into a row.
Multi-I/O Transformations - operate on more than one input, split the input into more than one output. Splitting rows, merging rows. i.e. Conditional Split, Join, Union
Azure Data Factory Pipelines
ADF provides standard templates that you can use to copy and transform data, available in the ADF template gallery.
When should I use Stream Analytics? What does it do?
Stream Analytics provides real-time data processing, enabling transformation, filtering, and aggregatino of continuous data streams from IoT sensors.
What is “trimming inputs” related to Cleansing data?
- Removing values such as trailing whitespaces.
What is standardizing values in data cleansing?
Converting currency to a single type, i.e. USD before sending data downstream.
What is handling outliers in data cleansing?
If values of some fields look abnormal, you can replace or impute them with average or median values. i.e. Salary greater than 5,000 replace with AvgSalary, AvgSalary is found via aggregate function in the workflow.
What is handling duplicate data within data cleansing?
Can be removed via deduplication and cleaning during the ERL process in ADF.
What is late-arriving data and how is it handled during data cleaning?
Late-arriving data is information that arrives after its expected processing window due to:
Technical issues (network/system delays)
Business processes (delayed entries, batch uploads)
Handling approaches:
Drop data if loss is acceptable
Reprocess pipeline if accuracy required
Update historical records via SCD Type 2
Track processing windows with timestamps
Implement windowing strategies in data flows
What are the two data splitting techniques?
Conditional Split: split based on set criteria.
Cloning (New Branch): Copy entire dataset for new execution flow. Enables parallel processing or different operations on the same dataset without affecting the original data flow.
What is JSON shredding?
JSON document is broken down into my manageable parts for easier processing and analysis, i.e. individual components, key-value pairs, or arrays.
Three methods:
Spark native - create a dataframe in Spark
T-SQL - create a csv output using SQL to parse the JSON.
ADF: Flatten transformation to convert hierarchical data into flat structure (table)
What is encoding or decoding data in data cleaning?
Encoding - process of converting data from one for to another for efficient transmission, storage, or security (encryption).
Decoding - Reverse encoding, data is returns to its original form.
Can be done in T-SQL, Spark SQL, or ADF (low code)
Which pipeline tool provides built-in error handling?
ADF supports Success, Failure, Completion, and Skipped as error handling.
Normalization does what?
Gets your data in order, ensures everything has its proper place, reduces clutter - breaks down dataset into multiple tables and establishing clear relationships between them.
Denormalization does what?
denormalization combines multiple table data into a single table to improve performance of the databse by reducing the complexity of queries that join multiple tables, which can be slow and resource intensive.
What does pivot function do?
Pivot transforms rows into columns, allowing for more compact data representation that is used for visualization and reporting. Increases the number of columns and decreases the number of rows.