Chapter 4 - Ingesting and Transforming Data Flashcards
Incremental loading
Process of loading smaller increments of data into a storage solution. i.e. daily data is loaded into a data lake or hourly data flowing into a ETL pipeline.
What does Azure Data Factory do?
- Pipelines: collection of activities linked together to control flow or data transformation.
- Activities: Steps in the pipeline, such as copying data or running a Spark job.
- Datasets: data pipelines or activities operate on.
- Linked Services: connections that are utilized by ADF to link a variety of data stores and resources in Azure.
What are the ADF incremental loading techniques?
Watermarks (Relational DB): Track the last updated record (e.g., via a “LastModified” column) to load only new or changed rows.
File Timestamp (Blob Storage): Use the file’s last modified date/time to identify and load only the updated files at source.
Partition Data (Time-Based Source): Load data incrementally by processing only the time-specific partitions that have changed.
Folder Structure (Time-Based Source): Organize files into time-based folders so you can pull only new or updated data during each load.
Three APIs supported by Apache Spark?
- Resilient Distributed Datasets
- Dataframes
- Datasets
What are Resilient Distributed Datasets?
RDDs are the backbone of Spark, it is a powerful data processing framework that allows to perform in-memory computations across large clusters in a fault-tolerant way.
Think of them as a team of data warriors. It supports JSON, CSVs and Parquet file types.
What are DataFrames?
Similar to tables in relational databases. They are immutable, redundant, and distributed but represent a higher form of data abstraction.
They contain schemas, columns, and rows.
What is T-SQL and what is it used for?
Variation of SQL that is primarily utilized with Microsoft SQL Server which focuses on enterprise features for SQL Server.
How are Azure Data Factory and Azure Synapse Pipelines used to transform data?
Azure Data Factory (stand-alone data transformation service) and Azure Synapse Pipelines (analytics focused with other Synapse features) offer low-code alternatives to T-SQL and Spark (Python) environments, however, code can be added later to expand available features.
What are Synapse Pipelines?
Synapse pipelines are simply ADF pipelines whose functionalities are ported into the Synapse workspace and can be considered, for all intents and purposes, functionally equivalent to ADF pipelines.
What kind of transformations does Azure Data Factory provide?
Schema Transformations - changing table schema i.e. Aggregate, Source (add a new source), Sink (add destination option to export)
Row Transformations - Alter Row (insert, delete, update, sort, or upsert) into a row.
Multi-I/O Transformations - operate on more than one input, split the input into more than one output. Splitting rows, merging rows. i.e. Conditional Split, Join, Union
Azure Data Factory Pipelines
ADF provides standard templates that you can use to copy and transform data, available in the ADF template gallery.
When should I use Stream Analytics? What does it do?
Stream Analytics provides real-time data processing, enabling transformation, filtering, and aggregatino of continuous data streams from IoT sensors.
What is “trimming inputs” related to Cleansing data?
- Removing values such as trailing whitespaces.
What is standardizing values in data cleansing?
Converting currency to a single type, i.e. USD before sending data downstream.
What is handling outliers in data cleansing?
If values of some fields look abnormal, you can replace or impute them with average or median values. i.e. Salary greater than 5,000 replace with AvgSalary, AvgSalary is found via aggregate function in the workflow.