Chapter 4 - Ingesting and Transforming Data Flashcards

1
Q

Incremental loading

A

Process of loading smaller increments of data into a storage solution. i.e. daily data is loaded into a data lake or hourly data flowing into a ETL pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does Azure Data Factory do?

A
  • Pipelines: collection of activities linked together to control flow or data transformation.
  • Activities: Steps in the pipeline, such as copying data or running a Spark job.
  • Datasets: data pipelines or activities operate on.
  • Linked Services: connections that are utilized by ADF to link a variety of data stores and resources in Azure.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the ADF incremental loading techniques?

A

Watermarks (Relational DB): Track the last updated record (e.g., via a “LastModified” column) to load only new or changed rows.

File Timestamp (Blob Storage): Use the file’s last modified date/time to identify and load only the updated files at source.

Partition Data (Time-Based Source): Load data incrementally by processing only the time-specific partitions that have changed.

Folder Structure (Time-Based Source): Organize files into time-based folders so you can pull only new or updated data during each load.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Three APIs supported by Apache Spark?

A
  1. Resilient Distributed Datasets
  2. Dataframes
  3. Datasets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are Resilient Distributed Datasets?

A

RDDs are the backbone of Spark, it is a powerful data processing framework that allows to perform in-memory computations across large clusters in a fault-tolerant way.

Think of them as a team of data warriors. It supports JSON, CSVs and Parquet file types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are DataFrames?

A

Similar to tables in relational databases. They are immutable, redundant, and distributed but represent a higher form of data abstraction.

They contain schemas, columns, and rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is T-SQL and what is it used for?

A

Variation of SQL that is primarily utilized with Microsoft SQL Server which focuses on enterprise features for SQL Server.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are Azure Data Factory and Azure Synapse Pipelines used to transform data?

A

Azure Data Factory (stand-alone data transformation service) and Azure Synapse Pipelines (analytics focused with other Synapse features) offer low-code alternatives to T-SQL and Spark (Python) environments, however, code can be added later to expand available features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are Synapse Pipelines?

A

Synapse pipelines are simply ADF pipelines whose functionalities are ported into the Synapse workspace and can be considered, for all intents and purposes, functionally equivalent to ADF pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What kind of transformations does Azure Data Factory provide?

A

Schema Transformations - changing table schema i.e. Aggregate, Source (add a new source), Sink (add destination option to export)

Row Transformations - Alter Row (insert, delete, update, sort, or upsert) into a row.

Multi-I/O Transformations - operate on more than one input, split the input into more than one output. Splitting rows, merging rows. i.e. Conditional Split, Join, Union

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Azure Data Factory Pipelines

A

ADF provides standard templates that you can use to copy and transform data, available in the ADF template gallery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When should I use Stream Analytics? What does it do?

A

Stream Analytics provides real-time data processing, enabling transformation, filtering, and aggregatino of continuous data streams from IoT sensors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is “trimming inputs” related to Cleansing data?

A
  • Removing values such as trailing whitespaces.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is standardizing values in data cleansing?

A

Converting currency to a single type, i.e. USD before sending data downstream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is handling outliers in data cleansing?

A

If values of some fields look abnormal, you can replace or impute them with average or median values. i.e. Salary greater than 5,000 replace with AvgSalary, AvgSalary is found via aggregate function in the workflow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is handling duplicate data within data cleansing?

A

Can be removed via deduplication and cleaning during the ERL process in ADF.

17
Q

What is late-arriving data and how is it handled during data cleaning?

A

Late-arriving data is information that arrives after its expected processing window due to:

Technical issues (network/system delays)
Business processes (delayed entries, batch uploads)

Handling approaches:

Drop data if loss is acceptable
Reprocess pipeline if accuracy required
Update historical records via SCD Type 2
Track processing windows with timestamps
Implement windowing strategies in data flows

18
Q

What are the two data splitting techniques?

A

Conditional Split: split based on set criteria.

Cloning (New Branch): Copy entire dataset for new execution flow. Enables parallel processing or different operations on the same dataset without affecting the original data flow.

19
Q

What is JSON shredding?

A

JSON document is broken down into my manageable parts for easier processing and analysis, i.e. individual components, key-value pairs, or arrays.

Three methods:
Spark native - create a dataframe in Spark
T-SQL - create a csv output using SQL to parse the JSON.
ADF: Flatten transformation to convert hierarchical data into flat structure (table)

19
Q

What is encoding or decoding data in data cleaning?

A

Encoding - process of converting data from one for to another for efficient transmission, storage, or security (encryption).
Decoding - Reverse encoding, data is returns to its original form.

Can be done in T-SQL, Spark SQL, or ADF (low code)

20
Q

Which pipeline tool provides built-in error handling?

A

ADF supports Success, Failure, Completion, and Skipped as error handling.

21
Q

Normalization does what?

A

Gets your data in order, ensures everything has its proper place, reduces clutter - breaks down dataset into multiple tables and establishing clear relationships between them.

22
Q

Denormalization does what?

A

denormalization combines multiple table data into a single table to improve performance of the databse by reducing the complexity of queries that join multiple tables, which can be slow and resource intensive.

23
Q

What does pivot function do?

A

Pivot transforms rows into columns, allowing for more compact data representation that is used for visualization and reporting. Increases the number of columns and decreases the number of rows.

24
Q

What does the unpivot function do?

A

Unpivot turns columns into rows, making it more suitable for analysis, converts to a long format, increasing the number of rows and decreasing the number of columns.