Introduction to Data Engineering Flashcards

1
Q

The role of a data engineer

A
  1. the primary role responsible for integrating, transforming, and consolidating data from various structured and unstructured data systems into structures that are suitable for building analytics solutions.
  2. helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of Data

A
  1. Structured
  2. Unstructured
  3. Semi-structured
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Structured data

A
  1. primarily comes from table-based source systems
  2. the rows and columns are aligned consistently throughout the file.
  3. CSV, RDB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Semi-structured Data

A
  1. may require flattening prior to loading into your source system.
  2. doesn’t have to fit neatly into a table structure.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Unstructured data

A

data stored as key-value pairs that don’t adhere to standard relational models and Other types of unstructured data that are commonly used include portable data format (PDF), word processor documents, and images.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Main data operations

A
  1. Data integration
  2. Data transformation
  3. Data consolidation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data integration

A

establishing links between operational and analytical services and data sources to enable secure, reliable access to data across multiple systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data transformation

A

extract, transform, and load (ETL) process

the data is prepared to support downstream analytical needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data consolidation

A
  1. the process of combining data that has been extracted from multiple data sources into a consistent structure - usually to support analytics and reporting.
  2. data from operational systems is extracted, transformed, and loaded into analytical stores such as a data lake or data warehouse.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Operational data

A

transactional data that is generated and stored by applications, often in a relational or non-relational database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Analytical data

A

data that has been optimized for analysis and reporting, often in a data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Streaming data

A

perpetual sources of data that generate data values in real-time, often relating to specific events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data pipelines

A
  1. are used to orchestrate activities that transfer and transform data.
  2. Pipelines are the primary way in which data engineers implement repeatable extract, transform, and load (ETL) solutions that can be triggered based on a schedule or in response to events.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data lakes

A
  1. a storage repository that holds large amounts of data in native, raw formats.
  2. optimized for scaling to massive volumes (terabytes or petabytes) of data.
  3. data typically comes from multiple heterogeneous sources, and may be structured, semi-structured, or unstructured.

GOAL: store everything in its original, untransformed state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data warehouse

A
  1. a centralized repository of integrated data from one or more disparate sources.
  2. store current and historical data in relational tables that are organized into a schema that optimizes performance for analytical queries.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Apache Spark

A

a parallel processing framework that takes advantage of in-memory processing and a distributed file storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Core Azure technologies used to implement data engineering workloads include:

A
  1. Azure Synapse Analytics
  2. Azure Data Lake Storage Gen2
  3. Azure Stream Analytics
  4. Azure Data Factory
  5. Azure Databricks
  6. Azure Event Hubs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Data Lake pt2

A
  1. provides file-based storage, usually in a distributed file system that supports high scalability for massive volumes of data.
  2. an store structured, semi-structured, and unstructured files in the data lake and then consume them from there in big data processing technologies, such as Apache Spark.
19
Q

Azure Data Lake Storage

A
  1. a comprehensive, massively scalable, secure, and cost-effective data lake solution for high performance analytics built into Azure.
  2. combines a file system with a storage platform to help you quickly identify insights into your data.
  3. enables analytics performance, the tiering and data lifecycle management capabilities of Blob storage, and the high-availability, security, and durability capabilities of Azure Storage.
  4. can use as the basis for both real-time and batch solutions.
20
Q

Benefits

A
  1. Hadoop compatible access
  2. Security
  3. Performance
  4. Data redundancy
21
Q

Hadoop compatible access

A
  1. As if stored in HDFS
  2. can store the data in one place and access it through compute technologies including Azure Databricks, Azure HDInsight, and Azure Synapse Analytics without moving the data between environments.
  3. use storage mechanisms such as the parquet format, which is highly compressed and performs well across multiple platforms using an internal columnar storage.
22
Q

Security

A
  1. supports access control lists (ACLs)
  2. Portable Operating System Interface (POSIX) permissions
  3. can set permissions at a directory level or file level for the data stored within the data lake
  4. encrypted at rest
23
Q

Performance

A

organizes the stored data into a hierarchy of directories and subdirectories, much like a file system, for easier navigation.

24
Q

Data redundancy

A

takes advantage of the Azure Blob replication models that provide data redundancy in a single data center with locally redundant storage (LRS), or to a secondary region by using the Geo-redundant storage (GRS) option.

25
Q

Consideration for data lake

A
  1. Types of data to be stored
  2. How the data will be transformed
  3. Who should access the data
  4. What are the typical access patterns

how to plan for access control governance across your lake

26
Q

Enable Azure Data Lake

A
  1. isn’t a standalone Azure service, but rather a configurable capability of a StorageV2
  2. select the option to Enable hierarchical namespace in the Advanced page when creating the storage account in the Azure portal

OR
use the Data Lake Gen2 upgrade wizard in the Azure portal page for your storage account resource.

27
Q

Azure Blob vs Azure Data Lake (WIP)

A
  1. Azure Blob storage, you can store large amounts of unstructured (“object”) data in a flat namespace within a blob container.
  2. blobs are stored as a single-level hierarchy in a flat namespace.
  3. Blob: You can access this data by using HTTP or HTTPs
  4. Azure Data Lake Storage Gen2 builds on blob storage and optimizes I/O of high-volume data by using a hierarchical namespace that organizes blob data into directories, and stores metadata about each directory and the files within it.
  5. directory renames and deletes, to be performed in a single atomic operation
  6. Hierarchical namespace: better storage and retrieval performance for an analytical use case and lowers the cost of analysis.
  7. applications can use either the Blob APIs or the Azure Data Lake Storage Gen2 file system APIs to access data.
28
Q

4 stages for processing big data solutions that are common to all architectures:

A
  1. Ingest
  2. Store
  3. Prep and train
  4. Model and serve
29
Q

Ingest

A

Batch: Azure Synapse Analytics or Azure Data Factory
Stream: Apache Kafka for HDInsight or Stream Analytics

acquire the source data

30
Q

Store

A

Azure Data Lake Storage Gen2

where the ingested data should be placed

31
Q

Prep and train

A

Azure Synapse Analytics, Azure Databricks, Azure HDInsight, and Azure Machine Learning.

perform data preparation and model training and scoring for machine learning solutions

32
Q

Model and serve

A

Microsoft Power BI, or analytical data stores such as Azure Synapse Analytics

present the data to users.

33
Q

Azure Data Lake Storage Gen2 use cases:

A
  1. Big data processing and analytics
  2. Data warehousing
  3. Real-time data analytics
  4. Data science and machine learning
34
Q

Data lake: Big data processing and analytics

A
  1. provides a scalable and secure distributed data store on which big data services such as Azure Synapse Analytics, Azure Databricks, and Azure HDInsight can apply data processing frameworks such as Apache Spark, Hive, and Hadoop
  2. enables tasks to be performed in parallel, resulting in high-performance and scalability
35
Q

Data lake: Data warehousing

A
  1. integrate large volumes of data stored as files in a data lake with relational tables in a data warehouse.
  2. the data is staged in a data lake in order to facilitate distributed processing before being loaded into a relational data warehouse.
36
Q

“data lakehouse” or “lake database”

A
  1. the data warehouse uses external tables to define a relational metadata layer over files in the data lake
  2. The data warehouse can then support analytical queries for reporting and visualization.
37
Q

Data lake: real time data analysis

A
  1. Streaming events are often captured in a queue for processing (i.e. Azure Event Hubs)
  2. Azure Stream Analytics enables you to create jobs that query and aggregate event data as it arrives, and write the results in an output sink.
  3. One such sink is Azure Data Lake Storage Gen2; from where the captured real-time data can be analyzed and visualized.
38
Q

4 Analytical techniques commonly used for research

A
  1. Descriptive analytics
  2. Diagnostic analytics
  3. Predictive analytics
  4. Prescriptive analytics
39
Q

Descriptive analytics

A

“What is happening in my business?”

creation of a data warehouse in which historical data is persisted in relational tables for multidimensional modeling and reporting.

40
Q

Diagnostic analytics

A

Why is it happening?

may involve exploring information that already exists in a data warehouse, but typically involves a wider search of your data estate to find more data to support this type of analysis.

41
Q

Prescriptive analytics

A

autonomous decision making based on real-time or near real-time analysis of data, using predictive analytics.

42
Q

Azure Synapse Analytics workspace

A
  1. an instance of the Synapse Analytics service in which you can manage the services and data resources needed for your analytics solution
  2. Synapse Studio; a web-based portal for Azure Synapse Analytics.
  3. A workspace typically has a default data lake, which is implemented as a linked service to an Azure Data Lake Storage Gen2 container
  4. Azure Synapse Analytics includes built-in support for creating, running, and managing pipelines that orchestrate the activities necessary to retrieve data from a range of sources, transform the data as required, and load the resulting transformed data into an analytical store.
  5. can create one or more Spark pools and use interactive notebooks to combine code and notes as you build solutions for data analytics
  6. Azure Synapse Data Explorer is a data processing engine in Azure Synapse Analytics that is based on the Azure Data Explorer service.
43
Q

Azure synapse analytics can be integrated with other services

A
  1. Azure Synapse Link enables near-realtime synchronization between operational data
  2. Microsoft Power BI - data visualisation
  3. Microsoft Purview - catalogue data assets
  4. Azure Machine Learning
44
Q
A