Azure Synapse Analytics Modern Warehousing Concepts Flashcards

Question 1

Q

Define Data Warehousing with Microsoft Azure Synapse Analytics

Answer

A

Azure Synapse Analytics, formerly known as Azure SQL Data Warehouse, is an analytics engine. It’s designed to process large amounts of data very quickly using a SQL pool and MPP engine.. Azure Synapse Analytics blends big data analytics with data warehousing.

Question 2

Q

How do we get to or consume the Insights using Azure Synapse Analytics?

Answer

A

Using Analytics Dashbords, Operational Reporting or Advanced Analytics

Question 3

Q

What are the 4 main stages in an Azure Synapse Pipeline flow?

Answer

A

Ingest and Prepare
Store
Model and Serve
Visualize

Question 4

Q

Can we move SSIS ETL or transformation flows into Azure Data Factory

Answer

A

Yes we can easily

Four storage types for SSIS packages

SSIS catalog (SSISDB). Introduced with SQL Server 2012 and contains a set of stored procedures, views, and table-valued functions used for working with SSIS projects/packages.
File System.
SQL Server system database (MSDB).
SSIS Package Store. A package management layer on top of two subtypes:
- MSDB, which is a system database in SQL Server used to store SSIS packages.
- Managed file system, which is a specific folder in SQL Server installation path used to store SSIS packages.

DMA currently supports the batch-assessment of packages stored in File System, Package Store, and SSIS catalog since DMA version v5.0.

Get DMA, and perform your package assessment with it.

Migration

Depending on the storage types of source SSIS packages, the steps to migrate SSIS packages and SQL Server Agent jobs that schedule SSIS package executions may vary.

It is also a practical way to use SSIS DevOps Tools, to do batch package redeployment to the migration destination.

Package storage typeHow to migrate SSIS packagesHow to migrate SSIS jobsSSISDBRedeploy packages via SSDT/SSMS to SSISDB hosted in Azure Managed Instance. For more info, see Deploying SSIS packages in Azure.
* Migrate from SQL Server Agent on premises to SQL Managed Instance agent via scripts/manual copy. For more info, see run SSIS packages via Azure SQL Managed Instance Agent
* Convert them into ADF pipelines/activities/triggers via scripts/SSMS/ADF portal. For more info, see SSMS scheduling feature.
File SystemRedeploy them to file shares/Azure Files via dtinstall/dtutil/manual copy, or to keep in file systems to access via VNet/Self-Hosted IR. For more info, see dtutil utility.
* Migrate from SQL Server Agent on premises to SQL Managed Instance agent via scripts/manual copy. For more info, see run SSIS packages via Azure SQL Managed Instance Agent
* Migrate with SSIS Job Migration Wizard in SSMS
* Convert them into ADF pipelines/activities/triggers via scripts/SSMS/ADF portal. For more info, see SSMS scheduling feature.
SQL Server (MSDB)Export them to file systems/file shares/Azure Files via SSMS/dtutil. For more info, see Exporting SSIS packages.Convert them into ADF pipelines/activities/triggers via scripts/SSMS/ADF portal. For more info, see SSMS scheduling feature.Package StoreExport them to package store via SSMS/dtutil or redeploy them to package store via dtinstall/dtutil/manual copy. For more info, see Manage packages with Azure-SSIS Integration Runtime package store.
* Migrate from SQL Server Agent on premises to SQL Managed Instance agent via scripts/manual copy. For more info, see run SSIS packages via Azure SQL Managed Instance Agent
* Convert them into ADF pipelines/activities/triggers via scripts/SSMS/ADF portal. For more info, see SSMS scheduling feature.

Question 5

Q

What are the 2 main tools for Azure Ingest and Prepare stage?

Answer

A

Azure Data Factory

Azure Databricks

Question 6

Q

Can we use Azure Synapse Exclusively (without any other tool)?

Answer

A

Yes, it’s a good choice for Greenfield implementations or use a Hybrid approach for corporations with existing Azure Service

Question 7

Q

Name 3 languages which we can use in Azure Synapse Analytics

Answer

A

Python
SQL
Scala

Question 8

Q

If we press the “integrate” button in Synapse Analytics what will we see?

Answer

A

An environment exactly like the one Az Data Factory has, since ADF is included in Synapse Analytics.

The pipeline creation experience is the same as in ADF, which gives you another powerful integration built into Azure Synapse Analytics, removing the need to use Azure Data Factory for data movement and transformation pipelines.

Question 9

Q

What are the 3 Copy methods in Synapse Analytics Integration tab for Sink?

Answer

A

PolyBase
Copy command
Bulk insert

Question 10

Q

To perform a scale operation for a SQL Pool…

Answer

A

To perform a scale operation, SQL pool first kills all incoming queries and then rolls back transactions to ensure a consistent state.

Scaling only occurs once the transaction rollback is complete.

Question 11

Q

What are the 2 types of Analytics Pools available in Synapse?

Answer

A

SQL pools
- Dedicated
- Serverless
Apache Spark pools

Question 12

Q

How does Synapse scale out or in automatically Apache Spark pools?

Answer

A

You can configure the Apache Spark pool to have a fixed size by disabling the autoscale setting. Here we have enabled autoscale and set the minimum and maximum number of nodes to control the amount of scale applied. When you enable autoscale, Synapse Analytics monitors the resource requirements of the load and scales the number of nodes up or down. It does this by continuously monitoring pending CPU, pending memory, free CPU, free memory, and used memory per node metrics. It checks these metrics every 30 seconds and makes scaling decisions based on the values.

It can take 1-5 minutes for a scaling operation to complete.

Question 13

Q

What options can we find in Develop tab?

Answer

A

SQL Scripts
Notebooks
Data flows
Power BI

Question 14

Q

We can use JSON functions, such as xxx and xxx to extract the JSON data and extract it to specific structured columns inside a Table in a SQL Pool.

Answer

A

JSON_VALUE and ISJSON

Question 15

Q

What is a data flow in Develop Tab of Synapse?

Answer

A

data flows are powerful data transformation workflows that use the power of Apache Spark, but are authored using a code-free GUI. The work you do in the UI gets transformed into code executed by a managed Spark cluster, automatically, without having to write any code or manage the cluster.

Question 16

Q

When ingesting raw data in batch from new data sources, which of the following data formats are natively supported by Synapse Analytics?

Scala
JSON
ORC
CSV
Parquet

Answer

A

JSON
ORC
CSV
Parquet

Question 17

Q

Which of the following would be a valid reason for adding a staging area into the architecture of a modern data warehouse?

To reduce contention on source systems
To make data analytics available directly from the staging area
To join data from different source systems
Enable the ingestion of source systems based on different schedules

Answer

A

To reduce contention on source systems
To join data from different source systems
Enable the ingestion of source systems based on different schedules

Question 18

Q

The process of building a modern data warehouse typically includes Data Ingestion and Preparation.

You have recently deployed Azure Synapse Analytics. You now have a requirement to ingest data code-free. Which of the following tools can be used to perform this task?

Azure Databricks
Azure Data Factory
Power BI

Answer

A

Azure Data Factory

Question 19

Q

Processing data that arrives in real-time /near real-time is also referred to as streaming data processing. Azure offers purpose-built stream ingestion services such as Azure IoT Hub and Azure Event Hubs. To collect messages from these or similar services, and process them, you can use which of the following features?

Azure Stream Analytics
Azure IoT Central
Azure Databricks
Azure Functions

Answer

A

Azure Stream Analytics
Azure Functions
Azure Databricks

Question 20

Q

Question 5

Which technology is typically used as a staging area in a modern data warehousing architecture?

Azure Data Lake
Azure Synapse SQL Pools
Azure Synapse Spark Pools.

Answer

A

Azure Data Lake

Question 21

Q

Which of the following is a Big Data Solution that stores data in a relational table format with columnar storage?

Azure Synapse SQL Pools
Azure Synapse Spark Pools

Answer

A

Azure Synapse SQL Pools

Question 22

Q

What is the main reason for adding a staging area into the architecture of a modern data warehouse?

Answer

A

The main reason for adding a staging area into the architecture of a modern data

warehouse is for any one of the following reasons.

To reduce contention on source systems.
To deal with the ingestion of source systems on different schedules.
To join data together from different source systems.
And to rerun failed data warehouse loads from a staging area.

Question 23

Q

2 main functions of source systems in an organization

Answer

A

Source systems typically play an important role in fulfilling

business operations that either bring in (generate) revenue to an organization or

provides a function that is mission critical to the business.

Question 24

Q

What is Data Dumping?

Answer

A

In this case, some data warehouse design strategies will involve grabbing

data as a source and dumping the data into a staging area.

This approach involves no transformation or cleansing.

It simply grabs the data so it minimizes the contention on the source system.

Question 25

Q

What is metadata and how can you leverage this in a staging area?

Answer

A

You can even create additional tables that can aid the process of joining data together from different source systems, referred to as mapping tables. In this scenario, imagine that you have a customers table in one source system that has a column named FirstName and a second source system, perhaps running an AS400 system. You have a customers table that has a column named FIRNAM that also represents the first name of the customer. You can create a separate table that contains metadata that maps the data in a column from one source system with another column from another source system that represents the same business entity, in this case, FirstName.

Question 26

Q

What is a Data Lake?

Answer

A

A data lake is a repository of data that is stored in its natural format, usually as blobs or files. Azure Data Lake Storage is a comprehensive, scalable and cost effective data lake solution for big data analytics built into Azure. Azure Data Lake storage combines a file system with a storage platform to help you quickly identify insights into your data. And Data Lake Storage Gen2 builds on Azure blob storage capabilities to optimize it specifically for analytics workloads. This integration enables analytics performance, the tearing and data lifecycle management capabilities of blob storage, and the high availability, security, and durability capabilities of Azure Storage.

Question 27

Q

What are the 3 Data-loading velocities?

Answer

A

Batch query
- A batch which is queries or programs that take tens of minutes, hours, or days to complete. Activities could include initial data wrangling, complete GTL pipeline, or preparation for downstream analytics.
Interactive query
- An interactive query which involves querying batch data at human interactive speeds which with the current generation of technologies means results are ready in time frames measured in seconds to minutes.
Real-time query
- And a real-time or near real-time query, this is processing of a typically infinite stream of input data stream. The time until results are ready is short, and is measured in milliseconds or seconds in the longest of cases.

Question 28

Q

Recommended storage formats for common data types in a Modern Datawarehouse

Answer

A

For raw data, it is recommended that this data be stored in its native format.
Data from relational databases should typically be stored in CSV format. This is the format supported by the most systems, so it provides the greatest flexibility.
For data from web APIs and NoSQL databases, JSON is the recommended format.
When it comes to storing refined versions of the data for possible querying, the recommended data format is Parquet.

Question 29

Q

Parquet benefits

Answer

A

There is industry alignment around the Parquet format for sharing data at the storage layer. For example, across Hadoop data bricks and SQL engine scenarios.
And Parquet is a high-performance column orientated format optimized for big data scenarios. Columnar formats like Parquet have storage and performance benefits.
The values are clustered by column so the compression is more efficient to shrink the storage footprint. And a query engine can push down column projections to reduce read IO from network and disk by skipping unwanted column, otherwise known as column pruning.
Similar data types for a column are stored together in Parquet files, leading to efficient data compression and encoding schemes.
Finally, Parquet stores the file schema in the file metadata. CSV files don’t store file metadata. So readers need to be either supplied with the schema or the schema needs to be inferred. Supplying a schema is tedious, and inferring a schema is error prone and expensive.

Question 30

Q

he first thing you should consider when ingesting data into the data lake is…

Answer

A

The first thing you should consider when ingesting data into the data lake is how to structure or organize data within the data lake. You should use Azure Data Lake Storage or ADLS Gen2. Within the Azure portal, this is an Azure storage account with a hierarchical name space enabled. A key mechanism of ADLS Gen2 is to provide file system performance at object storage scale and prices. And it also provides a hierarchical namespace. This allows the collection of objects and files within an account to be organized into a hierarchy of directories and nested subdirectories in the same way that the file system on your computer is organized. With a hierarchical name space enabled, a storage account becomes capable of providing the scalability and cost effectiveness of object storage. And file system semantics that are familiar to analytics engines and frameworks.

Question 31

Q

In ADLS Gen2, it is a best practice(for storage environments) to have a…

Answer

A

In ADLS Gen2, it is a best practice to have a dedicated storage account for production, and a separate storage account for dev and test workloads. This will ensure that dev or test workloads never interfere with production.

Question 32

Q

A common method for structuring folders within a data lake

Answer

A

A common method for structuring folders within a data lake is to organize data in separate folders by the degree of refinement.

For example,

Bronze folder might contain raw data.
Silver contains the cleaned, prepared, and integrated data.
Gold contains data ready to support analytics, which might include final refinements, such as pre-computed aggregates.

Question 33

Q

When data is stored in data lake storage Gen2, the file size, number of files, and folder structure have an impact on performance…explain this

Answer

A

When working with data lake storage Gen2, the following should be considered.

When data is stored in data lake storage Gen2, the file size, number of files, and folder structure have an impact on performance.
And if you store your data as many small files, this can negatively affect performance.
In general, organize your data into larger sized files for better performance, 256 megabytes to 100 gigbytes in size.
Some engines and applications might have trouble efficiently processing files that are greater than 100 gigabytes in size. Keep file sizes below 100GB
Finally, sometimes data pipelines have limited control over the raw data, which has lots of small files.
It is recommended to have a cooking process that generates larger files to use for downstream applications.

Question 34

Q

You can natively perform data transformations with Azure Synapse pipelines code free using…

Answer

A

You can natively perform data transformations with Azure Synapse pipelines code free using the Mapping Data Flow task.

Mapping Data Flows provide a fully visual experience with no coding required. Your data flows will run on your own execution cluster for scaled-out data processing. Data flow activities can be operationalized via existing Data Factory scheduling, control, flow, and monitoring capabilities.

Question 35

Q

How can we debug Data flows?

Answer

A

When building data flows, you can enable debug mode, which turns on a small interactive Spark cluster. Turn on debug mode by toggling the slider at the top of the authoring module. Debug clusters take a few minutes to warm up, but can be used to interactively preview the output of your transformation logic.

Question 36

Q

Add source data to the Mapping Data Flow

Mapping Data Flow follows an extract, load, transform (ELT) approach and works with staging datasets that are all in Azure. Currently which datasets can be used?

Answer

A

Mapping Data Flow follows an extract, load, transform (ELT) approach and works with staging datasets that are all in Azure. Currently the following datasets can be used in a source transformation:

Azure Blob Storage (JSON, Avro, Text, Parquet)
Azure Data Lake Storage Gen1 (JSON, Avro, Text, Parquet)
Azure Data Lake Storage Gen2 (JSON, Avro, Text, Parquet)
Azure Synapse Analytics
Azure SQL Database
Azure CosmosDB

Question 37

Q

Use transformations in the Mapping Data Flow

What are the main tasks for transformations in Mapping Data Flows?

Answer

A

Now that you have moved the data into Azure Data Lake Store Gen2, you are ready to build a Mapping Data Flow that will transform your data at scale via a spark cluster and then load it into a Data Warehouse.

The main tasks for this are as follows:

Preparing the environment
Adding a Data Source
Using Mapping Data Flow transformation
1. Add a Select transformation to rename and drop a column
2. Add a Filter Transformation to filter out unwanted years.
3. Add a Derive Transformation to calculate primary genre.
4. Rank movies via a Window Transformation.
5. Aggregate ratings with an Aggregate Transformation.
6. Specify Upsert condition via an Alter Row Transformation.
Writing to a Data Sink

Question 38

Q

What can we do in the PowerBI blade of Synapse Studio?

Answer

A

As you can see, we can create, edit, and view Power BI reports from within Synapse Studio! As a business analyst, data engineer, or developer, you NO longer need to open another browser window, sign in to Power BI, and toggle back and forth between environments.

Question 39

Q

Which transformation in the Mapping Data Flow is used to routes data rows to different streams based on matching conditions?

Lookup
GetMetadata activity
Conditional Split

Answer

A

Conditional Split

Question 40

Q

Which transformation is used to load data into a destination data store or compute resource?

Sink
Source
Window

Question 41

Q

When data is stored in Data Lake Storage Gen2, the file size, number of files, and folder structure can have an impact on performance.

True or false

Question 42

Q

When working with Data Lake Storage Gen2 many small files can negatively affect performance. The recommended file size for Data Lake Storage Gen2 is between which of the following sizes?

256MB to 100GB
1GB to 10GB
10GB to 100GB
256MB to 1GB

Answer

A

256MB to 100GB

Question 43

Q

When building data flows in Azure Synapse you can enable debug mode_._, When Debug mode is enabled Synapse automatically turns on which of the following?

Serverless cluster
Spark cluster
Dedicated SQL Pool

Answer

A

Spark cluster