Azure Synapse Analytics Modern Warehousing Concepts Flashcards

1
Q

Define Data Warehousing with Microsoft Azure Synapse Analytics

A

Azure Synapse Analytics, formerly known as Azure SQL Data Warehouse, is an analytics engine. It’s designed to process large amounts of data very quickly using a SQL pool and MPP engine.. Azure Synapse Analytics blends big data analytics with data warehousing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do we get to or consume the Insights using Azure Synapse Analytics?

A

Using Analytics Dashbords, Operational Reporting or Advanced Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 4 main stages in an Azure Synapse Pipeline flow?

A
  • Ingest and Prepare
  • Store
  • Model and Serve
  • Visualize
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Can we move SSIS ETL or transformation flows into Azure Data Factory

A

Yes we can easily

Four storage types for SSIS packages

  • SSIS catalog (SSISDB). Introduced with SQL Server 2012 and contains a set of stored procedures, views, and table-valued functions used for working with SSIS projects/packages.
  • File System.
  • SQL Server system database (MSDB).
  • SSIS Package Store. A package management layer on top of two subtypes:
    • MSDB, which is a system database in SQL Server used to store SSIS packages.
    • Managed file system, which is a specific folder in SQL Server installation path used to store SSIS packages.

DMA currently supports the batch-assessment of packages stored in File System, Package Store, and SSIS catalog since DMA version v5.0.

Get DMA, and perform your package assessment with it.

Migration

Depending on the storage types of source SSIS packages, the steps to migrate SSIS packages and SQL Server Agent jobs that schedule SSIS package executions may vary.

It is also a practical way to use SSIS DevOps Tools, to do batch package redeployment to the migration destination.

Package storage typeHow to migrate SSIS packagesHow to migrate SSIS jobsSSISDBRedeploy packages via SSDT/SSMS to SSISDB hosted in Azure Managed Instance. For more info, see Deploying SSIS packages in Azure.
* Migrate from SQL Server Agent on premises to SQL Managed Instance agent via scripts/manual copy. For more info, see run SSIS packages via Azure SQL Managed Instance Agent
* Convert them into ADF pipelines/activities/triggers via scripts/SSMS/ADF portal. For more info, see SSMS scheduling feature.
File SystemRedeploy them to file shares/Azure Files via dtinstall/dtutil/manual copy, or to keep in file systems to access via VNet/Self-Hosted IR. For more info, see dtutil utility.
* Migrate from SQL Server Agent on premises to SQL Managed Instance agent via scripts/manual copy. For more info, see run SSIS packages via Azure SQL Managed Instance Agent
* Migrate with SSIS Job Migration Wizard in SSMS
* Convert them into ADF pipelines/activities/triggers via scripts/SSMS/ADF portal. For more info, see SSMS scheduling feature.
SQL Server (MSDB)Export them to file systems/file shares/Azure Files via SSMS/dtutil. For more info, see Exporting SSIS packages.Convert them into ADF pipelines/activities/triggers via scripts/SSMS/ADF portal. For more info, see SSMS scheduling feature.Package StoreExport them to package store via SSMS/dtutil or redeploy them to package store via dtinstall/dtutil/manual copy. For more info, see Manage packages with Azure-SSIS Integration Runtime package store.
* Migrate from SQL Server Agent on premises to SQL Managed Instance agent via scripts/manual copy. For more info, see run SSIS packages via Azure SQL Managed Instance Agent
* Convert them into ADF pipelines/activities/triggers via scripts/SSMS/ADF portal. For more info, see SSMS scheduling feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 2 main tools for Azure Ingest and Prepare stage?

A

Azure Data Factory

Azure Databricks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Can we use Azure Synapse Exclusively (without any other tool)?

A

Yes, it’s a good choice for Greenfield implementations or use a Hybrid approach for corporations with existing Azure Service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name 3 languages which we can use in Azure Synapse Analytics

A
  • Python
  • SQL
  • Scala
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If we press the “integrate” button in Synapse Analytics what will we see?

A

An environment exactly like the one Az Data Factory has, since ADF is included in Synapse Analytics.

The pipeline creation experience is the same as in ADF, which gives you another powerful integration built into Azure Synapse Analytics, removing the need to use Azure Data Factory for data movement and transformation pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 3 Copy methods in Synapse Analytics Integration tab for Sink?

A
  • PolyBase
  • Copy command
  • Bulk insert
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

To perform a scale operation for a SQL Pool…

A

To perform a scale operation, SQL pool first kills all incoming queries and then rolls back transactions to ensure a consistent state.

Scaling only occurs once the transaction rollback is complete.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 2 types of Analytics Pools available in Synapse?

A
  • SQL pools
    • Dedicated
    • Serverless
  • Apache Spark pools
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does Synapse scale out or in automatically Apache Spark pools?

A

You can configure the Apache Spark pool to have a fixed size by disabling the autoscale setting. Here we have enabled autoscale and set the minimum and maximum number of nodes to control the amount of scale applied. When you enable autoscale, Synapse Analytics monitors the resource requirements of the load and scales the number of nodes up or down. It does this by continuously monitoring pending CPU, pending memory, free CPU, free memory, and used memory per node metrics. It checks these metrics every 30 seconds and makes scaling decisions based on the values.

It can take 1-5 minutes for a scaling operation to complete.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What options can we find in Develop tab?

A
  • SQL Scripts
  • Notebooks
  • Data flows
  • Power BI
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

We can use JSON functions, such as xxx and xxx to extract the JSON data and extract it to specific structured columns inside a Table in a SQL Pool.

A

JSON_VALUE and ISJSON

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a data flow in Develop Tab of Synapse?

A

data flows are powerful data transformation workflows that use the power of Apache Spark, but are authored using a code-free GUI. The work you do in the UI gets transformed into code executed by a managed Spark cluster, automatically, without having to write any code or manage the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When ingesting raw data in batch from new data sources, which of the following data formats are natively supported by Synapse Analytics?

  • Scala
  • JSON
  • ORC
  • CSV
  • Parquet
A
  • JSON
  • ORC
  • CSV
  • Parquet
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which of the following would be a valid reason for adding a staging area into the architecture of a modern data warehouse?

  • To reduce contention on source systems
  • To make data analytics available directly from the staging area
  • To join data from different source systems
  • Enable the ingestion of source systems based on different schedules
A
  • To reduce contention on source systems
  • To join data from different source systems
  • Enable the ingestion of source systems based on different schedules
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The process of building a modern data warehouse typically includes Data Ingestion and Preparation.

You have recently deployed Azure Synapse Analytics. You now have a requirement to ingest data code-free. Which of the following tools can be used to perform this task?

  • Azure Databricks
  • Azure Data Factory
  • Power BI
A

Azure Data Factory

19
Q

Processing data that arrives in real-time /near real-time is also referred to as streaming data processing. Azure offers purpose-built stream ingestion services such as Azure IoT Hub and Azure Event Hubs. To collect messages from these or similar services, and process them, you can use which of the following features?

  • Azure Stream Analytics
  • Azure IoT Central
  • Azure Databricks
  • Azure Functions
A
  • Azure Stream Analytics
  • Azure Functions
  • Azure Databricks
20
Q

Question 5

Which technology is typically used as a staging area in a modern data warehousing architecture?

  • Azure Data Lake
  • Azure Synapse SQL Pools
  • Azure Synapse Spark Pools.
A

Azure Data Lake

21
Q

Which of the following is a Big Data Solution that stores data in a relational table format with columnar storage?

  • Azure Synapse SQL Pools
  • Azure Synapse Spark Pools
A

Azure Synapse SQL Pools

22
Q

What is the main reason for adding a staging area into the architecture of a modern data warehouse?

A

The main reason for adding a staging area into the architecture of a modern data

warehouse is for any one of the following reasons.

  • To reduce contention on source systems.
  • To deal with the ingestion of source systems on different schedules.
  • To join data together from different source systems.
  • And to rerun failed data warehouse loads from a staging area.
23
Q

2 main functions of source systems in an organization

A

Source systems typically play an important role in fulfilling

business operations that either bring in (generate) revenue to an organization or

provides a function that is mission critical to the business.

24
Q

What is Data Dumping?

A

In this case, some data warehouse design strategies will involve grabbing

data as a source and dumping the data into a staging area.

This approach involves no transformation or cleansing.

It simply grabs the data so it minimizes the contention on the source system.

25
Q

What is metadata and how can you leverage this in a staging area?

A

You can even create additional tables that can aid the process of joining data together from different source systems, referred to as mapping tables. In this scenario, imagine that you have a customers table in one source system that has a column named FirstName and a second source system, perhaps running an AS400 system. You have a customers table that has a column named FIRNAM that also represents the first name of the customer. You can create a separate table that contains metadata that maps the data in a column from one source system with another column from another source system that represents the same business entity, in this case, FirstName.

26
Q

What is a Data Lake?

A

A data lake is a repository of data that is stored in its natural format, usually as blobs or files. Azure Data Lake Storage is a comprehensive, scalable and cost effective data lake solution for big data analytics built into Azure. Azure Data Lake storage combines a file system with a storage platform to help you quickly identify insights into your data. And Data Lake Storage Gen2 builds on Azure blob storage capabilities to optimize it specifically for analytics workloads. This integration enables analytics performance, the tearing and data lifecycle management capabilities of blob storage, and the high availability, security, and durability capabilities of Azure Storage.

27
Q

What are the 3 Data-loading velocities?

A
  • Batch query
    • A batch which is queries or programs that take tens of minutes, hours, or days to complete. Activities could include initial data wrangling, complete GTL pipeline, or preparation for downstream analytics.
  • Interactive query
    • An interactive query which involves querying batch data at human interactive speeds which with the current generation of technologies means results are ready in time frames measured in seconds to minutes.
  • Real-time query
    • And a real-time or near real-time query, this is processing of a typically infinite stream of input data stream. The time until results are ready is short, and is measured in milliseconds or seconds in the longest of cases.
28
Q

Recommended storage formats for common data types in a Modern Datawarehouse

A
  • For raw data, it is recommended that this data be stored in its native format.
  • Data from relational databases should typically be stored in CSV format. This is the format supported by the most systems, so it provides the greatest flexibility.
  • For data from web APIs and NoSQL databases, JSON is the recommended format.
  • When it comes to storing refined versions of the data for possible querying, the recommended data format is Parquet.
29
Q

Parquet benefits

A
  • There is industry alignment around the Parquet format for sharing data at the storage layer. For example, across Hadoop data bricks and SQL engine scenarios.
  • And Parquet is a high-performance column orientated format optimized for big data scenarios. Columnar formats like Parquet have storage and performance benefits.
  • The values are clustered by column so the compression is more efficient to shrink the storage footprint. And a query engine can push down column projections to reduce read IO from network and disk by skipping unwanted column, otherwise known as column pruning.
  • Similar data types for a column are stored together in Parquet files, leading to efficient data compression and encoding schemes.
  • Finally, Parquet stores the file schema in the file metadata. CSV files don’t store file metadata. So readers need to be either supplied with the schema or the schema needs to be inferred. Supplying a schema is tedious, and inferring a schema is error prone and expensive.
30
Q

he first thing you should consider when ingesting data into the data lake is…

A

The first thing you should consider when ingesting data into the data lake is how to structure or organize data within the data lake. You should use Azure Data Lake Storage or ADLS Gen2. Within the Azure portal, this is an Azure storage account with a hierarchical name space enabled. A key mechanism of ADLS Gen2 is to provide file system performance at object storage scale and prices. And it also provides a hierarchical namespace. This allows the collection of objects and files within an account to be organized into a hierarchy of directories and nested subdirectories in the same way that the file system on your computer is organized. With a hierarchical name space enabled, a storage account becomes capable of providing the scalability and cost effectiveness of object storage. And file system semantics that are familiar to analytics engines and frameworks.

31
Q

In ADLS Gen2, it is a best practice(for storage environments) to have a…

A

In ADLS Gen2, it is a best practice to have a dedicated storage account for production, and a separate storage account for dev and test workloads. This will ensure that dev or test workloads never interfere with production.

32
Q

A common method for structuring folders within a data lake

A

A common method for structuring folders within a data lake is to organize data in separate folders by the degree of refinement.

For example,

  • Bronze folder might contain raw data.
  • Silver contains the cleaned, prepared, and integrated data.
  • Gold contains data ready to support analytics, which might include final refinements, such as pre-computed aggregates.
33
Q

When data is stored in data lake storage Gen2, the file size, number of files, and folder structure have an impact on performance…explain this

A

When working with data lake storage Gen2, the following should be considered.

  • When data is stored in data lake storage Gen2, the file size, number of files, and folder structure have an impact on performance.
  • And if you store your data as many small files, this can negatively affect performance.
  • In general, organize your data into larger sized files for better performance, 256 megabytes to 100 gigbytes in size.
  • Some engines and applications might have trouble efficiently processing files that are greater than 100 gigabytes in size. Keep file sizes below 100GB
  • Finally, sometimes data pipelines have limited control over the raw data, which has lots of small files.
  • It is recommended to have a cooking process that generates larger files to use for downstream applications.
34
Q

You can natively perform data transformations with Azure Synapse pipelines code free using…

A

You can natively perform data transformations with Azure Synapse pipelines code free using the Mapping Data Flow task.

Mapping Data Flows provide a fully visual experience with no coding required. Your data flows will run on your own execution cluster for scaled-out data processing. Data flow activities can be operationalized via existing Data Factory scheduling, control, flow, and monitoring capabilities.

35
Q

How can we debug Data flows?

A

When building data flows, you can enable debug mode, which turns on a small interactive Spark cluster. Turn on debug mode by toggling the slider at the top of the authoring module. Debug clusters take a few minutes to warm up, but can be used to interactively preview the output of your transformation logic.

36
Q

Add source data to the Mapping Data Flow

Mapping Data Flow follows an extract, load, transform (ELT) approach and works with staging datasets that are all in Azure. Currently which datasets can be used?

A

Mapping Data Flow follows an extract, load, transform (ELT) approach and works with staging datasets that are all in Azure. Currently the following datasets can be used in a source transformation:

  • Azure Blob Storage (JSON, Avro, Text, Parquet)
  • Azure Data Lake Storage Gen1 (JSON, Avro, Text, Parquet)
  • Azure Data Lake Storage Gen2 (JSON, Avro, Text, Parquet)
  • Azure Synapse Analytics
  • Azure SQL Database
  • Azure CosmosDB
37
Q

Use transformations in the Mapping Data Flow

What are the main tasks for transformations in Mapping Data Flows?

A

Now that you have moved the data into Azure Data Lake Store Gen2, you are ready to build a Mapping Data Flow that will transform your data at scale via a spark cluster and then load it into a Data Warehouse.

The main tasks for this are as follows:

  1. Preparing the environment
  2. Adding a Data Source
  3. Using Mapping Data Flow transformation
    1. Add a Select transformation to rename and drop a column
    2. Add a Filter Transformation to filter out unwanted years.
    3. Add a Derive Transformation to calculate primary genre.
    4. Rank movies via a Window Transformation.
    5. Aggregate ratings with an Aggregate Transformation.
    6. Specify Upsert condition via an Alter Row Transformation.
  4. Writing to a Data Sink
38
Q

What can we do in the PowerBI blade of Synapse Studio?

A

As you can see, we can create, edit, and view Power BI reports from within Synapse Studio! As a business analyst, data engineer, or developer, you NO longer need to open another browser window, sign in to Power BI, and toggle back and forth between environments.

39
Q

Which transformation in the Mapping Data Flow is used to routes data rows to different streams based on matching conditions?

  • Lookup
  • GetMetadata activity
  • Conditional Split
A

Conditional Split

40
Q

Which transformation is used to load data into a destination data store or compute resource?

  • Sink
  • Source
  • Window
A

Sink

41
Q

When data is stored in Data Lake Storage Gen2, the file size, number of files, and folder structure can have an impact on performance.

True or false

A

True

42
Q

When working with Data Lake Storage Gen2 many small files can negatively affect performance. The recommended file size for Data Lake Storage Gen2 is between which of the following sizes?

  • 256MB to 100GB
  • 1GB to 10GB
  • 10GB to 100GB
  • 256MB to 1GB
A

256MB to 100GB

43
Q

When building data flows in Azure Synapse you can enable debug mode_._, When Debug mode is enabled Synapse automatically turns on which of the following?

  • Serverless cluster
  • Spark cluster
  • Dedicated SQL Pool
A

Spark cluster