Describe an analytics workload on Azure (25% - 30%) Flashcards by Adrian A

What is HTAP?

Hybrid transactional and analytical processing.

How well did you know this?

Not at all

Perfectly

What Power BI tools do you use to make reports?

Power BI Desktop, Power BI service and Power BI Reports Builder.

How well did you know this?

Not at all

Perfectly

What Power BI tools do you use to make paginated reports?

Power BI service, Power BI Premium and Power BI Report Builder.

How well did you know this?

Not at all

Perfectly

What’s the difference between a data lake and a data warehouse?

A data warehouse holds structured information; a data lake holds raw data.

How well did you know this?

Not at all

Perfectly

When would you need a data warehouse?

When you want to run complex queries over large sets of data (Big Data) from various sources.

How well did you know this?

Not at all

Perfectly

What five services does Azure offer for modern data warehousing?

Azure Synapse Analytics, Azure Data Factory, Azure Data Lake Storage, Azure Databricks and Azure Analysis Services.

How well did you know this?

Not at all

Perfectly

What is Azure Data Factory?

Azure Data Factory (ADF) is a data integration service for retrieving data from various sources and converting it into a format for further processing.

It uses pipelines to ingest, clean, convert, and output data.

How well did you know this?

Not at all

Perfectly

What is Azure Data Lake Storage?

An extension of Azure Blob Storage organised as a near-infinite file system.

How well did you know this?

Not at all

Perfectly

What is a data lake?

A repository for large quantities of raw data from various sources.

How well did you know this?

Not at all

Perfectly

What are the characteristics of Azure Data Lake Storage?

It organises files in directories and sub-directories for improved file organisation.

It supports Portable Operating System Interface (POSIX) file and directory permissions to enable granular RBAC access control on your data.

It is compatible with the Hadoop Distributed File System (HDFS). All Apache Hadoop environments can access data in Azure Data Lake Storage Gen2.

How well did you know this?

Not at all

Perfectly

What is Azure Databricks?

An Apache Spark environment running on Azure to provide big data processing, streaming, and machine learning.

How well did you know this?

Not at all

Perfectly

What languages can you use with Azure Databricks?

R, Scala and Python.

How well did you know this?

Not at all

Perfectly

What are notebooks?

A notebook is a collection of cells, each of which contain a separate block of code.

How well did you know this?

Not at all

Perfectly

What is Azure Synapse Analytics?

Azure Synapse Analytics is a analytics engine designed to process large amounts of data quickly.

How well did you know this?

Not at all

Perfectly

How does Azure Synapse Analytics leverage a massively parallel processing (MPP) architecture?

With a control node and a pool of compute nodes.

How well did you know this?

Not at all

Perfectly

What is a Control node?

The brain of the MPP architecture, and the front-end that interacts with all applications.

The MPP engine runs on the Control node to optimise and coordinate parallel queries.

When it receives a request, the Control Node transforms it up into smaller requests that run against distinct subsets of the data in parallel.

What are Compute nodes?

The computational power of the architecture. The data to be processed is evenly distributed across the nodes.

The control node sends queries to the compute nodes, which run the queries over the portion of data that they each hold.

When each node has finished its processing, the results are sent back to the control node where they’re combined into an overall result.

What the two computational models in Azure Synapse Analytics?

SQL pools and Spark pools.

What is a SQL pool?

In a pool where each compute node uses an Azure SQL Database and Azure Storage to handle a portion of the data.

You can submit T-SQL (transact-SQL) queries on data retrieved from various data sources using PolyBase.

You specify the number of nodes when you create a SQL pool, and you can horizontally scale it manually as necessary. (Not when running)

What is PolyBase?

A feature of SQL Server and Azure Synapse Analytics that enables you to run T-SQL queries that read data from external relational and non-relational sources, and make them look like tables in a SQL database,

What is a Spark pool?

In a pool where compute nodes are replaced with a Spark cluster. Spark jobs are run using notebooks, same as Azure Databricks.

As with a SQL pool, the Spark cluster splits the work out into a series of parallel tasks that can be performed concurrently.

You can save data generated by your notebooks in Azure storage or Azure Data Lake Storage.

You specify the number of node when you make a Spark cluster, but, unlike SQL pools, you can enable autoscaling so pools can scale as needed. Furthermore, scaling can happen during processing.

What languages can Spark clusters in Azure Synapse Analytics use?

C#, Scala, Python and Spark SQL (a different dialect of SQL from T-SQL).

What is Spark optimised for and why?

Spark is optimised for in-memory processing which is much faster than disk-based operations, but requires additional memory resources.

How can you reduce your costs when using Azure Synapse Analytics?

By pausing the service. This releases resources in the pool for other users to use and reduces cost.

What is Azure Analysis Services?

A service that enables you to build tabular models to support OLAP queries. You can combine data from various sources, including Azure SQL Database, Azure Synapse Analytics. Azure Data Lake Storage, Azure Cosmos DB and many others. It includes a graphical designer to help connect data sources together and define queries that combine, filter, and aggregate data. You can explore the data from within the service or use an external tool like Power BI to visualise the data.

What is a model?

A set of queries and expressions that retrieve data from the various data sources and generate results. The results can be cached in-memory for later use, or calculated dynamically, directly from the underlying data sources.

What would you use Azure Synapse Analytics for?

Large volumes of data (TB to PB sized) Very complex queries. Data mining and exploration Complex ETL operations. Low to mid concurrency (128 users or fewer)

What would you use Azure Analysis Services for?

Volumes of data no bigger than a couple of terabytes. Multiple sources can be correlated. Rapid dashboard development from tabular data. Detailed analysis, and drilling into data, using Power BI. High read concurrency (thousands of users).

What is Azure HDInsight?

A big data processing service, that provides technologies such as Spark in an Azure environment.

What does Azure HDInsight provide?

Distributed processing across computers using a cluster model. This model is similar to Synapse Analytics, but it runs Apache Spark on the nodes instead of Azure SQL Server.

What streaming technologies does Azure HDInsight support?

Apache Spark Apache Kafka the Apache Hadoop processing model.

What is Hadoop?

An open-source framework that breaks large data processing problems down into smaller chunks and distributes them across a cluster of servers, similar to how Azure Synapse Analytics operates.

What is Apache Hive?

A SQL-like query facility that you can use with an HDInsight cluster to examine data held in a variety of formats. You can use it to create, load, and query external tables, in a manner similar to PolyBase for Azure Synapse Analytics.

What is the Modern Data Warehouse?

Using cloud services to manage data around big data analytics, relational online analytical processing, streaming real-time data and storing historical data.

What is a linked service in Azure Data Factory?

A linked service provides the information needed for ADF to connect to a source or destination. The information a linked service contains varies according to the resource.

What is a dataset in Azure Data Factory?

A dataset in ADF represents the data you want to ingest (input) or store (output). If your data has a structure, a dataset specifies how the data is structured.

How does an ADF dataset connect to an input or output?

By using linked services.

What is a pipeline in Azure Data Factory?

A logical grouping of activities that together perform a task. The actions in a pipeline define actions to perform on your data. Pipelines can be branched and looped, and can be run manually or automatically by using a trigger (similar to database triggers).

What is Spark?

A parallel-processing engine that supports largescale analytics.