Describe an analytics workload on Azure (25% - 30%) Flashcards
What is HTAP?
Hybrid transactional and analytical processing.
What Power BI tools do you use to make reports?
Power BI Desktop, Power BI service and Power BI Reports Builder.
What Power BI tools do you use to make paginated reports?
Power BI service, Power BI Premium and Power BI Report Builder.
What’s the difference between a data lake and a data warehouse?
A data warehouse holds structured information; a data lake holds raw data.
When would you need a data warehouse?
When you want to run complex queries over large sets of data (Big Data) from various sources.
What five services does Azure offer for modern data warehousing?
Azure Synapse Analytics, Azure Data Factory, Azure Data Lake Storage, Azure Databricks and Azure Analysis Services.
What is Azure Data Factory?
Azure Data Factory (ADF) is a data integration service for retrieving data from various sources and converting it into a format for further processing.
It uses pipelines to ingest, clean, convert, and output data.
What is Azure Data Lake Storage?
An extension of Azure Blob Storage organised as a near-infinite file system.
What is a data lake?
A repository for large quantities of raw data from various sources.
What are the characteristics of Azure Data Lake Storage?
It organises files in directories and sub-directories for improved file organisation.
It supports Portable Operating System Interface (POSIX) file and directory permissions to enable granular RBAC access control on your data.
It is compatible with the Hadoop Distributed File System (HDFS). All Apache Hadoop environments can access data in Azure Data Lake Storage Gen2.
What is Azure Databricks?
An Apache Spark environment running on Azure to provide big data processing, streaming, and machine learning.
What languages can you use with Azure Databricks?
R, Scala and Python.
What are notebooks?
A notebook is a collection of cells, each of which contain a separate block of code.
What is Azure Synapse Analytics?
Azure Synapse Analytics is a analytics engine designed to process large amounts of data quickly.
How does Azure Synapse Analytics leverage a massively parallel processing (MPP) architecture?
With a control node and a pool of compute nodes.
What is a Control node?
The brain of the MPP architecture, and the front-end that interacts with all applications.
The MPP engine runs on the Control node to optimise and coordinate parallel queries.
When it receives a request, the Control Node transforms it up into smaller requests that run against distinct subsets of the data in parallel.
What are Compute nodes?
The computational power of the architecture. The data to be processed is evenly distributed across the nodes.
The control node sends queries to the compute nodes, which run the queries over the portion of data that they each hold.
When each node has finished its processing, the results are sent back to the control node where they’re combined into an overall result.
What the two computational models in Azure Synapse Analytics?
SQL pools and Spark pools.
What is a SQL pool?
In a pool where each compute node uses an Azure SQL Database and Azure Storage to handle a portion of the data.
You can submit T-SQL (transact-SQL) queries on data retrieved from various data sources using PolyBase.
You specify the number of nodes when you create a SQL pool, and you can horizontally scale it manually as necessary. (Not when running)
What is PolyBase?
A feature of SQL Server and Azure Synapse Analytics that enables you to run T-SQL queries that read data from external relational and non-relational sources, and make them look like tables in a SQL database,
What is a Spark pool?
In a pool where compute nodes are replaced with a Spark cluster. Spark jobs are run using notebooks, same as Azure Databricks.
As with a SQL pool, the Spark cluster splits the work out into a series of parallel tasks that can be performed concurrently.
You can save data generated by your notebooks in Azure storage or Azure Data Lake Storage.
You specify the number of node when you make a Spark cluster, but, unlike SQL pools, you can enable autoscaling so pools can scale as needed. Furthermore, scaling can happen during processing.
What languages can Spark clusters in Azure Synapse Analytics use?
C#, Scala, Python and Spark SQL (a different dialect of SQL from T-SQL).
What is Spark optimised for and why?
Spark is optimised for in-memory processing which is much faster than disk-based operations, but requires additional memory resources.
How can you reduce your costs when using Azure Synapse Analytics?
By pausing the service. This releases resources in the pool for other users to use and reduces cost.