Terms Flashcards
Azure Active Directory
Cloud-based identity and access management (IAM) solution.
Provides single sign-on and multi-factor authentication to help protect users.
Helps organize computers and users.
Azure Synapse Analytics
Analytics Service that brings together integration, enterprise data warehousing, and big data analytics. An evolution of Azure SQL Data Warehouse. Allows you to build and manage a modern DW.
Strengths: Quickly run complex queries across petabytes of data.
Possible to use serverless SQl pool in Synapse, which is adaptive to current workloads and can shrink or grow on command. You can therefore use the pattern that the data takes (Hash, Round Robin, Replicated)
Apache Blob Storage
File storage in the cloud and an API that lets you build apps to access data. Unstructured, no restriction in the kinds of data it holds. Have higher latency than memory and local disk and don’t have indexing features.
Frequently used in combination with databases to store non-queryable data.
I.e profile pictures for an app could be stored in blobs.
Every blob lives inside a blob container. If you want to store data without performing analysis on the data, set Hierarchical Namespace to Disabled to set up the storage account. Also good for archive rarely used data or store website assets such as images and media.
Azure Synapse Analytics Serverless SQL Pool
A query service over the data in data lake. Benefits: Basic discovery and exploration, logical date warehousing, data transformation.
Apache Spark
Processing system for big data workloads. Interface for programming entire clusters with implicit data parallelism and fault tolerance.
Azure Cosmos DB
Used within web and mobile applications. Good for modeling social interactions. Cosmos DB is globally distributed and a NoSQL database. Organizational entity for your databases.
Azure Data Lake Storage Gen2
Capabilities dedicated to big data analytics. Built on Azure Blob Storage. File system semantics, file-level security, and scale.
Also: low-cost, tiered storage, high availability. If you’re performing analytics on data, set Hierarchical Namespace to Enabled.
Gen2 is with hierarchical namespace, meaning it has a physical folder structure. Gen1 is Blob-based.
Hierarchical namespace
A physical folder structure
Azure Databricks
Data analytics platform optimized for Azure cloud services. Uses notebooks that run on a Spark engine. Integrated with PowerBI, Tableau, and similar.
Cannot be assigned a system assigned managed identity. Instead, use Secret Scope. Allows Databricks to access a Key-vault. Only users with Contributer-permission, or higher, might activate secret scopes.
Azure Data Factory
Azure’s cloud ETL service for scale-out serverless data integration and data transformation. You can create and schedule data-driven workflows. Can create pipelines without writing code. Can copy and transform data. Can orchestrate batch data movement and transformations. Only store pipeline-run data for 45 days.
Activities: copy data from source into a sink, perform transformation, or similar.
Linked services: connection tools that ADF uses to connect to services like API, storage accounts, and databases.
Pipelines: Container for activities in ADF. Contains a flow of activities that execute depending on completion status. Typically equipped with a trigger.
Integration Runtimes: the infrastructure used to compute activities, data flow, SSIS package, and data movements. 3 types:
Azure – runs data flow, data movement, and activities.
Self-hosted – runs data movement and activities.
Azure SSIS – SSIS-package execution
Azure Virtual Network
VNet. Enables Azure resources to securely communicate.
Azure Event Hub
Real-time data ingestion service. Stream events to build dynamic data pipelines and immediately respond. Can process millions of events per second. Collects events. Accepts only endpoints for ingestion of data. No mechanism for sending data back to publisher. Good for massive scale or for a series of events.
Event Hubs Dedicated is a pricing tier, billed at a fixed monthly price, minimum of 4 hours of usage.
Azure Event Grid
Build applications with event-based architectures. Good for dealing with discrete events and when there’s a need for the application to work in a publisher/subscriber model and handle event but not data.
Azure IoT Hub
IoT connector to the cloud. Enables solutions with reliable and secure communication
Azure SQL Database
Managed coud database
Azure Stream Analytics
Serverless scalable event processing engine. Can run real-time analytics on multiple streams.
Used togeahter with Event Hub. Event Hubs feeds events into Azure and Stream Analytics processes them.
Azure DevOps Git Repository
Set of version control tools to manage code. Can help track changes over time
Azure Monitor
To keep data for longer than 45 days. Helps maximize availability and performance of applications and services.
Azure Log Analytics
To edit and run log queries from data collected by Azure Monitor Logs and interactively analyze their results.
Azure Monitor builds on tp of Log Analytics. Monitor is the marketing name, Log Analytics is the technology that powers it.
Microsoft Power BI
Business analytics service. Provides interactive visualizations and BI capabilities. Reports and dashboards for end users.
Microsoft Visual Studio
Development environment. IDE and Code Editor
Delta Lake
Efficient method of storing data. ACID-compliant and stores data in hacked up parquet. Not readable by humans. Can be used for batch data and streaming data. Delta enforces schemas.
Blobs (3 kinds)
Block blobs: blocks of different sizes.
Append blobs: support only appending new data (not updating or deleting existing data). Good for scenarios like storing logs or writing streamed data.
Page blobs: for scenarios involving random-access reads and writes.
Container
Packages of software that contains everything needed to run in any environment. Virtualize the operating system.
SQL Pool
Traditional Data Warehouse
Dedicated SQL Pool
Formerly SQL DW. Refers to enterprise data warehousing features available in Azuer Synapse Analytics.
Spark Pool
To run computation jobs, notebooks run on Apache Spark pools, similar to Databricsks.
Allows for usage of Hadoop file formats (Parquet, Avro, ORC)
Only charged when active. Possible to enable autoscaling.
Dynamic Management Views
Views that user can query to view database performance. All dynamic views belong to sys schema and are named dm_
Managed Identity
Provide an identity for applications when connecting to resources that support Azure AD authentication.
Shared Access Signature (SAS)
A URI that grants restricted access rights to Azure storage resources. Specify period and specify permissions. URI points to one or more storage resources.
Good for untrusted clients.
Service-level: Allow specific resources in account. I.e. allow app to retrieve a list of files or download a file.
Account-level: service-level + additional resources and abilities. I.e. ability to create file systems
Data Warehouse
Data management system. Intended to perform queries and analysis. Contains large amounts of historical data. Makes Data Mining possible (assists businesses in looking for patterns).
Data Masking
When users have access to databases but should not access certain columns in data.
Default: Different masking methods for different types. 0 for nuberic, XXXXX for strings.
Credit card: the last 4 digits.
Email: First letter and replace domain with XXXX.com
Random number: Generate a random number.
Custom text: Expose first and last characters and replace other with padded string.
PolyBase
Data virtualization feature for SQL Server. Accesses external data stored in Azure Blob Storage, Hadoop, or Azure Data Lake using T-SQL.
I.e. you can create and external table to query Parquet files stored in Data Lake without importing data to DW
Blob container
Containers are flat, cannot store other containers, only blobs
Round Robin
An arrangement of choosing all elements in a group equally in some rational order. A row in a RR table is non-deterministic and can end up in different distributions.
Good candidate for RR: If most columns are null able and no good hash distribution can be achieved. Useful for improving load speed. Possibly for stage tables.
Hash
Table with rows dispersed across multiple distributions based on a hash function applied to a column. Choose “not null” columns when creating hash distributed tables. Improves query performance on large fact tables. Improves speed of Joins.
Replicated distribution
Works when the tables are < 2GB
Distributed Table
Appears as a single table but the rows are stored across 60 distributions.
Distribution
An Azure SQL database in witch one or more distributed tables are stored
External Table
A table whose data comes from flat files stored outside of the database. These tables support use cases for both exporting and importing data. Can be created solely for the purpose of writing data out to external files. Only the following statements are allowed on external tables:
CREATE TABLE
DROP TABLE
CREATE STATISTICS
DROP STATISTICS
CREATE VIEW
DROP VIEW
Staging table
Temporary table containing business data, modified and/or cleaned. Used primarily to stage incremental data from the transactional database. When ETL runs, staging tables are truncated before they are populated with change capture data.