Terms Flashcards
Azure Active Directory
Cloud-based identity and access management (IAM) solution.
Provides single sign-on and multi-factor authentication to help protect users.
Helps organize computers and users.
Azure Synapse Analytics
Analytics Service that brings together integration, enterprise data warehousing, and big data analytics. An evolution of Azure SQL Data Warehouse. Allows you to build and manage a modern DW.
Strengths: Quickly run complex queries across petabytes of data.
Possible to use serverless SQl pool in Synapse, which is adaptive to current workloads and can shrink or grow on command. You can therefore use the pattern that the data takes (Hash, Round Robin, Replicated)
Apache Blob Storage
File storage in the cloud and an API that lets you build apps to access data. Unstructured, no restriction in the kinds of data it holds. Have higher latency than memory and local disk and don’t have indexing features.
Frequently used in combination with databases to store non-queryable data.
I.e profile pictures for an app could be stored in blobs.
Every blob lives inside a blob container. If you want to store data without performing analysis on the data, set Hierarchical Namespace to Disabled to set up the storage account. Also good for archive rarely used data or store website assets such as images and media.
Azure Synapse Analytics Serverless SQL Pool
A query service over the data in data lake. Benefits: Basic discovery and exploration, logical date warehousing, data transformation.
Apache Spark
Processing system for big data workloads. Interface for programming entire clusters with implicit data parallelism and fault tolerance.
Azure Cosmos DB
Used within web and mobile applications. Good for modeling social interactions. Cosmos DB is globally distributed and a NoSQL database. Organizational entity for your databases.
Azure Data Lake Storage Gen2
Capabilities dedicated to big data analytics. Built on Azure Blob Storage. File system semantics, file-level security, and scale.
Also: low-cost, tiered storage, high availability. If you’re performing analytics on data, set Hierarchical Namespace to Enabled.
Gen2 is with hierarchical namespace, meaning it has a physical folder structure. Gen1 is Blob-based.
Hierarchical namespace
A physical folder structure
Azure Databricks
Data analytics platform optimized for Azure cloud services. Uses notebooks that run on a Spark engine. Integrated with PowerBI, Tableau, and similar.
Cannot be assigned a system assigned managed identity. Instead, use Secret Scope. Allows Databricks to access a Key-vault. Only users with Contributer-permission, or higher, might activate secret scopes.
Azure Data Factory
Azure’s cloud ETL service for scale-out serverless data integration and data transformation. You can create and schedule data-driven workflows. Can create pipelines without writing code. Can copy and transform data. Can orchestrate batch data movement and transformations. Only store pipeline-run data for 45 days.
Activities: copy data from source into a sink, perform transformation, or similar.
Linked services: connection tools that ADF uses to connect to services like API, storage accounts, and databases.
Pipelines: Container for activities in ADF. Contains a flow of activities that execute depending on completion status. Typically equipped with a trigger.
Integration Runtimes: the infrastructure used to compute activities, data flow, SSIS package, and data movements. 3 types:
Azure – runs data flow, data movement, and activities.
Self-hosted – runs data movement and activities.
Azure SSIS – SSIS-package execution
Azure Virtual Network
VNet. Enables Azure resources to securely communicate.
Azure Event Hub
Real-time data ingestion service. Stream events to build dynamic data pipelines and immediately respond. Can process millions of events per second. Collects events. Accepts only endpoints for ingestion of data. No mechanism for sending data back to publisher. Good for massive scale or for a series of events.
Event Hubs Dedicated is a pricing tier, billed at a fixed monthly price, minimum of 4 hours of usage.
Azure Event Grid
Build applications with event-based architectures. Good for dealing with discrete events and when there’s a need for the application to work in a publisher/subscriber model and handle event but not data.
Azure IoT Hub
IoT connector to the cloud. Enables solutions with reliable and secure communication
Azure SQL Database
Managed coud database
Azure Stream Analytics
Serverless scalable event processing engine. Can run real-time analytics on multiple streams.
Used togeahter with Event Hub. Event Hubs feeds events into Azure and Stream Analytics processes them.
Azure DevOps Git Repository
Set of version control tools to manage code. Can help track changes over time
Azure Monitor
To keep data for longer than 45 days. Helps maximize availability and performance of applications and services.
Azure Log Analytics
To edit and run log queries from data collected by Azure Monitor Logs and interactively analyze their results.
Azure Monitor builds on tp of Log Analytics. Monitor is the marketing name, Log Analytics is the technology that powers it.
Microsoft Power BI
Business analytics service. Provides interactive visualizations and BI capabilities. Reports and dashboards for end users.
Microsoft Visual Studio
Development environment. IDE and Code Editor
Delta Lake
Efficient method of storing data. ACID-compliant and stores data in hacked up parquet. Not readable by humans. Can be used for batch data and streaming data. Delta enforces schemas.
Blobs (3 kinds)
Block blobs: blocks of different sizes.
Append blobs: support only appending new data (not updating or deleting existing data). Good for scenarios like storing logs or writing streamed data.
Page blobs: for scenarios involving random-access reads and writes.
Container
Packages of software that contains everything needed to run in any environment. Virtualize the operating system.
SQL Pool
Traditional Data Warehouse
Dedicated SQL Pool
Formerly SQL DW. Refers to enterprise data warehousing features available in Azuer Synapse Analytics.
Spark Pool
To run computation jobs, notebooks run on Apache Spark pools, similar to Databricsks.
Allows for usage of Hadoop file formats (Parquet, Avro, ORC)
Only charged when active. Possible to enable autoscaling.
Dynamic Management Views
Views that user can query to view database performance. All dynamic views belong to sys schema and are named dm_
Managed Identity
Provide an identity for applications when connecting to resources that support Azure AD authentication.
Shared Access Signature (SAS)
A URI that grants restricted access rights to Azure storage resources. Specify period and specify permissions. URI points to one or more storage resources.
Good for untrusted clients.
Service-level: Allow specific resources in account. I.e. allow app to retrieve a list of files or download a file.
Account-level: service-level + additional resources and abilities. I.e. ability to create file systems
Data Warehouse
Data management system. Intended to perform queries and analysis. Contains large amounts of historical data. Makes Data Mining possible (assists businesses in looking for patterns).
Data Masking
When users have access to databases but should not access certain columns in data.
Default: Different masking methods for different types. 0 for nuberic, XXXXX for strings.
Credit card: the last 4 digits.
Email: First letter and replace domain with XXXX.com
Random number: Generate a random number.
Custom text: Expose first and last characters and replace other with padded string.
PolyBase
Data virtualization feature for SQL Server. Accesses external data stored in Azure Blob Storage, Hadoop, or Azure Data Lake using T-SQL.
I.e. you can create and external table to query Parquet files stored in Data Lake without importing data to DW
Blob container
Containers are flat, cannot store other containers, only blobs
Round Robin
An arrangement of choosing all elements in a group equally in some rational order. A row in a RR table is non-deterministic and can end up in different distributions.
Good candidate for RR: If most columns are null able and no good hash distribution can be achieved. Useful for improving load speed. Possibly for stage tables.
Hash
Table with rows dispersed across multiple distributions based on a hash function applied to a column. Choose “not null” columns when creating hash distributed tables. Improves query performance on large fact tables. Improves speed of Joins.
Replicated distribution
Works when the tables are < 2GB
Distributed Table
Appears as a single table but the rows are stored across 60 distributions.
Distribution
An Azure SQL database in witch one or more distributed tables are stored
External Table
A table whose data comes from flat files stored outside of the database. These tables support use cases for both exporting and importing data. Can be created solely for the purpose of writing data out to external files. Only the following statements are allowed on external tables:
CREATE TABLE
DROP TABLE
CREATE STATISTICS
DROP STATISTICS
CREATE VIEW
DROP VIEW
Staging table
Temporary table containing business data, modified and/or cleaned. Used primarily to stage incremental data from the transactional database. When ETL runs, staging tables are truncated before they are populated with change capture data.
Slowly Changing Dimension (SCD)
Dimension that contains relatively static data. Changes slowly but unpredictably. I.e. customers, products, geographic locations etc.
Type 0: Fixed Dimension (no changes)
Type 1: Overwrite-No History (updates record directly)
Type 2: Row Versioning. Supports versioning of dimension members.
Type 3: Previous Value Column. Adds new attribute in the dimension to preserve the old attribute value.
Type 4: History Table (track all changes in sperate table)
Type 5: Hybrid SCD Utilize techniques from Type 4 + 1
Type 6: Hybrid SCD Utilize techniques from Type 4 + 2
Type 7: Hybrid SCD Utilize techniques from Type 4 + 3
Surrogate Key
Unique identifier. Not derived from application data. Any column or set of columns can be declared the PK. Is a candidate key. Most kommon is incrementing integer.
Business Key
Same as Natural key.
Unique key formed of attributes that exist and are used int eh external world. I.e. order number, product number, personal number etc.
Candidate Key
Keys that could be declared as PK
Natural Key
Unique key formed of attributes that exist and are used in the external world. I.e. order number, product number, personal number.
Also called business key.
Audit Column
Columns with stamps such as “Last edited” or “Created Date” to track when records were updated or inserted. Don’t track deletion of columns.
Heap indexing
Heap is a table without a clustered index. Data is stored in heap without specifying the order. Can be used as staging tables.
Clustered Column Store
Physical Storage for entire table. Combines query results from columnstore and deltastore.
Columnstore
Standard for storing and querying large data warehousing fact tables by using a columnar data format. Better for analytic work / read efficiency.
Deltastore
When data is stored temporarily into a clustered index. Deltastore operations are handled behind the scenes.
Partitioning
A way of splitting numbers into smaller parts to make them easier to work with. Might be faster to partition by day first if portioning is done on a date. With parallel problems, we want to match partitions and keys between input and output. WHERE. clauses will only check partitions with relevant data and you avoid to search through entire table A good choice for partition needs tho have over 1 million rows, there should not be many wheres, and it should not be a date-column.
Stored proceedure
A set of SQL statements with an assigned name so that it can be reused and shared by multiple programs.
Hadoop
Open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.
Includes: Avro Parquet, ORC.
All Hadoop-formats are compressed and optimized for storing and querying.
Avro
Row-oriented remote procedure call and data serialization framework. Created using JSON.
Supports timestamps. Useful to write heavy uses or when aim is to retrieve full rows of data.
CSV
Comma-separated value file. Delimited text file.
Parquet
Open-source file format. Can only read needed columns. Column-oriented data storage format. Supported by Synapse, Databricks, ADF.
ORC
Optimized Row Columnar. Column-stored. Efficient when reading heavy operations or when subsets of columns in data are interesting.
TSV
Tab-separated values. Storing data in tabular format.
TXT
File extension for a text
GZip
Used for file compression and decompression. Reduces the size fo the files. Compressed files have the fastest load.
JSON
Open standard file format. Text-based. Often used for serializing structured data and exchanging it over a network, typically between server and web.
Binary
The service does not parse file content but treat it as-is. Can use in Copy activity, GetMetadata activity, or Delete activity. When using Binary in copy activity, you can only copy from Binary to Binary.
Hot access tier
Storing data that is accessed frequently. Staged data. Data in active use.
Cold access tier
Storing data that is infrequently accessed and stored for at least 30 days. short-term backup and disaster recovery. Older data expected to be accessed immediatley
Archive access tier
Storing data that is rarely accessed. Stored at least 180 days. Long-term backup. Raw data that must be preserved. Compliance and archival data.
CI/CD
Continuous Integration and -Delivery means moving Data Factory pipelines from one environment to another.
Reference data
Also known as a lookup table. A finite data set that is static or slowly changing in nature, used to perform lookup.
POSIX
Portable Operating System Interface.
Family of standards specified by IEEE. Purpose to maintain compatibility between operating systems.
FlattenHierarchy
copy activity copy behavior. All files from source folder are in first level of the target folder. Target files have autogenerated names. Hierarchy is no preserved.
MergeFiles
Merge all files from source folder to one file. If file name is specified, the merged file name is that name. Otherwise, autogenerated file name.
PreserveHirearchy
Default setting. Preserves the hierarchy in the target folder.
Geo-Redundant Storage (GRS)
Copies data synchronously three times within a single location in primary region using LRS. Then copies data asynchronously to a single physical location in secondary region. Within secondary region, data is copied synchronously three times with LRS.
Geo-Zone-Redundant Storage (GZRS)
Copies data synchronously across 3 Azure availability zones in primary region using ZRS. Then copies data asynchronously to a single physical location in secondary region. In secondary region data is copied synchronously 3 times using LRS. Main difference from GRS is how data is replicated in primary region.
Read-access GRS
Azure Files do not support RA-GRS. When you enable read access to the secondary region, but data can then be read at all times even if the primary region becomes unavailable.
Zone-Redundant Storage (ZRS)
Copies data synchronously across 3 Azure availability zones in the primary region. For applications requiring high availability. Recommended to use ZRS in primary region and then replicate to a second region. Recommended for Azure data Lake Storage Gen2 workloads.
Locally-Redundant Storage (LRS)
Copies data synchronously three times within a single physical location in primary region. Least expensive replication option. Not recommended for application requiring high availability or durability.
Tumbling Window trigger
Fires at periodic time interval from specified start time. Retains state. Can only reference a singluar pipeline. Can use the delay parameter so that late-arriving data can be included in the load for the time when the data should have arrived. Backfill. Delay does not alter startTime.
Schedule trigger
Runs on wall-clock schedule. Cannot backfill data.
On-Demand Trigger
Manual execution
Event-based trigger
Responds to an event
Sliding Window
Output events only for points in time when the content of the window actually changes. Reports when things change in window. Events can belong to more than one window.
Hopping Window
Hop forward in time by a fixed period. Think of them as tumbling windows that can overlap. Events can belong to more than one Hopping window result set.
Tumbling Window
Used to segment a data stream into distinct time segments and perform a function against them. They repeat, do not overlap, and cannot belong to more than one tumbling window.
Session Window
Group events that arrive at similar times, filter out periods of time where there is no data.
Snapshot Window
Group events that have same timestamp.
Deserialize
The reverse process where the byte stream is used to recreate the actual Java object in memory.
Integration runtime (IR)
The compute infrastructure used by ADF and Azure Synapse pipelines to provide data integration capabilities across the environments: data flow, data movement, activity dispatch, SSIS package execution.
Self-hosted IR
Can run copy activity between cloud data stores and a data store in private network. Use to support data stores that requires bring-your-own driver such as SAP Hana, MySQL, ets. Is to be used On-premises.
Azure-SSIS IR
Are not supported in Synapse pipelines. To lift and shift existing SSIS workload.
SSIS
SQL Server Integration Services. Platform for building enterprise-level data integration and data transformation solutions. Use to solve complex business problems by copying or downloading files, loading data warehouses, cleansing and mining data, and managing SQL Server objects and data.
Databrics Cluster
Standard Clusters: Recommended for single user. Can run workloads developed in any language. Require at least one Spark worker node in addition to driver node. Good for large amounts of data.
High Concurrency clusters: A managed cloud resource. Provides fine-grained sharing for maximum resource utilization and minimum query latencies. Can run SQL, Python, and R. Cannot run Scala. Recommended to enable autoscaling.
Single Node Clusters: Has no workers, run Spark nodes on the driver node. recommended for small amounts of data.
Clusters automatically delete themselves after 30 days of inactivity. If you don’t want it to be deleted you can pin it and it won’t be deleted despite inactivity. Up to 70 clusters can be pinned.
Azure stream Analytics Functions
Aggregate Functions: Perform a calculation on a set of calues and return a single calue. COUNT() and similar.
GeoSpatial Functions: Enable real-time analytics.
Windowing Functions: Subsets of events, grouped by time.
HTTPS
Extension of HTTP. Used for secure communication over a computer network.
DTU
Database Transaction Unit. Combines measure of compute, storage and IO resources.
vCores
Virtual cores. Give greater control over compute and storage resources.
SWL elastic pools
Relate to eDTU. Enable to buy a set of compute and storage resources that are shared among all databases in the pool. Cost effective service. Resource allocation service used to scale and manage the performance and cost of a group of Azure SQL databases. Ideal when you have several SQL databases with low average utilization, but infrequent spikes.
Collation
Refers to rules that sort and compare data. Helps you define sorting rules when case sensitivity, accent marks, and other characteristics are important.
SaaS (Software as a Service)
Software licensing and delivery model where software is licensed on subscription basis and centrally hosed.
PolyBase
Fastest and most scalable way to load data. When using PolyBase with T’SQL, must define 3 objects in order for it to work:
External data source (to connect with source)
External file format (info about how data is structured)
External table (defines the structure)
OPENROWSET
A function that allows user to bulk load from storage account directly. Reads directly from storage account using AD authentication and SAS tokens. Can read from various formats or may sources at once.
Azure Key Vault
Stores keys, certificates, and secrets for Azure services. Only way to authenticate the encrypted data is with an Azure active directory, RBAC, or access policy.
Star Schema
Design pattern of using fact and dimension tables when designing a data warehouse. Events are stored in fact table with keys that refer to dimesnion tables. No need to normalize database, easy query ogic to write. Queries may be faster to compute.
Role Based Access Control (RBAC)
To manage access in Azure Cloud Services. Permission is given out by assigning roles. Four levels:
Management group: Admins can also restrict access to resources, Negative access overrules positive.
Subscription: a collection of all resource groups, here billing is done.
Resource group: a container for several resources.
Resource: lowest level, connected to a single service.