Terms Flashcards

1
Q

Azure Active Directory

A

Cloud-based identity and access management (IAM) solution.
Provides single sign-on and multi-factor authentication to help protect users.
Helps organize computers and users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Azure Synapse Analytics

A

Analytics Service that brings together integration, enterprise data warehousing, and big data analytics. An evolution of Azure SQL Data Warehouse. Allows you to build and manage a modern DW.

Strengths: Quickly run complex queries across petabytes of data.

Possible to use serverless SQl pool in Synapse, which is adaptive to current workloads and can shrink or grow on command. You can therefore use the pattern that the data takes (Hash, Round Robin, Replicated)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Apache Blob Storage

A

File storage in the cloud and an API that lets you build apps to access data. Unstructured, no restriction in the kinds of data it holds. Have higher latency than memory and local disk and don’t have indexing features.

Frequently used in combination with databases to store non-queryable data.
I.e profile pictures for an app could be stored in blobs.

Every blob lives inside a blob container. If you want to store data without performing analysis on the data, set Hierarchical Namespace to Disabled to set up the storage account. Also good for archive rarely used data or store website assets such as images and media.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Azure Synapse Analytics Serverless SQL Pool

A

A query service over the data in data lake. Benefits: Basic discovery and exploration, logical date warehousing, data transformation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Apache Spark

A

Processing system for big data workloads. Interface for programming entire clusters with implicit data parallelism and fault tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Azure Cosmos DB

A

Used within web and mobile applications. Good for modeling social interactions. Cosmos DB is globally distributed and a NoSQL database. Organizational entity for your databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Azure Data Lake Storage Gen2

A

Capabilities dedicated to big data analytics. Built on Azure Blob Storage. File system semantics, file-level security, and scale.

Also: low-cost, tiered storage, high availability. If you’re performing analytics on data, set Hierarchical Namespace to Enabled.
Gen2 is with hierarchical namespace, meaning it has a physical folder structure. Gen1 is Blob-based.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hierarchical namespace

A

A physical folder structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Azure Databricks

A

Data analytics platform optimized for Azure cloud services. Uses notebooks that run on a Spark engine. Integrated with PowerBI, Tableau, and similar.

Cannot be assigned a system assigned managed identity. Instead, use Secret Scope. Allows Databricks to access a Key-vault. Only users with Contributer-permission, or higher, might activate secret scopes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Azure Data Factory

A

Azure’s cloud ETL service for scale-out serverless data integration and data transformation. You can create and schedule data-driven workflows. Can create pipelines without writing code. Can copy and transform data. Can orchestrate batch data movement and transformations. Only store pipeline-run data for 45 days.

Activities: copy data from source into a sink, perform transformation, or similar.

Linked services: connection tools that ADF uses to connect to services like API, storage accounts, and databases.

Pipelines: Container for activities in ADF. Contains a flow of activities that execute depending on completion status. Typically equipped with a trigger.

Integration Runtimes: the infrastructure used to compute activities, data flow, SSIS package, and data movements. 3 types:
Azure – runs data flow, data movement, and activities.
Self-hosted – runs data movement and activities.
Azure SSIS – SSIS-package execution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Azure Virtual Network

A

VNet. Enables Azure resources to securely communicate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Azure Event Hub

A

Real-time data ingestion service. Stream events to build dynamic data pipelines and immediately respond. Can process millions of events per second. Collects events. Accepts only endpoints for ingestion of data. No mechanism for sending data back to publisher. Good for massive scale or for a series of events.

Event Hubs Dedicated is a pricing tier, billed at a fixed monthly price, minimum of 4 hours of usage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Azure Event Grid

A

Build applications with event-based architectures. Good for dealing with discrete events and when there’s a need for the application to work in a publisher/subscriber model and handle event but not data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Azure IoT Hub

A

IoT connector to the cloud. Enables solutions with reliable and secure communication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Azure SQL Database

A

Managed coud database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Azure Stream Analytics

A

Serverless scalable event processing engine. Can run real-time analytics on multiple streams.

Used togeahter with Event Hub. Event Hubs feeds events into Azure and Stream Analytics processes them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Azure DevOps Git Repository

A

Set of version control tools to manage code. Can help track changes over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Azure Monitor

A

To keep data for longer than 45 days. Helps maximize availability and performance of applications and services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Azure Log Analytics

A

To edit and run log queries from data collected by Azure Monitor Logs and interactively analyze their results.

Azure Monitor builds on tp of Log Analytics. Monitor is the marketing name, Log Analytics is the technology that powers it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Microsoft Power BI

A

Business analytics service. Provides interactive visualizations and BI capabilities. Reports and dashboards for end users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Microsoft Visual Studio

A

Development environment. IDE and Code Editor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Delta Lake

A

Efficient method of storing data. ACID-compliant and stores data in hacked up parquet. Not readable by humans. Can be used for batch data and streaming data. Delta enforces schemas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Blobs (3 kinds)

A

Block blobs: blocks of different sizes.

Append blobs: support only appending new data (not updating or deleting existing data). Good for scenarios like storing logs or writing streamed data.

Page blobs: for scenarios involving random-access reads and writes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Container

A

Packages of software that contains everything needed to run in any environment. Virtualize the operating system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

SQL Pool

A

Traditional Data Warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Dedicated SQL Pool

A

Formerly SQL DW. Refers to enterprise data warehousing features available in Azuer Synapse Analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Spark Pool

A

To run computation jobs, notebooks run on Apache Spark pools, similar to Databricsks.

Allows for usage of Hadoop file formats (Parquet, Avro, ORC)

Only charged when active. Possible to enable autoscaling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Dynamic Management Views

A

Views that user can query to view database performance. All dynamic views belong to sys schema and are named dm_

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Managed Identity

A

Provide an identity for applications when connecting to resources that support Azure AD authentication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Shared Access Signature (SAS)

A

A URI that grants restricted access rights to Azure storage resources. Specify period and specify permissions. URI points to one or more storage resources.

Good for untrusted clients.

Service-level: Allow specific resources in account. I.e. allow app to retrieve a list of files or download a file.

Account-level: service-level + additional resources and abilities. I.e. ability to create file systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Data Warehouse

A

Data management system. Intended to perform queries and analysis. Contains large amounts of historical data. Makes Data Mining possible (assists businesses in looking for patterns).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Data Masking

A

When users have access to databases but should not access certain columns in data.

Default: Different masking methods for different types. 0 for nuberic, XXXXX for strings.
Credit card: the last 4 digits.
Email: First letter and replace domain with XXXX.com
Random number: Generate a random number.
Custom text: Expose first and last characters and replace other with padded string.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

PolyBase

A

Data virtualization feature for SQL Server. Accesses external data stored in Azure Blob Storage, Hadoop, or Azure Data Lake using T-SQL.

I.e. you can create and external table to query Parquet files stored in Data Lake without importing data to DW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Blob container

A

Containers are flat, cannot store other containers, only blobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Round Robin

A

An arrangement of choosing all elements in a group equally in some rational order. A row in a RR table is non-deterministic and can end up in different distributions.

Good candidate for RR: If most columns are null able and no good hash distribution can be achieved. Useful for improving load speed. Possibly for stage tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Hash

A

Table with rows dispersed across multiple distributions based on a hash function applied to a column. Choose “not null” columns when creating hash distributed tables. Improves query performance on large fact tables. Improves speed of Joins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Replicated distribution

A

Works when the tables are < 2GB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Distributed Table

A

Appears as a single table but the rows are stored across 60 distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Distribution

A

An Azure SQL database in witch one or more distributed tables are stored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

External Table

A

A table whose data comes from flat files stored outside of the database. These tables support use cases for both exporting and importing data. Can be created solely for the purpose of writing data out to external files. Only the following statements are allowed on external tables:
CREATE TABLE
DROP TABLE
CREATE STATISTICS
DROP STATISTICS
CREATE VIEW
DROP VIEW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Staging table

A

Temporary table containing business data, modified and/or cleaned. Used primarily to stage incremental data from the transactional database. When ETL runs, staging tables are truncated before they are populated with change capture data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Slowly Changing Dimension (SCD)

A

Dimension that contains relatively static data. Changes slowly but unpredictably. I.e. customers, products, geographic locations etc.

Type 0: Fixed Dimension (no changes)
Type 1: Overwrite-No History (updates record directly)
Type 2: Row Versioning. Supports versioning of dimension members.
Type 3: Previous Value Column. Adds new attribute in the dimension to preserve the old attribute value.
Type 4: History Table (track all changes in sperate table)
Type 5: Hybrid SCD Utilize techniques from Type 4 + 1
Type 6: Hybrid SCD Utilize techniques from Type 4 + 2
Type 7: Hybrid SCD Utilize techniques from Type 4 + 3

43
Q

Surrogate Key

A

Unique identifier. Not derived from application data. Any column or set of columns can be declared the PK. Is a candidate key. Most kommon is incrementing integer.

44
Q

Business Key

A

Same as Natural key.

Unique key formed of attributes that exist and are used int eh external world. I.e. order number, product number, personal number etc.

45
Q

Candidate Key

A

Keys that could be declared as PK

46
Q

Natural Key

A

Unique key formed of attributes that exist and are used in the external world. I.e. order number, product number, personal number.

Also called business key.

47
Q

Audit Column

A

Columns with stamps such as “Last edited” or “Created Date” to track when records were updated or inserted. Don’t track deletion of columns.

48
Q

Heap indexing

A

Heap is a table without a clustered index. Data is stored in heap without specifying the order. Can be used as staging tables.

49
Q

Clustered Column Store

A

Physical Storage for entire table. Combines query results from columnstore and deltastore.

50
Q

Columnstore

A

Standard for storing and querying large data warehousing fact tables by using a columnar data format. Better for analytic work / read efficiency.

51
Q

Deltastore

A

When data is stored temporarily into a clustered index. Deltastore operations are handled behind the scenes.

52
Q

Partitioning

A

A way of splitting numbers into smaller parts to make them easier to work with. Might be faster to partition by day first if portioning is done on a date. With parallel problems, we want to match partitions and keys between input and output. WHERE. clauses will only check partitions with relevant data and you avoid to search through entire table A good choice for partition needs tho have over 1 million rows, there should not be many wheres, and it should not be a date-column.

53
Q

Stored proceedure

A

A set of SQL statements with an assigned name so that it can be reused and shared by multiple programs.

54
Q

Hadoop

A

Open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.

Includes: Avro Parquet, ORC.

All Hadoop-formats are compressed and optimized for storing and querying.

55
Q

Avro

A

Row-oriented remote procedure call and data serialization framework. Created using JSON.
Supports timestamps. Useful to write heavy uses or when aim is to retrieve full rows of data.

56
Q

CSV

A

Comma-separated value file. Delimited text file.

57
Q

Parquet

A

Open-source file format. Can only read needed columns. Column-oriented data storage format. Supported by Synapse, Databricks, ADF.

58
Q

ORC

A

Optimized Row Columnar. Column-stored. Efficient when reading heavy operations or when subsets of columns in data are interesting.

59
Q

TSV

A

Tab-separated values. Storing data in tabular format.

60
Q

TXT

A

File extension for a text

61
Q

GZip

A

Used for file compression and decompression. Reduces the size fo the files. Compressed files have the fastest load.

62
Q

JSON

A

Open standard file format. Text-based. Often used for serializing structured data and exchanging it over a network, typically between server and web.

63
Q

Binary

A

The service does not parse file content but treat it as-is. Can use in Copy activity, GetMetadata activity, or Delete activity. When using Binary in copy activity, you can only copy from Binary to Binary.

64
Q

Hot access tier

A

Storing data that is accessed frequently. Staged data. Data in active use.

65
Q

Cold access tier

A

Storing data that is infrequently accessed and stored for at least 30 days. short-term backup and disaster recovery. Older data expected to be accessed immediatley

66
Q

Archive access tier

A

Storing data that is rarely accessed. Stored at least 180 days. Long-term backup. Raw data that must be preserved. Compliance and archival data.

67
Q

CI/CD

A

Continuous Integration and -Delivery means moving Data Factory pipelines from one environment to another.

68
Q

Reference data

A

Also known as a lookup table. A finite data set that is static or slowly changing in nature, used to perform lookup.

69
Q

POSIX

A

Portable Operating System Interface.

Family of standards specified by IEEE. Purpose to maintain compatibility between operating systems.

70
Q

FlattenHierarchy

A

copy activity copy behavior. All files from source folder are in first level of the target folder. Target files have autogenerated names. Hierarchy is no preserved.

71
Q

MergeFiles

A

Merge all files from source folder to one file. If file name is specified, the merged file name is that name. Otherwise, autogenerated file name.

72
Q

PreserveHirearchy

A

Default setting. Preserves the hierarchy in the target folder.

73
Q

Geo-Redundant Storage (GRS)

A

Copies data synchronously three times within a single location in primary region using LRS. Then copies data asynchronously to a single physical location in secondary region. Within secondary region, data is copied synchronously three times with LRS.

74
Q

Geo-Zone-Redundant Storage (GZRS)

A

Copies data synchronously across 3 Azure availability zones in primary region using ZRS. Then copies data asynchronously to a single physical location in secondary region. In secondary region data is copied synchronously 3 times using LRS. Main difference from GRS is how data is replicated in primary region.

75
Q

Read-access GRS

A

Azure Files do not support RA-GRS. When you enable read access to the secondary region, but data can then be read at all times even if the primary region becomes unavailable.

76
Q

Zone-Redundant Storage (ZRS)

A

Copies data synchronously across 3 Azure availability zones in the primary region. For applications requiring high availability. Recommended to use ZRS in primary region and then replicate to a second region. Recommended for Azure data Lake Storage Gen2 workloads.

77
Q

Locally-Redundant Storage (LRS)

A

Copies data synchronously three times within a single physical location in primary region. Least expensive replication option. Not recommended for application requiring high availability or durability.

78
Q

Tumbling Window trigger

A

Fires at periodic time interval from specified start time. Retains state. Can only reference a singluar pipeline. Can use the delay parameter so that late-arriving data can be included in the load for the time when the data should have arrived. Backfill. Delay does not alter startTime.

79
Q

Schedule trigger

A

Runs on wall-clock schedule. Cannot backfill data.

80
Q

On-Demand Trigger

A

Manual execution

81
Q

Event-based trigger

A

Responds to an event

82
Q

Sliding Window

A

Output events only for points in time when the content of the window actually changes. Reports when things change in window. Events can belong to more than one window.

83
Q

Hopping Window

A

Hop forward in time by a fixed period. Think of them as tumbling windows that can overlap. Events can belong to more than one Hopping window result set.

84
Q

Tumbling Window

A

Used to segment a data stream into distinct time segments and perform a function against them. They repeat, do not overlap, and cannot belong to more than one tumbling window.

85
Q

Session Window

A

Group events that arrive at similar times, filter out periods of time where there is no data.

86
Q

Snapshot Window

A

Group events that have same timestamp.

87
Q

Deserialize

A

The reverse process where the byte stream is used to recreate the actual Java object in memory.

88
Q

Integration runtime (IR)

A

The compute infrastructure used by ADF and Azure Synapse pipelines to provide data integration capabilities across the environments: data flow, data movement, activity dispatch, SSIS package execution.

89
Q

Self-hosted IR

A

Can run copy activity between cloud data stores and a data store in private network. Use to support data stores that requires bring-your-own driver such as SAP Hana, MySQL, ets. Is to be used On-premises.

90
Q

Azure-SSIS IR

A

Are not supported in Synapse pipelines. To lift and shift existing SSIS workload.

91
Q

SSIS

A

SQL Server Integration Services. Platform for building enterprise-level data integration and data transformation solutions. Use to solve complex business problems by copying or downloading files, loading data warehouses, cleansing and mining data, and managing SQL Server objects and data.

92
Q

Databrics Cluster

A

Standard Clusters: Recommended for single user. Can run workloads developed in any language. Require at least one Spark worker node in addition to driver node. Good for large amounts of data.

High Concurrency clusters: A managed cloud resource. Provides fine-grained sharing for maximum resource utilization and minimum query latencies. Can run SQL, Python, and R. Cannot run Scala. Recommended to enable autoscaling.

Single Node Clusters: Has no workers, run Spark nodes on the driver node. recommended for small amounts of data.

Clusters automatically delete themselves after 30 days of inactivity. If you don’t want it to be deleted you can pin it and it won’t be deleted despite inactivity. Up to 70 clusters can be pinned.

93
Q

Azure stream Analytics Functions

A

Aggregate Functions: Perform a calculation on a set of calues and return a single calue. COUNT() and similar.

GeoSpatial Functions: Enable real-time analytics.

Windowing Functions: Subsets of events, grouped by time.

94
Q

HTTPS

A

Extension of HTTP. Used for secure communication over a computer network.

95
Q

DTU

A

Database Transaction Unit. Combines measure of compute, storage and IO resources.

96
Q

vCores

A

Virtual cores. Give greater control over compute and storage resources.

97
Q

SWL elastic pools

A

Relate to eDTU. Enable to buy a set of compute and storage resources that are shared among all databases in the pool. Cost effective service. Resource allocation service used to scale and manage the performance and cost of a group of Azure SQL databases. Ideal when you have several SQL databases with low average utilization, but infrequent spikes.

98
Q

Collation

A

Refers to rules that sort and compare data. Helps you define sorting rules when case sensitivity, accent marks, and other characteristics are important.

99
Q

SaaS (Software as a Service)

A

Software licensing and delivery model where software is licensed on subscription basis and centrally hosed.

100
Q

PolyBase

A

Fastest and most scalable way to load data. When using PolyBase with T’SQL, must define 3 objects in order for it to work:
External data source (to connect with source)
External file format (info about how data is structured)
External table (defines the structure)

101
Q

OPENROWSET

A

A function that allows user to bulk load from storage account directly. Reads directly from storage account using AD authentication and SAS tokens. Can read from various formats or may sources at once.

102
Q

Azure Key Vault

A

Stores keys, certificates, and secrets for Azure services. Only way to authenticate the encrypted data is with an Azure active directory, RBAC, or access policy.

103
Q

Star Schema

A

Design pattern of using fact and dimension tables when designing a data warehouse. Events are stored in fact table with keys that refer to dimesnion tables. No need to normalize database, easy query ogic to write. Queries may be faster to compute.

104
Q

Role Based Access Control (RBAC)

A

To manage access in Azure Cloud Services. Permission is given out by assigning roles. Four levels:
Management group: Admins can also restrict access to resources, Negative access overrules positive.
Subscription: a collection of all resource groups, here billing is done.
Resource group: a container for several resources.
Resource: lowest level, connected to a single service.