Terms Flashcards

Question 1

Q

Azure Active Directory

Answer

A

Cloud-based identity and access management (IAM) solution.
Provides single sign-on and multi-factor authentication to help protect users.
Helps organize computers and users.

Question 2

Q

Azure Synapse Analytics

Answer

A

Analytics Service that brings together integration, enterprise data warehousing, and big data analytics. An evolution of Azure SQL Data Warehouse. Allows you to build and manage a modern DW.

Strengths: Quickly run complex queries across petabytes of data.

Possible to use serverless SQl pool in Synapse, which is adaptive to current workloads and can shrink or grow on command. You can therefore use the pattern that the data takes (Hash, Round Robin, Replicated)

Question 3

Q

Apache Blob Storage

Answer

A

File storage in the cloud and an API that lets you build apps to access data. Unstructured, no restriction in the kinds of data it holds. Have higher latency than memory and local disk and don’t have indexing features.

Frequently used in combination with databases to store non-queryable data.
I.e profile pictures for an app could be stored in blobs.

Every blob lives inside a blob container. If you want to store data without performing analysis on the data, set Hierarchical Namespace to Disabled to set up the storage account. Also good for archive rarely used data or store website assets such as images and media.

Question 4

Q

Azure Synapse Analytics Serverless SQL Pool

Answer

A

A query service over the data in data lake. Benefits: Basic discovery and exploration, logical date warehousing, data transformation.

Question 5

Q

Apache Spark

Answer

A

Processing system for big data workloads. Interface for programming entire clusters with implicit data parallelism and fault tolerance.

Question 6

Q

Azure Cosmos DB

Answer

A

Used within web and mobile applications. Good for modeling social interactions. Cosmos DB is globally distributed and a NoSQL database. Organizational entity for your databases.

Question 7

Q

Azure Data Lake Storage Gen2

Answer

A

Capabilities dedicated to big data analytics. Built on Azure Blob Storage. File system semantics, file-level security, and scale.

Also: low-cost, tiered storage, high availability. If you’re performing analytics on data, set Hierarchical Namespace to Enabled.
Gen2 is with hierarchical namespace, meaning it has a physical folder structure. Gen1 is Blob-based.

Question 8

Q

Hierarchical namespace

Answer

A

A physical folder structure

Question 9

Q

Azure Databricks

Answer

A

Data analytics platform optimized for Azure cloud services. Uses notebooks that run on a Spark engine. Integrated with PowerBI, Tableau, and similar.

Cannot be assigned a system assigned managed identity. Instead, use Secret Scope. Allows Databricks to access a Key-vault. Only users with Contributer-permission, or higher, might activate secret scopes.

Question 10

Q

Azure Data Factory

Answer

A

Azure’s cloud ETL service for scale-out serverless data integration and data transformation. You can create and schedule data-driven workflows. Can create pipelines without writing code. Can copy and transform data. Can orchestrate batch data movement and transformations. Only store pipeline-run data for 45 days.

Activities: copy data from source into a sink, perform transformation, or similar.

Linked services: connection tools that ADF uses to connect to services like API, storage accounts, and databases.

Pipelines: Container for activities in ADF. Contains a flow of activities that execute depending on completion status. Typically equipped with a trigger.

Integration Runtimes: the infrastructure used to compute activities, data flow, SSIS package, and data movements. 3 types:
Azure – runs data flow, data movement, and activities.
Self-hosted – runs data movement and activities.
Azure SSIS – SSIS-package execution

Question 11

Q

Azure Virtual Network

Answer

A

VNet. Enables Azure resources to securely communicate.

Question 12

Q

Azure Event Hub

Answer

A

Real-time data ingestion service. Stream events to build dynamic data pipelines and immediately respond. Can process millions of events per second. Collects events. Accepts only endpoints for ingestion of data. No mechanism for sending data back to publisher. Good for massive scale or for a series of events.

Event Hubs Dedicated is a pricing tier, billed at a fixed monthly price, minimum of 4 hours of usage.

Question 13

Q

Azure Event Grid

Answer

A

Build applications with event-based architectures. Good for dealing with discrete events and when there’s a need for the application to work in a publisher/subscriber model and handle event but not data.

Question 14

Q

Azure IoT Hub

Answer

A

IoT connector to the cloud. Enables solutions with reliable and secure communication

Question 15

Q

Azure SQL Database

Answer

A

Managed coud database

Question 16

Q

Azure Stream Analytics

Answer

A

Serverless scalable event processing engine. Can run real-time analytics on multiple streams.

Used togeahter with Event Hub. Event Hubs feeds events into Azure and Stream Analytics processes them.

Question 17

Q

Azure DevOps Git Repository

Answer

A

Set of version control tools to manage code. Can help track changes over time

Question 18

Q

Azure Monitor

Answer

A

To keep data for longer than 45 days. Helps maximize availability and performance of applications and services.

Question 19

Q

Azure Log Analytics

Answer

A

To edit and run log queries from data collected by Azure Monitor Logs and interactively analyze their results.

Azure Monitor builds on tp of Log Analytics. Monitor is the marketing name, Log Analytics is the technology that powers it.

Question 20

Q

Microsoft Power BI

Answer

A

Business analytics service. Provides interactive visualizations and BI capabilities. Reports and dashboards for end users.

Question 21

Q

Microsoft Visual Studio

Answer

A

Development environment. IDE and Code Editor

Question 22

Q

Delta Lake

Answer

A

Efficient method of storing data. ACID-compliant and stores data in hacked up parquet. Not readable by humans. Can be used for batch data and streaming data. Delta enforces schemas.

Question 23

Q

Blobs (3 kinds)

Answer

A

Block blobs: blocks of different sizes.

Append blobs: support only appending new data (not updating or deleting existing data). Good for scenarios like storing logs or writing streamed data.

Page blobs: for scenarios involving random-access reads and writes.

Question 24

Q

Container

Answer

A

Packages of software that contains everything needed to run in any environment. Virtualize the operating system.

Question 25

Q

SQL Pool

Answer

A

Traditional Data Warehouse

Question 26

Q

Dedicated SQL Pool

Answer

A

Formerly SQL DW. Refers to enterprise data warehousing features available in Azuer Synapse Analytics.

Question 27

Q

Spark Pool

Answer

A

To run computation jobs, notebooks run on Apache Spark pools, similar to Databricsks.

Allows for usage of Hadoop file formats (Parquet, Avro, ORC)

Only charged when active. Possible to enable autoscaling.

Question 28

Q

Dynamic Management Views

Answer

A

Views that user can query to view database performance. All dynamic views belong to sys schema and are named dm_

Question 29

Q

Managed Identity

Answer

A

Provide an identity for applications when connecting to resources that support Azure AD authentication.

Question 30

Q

Shared Access Signature (SAS)

Answer

A

A URI that grants restricted access rights to Azure storage resources. Specify period and specify permissions. URI points to one or more storage resources.

Good for untrusted clients.

Service-level: Allow specific resources in account. I.e. allow app to retrieve a list of files or download a file.

Account-level: service-level + additional resources and abilities. I.e. ability to create file systems

Question 31

Q

Data Warehouse

Answer

A

Data management system. Intended to perform queries and analysis. Contains large amounts of historical data. Makes Data Mining possible (assists businesses in looking for patterns).

Question 32

Q

Data Masking

Answer

A

When users have access to databases but should not access certain columns in data.

Default: Different masking methods for different types. 0 for nuberic, XXXXX for strings.
Credit card: the last 4 digits.
Email: First letter and replace domain with XXXX.com
Random number: Generate a random number.
Custom text: Expose first and last characters and replace other with padded string.

Question 33

Q

PolyBase

Answer

A

Data virtualization feature for SQL Server. Accesses external data stored in Azure Blob Storage, Hadoop, or Azure Data Lake using T-SQL.

I.e. you can create and external table to query Parquet files stored in Data Lake without importing data to DW

Question 34

Q

Blob container

Answer

A

Containers are flat, cannot store other containers, only blobs

Question 35

Q

Round Robin

Answer

A

An arrangement of choosing all elements in a group equally in some rational order. A row in a RR table is non-deterministic and can end up in different distributions.

Good candidate for RR: If most columns are null able and no good hash distribution can be achieved. Useful for improving load speed. Possibly for stage tables.

Question 36

Q

Hash

Answer

A

Table with rows dispersed across multiple distributions based on a hash function applied to a column. Choose “not null” columns when creating hash distributed tables. Improves query performance on large fact tables. Improves speed of Joins.

Question 37

Q

Replicated distribution

Answer

A

Works when the tables are < 2GB

Question 38

Q

Distributed Table

Answer

A

Appears as a single table but the rows are stored across 60 distributions.

Question 39

Q

Distribution

Answer

A

An Azure SQL database in witch one or more distributed tables are stored

Question 40

Q

External Table

Answer

A

A table whose data comes from flat files stored outside of the database. These tables support use cases for both exporting and importing data. Can be created solely for the purpose of writing data out to external files. Only the following statements are allowed on external tables:
CREATE TABLE
DROP TABLE
CREATE STATISTICS
DROP STATISTICS
CREATE VIEW
DROP VIEW

Question 41

Q

Staging table

Answer

A

Temporary table containing business data, modified and/or cleaned. Used primarily to stage incremental data from the transactional database. When ETL runs, staging tables are truncated before they are populated with change capture data.

Question 42

Q

Slowly Changing Dimension (SCD)

Answer

A

Dimension that contains relatively static data. Changes slowly but unpredictably. I.e. customers, products, geographic locations etc.

Type 0: Fixed Dimension (no changes)
Type 1: Overwrite-No History (updates record directly)
Type 2: Row Versioning. Supports versioning of dimension members.
Type 3: Previous Value Column. Adds new attribute in the dimension to preserve the old attribute value.
Type 4: History Table (track all changes in sperate table)
Type 5: Hybrid SCD Utilize techniques from Type 4 + 1
Type 6: Hybrid SCD Utilize techniques from Type 4 + 2
Type 7: Hybrid SCD Utilize techniques from Type 4 + 3

Question 43

Q

Surrogate Key

Answer

A

Unique identifier. Not derived from application data. Any column or set of columns can be declared the PK. Is a candidate key. Most kommon is incrementing integer.

Question 44

Q

Business Key

Answer

A

Same as Natural key.

Unique key formed of attributes that exist and are used int eh external world. I.e. order number, product number, personal number etc.

Question 45

Q

Candidate Key

Answer

A

Keys that could be declared as PK

Question 46

Q

Natural Key

Answer

A

Unique key formed of attributes that exist and are used in the external world. I.e. order number, product number, personal number.

Also called business key.

Question 47

Q

Audit Column

Answer

A

Columns with stamps such as “Last edited” or “Created Date” to track when records were updated or inserted. Don’t track deletion of columns.

Question 48

Q

Heap indexing

Answer

A

Heap is a table without a clustered index. Data is stored in heap without specifying the order. Can be used as staging tables.

Question 49

Q

Clustered Column Store

Answer

A

Physical Storage for entire table. Combines query results from columnstore and deltastore.

Question 50

Q

Columnstore

Answer

A

Standard for storing and querying large data warehousing fact tables by using a columnar data format. Better for analytic work / read efficiency.

Question 51

Q

Deltastore

Answer

A

When data is stored temporarily into a clustered index. Deltastore operations are handled behind the scenes.

Question 52

Q

Partitioning

Answer

A

A way of splitting numbers into smaller parts to make them easier to work with. Might be faster to partition by day first if portioning is done on a date. With parallel problems, we want to match partitions and keys between input and output. WHERE. clauses will only check partitions with relevant data and you avoid to search through entire table A good choice for partition needs tho have over 1 million rows, there should not be many wheres, and it should not be a date-column.

Question 53

Q

Stored proceedure

Answer

A

A set of SQL statements with an assigned name so that it can be reused and shared by multiple programs.

Question 54

Q

Hadoop

Answer

A

Open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.

Includes: Avro Parquet, ORC.

All Hadoop-formats are compressed and optimized for storing and querying.

Question 55

Q

Avro

Answer

A

Row-oriented remote procedure call and data serialization framework. Created using JSON.
Supports timestamps. Useful to write heavy uses or when aim is to retrieve full rows of data.

Question 56

Q

CSV

Answer

A

Comma-separated value file. Delimited text file.

Question 57

Q

Parquet

Answer

A

Open-source file format. Can only read needed columns. Column-oriented data storage format. Supported by Synapse, Databricks, ADF.

Question 58

Q

ORC

Answer

A

Optimized Row Columnar. Column-stored. Efficient when reading heavy operations or when subsets of columns in data are interesting.

Question 59

Q

TSV

Answer

A

Tab-separated values. Storing data in tabular format.

Question 60

Q

TXT

Answer

A

File extension for a text

Question 61

Q

GZip

Answer

A

Used for file compression and decompression. Reduces the size fo the files. Compressed files have the fastest load.

Question 62

Q

JSON

Answer

A

Open standard file format. Text-based. Often used for serializing structured data and exchanging it over a network, typically between server and web.

Question 63

Q

Binary

Answer

A

The service does not parse file content but treat it as-is. Can use in Copy activity, GetMetadata activity, or Delete activity. When using Binary in copy activity, you can only copy from Binary to Binary.

Question 64

Q

Hot access tier

Answer

A

Storing data that is accessed frequently. Staged data. Data in active use.

Answer 65

A

Storing data that is infrequently accessed and stored for at least 30 days. short-term backup and disaster recovery. Older data expected to be accessed immediatley

Answer 66

A

Storing data that is rarely accessed. Stored at least 180 days. Long-term backup. Raw data that must be preserved. Compliance and archival data.

Answer 67

A

Continuous Integration and -Delivery means moving Data Factory pipelines from one environment to another.

Answer 68

A

Also known as a lookup table. A finite data set that is static or slowly changing in nature, used to perform lookup.

Answer 69

A

Portable Operating System Interface.

Family of standards specified by IEEE. Purpose to maintain compatibility between operating systems.

Answer 70

A

copy activity copy behavior. All files from source folder are in first level of the target folder. Target files have autogenerated names. Hierarchy is no preserved.

Answer 71

A

Merge all files from source folder to one file. If file name is specified, the merged file name is that name. Otherwise, autogenerated file name.

Answer 72

A

Default setting. Preserves the hierarchy in the target folder.

Answer 73

A

Copies data synchronously three times within a single location in primary region using LRS. Then copies data asynchronously to a single physical location in secondary region. Within secondary region, data is copied synchronously three times with LRS.

Answer 74

A

Copies data synchronously across 3 Azure availability zones in primary region using ZRS. Then copies data asynchronously to a single physical location in secondary region. In secondary region data is copied synchronously 3 times using LRS. Main difference from GRS is how data is replicated in primary region.

Answer 75

A

Azure Files do not support RA-GRS. When you enable read access to the secondary region, but data can then be read at all times even if the primary region becomes unavailable.

Answer 76

A

Copies data synchronously across 3 Azure availability zones in the primary region. For applications requiring high availability. Recommended to use ZRS in primary region and then replicate to a second region. Recommended for Azure data Lake Storage Gen2 workloads.

Answer 77

A

Copies data synchronously three times within a single physical location in primary region. Least expensive replication option. Not recommended for application requiring high availability or durability.

Answer 78

A

Fires at periodic time interval from specified start time. Retains state. Can only reference a singluar pipeline. Can use the delay parameter so that late-arriving data can be included in the load for the time when the data should have arrived. Backfill. Delay does not alter startTime.

Answer 79

A

Runs on wall-clock schedule. Cannot backfill data.

Answer 80

A

Manual execution

Answer 81

A

Responds to an event

Answer 82

A

Output events only for points in time when the content of the window actually changes. Reports when things change in window. Events can belong to more than one window.

Answer 83

A

Hop forward in time by a fixed period. Think of them as tumbling windows that can overlap. Events can belong to more than one Hopping window result set.

Answer 84

A

Used to segment a data stream into distinct time segments and perform a function against them. They repeat, do not overlap, and cannot belong to more than one tumbling window.

Answer 85

A

Group events that arrive at similar times, filter out periods of time where there is no data.

Answer 86

A

Group events that have same timestamp.

Answer 87

A

The reverse process where the byte stream is used to recreate the actual Java object in memory.

Answer 88

A

The compute infrastructure used by ADF and Azure Synapse pipelines to provide data integration capabilities across the environments: data flow, data movement, activity dispatch, SSIS package execution.

Answer 89

A

Can run copy activity between cloud data stores and a data store in private network. Use to support data stores that requires bring-your-own driver such as SAP Hana, MySQL, ets. Is to be used On-premises.

Answer 90

A

Are not supported in Synapse pipelines. To lift and shift existing SSIS workload.

Answer 91

A

SQL Server Integration Services. Platform for building enterprise-level data integration and data transformation solutions. Use to solve complex business problems by copying or downloading files, loading data warehouses, cleansing and mining data, and managing SQL Server objects and data.

Answer 92

A

Standard Clusters: Recommended for single user. Can run workloads developed in any language. Require at least one Spark worker node in addition to driver node. Good for large amounts of data.

High Concurrency clusters: A managed cloud resource. Provides fine-grained sharing for maximum resource utilization and minimum query latencies. Can run SQL, Python, and R. Cannot run Scala. Recommended to enable autoscaling.

Single Node Clusters: Has no workers, run Spark nodes on the driver node. recommended for small amounts of data.

Clusters automatically delete themselves after 30 days of inactivity. If you don’t want it to be deleted you can pin it and it won’t be deleted despite inactivity. Up to 70 clusters can be pinned.

Answer 93

A

Aggregate Functions: Perform a calculation on a set of calues and return a single calue. COUNT() and similar.

GeoSpatial Functions: Enable real-time analytics.

Windowing Functions: Subsets of events, grouped by time.

Answer 94

A

Extension of HTTP. Used for secure communication over a computer network.

Answer 95

A

Database Transaction Unit. Combines measure of compute, storage and IO resources.

Answer 96

A

Virtual cores. Give greater control over compute and storage resources.

Answer 97

A

Relate to eDTU. Enable to buy a set of compute and storage resources that are shared among all databases in the pool. Cost effective service. Resource allocation service used to scale and manage the performance and cost of a group of Azure SQL databases. Ideal when you have several SQL databases with low average utilization, but infrequent spikes.

Answer 98

A

Refers to rules that sort and compare data. Helps you define sorting rules when case sensitivity, accent marks, and other characteristics are important.

Answer 99

A

Software licensing and delivery model where software is licensed on subscription basis and centrally hosed.

Answer 100

A

Fastest and most scalable way to load data. When using PolyBase with T’SQL, must define 3 objects in order for it to work:
External data source (to connect with source)
External file format (info about how data is structured)
External table (defines the structure)

Answer 101

A

A function that allows user to bulk load from storage account directly. Reads directly from storage account using AD authentication and SAS tokens. Can read from various formats or may sources at once.

Answer 102

A

Stores keys, certificates, and secrets for Azure services. Only way to authenticate the encrypted data is with an Azure active directory, RBAC, or access policy.

Answer 103

A

Design pattern of using fact and dimension tables when designing a data warehouse. Events are stored in fact table with keys that refer to dimesnion tables. No need to normalize database, easy query ogic to write. Queries may be faster to compute.

Answer 104

A

To manage access in Azure Cloud Services. Permission is given out by assigning roles. Four levels:
Management group: Admins can also restrict access to resources, Negative access overrules positive.
Subscription: a collection of all resource groups, here billing is done.
Resource group: a container for several resources.
Resource: lowest level, connected to a single service.