Azure Data Engineering Certification Flashcards

1
Q

resource group

A

grouping of your services for project or app. pricing, budget, permissions. policies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

performance (stand./premium)

A

premium: uses solid state drives, more expensive, better perform. changes access tier, replication,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

4 Services Offered in Azure Storage

A

Blob: any type of unsctructured data

Table: NoSQL non-relational data tables

File: similar to OneDrive/GoogleDrive, attaches to multiple VMs for read/write access

Queue: message storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Attributes of Azure Storage (CAASS)

A

Cost Effective

Available & Durable: redundant & replication of data

Accessable: REST APIs, SDKs (software dev. kit), Azure CLI (Command Line Interface), Powershell, Storage Explorer, AzCopy

Security: private endpoint, user access, HTTPS, vitrual networks, encrypted data

Scalable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

5 APIs of Cosmos DB

(Maggie Crosses The Street w/ Greg)

(details)

A

SQL: json format

Table: key-value pairs (think dictionary in python)

MongoDB: json format (file storage)

Casandra: wide column data

Gremlin: graph data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Account Selection

For Standard performance, what account types can you select and what replication options are offered?

For Premium performance, what account types can you select and what replication options are offered?

A

Standard

General Purpose v1: LRS, GRS, RA-GRS

General Purpose v2: LRS, GRS, RA-GRS, ZRS, GZRS, RA-GZRS

BlobStorage: LRS, GRS, RA-GRS

Premium

General Purpose v1: LRS

General Purpose v2: LRS

BlockBlobStorage: LRS, ZRS

FileStorage: LRS, ZRS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What functions do you get with the different storage types (5)?

Differences with Premium performance?

A

General Purpose v1

  • support all storage types up to 100 terabytes
  • supports all blob storage types: Block, Append, Page, Hierarchical
  • used for VMs & VN (virtual networks) still on classic deployment
  • Premium: only page blob supported

General Purpose v2

  • support all storage types up to 100 terabytes
  • supports all blob storage types: Block, Append, Page, Hierarchical
  • supports blob access tiers (hot or cold)
  • Premium: only page blob supported

BlobStorage

  • only avaiable in Standard performance
  • supports only block & append blob storages, no other storage types supported

BlockBlobStorage

  • only avaiable in Premium performance
  • only supports block & append blob storages, no other storage types supported
  • does not support blob access tiers (hot or cold)
  • designed for high performance low latency, interactive workloads, and mapping apps. (analytics/data trans., ecommerce, quick display)

FileStorage

  • only available in Premium performance
  • higher performance and lower latency compared to general purpose
  • IOPS bursting: 3x input/output per sec.
  • billed based on provisioned storage up to 100 terabytes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Setting up an Account: What are the different Networking options (3)?

A

Public (all networks): all networks can access and the internet

Public (selected networks): prevents internet access

Privant Endpoint: secured access on private virtual networks using an IP address. Connects to on-prem or express route connections, placing Azure services inside you virtual network. Needs a Business Network Zone (BNZ) to function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Blob Soft Delete?

A

recycled data builted for your blob in case of accidental deletion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

3 Blob Access Tiers & Definitions

A

Hot: lowest acess, highest storage cost. Designed for current data, freq. used)

Cold: higher access, lower storage cost. Designed for older data not used frequently.

Archieved: highest access, lowest storage cost. Data is offline and can take hours to access. Designed for historical, very old data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Local Redundant Storage

A

Replication within a zone, within a region across different hardware racks (also called nodes).

If the zone goes down, data is lost. This is the default replication option.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Zone Redundant Storage

A

Data copied across availability zones

If a region goes down, data is lost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Geo Redundant Storage

A

Data copied across regions to prevent loss of data in the event a natural diasters. Generally done within the same country.

Data in secondary region is not available to applications without a failover initiated by MS. If region A goes down MS will initiate a failover and then your data will be available from region B.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Geo-Zone Redundant Storage

A

Copied across availability zones within 1st region. Copied within an availability zone in the 2nd region.

Data in secondary region is not available to applications without a failover initiated by MS. If region A goes down MS will initiate a failover and then your data will be available from region B.

High availability and diaster recovery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Read-Access Geo Redundant Storage

A

Data copied across regions to prevent loss of data in the event a natural diasters. Generally done within the same country.

Without failover data is accessable to individuals closest to the region. Data is always avaialable for your applications.

High availability, disasterrecovery, & immediate access in the event of natural disaster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Read-Access Geo-Zone Redundant Storage

A

Copied across availability zones within 1st region. Copied within an availability zone in the 2nd region. Without failover data is accessable to individuals closest to the region.

Data is always avaialable for your applications

17
Q

Azure Blob Storage Types (4)

A

Block: upto 4.7 terrabytes, composed of blocks to optimize data for uploading

Append: append blocks, ideal for logs

Page: VM disk & databses, frequent & random read/write applications.

Hierarchical: allows for collection of files to be organzied into a hierarchy of directories

18
Q

Advantages (5) & Disadvantages (5)

Azure Blob Storage

A

Advantages

  • designed for all types of unstructured data
  • scalable
  • cheap
  • simple set up, no configuration
  • no need for powerful computing to manage

Disadvantages

  • no indexes
  • no search tooles
  • not optimized for performance
  • user responsible for replication & syncing
  • requires external computing to process
19
Q

Multi-model Cosmos DB

(4 General Types of NoSQL Databases)

(5 APIs per NoSQL Type)

(provide information about each API)

A

Document APIs

SQL (Core)

  • supports server-side programming model
  • json documents
  • SQL like for NoSQL
  • default programming language after transitioning other APIs into Azure

mongoDB

  • all mongo SDKs can interact with Azure API, fully compatable with mongo app. code
  • implements “wire” protocol
  • bson documents (binary json)

Key-Value API

Table

  • premium offering for Azure Table storage
  • not traditional SQL “table”
    • rows can be of different lengths
    • row value can be simple number

Wide-Column API

Cansandra

  • data is stored in columns, each column is stored seperatly (each attribute is seperated from the other, think of individual list of columns)
  • name and format of columns can vary from row to row
  • compatible with current, external Casandra
  • ways to interact with Casandra
    • Casandra base tools
    • Data Explorer
  • SDK: CansandraCSharpedriver

Graph API

Gremlin

  • entity relationships: nodes and edges
  • use cases
    • geospatial
    • recommendation engines
    • social networkds
    • IoT
  • Presist relationships at the storage layer
  • no model required
20
Q

Redundancies Available in Cosmos DB

A

Geo-Redundant

Multi-Region Write

Availability Zone

21
Q

Encryption in Cosmos DB

(defaults & choices)

A

Default

  • encryption is always set at rest (stored data)

Choices

  • Service Managed: Azure managed
  • Customer Managed: User set encryption and key
22
Q

Latency

(definition & mitigation)

A

Latency is the wait time between request and response. It is migitated by housing the server as close as possible to the user.

23
Q

Throughput

(definition, when is the amount set, in what units does Azure manage throughput & the calculation)

A

Throughput is the number of requested that can be processed by the database within a given timeframe. The Throughput amount can be defined either at database level or a the container level. If throughput exceeds the alotted time an error is thrown.

Cosmos manages throughput in request units (RU)

RU calculation: Memory + CPU + IOPs (input/output proessing per second)

24
Q

Container

Componenets (5) & Names per API (3 for each API)

A

Components

  1. Database
  2. Throughput
  3. Container ID
  4. Partition Key
  5. Analytical Store

SQL API

Database is defined as Databse

Container is defined as Container

Item is defined as Document

Cassandra API

Database is defined as Keyspace

Container is defined as Table

Item is defined as Row

MongoDB API

Database is defined as Databse

Container is defined as Collection

Item is defined as Document

Gremlin API

Database is defined as Databse

Container is defined as Graph

Item is defined as Node of edge

Table API

Database is not defined

Container is defined as Table

Item is defined as Item

25
Q

Partitioning Definitions

  1. Partitioning
  2. Partition Keys (rem. imp. fct.)
  3. Logical Partition
  4. Physical Partition (rem. imp. fct.)
  5. Composite Key (add. term)
  6. Partition Restrictions
A

Partition: items in a container are divided into distinct subsets called logical partitions.

Partition Key: the value by which Azure organizes your data into logical divisions. Cannot change partition key after creation of the database or container.

Logical Partitions: subsets of your data divided by the partition key

Physical Partitions: the physical machines that house the different logical partitions. Logical partitions are never divided across multiple physical partitions.

Composite Key: multiple unique identifiers combined to create a single partition key, further subdividing data into smaller units.

Restrictions:

  1. Each document cannot exceed 2MB
  2. Each logical partition cannot exceed 20GB
26
Q

Dedicated & Shared Throughput

(definitions)

A

When you define throughput at the database level:

Shared: throughput is evenly distributed across containers (recommended)

Dedicated: defined throughput for each container, if throughput is defined at the container level by default it will be dedicated

27
Q

Hot Partition

(definition)

A

not enough RUs for the logical partition, while other logical partitions have plenty of available RUs

(good practice to create partition keys that evenly distribute data across logical partitions)

28
Q

Single v. Cross Partition Queries

A

Single: can identify all data from a query in a single logical partition (most efficient).

Cross: queries has to look across multiple logical partitions to find data for query. Also called a fan out query.

29
Q

High Cardinality

(definition within context of databases)

A

columns with values that are unique or very uncommon

30
Q

Fixed Request Charge

A

the cost to run each query against your data

31
Q

Time to Live

A
  • the time period for the data to be active before it is deleted
  • set the time to live value under settings
  • defaults to comsuming only leftover RUs - if other workloads are running, time to live data deletion will be delayed
32
Q

Cosmos DB Global Distribution

(def., paired regions, multi-region write & choices)

A

Definition

Data can be replicated globally and read from any selected region. Storage and throughput are copied into selected global region.

Paired Regions

Two geographic centers with high speed connection. Used for diaster recovery and business contunuity purposes.

Multi-Region Write (Multi-Master Write)

User from two seperate regions (ex. Japan & US) update data at the same time. Options:

  • last write wins (must define, ex. time stamp)
  • merge procedure (define the procedure)
  • merge procedure ( don’t define)
    • actions are stored and manually define the stored procedure later
33
Q

Automatic v. Manual Failover

(definitions & when does it applies)

A

applies when there is only one write enabled center

Manual: user chooses the next write enabled center

Automatic: decide prior to natural diaster

replication will automatically occur in either scenario as long as a global backup center has been identified

34
Q

Consistency Levels of Cosmos DB

&

Definitions

(5)

A

In general, there is a trade-off between consistency and availability.

Strong: always read most up-to-date data, no dirty reads. High latency, highest cost.

Bounded Staleness: dirty reads are only possible within a bounded timeframe.

Session: within a session no dirty reads. Once session ends dirty reads possible. No dirty reads for writers in the same session, however dirty reads possible for other users.

Consistency Prefix: dirty reads are possible but never seen out-of-order for updates. Data is always read in order althought it may not be the most recent data.

Eventual: automatically respond to request, so dirty reads are possible and those read may be out of order. Evenetually everything will be updated to the correct data.

35
Q

Is it possible for clients to override consistency levels?

A

clients can set consistency levels to a lower level at connection time

(Strong is the highest consistency level)

36
Q

Areas Covered in Non-Relational Portion of Exam

A

Azure Storage

  • how to provision an account
  • replication options (LRS, GRS, ZRS, GZRS, RA-GRS, RA-GZRS)
  • blob storage

Data Lake

  • evolution from Blob & distinctions
  • security options

Cosmos DB (largest area of this portion)

  • features
  • multi-model
  • consistency levels
  • databases & containers
  • throughput & request
  • partitioning & horizontal scaling
  • global distribution
  • multi-master write
  • failover
  • time to live
  • CLI (code to create an account)
  • security
  • pricing