Implementing Data Storage Solutions: Non-Relational Flashcards
resource group
grouping of your services for project or app. pricing, budget, permissions. policies
performance (stand./premium)
premium: uses solid state drives, more expensive, better perform. changes access tier, replication,
4 Services Offered in Azure Storage
Blob: any type of unsctructured data
Table: NoSQL non-relational data tables
File: similar to OneDrive/GoogleDrive, attaches to multiple VMs for read/write access
Queue: message storage
Attributes of Azure Storage (CAASS)
Cost Effective
Available & Durable: redundant & replication of data
Accessable: REST APIs, SDKs (software dev. kit), Azure CLI (Command Line Interface), Powershell, Storage Explorer, AzCopy
Security: private endpoint, user access, HTTPS, vitrual networks, encrypted data
Scalable
5 APIs of Cosmos DB
(Maggie Crosses The Street w/ Greg)
(details)
SQL: json format
Table: key-value pairs (think dictionary in python)
MongoDB: json format (file storage) - tree shaped data instead of rows and columns
Casandra: wide column data
Gremlin: graph data
Account Selection
For Standard performance, what account types can you select and what replication options are offered?
For Premium performance, what account types can you select and what replication options are offered?
Standard
General Purpose v1: LRS, GRS, RA-GRS
General Purpose v2: LRS, GRS, RA-GRS, ZRS, GZRS, RA-GZRS
BlobStorage: LRS, GRS, RA-GRS
Premium
General Purpose v1: LRS
General Purpose v2: LRS
BlockBlobStorage: LRS, ZRS
FileStorage: LRS, ZRS
What functions do you get with the different storage types (5)?
Differences with Premium performance?
General Purpose v1
- support all storage types up to 100 terabytes
- supports all blob storage types: Block, Append, Page, Hierarchical
- used for VMs & VN (virtual networks) still on classic deployment
- Premium: only page blob supported
General Purpose v2
- support all storage types up to 100 terabytes
- supports all blob storage types: Block, Append, Page, Hierarchical
- supports blob access tiers (hot or cold)
- Premium: only page blob supported
BlobStorage
- only avaiable in Standard performance
- supports only block & append blob storages, no other storage types supported
BlockBlobStorage
- only avaiable in Premium performance
- only supports block & append blob storages, no other storage types supported
- does not support blob access tiers (hot or cold)
- designed for high performance low latency, interactive workloads, and mapping apps. (analytics/data trans., ecommerce, quick display)
FileStorage
- only available in Premium performance
- higher performance and lower latency compared to general purpose
- IOPS bursting: 3x input/output per sec.
- billed based on provisioned storage up to 100 terabytes
Setting up an Account: What are the different Networking options (3)?
Public (all networks): all networks can access and the internet
Public (selected networks): prevents internet access
Privant Endpoint: secured access on private virtual networks using an IP address. Connects to on-prem or express route connections, placing Azure services inside you virtual network. Needs a Business Network Zone (BNZ) to function.
What is Blob Soft Delete?
recycled data builted for your blob in case of accidental deletion
3 Blob Access Tiers & Definitions
Hot: lowest acess, highest storage cost. Designed for current data, freq. used)
Cold: higher access, lower storage cost. Designed for older data not used frequently.
Archieved: highest access, lowest storage cost. Data is offline and can take hours to access. Designed for historical, very old data.
Local Redundant Storage
Replication within a zone, within a region across different hardware racks (also called nodes).
If the zone goes down, data is lost. This is the default replication option.

Zone Redundant Storage
Data copied across availability zones
If a region goes down, data is lost

Geo Redundant Storage
Data copied across regions to prevent loss of data in the event a natural diasters. Generally done within the same country.
Data in secondary region is not available to applications without a failover initiated by MS. If region A goes down MS will initiate a failover and then your data will be available from region B.

Geo-Zone Redundant Storage
Copied across availability zones within 1st region. Copied within an availability zone in the 2nd region.
Data in secondary region is not available to applications without a failover initiated by MS. If region A goes down MS will initiate a failover and then your data will be available from region B.
High availability and diaster recovery.

Read-Access Geo Redundant Storage
Data copied across regions to prevent loss of data in the event a natural diasters. Generally done within the same country.
Without failover data is accessable to individuals closest to the region. Data is always avaialable for your applications.
High availability, disasterrecovery, & immediate access in the event of natural disaster.

Read-Access Geo-Zone Redundant Storage
Copied across availability zones within 1st region. Copied within an availability zone in the 2nd region. Without failover data is accessable to individuals closest to the region.
Data is always avaialable for your applications

Azure Blob Storage Types (4)
Block: upto 4.7 terrabytes, composed of blocks to optimize data for uploading
Append: append blocks, ideal for logs
Page: VM disk & databses, frequent & random read/write applications.
Hierarchical: allows for collection of files to be organzied into a hierarchy of directories
Advantages (5) & Disadvantages (5)
Azure Blob Storage
Advantages
- designed for all types of unstructured data
- scalable
- cheap
- simple set up, no configuration
- no need for powerful computing to manage
Disadvantages
- no indexes
- no search tooles
- not optimized for performance
- user responsible for replication & syncing
- requires external computing to process
Multi-model Cosmos DB
(4 General Types of NoSQL Databases)
(5 APIs per NoSQL Type)
(provide information about each API)
Document APIs
SQL (Core)
- supports server-side programming model
- supports schema-less data
- json documents
- SQL like for NoSQL
- default programming language after transitioning other APIs into Azure
mongoDB
- all mongo SDKs can interact with Azure API, fully compatable with mongo app. code
- implements “wire” protocol
- bson documents (binary json)
- tree-shaped data instead of rows and columns
Key-Value API
Table
- premium offering for Azure Table storage
- not traditional SQL “table”
- rows can be of different lengths
- row value can be simple number
Wide-Column API
Cansandra
- data is stored in columns, each column is stored seperatly (each attribute is seperated from the other, think of individual list of columns)
- name and format of columns can vary from row to row
- compatible with current, external Casandra
- ways to interact with Casandra
- Casandra base tools
- Data Explorer
- SDK: CansandraCSharpedriver
Graph API
Gremlin
- entity relationships: nodes and edges
- use cases
- geospatial
- recommendation engines
- social networkds
- IoT
- Presist relationships at the storage layer
- no model required
Redundancies Available in Cosmos DB
Geo-Redundant
Multi-Region Write
Availability Zone
Encryption in Cosmos DB
(defaults & choices)
Default
- encryption is always set at rest (stored data)
Choices
- Service Managed: Azure managed
- Customer Managed: User set encryption and key
Latency
(definition & mitigation)
Latency is the wait time between request and response. It is migitated by housing the server as close as possible to the user.
Throughput
(definition, when is the amount set, in what units does Azure manage throughput & the calculation)
Throughput is the number of requested that can be processed by the database within a given timeframe. The Throughput amount can be defined either at database level or a the container level. If throughput exceeds the alotted time an error is thrown.
Cosmos manages throughput in request units (RU)
RU calculation: Memory + CPU + IOPs (input/output proessing per second)
Container
Componenets (5) & Names per API (3 for each API)
Components
- Database
- Throughput
- Container ID
- Partition Key
- Analytical Store
SQL API
Database is defined as Databse
Container is defined as Container
Item is defined as Document
Cassandra API
Database is defined as Keyspace
Container is defined as Table
Item is defined as Row
MongoDB API
Database is defined as Databse
Container is defined as Collection
Item is defined as Document
Gremlin API
Database is defined as Databse
Container is defined as Graph
Item is defined as Node of edge
Table API
Database is not defined
Container is defined as Table
Item is defined as Item
Partitioning Definitions
- Partitioning
- Partition Keys (rem. imp. fct.)
- Logical Partition
- Physical Partition (rem. imp. fct.)
- Composite Key (add. term)
- Partition Restrictions
Partition: items in a container are divided into distinct subsets called logical partitions.
Partition Key: the value by which Azure organizes your data into logical divisions. Cannot change partition key after creation of the database or container. Should be distinctive so that data is eveny distributed across logical paritions, but not so unique that you create overly numerous partitions, impacting read & write throughput.
Logical Partitions: subsets of your data divided by the partition key.
Physical Partitions: the physical machines that house the different logical partitions. Logical partitions are never divided across multiple physical partitions.
Composite Key: multiple unique identifiers combined to create a single partition key, further subdividing data into smaller units.
Restrictions:
- Each document cannot exceed 2MB
- Each logical partition cannot exceed 20GB

Dedicated & Shared Throughput
(definitions)
When you define throughput at the database level:
Shared: throughput is evenly distributed across containers (recommended)
Dedicated: defined throughput for each container, if throughput is defined at the container level by default it will be dedicated
Hot Partition
(definition)
not enough RUs for the logical partition, while other logical partitions have plenty of available RUs
(good practice to create partition keys that evenly distribute data across logical partitions)
Single v. Cross Partition Queries
Single: can identify all data from a query in a single logical partition (most efficient).
Cross: queries has to look across multiple logical partitions to find data for query. Also called a fan out query.
High Cardinality
(definition within context of databases)
columns with values that are unique or very uncommon
Fixed Request Charge
the cost to run each query against your data
Time to Live
- the time period for the data to be active before it is deleted
- set the time to live value under settings
- defaults to comsuming only leftover RUs - if other workloads are running, time to live data deletion will be delayed
Cosmos DB Global Distribution
(def., paired regions, multi-region write & choices)
Definition
Data can be replicated globally and read from any selected region. Storage and throughput are copied into selected global region.
Paired Regions
Two geographic centers with high speed connection. Used for diaster recovery and business contunuity purposes.
Multi-Region Write (Multi-Master Write)
User from two seperate regions (ex. Japan & US) update data at the same time. Options:
- last write wins (must define, ex. time stamp)
- merge procedure (define the procedure)
- merge procedure ( don’t define)
- actions are stored and manually define the stored procedure later
Automatic v. Manual Failover
(definitions & when does it applies)
applies when there is only one write enabled center
Manual: user chooses the next write enabled center
Automatic: decide prior to natural diaster
replication will automatically occur in either scenario as long as a global backup center has been identified
Consistency Levels of Cosmos DB
&
Definitions
(5)
Sally and Beth Steal from Cathy and Ethel
In general, there is a trade-off between consistency and availability.
Strong: always read most up-to-date data, no dirty reads. High latency, highest cost.
- Even user who writes data cannot see changes until they are committed and synchronized.
Bounded Staleness: dirty reads are only possible within a bounded timeframe.
Session: within a session no dirty reads. Once session ends dirty reads possible. No dirty reads for writers in the same session, however dirty reads possible for other users.
- Session refers to the user’s session, time on the computer.
- The user can read in the value he/she writes within that session. Only the same user within the same session is guarenteed to read the same value written within a single session.
Consistency Prefix: dirty reads are possible but never seen out-of-order for updates. Data is always read in order althought it may not be the most recent data.
Eventual: automatically respond to request, so dirty reads are possible and those read may be out of order. Evenetually everything will be updated to the correct data.
Is it possible for clients to override consistency levels?
clients can set consistency levels to a lower level at connection time for each request
(Strong is the highest consistency level)
Areas Covered in Non-Relational Portion of Exam
Azure Storage
- how to provision an account
- replication options (LRS, GRS, ZRS, GZRS, RA-GRS, RA-GZRS)
- blob storage
Data Lake
- evolution from Blob & distinctions
- security options
Cosmos DB (largest area of this portion)
- features
- multi-model
- consistency levels
- databases & containers
- throughput & request
- partitioning & horizontal scaling
- global distribution
- multi-master write
- failover
- time to live
- CLI (code to create an account)
- security
- pricing
For Cosmos DB:
What is a role based access control?
What is another name for role based access control?
What are the 6 different roles (BACCO)?
RBAC: access controls based on users and groups of users. Also refered to as Identity & Access management. Does not automatically expire, needs to be manually revoked.
Roles:
- Owner: full access and grant others access
- Account Reader: read access only
- Backup Operator: restore the system
- Contributor: read, write, delete, account management
- Operator: provisions databases, accounts, but no creation of access keys
Cosmos DB:
What is Cross Origin Resource Sharing?
white list domain names that are allowed to make request to Cosmos DB
Cosmos DB
What are the access keys?
provide access to all of the administrative resources;
primary and secondary access keys that can be utilized to grant different read and write access;
can be utilized to change keys to limit unauthorized access to the systems;
What is a Data Lake and how is it different from Blob Storage?
Data Lake is a combination of Blob Storage and Hadoop HDFS stored in the cloud. It is optimized for big data analytics by handling the need for increased processing speed and wide variety of data types.
Blob v. Data Lake Similarities
- available in every region
- local and global redundancy
Data Lake Specific Features
- optimized for big data analytics
- allows for hierarchical namespace
- supports multiple integrations
- compatible with Hadoop
Blob Specific Features
- more features than Data Lake, general purpose data storage
- processing performance limits
Data Lake Security:
Storage Account Key Switching
(old approach to managing authentication)
- all accounts have 2 storage keys (gives all administrative access)
switching keys
- client applications uses one key
- shift applications to key two
- then regenerated to create a new key one
- shift applications to key one
- then regenerated to create a new key two
Data Lake Security:
What is an Active Directory?
it creates users and user groups, feeds information into the role based access controls
Data Lake Security:
Shared Access Signature
(new appraoch to managing authentication)
(3 steps to setting up)
- gives users the minimum set of permissions needed to perform their task
- autmatically expires after 2 months
How is it set up?
- specify permissions and the range of time for access to be granted
- identify the IP adress you wnat to grant access to
- identify the account key to utilize
Data Lake Security:
Access Control List (ACL)
(def., what is a service principal?, what are the 3 types of service principals?
- sets up permissions for files and folders
- service principal: defines the access policy and permissions for users/apps within an instance within an active directory
- 3 types of service principals:
- application - permissions @ app level
- managed identity - automated credentials management
- legacy: app. created before registration
Network Firewall Definition
a network security device that monitors incoming and outgoing traffic, deciding whether to allow or block the specific traffic
Recovery Point Objective (RPO)
&
Recovery Time Objective (RTO)
RPO: measurement of how frequently backups occur
RTO: the amount of downtime a business can tolerate
Cosmos DB:
Backup & Restore Options
(3 properties & default configuration)
- backups are completed automated with no RU cost
- defaults setting (can be changed):
- inteveral: backups every 4 hours
- retention: available for 8 hours
- maximum 2 backups
- backup for Cosmos DB are stored seperately in Blob storage
- initially store in the same region (lower latency) then in paired region
Terms that signify semi-structured data
(4)
- solution that does not restrict attributes to a specific vendor/customer/entity
- different types of products can have different attribute
- some products can have different columns populated
- not all columns in the existing database are used
Query Languages Used in
Different Cosmos DB APIs
(5)
Table: Language Integreated Query (LINQ)
MongoDB: javascript
Core SQL: SQL
Cansandra: Cansandra Query Language (CQL)
Gremlin: graph traversal language (Apache TinkerPop)
What API does GlobalDocumentDB refer to?
SQL API
What are Azure SQL Database elastic pools?
What is Data Sync & Sync Groups? (when to use & when not to use)
What Are Elastic Jobs?
Elastic Pools
- solution for managing and scaling multiple databases
- databases in a pool are on a single server and share a set number of resources at a set price
- used when you have to provision databases for different customer groups, where the source data comes from a similar source (remember data warehouse architecture)
Data Sync & Sync Groups
- sync group: group of databases to synchronize (hub & spoke)
- used for data needs to be kepy updated across several databases
- not the prefered strategy for diaster recovery
Elastic Jobs
- automate tasks on a set a Azure SQL servers or SQL databases
- task that needs to run regulary on a schedule, or run
- administrative task, mainainence, and scheudled transaction queries
Azure Data Factory
(definition)
What is orchestration?
- orchestration platform to move data across different data stores via data pipelines, can run scheduled pipelines but not suited for administrative task
- orchestration is the automated configuration, management, and coordination of computer systems, apps, and services
Resoure Tokens
(what are they/used for?)
- provide granular access to Comos DB while limiting access access to administrative tasks
- safe alternative to a master key
Shared Key Authorization
(def.)
full administrative acess to storage accounts
File Premissions Hierarchy (3)
&
POSIX format
File Premissions Hierarchy (order matters)
- Owner
- Owner Group
- Everyone Else
POSIX
- Read Only: 4
- Write Only: 2
- Execute Only: 1
- No Access: 0
- Read + Write: 4 + 2 = 6
- Read + Execute: 4 + 1 = 5