Data Platforms Flashcards

1
Q

Azure Cosmos DB

A

Azure Cosmos DB is a fully managed NoSQL database service designed for building scalable, globally distributed, and highly available applications. It provides multi-model support, low-latency access, and comprehensive SLA guarantees for throughput, latency, availability, and consistency.

Key Features
* Global Scale: Built for “planet scale.” Includes features for replication, failover, and access across the globe.
* Multi-Write: Unlike many other databases, Cosmos DB supports multi-region writes (concurrent read/write).
* Multi-Mode: Supports various relational and non-relational databases, including many popular APIs.

Cosmos DB APIs
* NoSQL: JSON document format, SQL query language, REST protocol.
* MongoDB: BSON document format, MongoDB query language, MongoDB wire protocol.
* PostgreSQL: Relational table format, PostgreSQL query language, PostgreSQL wire protocol.
* Apache Cassandra: CQL column-family format, CQL query language, Cassandra wire protocol.
* Apache Gremlin: JSON graph format, Gremlin query language, Gremlin wire protocol.
* Table:Key-value format, OData/LINQ query language, REST protocol.

Architecture - Key Components
* Cosmos DB Account: Parent resource. Defines pricing model, API, replication, consistency, and more.
* Database: A kind of “namespace” housing one or more containers. Can share throughput.
* Container: Stores data and settings (procedures, triggers, etc.). Dedicate/share throughput.
* Item: Actual piece of information, for example, a table or JSON document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Azure SQL

A

Azure SQL is a family of fully managed, scalable, and intelligent cloud database services offered by Microsoft Azure. It includes options for relational database solutions such as Azure SQL Database, Azure SQL Managed Instance, and SQL Server on Azure Virtual Machines. These services are designed for modern applications requiring high availability, performance, and scalability.

Azure SQL Family

1. Azure SQL Server on Azure VMs: (IaaS)
* VM Image preconfigured with SQL Server software.
* 100% SQL Server capabilities.
* Manage OS, software, and availability yourself.

2. Azure SQL Managed Instance: (PaaS)
* A kind of managed SQL Server deployment.
* Provides near-100% SQL Server capabilities.
* Includes backups, availability, etc.

3. Azure SQL Database: (PaaS)
* Fully managed by Microsoft, providing SQL Server-like capabilities.
* Includes backups, availability, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Azure SQL VMs

A

Key Considerations
* Full installation of SQL Server: All features and capabilities are supported.
* Manual patching required: Both SQL Server and the operating system need manual updates.
* Backup and high availability: Requires manual configuration of backups and high availability features.
* No SLA guarantee: Unless using Azure Premium SSD or Ultra Disk storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Azure SQL Managed Instance

A

Key Considerations
* Almost 100% feature SQL Server parity: (CLR, Agent, Database Mail, etc.).
* Always running the latest stable version: SQL Server database on a patched OS.
* Includes built-in backups: High availability, geo-replication, and failover.
* Supports 99.99% availability: Guarantee for every database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Azure SQL Database

A

Key Considerations
* Some limitations, but supports most common SQL Server capabilities.
* Always running the latest stable version of SQL Server database on a patched OS.
* Includes built-in backups, high availability, geo-replication, and failover.
* Supports 99.99% to 99.995% availability guarantee for every database.
* Elastic Pools are a feature of Azure SQL Database designed to optimize resource utilization and cost efficiency when managing multiple databases. They allow you to share resources such as compute and storage across multiple databases within a pool, ensuring better performance and cost control for varying workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Azure Pricing and Service Tiers

A

1. vCore (SQL DB / SQL MI):
* Permite elegir los recursos de cómputo deseados.
* Compatible con el modo sin servidor (pausa cuando está inactivo).
* Admite el beneficio híbrido de Azure (Azure Hybrid Benefit).

General Purpose:
* Cómputo: 2 a 128 vCores.
* Almacenamiento: Almacenamiento remoto premium (1GB - 4TB).
* Respaldo redundante geográfico, zonal o local (1-35 días de retención a corto plazo y 10 años a largo plazo).
* Disponibilidad: Una réplica, sin réplicas para escalado de lectura, alta disponibilidad redundante en zonas.

Business Critical:
* Cómputo: 2 a 128 vCores.
* Almacenamiento: SSD local súper rápido (1GB - 4TB).
* Respaldo redundante geográfico, zonal o local (1-35 días de retención a corto plazo y 10 años a largo plazo).
* Disponibilidad: Tres réplicas, una réplica para escalado de lectura, alta disponibilidad redundante en zonas.

Hyperscale:
* Cómputo: 2 a 128 vCores.
* Almacenamiento desacoplado con caché SSD (10GB - 100TB).
* Respaldo redundante geográfico, zonal o local (1-35 días de retención a corto plazo y 10 años a largo plazo).
* Alta disponibilidad redundante en zonas con diversos tipos de réplicas.

2. DTU (SQL DB Only)
* Bundled compute/storage packages.
* Does not support the serverless mode.
* Does not support Azure Hybrid Benefit.

Basic:
* Compute: Low CPU.
* Storage: Standard page blobs (up to 2GB).
* Backups: Geo/zone/locally redundant backup (1-7 days short-term retention, 10 years long-term retention).
* Availability: One replica, no read-scale replicas.

Standard:
* Compute: Low, Medium, High CPU.
* Storage: Standard page blobs (250GB - 1024GB).
* Backups: Geo/zone/locally redundant backup (1-35 days short-term retention, 10 years long-term retention).
* Availability: One replica, one read-scale replica (zone-redundant high availability opt-in).

Premium:
* Compute: Medium, High CPU.
* Storage: Premium page blobs (500GB - 4096GB).
* Backups: Geo/zone/locally redundant backup (1-35 days short-term retention, 10 years long-term retention).
* Availability: One replica, one read-scale replica, zone-redundant high availability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Azure Storage Overview

A

Is a set of features that Microsoft provide, to allow you to have highly available data stored and accessed in the cloud.

-It’s massively scalable, so you can store petabytes of data
-Great accessibility, it’s built for worldwide public internet access
-It’s managed for you

Types of Storage:

-Blob Storage: Is an object store that you’re typically going to use, if you store things like text and binary data
-Files: Is a managed file server (If you need folder hierarchy)
-Queue: Is a messaging service for decoupled components of you applications to be able to communicate with one another.
-Table: For storing data from your application (Schemaless structured data - NoSQL store)

Architecture

  1. Storage Account: Special container with important properties for the storage service
    -Name: Unique name used to create public DNS record for accessing storage
    -Performance: Standard: lowest cost, HDD backed
    -Premium: higher cost , SSD backed
    -Type: General Purpose v2, Page Blobs, BlockBlobs, File. Additional legacy options
    -Redundancy: Protect data by replicating across hardware and datacenters
  2. Storage Services: Multiple storage services can exist within a single storage account
  3. Public Endpoints: Storage services are built for public accessibility by design
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Azure Blob Storage Overview

A

Is built for web access, for binary objects, that you don’t need any structure for.

Architecture

  1. Storage Account: Requires GPv2, BlockBlob, PageBlob (or legacy: GPv1, BlobStorage)
  2. Blob Container: Container for managing access to unstructured data (no hierarchy)
  3. Blobs: Are the actual objects/files that are stored

-You don’t get hierarchy, unless, you trun on a feature called “hierarchical namespace”

Blob Types
-Block Blobs: Most common type of block for storing binary/text data. (standard file)
-Append Blobs: Like block blobs, but built for append operations (e.g logging data)
-Page Blobs: Random access files. Used for VM disks and Azure SQL DB files

Blob Sub-Types
-Blob Version: Retain version history of blobs automatically when edited. (Version Control)
-Blob Snapshot: Read-only point-in-time copy of a blob (only stores differences)
-Soft-Deleted Blobs: Blobs that have been deleted but are kept for a specified retention period

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Azure Storage Redundancy

A

Redundancy Types
-Locally Redundant Storage (LRS): Creates three copies synchronously within a single physical location
-Zone Redundant Storage (ZRS): Creates three copies synchronously across three AZs within the region
-Geo Redundant Storage (GRS): LRS is followed by one asynchronous copy to the secondary region (3:1)
-Geo Zone Redundant Storage (GZRS): ZRS is followed by one asynchronous copy to the secondary region ((1-1-1):1)

Secondary Read Access
-Supported by RA-GRS or RA-GZRS (without the need for a failover to be triggered)
-Can help ensure continuity of access in the event of any outages
-The copy in the secondary region is available via a public endpoint

Storage Account Failover
-Storage account failover is initiated by the customer, manually
-All data in the primary is lost, and the secondary will become the new primary
-After failover, the new secondary will be configured as locally redundant (LRS)
-Failover can result in data loss, because replication is asynchronous
-Microsoft will update the DNS when you trigger the failover so that applications point to the secondary

Important Considerations
-Not all types of redundancy are supportedd by all storage account types (especially Premium)
-Redundancy should not be relied on for data backup; it is for disaster recovery
-You can convert from/to many redundancy types. Some require a support ticket or special process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Blob Storage Access Tiers

A

Hot Tier
-An online tier optimized for storing data that is accessed or modified frequently
-Highest storage costs, but lowest access costs

Cool Tier:
-An online tier optimized for storing data that is infrequently accessed or modified.
-Lower storage costs, but higher access costs (Online)
-Fee if deleted/moved tier earlier than 45 days **
-Should be stored for a minimum of 30 days

Cold Tier:
-An online tier optimized for storing data that is rarely accessed or modified, but still requires fast retrieval.
-Should be stored for a minimum of 90 days.
-Lower storage costs and higher access costs compared to the cool tier.

Archive Tier:
-An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on the order of hours.
-Lowest storage costs, but highest access costs (Offline) (Latency)
-Fee if deleted/moved tier earlier than 180 days
-You can’t use it on any type of ZRS Redundant storage (ZRS, GZRS, or RAGZRS)

-Rehydration: The process when you are moving blobs in an archive tier to another tier

Architecture
1. Storage Account: Supports General Purpose V2. Not supported by Premium Blockblob
2. Blobs: Supports block blobs only. Page/append blobs are not supported
3. Access Tier: Default is defined for a storage account. Can be assigned per blob

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Blob Storage Lifecycle Management

A

With Lifecycle Management Policies, we are talking about automating some types of actions.
-Moving blobs between tiers
-Deleting blobs after an amount of days

Configuration
1. Storage Account: General Purpose V2, Premium BlockBlob, BlobStorage (legacy)
2. Blobs: Supports block and append blobs (and sub-types: versions, snapshots)
3. Management Policy: Supports complex rules with filters, blob sub-types, and actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Azure Files

A

Is a cloud-based file sharing service. It allows users to create highly available network file shares that can be accessed from multiple Azure virtual machines or from on-premises systems.

-Windows: SMB Share (GPv2 or Premium) - Win32
-Linux: NFS Share (Premium) - POSIX

Storage Tiers

Premium (SSD)
-Highest price for high performance, single-digit ms latency
-Supports both SMB and NFS shares
-Only supports provisioned billing (if you request a 100gb you will pay 100gb)

Transaction Optimized (HDD)
-High price for storage with low costs for transactions
-Only supports SMB shares
-Use pay-as-you-go billing

Hot (HDD)
-Mid price for both storage and transactions
-Only supports SMB shares
-Use pay-as-you-go billing

Cool (HDD)
-Lowest price for storage, but high price for transactions
-Only supports SMB shares
-Use pay-as-you-go billing

Architecture

  1. Storage Account: Supported General Purpose V2 and Premium FlieStorage
  2. File Share (configured at creation time)
    -SMB: Supports all tiers/redundancy
    -NFS: Premium and LRS/ZRS only
  3. Client Connectivity: Accessed using REST API for apps or mounted with SMB/NFS (for users)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Azure Files Sync

A

Allows organizations to synchronize files between on-premises servers and Azure cloud storage. It helps in centralizing file services in Azure while maintaining local access to files. Useful for organizations with distributed offices or branch offices that need access to the same set of files and data.

-NFS is not supported
-SMB is supported, with it you could have users directly connect to it
-FTP is supported to provide access to your users or systems out at your remote sites, or wherever you have those windows file servers that are synchronizing back to your share.
-You need a Windows O.S to have this synchronization service running
-If you create a file, with the same name in 2 synchronized locations, at the same time, the File Sync service is going to choose the first person who created that file, to be written to the share. The second is going to have kind of a “conflict” version of the file, that is saved with the actual server name, included in the name
-If your users go and directly write some data to the share, it can take up to 24 hrs, before the data will be synchronized

Cloud Tiering: Helps organizations optimize storage usage and reduce costs by intelligently managing file data across on-premises servers and Azure cloud storage.

“I’ve got one of these sites that maybe doesn’t have that much storage, So just synchronize some of the data, but not all of it. Provide access to it, but don’t actually have it sitting on the file server unless someone goes and requests it “

-The data is still visible to everyone, but it might not reside on that server until it is accessed

There’s two ways we can use Cloud Tiering:

-Space Policy: Looks at the space available on our file server and says “i need to keep 100gb free”. So it will go and only synchronize an amount of data that ensures that the free sapce is still available
-Date Policy: Looks at the access time of the data and will synchronize data based on that. So if data hasn’t been accessed in a long time, it won’t actually cache or synchronize a local copy

Architecture

  1. File Share: SMB share within a GPv2 or Premium FileStorage storage account
  2. Sync Service: Servers register to one only; they can then belong to many Sync Groups
    -We create Sync Groups so we can say, what servers will have access to what shares
    -Sync Groups are bound to one share
  3. Endpoints
    -Cloud Endpoint: Azure Files share
    –One Sync Group can only have one Cloud Enpoint
    -Sever Endpoint: Local folder - For the servers to synchronize the data, you will need to go and create some local enpoints (folder, volume, root directory). As long as that server is registered to the Sync Service that contains the Sync Group, then we can go and add an endpoint to that Sync Group, and the sync will take place
    –Servers can belong to multiple Sync Groups
    –One server can only be registered to one Sync Service
    –You can’t have multiple server endpoints that point to the same server for the same Sync Group
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Azure Storage Data Transfer Tools

A

Import/Export
Purpose: Move large volumes of data

  1. Customer Disk(s): Support one or more physical disks (2.5” or 3.5” SATA HDD or SSD)
  2. Supported Services: Supports Blobs (import/export) and Files (import only)
  3. Process: Disks are managed using a Windows tool (waimport). Manage job through Portal
    -You are still sending physical disks to Microsoft
    -You specify what service you’re working with, where it’s geographically based, what region and what storage account you are using
    -If you are importing, you are going to provide the Journal File to Microsoft

Data Box
Purpose: Move large volumes of data

  1. Data Box: Data Box/Disk/Heavy (offline) and Gateway (online) appliances
  2. Supported Services: Blobs (block/page), Managed Disks, Azure Files, ADLS Gen2
  3. Process: Order the device (for import/export) connect and use locally; return
    -It supports NFS and SMB

AzCopy
Purpose: Manage data across different platforms

  1. AzCopy Tool: Cross-platform (Windows/Mac/Linux) command line tool
  2. Supported Services: Blobs and Files (was also Tables, until Cosmos DB team took over)
  3. Process: Authenticate with azcopy then upload/download blobs/files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly