Week 5: Data Storage Part 1: Cloud-based object storage and file systems Flashcards
What is an Instance Store?
A form of cloud-hosted, block-level storage that is directly attached to the physical host running your virtual machine (VM).
High Throughput: Because the storage is physically attached (often using NVMe or SATA), instance stores can offer higher data transfer rates.
What are the different types of instance stores and how do they vary in size?
SSD-based instance stores: Often range from 80 GB to 320 GB.
HDD-based options: Can provide up to around 1.5 TB.
Some high-end instances (e.g., x1.32xlarge) may offer nearly 4 TB of storage.
What is the Lifetime of an Instance Store and why is it considered ephemeral?
- Data in an instance store persists only for the duration of the instance’s life on that particular physical host.
- If the instance is stopped, terminated, or moved to a different physical machine, the data is lost.
How Does Storage Virtualization Work in Cloud Block Storage?
- Transforms the raw physical storage into a logical view, enabling multiple physical disks to be managed as a single resource.
- The physical storage (hard drives/SSDs) is abstracted by the operating system’s file system through a virtual layer, which then presents a unified logical file system view.
- This abstraction allows storage devices to be pooled, resized, or reallocated dynamically, and it is fundamental to cloud block storage systems.`
What Is NVMe over Fabric (NVMe-oF)?
A protocol that extends the NVMe protocol (typically used for fast SSD access) over a network fabric (such as Ethernet or InfiniBand).
How Does NVMe-oF Impact Virtual Block Stores?
Performance Enhancement: Provides near-local storage speeds over the network by reducing latency and increasing throughput.
Bandwidth Considerations: While it improves access speeds, the overall performance is still influenced by the network’s bandwidth compared to directly attached instance storage.
How Does EBS Storage Differ from Instance Stores in Terms of Persistence?
EBS (Elastic Block Store):
Persistent: Data stored in EBS volumes remains intact even after an instance is stopped or terminated.
Detach/Attach: Volumes can be detached from one instance and reattached to another.
Instance Stores:
Ephemeral: Data is lost once the instance is stopped, terminated, or moved to a different physical host.
How Does EBS Storage Differ from Instance Stores in Terms of Bandwidth?
Instance Stores: Generally offer higher bandwidth due to being directly attached to the physical machine.
EBS Volumes: Accessed over a network. While they may have lower baseline throughput compared to instance stores, advanced features (e.g., dedicated EBS bandwidth on optimized instances) and provisioning options can deliver high performance.
What Factors Influence the Performance of EBS Volumes?
IOPS and Throughput: Defined by the volume type and the level of provisioned IOPS (input/output operations per second).
Volume Type: Different types (SSD vs. magnetic HDD) offer varying performance levels.
Instance Type: Some instances have dedicated EBS bandwidth, whereas others share bandwidth with network traffic.
Burst Capabilities: Many EBS volumes can burst beyond their baseline IOPS for short durations.
What Factors Influence the Cost of EBS Volumes?
Storage Capacity: Larger volumes cost more.
Provisioned IOPS: Higher performance configurations (like io1) incur additional charges.
Volume Type: SSD-based volumes are typically more expensive than magnetic (HDD-based) options.
Usage Patterns: Sustained high IOPS or throughput needs can increase overall cost.
What Happens to Data in an Instance Store When an Instance Is Stopped or Terminated?
When an instance that uses instance store storage is stopped, terminated, or relocated, all data stored in the instance store is lost.
This inherent volatility is why instance stores are best used only for temporary data like caches or scratch space.
How Can Reliability Be Improved When Using Instance Stores?
Data Replication: Implement a distributed file system (e.g., HDFS) that replicates data across multiple instance stores.
Periodic Backups: Regularly back up data from instance stores to persistent storage systems like EBS or S3.
Redundancy: Deploy multiple instances so that the failure of one does not result in complete data loss.
What Are the Key Differences Between Amazon EBS and Instance Stores?
Amazon EBS:
- Persistent: Data remains even after instance termination.
- Network-Attached: Volumes are accessed over the network, offering flexibility (attach/detach) and additional features like encryption and snapshots.
- Varied Performance: Options range from general-purpose to high IOPS configurations.
Instance Stores:
- Ephemeral: Data is tied to the life of the instance and is lost on stopping/termination.
- Directly Attached: Generally provide higher bandwidth and lower latency due to physical attachment.
- Limited Capacity: Often smaller in size compared to what can be provisioned with EBS.
What Are the Main Types of Amazon EBS Volumes, and How Do They Differ?
- gp2 (General Purpose SSD): Balanced Performance: Offers a baseline IOPS with the ability to burst; suitable for a wide range of workloads.
- io1 (Provisioned IOPS SSD): High Performance: Allows you to provision a specific level of IOPS and throughput, ideal for I/O-intensive applications.
- st1 (Throughput Optimized HDD): High Throughput for Sequential Workloads: Best for applications like log processing or big data workloads that require high data transfer rates.
- sc1 (Cold HDD): Cost-Effective: Lower-cost option for infrequently accessed data with lower performance compared to st1.
Which EBS Volume Type Is Best Suited for High-Throughput Workloads Like Log Processing?
Throughput Optimized HDD (st1) is designed for high-throughput, sequential workloads (such as log processing, ETL jobs, or large-scale data analytics) where sustained data transfer is crucial.
What is Object Storage?
Object storage stores data as self-contained objects, each with metadata and a unique key inside a “bucket” or container. There is no hierarchical file structure (no folders or subdirectories in the backend) – you simply work with objects via APIs.
How Does Object Storage Differ from File and Block Storage?
File Storage: Uses a directory structure; supports granular operations like file locking and partial reads/writes.
Block Storage: Provides raw disk-like storage segmented into blocks (e.g., disk sectors); typically used for virtual machine file systems or databases.
Object Storage: Simplifies access (using GET, PUT, DELETE), lacks traditional file system features (e.g., file locking), and is highly scalable.
WHat are Key Properties of Cloud Object Storage: Scalability, Availability, and Consistency
Scalability:
Object storage can scale to virtually unlimited capacity because data is stored as discrete objects. You pay only for the storage you use.
Availability & Durability:
Data is distributed across multiple data centers or availability zones, providing high durability (often quoted as “11 nines”) and availability (frequently around 99.99% uptime).
Consistency:
Modern cloud providers generally offer strong (read-after-write) consistency for individual objects, ensuring that once an object is updated, subsequent reads return the latest version.
What are Common Use Cases for Object Storage
Data Lakes & Analytics: Storing massive raw data sets for processing and analysis.
Backup & Archival: Due to its cost-effectiveness and durability, it’s ideal for backups and long-term retention.
Static Website Hosting & Media Streaming: Serving images, HTML, videos, and other media files.
Mobile & Web Applications: Storing user-generated content (photos, videos, documents).
What is the Impact of Storage Classes?
Hot/Standard: For frequently accessed data with low latency retrieval but higher cost.
Cool/Infrequent Access: Lower cost for data accessed less often, with potential retrieval fees and slight delays.
Archive (Glacier): Very low storage cost for rarely accessed data; retrieval times can range from minutes (expedited) to hours or even days (bulk or deep archive).
What is the AWS S3 Structure?
Buckets and Objects: Data is stored as objects within buckets, with each object identified by a unique key.
Replication: S3 automatically replicates objects across multiple availability zones within a region to ensure high durability and availability.
What is AWS S3’s Management for Durability and Availability?
S3’s design supports 11 nines of durability (extremely low risk of data loss) and high availability (with SLAs often around 99.99%).
The system uses strong consistency for read-after-write operations, enhancing reliability for applications.
AWS S3 Storage Classes Comparison
Primary Storage Classes:
S3 Standard: For frequently accessed data, offering high availability and millisecond retrieval.
S3 Standard-Infrequent Access (IA): Lower cost for data accessed less often, with a 30-day minimum storage duration.
One Zone-IA: Lower cost than standard IA but stored in a single availability zone, hence less resilient.
Intelligent Tiering: Automatically moves objects between tiers based on access patterns.
Archival Classes:
S3 Glacier Instant Retrieval: Provides millisecond access for archival data that might require quick recovery.
S3 Glacier Flexible Retrieval: Suitable for archives where retrieval can take minutes to hours.
S3 Glacier Deep Archive: Lowest cost with retrieval times from 12 to 48 hours.
Lifecycle Management:
These classes can be managed via lifecycle policies, which automatically transition objects between tiers based on rules (e.g., age or access frequency).
What are AWS S3 Security Features?
IAM and Bucket Policies: Control access by defining who can access or modify objects.
Encryption: Supports server-side encryption (SSE-S3 and SSE-KMS) and client-side encryption to secure data at rest and in transit.
Access Points: Allow fine-grained access control tailored to different applications or user groups.