Week 5: Data Storage Part 1: Cloud-based object storage and file systems Flashcards

1
Q

What is an Instance Store?

A

A form of cloud-hosted, block-level storage that is directly attached to the physical host running your virtual machine (VM).

High Throughput: Because the storage is physically attached (often using NVMe or SATA), instance stores can offer higher data transfer rates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the different types of instance stores and how do they vary in size?

A

SSD-based instance stores: Often range from 80 GB to 320 GB.
HDD-based options: Can provide up to around 1.5 TB.
Some high-end instances (e.g., x1.32xlarge) may offer nearly 4 TB of storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Lifetime of an Instance Store and why is it considered ephemeral?

A
  • Data in an instance store persists only for the duration of the instance’s life on that particular physical host.
  • If the instance is stopped, terminated, or moved to a different physical machine, the data is lost.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How Does Storage Virtualization Work in Cloud Block Storage?

A
  • Transforms the raw physical storage into a logical view, enabling multiple physical disks to be managed as a single resource.
  • The physical storage (hard drives/SSDs) is abstracted by the operating system’s file system through a virtual layer, which then presents a unified logical file system view.
  • This abstraction allows storage devices to be pooled, resized, or reallocated dynamically, and it is fundamental to cloud block storage systems.`
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What Is NVMe over Fabric (NVMe-oF)?

A

A protocol that extends the NVMe protocol (typically used for fast SSD access) over a network fabric (such as Ethernet or InfiniBand).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How Does NVMe-oF Impact Virtual Block Stores?

A

Performance Enhancement: Provides near-local storage speeds over the network by reducing latency and increasing throughput.

Bandwidth Considerations: While it improves access speeds, the overall performance is still influenced by the network’s bandwidth compared to directly attached instance storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How Does EBS Storage Differ from Instance Stores in Terms of Persistence?

A

EBS (Elastic Block Store):
Persistent: Data stored in EBS volumes remains intact even after an instance is stopped or terminated.
Detach/Attach: Volumes can be detached from one instance and reattached to another.
Instance Stores:
Ephemeral: Data is lost once the instance is stopped, terminated, or moved to a different physical host.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How Does EBS Storage Differ from Instance Stores in Terms of Bandwidth?

A

Instance Stores: Generally offer higher bandwidth due to being directly attached to the physical machine.
EBS Volumes: Accessed over a network. While they may have lower baseline throughput compared to instance stores, advanced features (e.g., dedicated EBS bandwidth on optimized instances) and provisioning options can deliver high performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What Factors Influence the Performance of EBS Volumes?

A

IOPS and Throughput: Defined by the volume type and the level of provisioned IOPS (input/output operations per second).
Volume Type: Different types (SSD vs. magnetic HDD) offer varying performance levels.
Instance Type: Some instances have dedicated EBS bandwidth, whereas others share bandwidth with network traffic.
Burst Capabilities: Many EBS volumes can burst beyond their baseline IOPS for short durations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What Factors Influence the Cost of EBS Volumes?

A

Storage Capacity: Larger volumes cost more.
Provisioned IOPS: Higher performance configurations (like io1) incur additional charges.
Volume Type: SSD-based volumes are typically more expensive than magnetic (HDD-based) options.
Usage Patterns: Sustained high IOPS or throughput needs can increase overall cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What Happens to Data in an Instance Store When an Instance Is Stopped or Terminated?

A

When an instance that uses instance store storage is stopped, terminated, or relocated, all data stored in the instance store is lost.
This inherent volatility is why instance stores are best used only for temporary data like caches or scratch space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How Can Reliability Be Improved When Using Instance Stores?

A

Data Replication: Implement a distributed file system (e.g., HDFS) that replicates data across multiple instance stores.
Periodic Backups: Regularly back up data from instance stores to persistent storage systems like EBS or S3.
Redundancy: Deploy multiple instances so that the failure of one does not result in complete data loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What Are the Key Differences Between Amazon EBS and Instance Stores?

A

Amazon EBS:
- Persistent: Data remains even after instance termination.
- Network-Attached: Volumes are accessed over the network, offering flexibility (attach/detach) and additional features like encryption and snapshots.
- Varied Performance: Options range from general-purpose to high IOPS configurations.
Instance Stores:
- Ephemeral: Data is tied to the life of the instance and is lost on stopping/termination.
- Directly Attached: Generally provide higher bandwidth and lower latency due to physical attachment.
- Limited Capacity: Often smaller in size compared to what can be provisioned with EBS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What Are the Main Types of Amazon EBS Volumes, and How Do They Differ?

A
  • gp2 (General Purpose SSD): Balanced Performance: Offers a baseline IOPS with the ability to burst; suitable for a wide range of workloads.
  • io1 (Provisioned IOPS SSD): High Performance: Allows you to provision a specific level of IOPS and throughput, ideal for I/O-intensive applications.
  • st1 (Throughput Optimized HDD): High Throughput for Sequential Workloads: Best for applications like log processing or big data workloads that require high data transfer rates.
  • sc1 (Cold HDD): Cost-Effective: Lower-cost option for infrequently accessed data with lower performance compared to st1.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which EBS Volume Type Is Best Suited for High-Throughput Workloads Like Log Processing?

A

Throughput Optimized HDD (st1) is designed for high-throughput, sequential workloads (such as log processing, ETL jobs, or large-scale data analytics) where sustained data transfer is crucial.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Object Storage?

A

Object storage stores data as self-contained objects, each with metadata and a unique key inside a “bucket” or container. There is no hierarchical file structure (no folders or subdirectories in the backend) – you simply work with objects via APIs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How Does Object Storage Differ from File and Block Storage?

A

File Storage: Uses a directory structure; supports granular operations like file locking and partial reads/writes.
Block Storage: Provides raw disk-like storage segmented into blocks (e.g., disk sectors); typically used for virtual machine file systems or databases.
Object Storage: Simplifies access (using GET, PUT, DELETE), lacks traditional file system features (e.g., file locking), and is highly scalable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

WHat are Key Properties of Cloud Object Storage: Scalability, Availability, and Consistency

A

Scalability:
Object storage can scale to virtually unlimited capacity because data is stored as discrete objects. You pay only for the storage you use.

Availability & Durability:
Data is distributed across multiple data centers or availability zones, providing high durability (often quoted as “11 nines”) and availability (frequently around 99.99% uptime).

Consistency:
Modern cloud providers generally offer strong (read-after-write) consistency for individual objects, ensuring that once an object is updated, subsequent reads return the latest version.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are Common Use Cases for Object Storage

A

Data Lakes & Analytics: Storing massive raw data sets for processing and analysis.
Backup & Archival: Due to its cost-effectiveness and durability, it’s ideal for backups and long-term retention.
Static Website Hosting & Media Streaming: Serving images, HTML, videos, and other media files.
Mobile & Web Applications: Storing user-generated content (photos, videos, documents).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the Impact of Storage Classes?

A

Hot/Standard: For frequently accessed data with low latency retrieval but higher cost.
Cool/Infrequent Access: Lower cost for data accessed less often, with potential retrieval fees and slight delays.
Archive (Glacier): Very low storage cost for rarely accessed data; retrieval times can range from minutes (expedited) to hours or even days (bulk or deep archive).

21
Q

What is the AWS S3 Structure?

A

Buckets and Objects: Data is stored as objects within buckets, with each object identified by a unique key.
Replication: S3 automatically replicates objects across multiple availability zones within a region to ensure high durability and availability.

22
Q

What is AWS S3’s Management for Durability and Availability?

A

S3’s design supports 11 nines of durability (extremely low risk of data loss) and high availability (with SLAs often around 99.99%).
The system uses strong consistency for read-after-write operations, enhancing reliability for applications.

23
Q

AWS S3 Storage Classes Comparison

A

Primary Storage Classes:
S3 Standard: For frequently accessed data, offering high availability and millisecond retrieval.
S3 Standard-Infrequent Access (IA): Lower cost for data accessed less often, with a 30-day minimum storage duration.
One Zone-IA: Lower cost than standard IA but stored in a single availability zone, hence less resilient.
Intelligent Tiering: Automatically moves objects between tiers based on access patterns.

Archival Classes:
S3 Glacier Instant Retrieval: Provides millisecond access for archival data that might require quick recovery.
S3 Glacier Flexible Retrieval: Suitable for archives where retrieval can take minutes to hours.
S3 Glacier Deep Archive: Lowest cost with retrieval times from 12 to 48 hours.

Lifecycle Management:
These classes can be managed via lifecycle policies, which automatically transition objects between tiers based on rules (e.g., age or access frequency).

24
Q

What are AWS S3 Security Features?

A

IAM and Bucket Policies: Control access by defining who can access or modify objects.
Encryption: Supports server-side encryption (SSE-S3 and SSE-KMS) and client-side encryption to secure data at rest and in transit.
Access Points: Allow fine-grained access control tailored to different applications or user groups.

25
What are AWS S3 Cost Optimization Best Practices?
Lifecycle Policies: Automate moving objects to lower-cost storage classes. Intelligent Tiering: Helps dynamically optimize costs by moving data based on access patterns. Data Transfer Considerations: Optimize application design to minimize unnecessary data transfer (data egress costs). Monitoring: Use CloudWatch and CloudTrail to track access patterns and unexpected charges.
26
What is AWS S3 Glacier?
- Designed for long-term data retention at low cost. - Best for data that is infrequently accessed such as compliance records, logs, or historical media files.
27
What is AWS S3 Glacier Retrieval Options for Archival Storage?
Expedited Retrieval: Access data within minutes (ideal for urgent needs). Standard Retrieval: Typically takes a few hours. Bulk Retrieval: Suitable for large-scale recoveries where time is less critical. Glacier Deep Archive: Offers even lower storage costs with retrieval times of up to 48 hours.
28
How do lifecycle policies manage data transitions between S3 storage tiers?
- Rules that automate the transition of objects between storage classes based on object age or access patterns. They optimize cost and performance by keeping frequently accessed data in higher-cost tiers and moving older data to less expensive archival tiers. They often configured to move data from S3 Standard to Infrequent Access or Glacier after a set period (e.g., 30 or 60 days).
29
How does AWS S3 Glacier pricing compare to other storage options?
Storage Cost: Glacier tiers are among the lowest in cost compared to S3 Standard, Standard-IA, and other AWS storage options (like EFS or EBS). Retrieval Fees: Although storage is cheap, you incur costs on data retrieval, and these fees vary by retrieval option (expedited, standard, bulk). Cost Trade-offs: Ideal for large volumes of data that are seldom accessed, making it an attractive option for long-term archival where retrieval speed is not critical.
30
What is a storage gateway, and why is it important for hybrid cloud environments?
A storage gateway acts as a bridge between on-premises infrastructure and cloud storage, facilitating data transfer, synchronization, and hybrid cloud setups. Enables seamless integration of on-premises systems with cloud-based storage. Supports data migration, backup, and disaster recovery scenarios where both local and cloud storage are used.
31
Clustered File System vs. Traditional Network File System
Clustered (or Network) File System: - A clustered file system presents a unified, local-like view of files and directories across many networked machines. - It allows multiple servers (or nodes) to work together so that users see the same set of operations (e.g., mount, read, write) regardless of the underlying distribution of data. Traditional Network File System: -a single server typically manages file operations for multiple clients. - This limits scalability and may offer less robust concurrency control compared to a clustered file system, which is designed to scale out with multiple servers sharing the workload.
32
Challenges of Maintaining Consistency, Availability, and Partition Tolerance (CAP)
Consistency: Ensuring every read reflects the most recent write is challenging when data is spread across nodes. Techniques such as file locking and fencing (preventing concurrent writes) are used to avoid conflicts. Availability: The system must always respond to user requests, yet distributed setups can introduce delays or failures if nodes become overloaded or unreachable. Partition Tolerance: In the event of network issues that split the system into partitions, the file system must still operate reliably. Balancing these three factors is nontrivial because, as per the CAP theorem, a distributed system can fully achieve at most two of these guarantees simultaneously.
33
Benefits of Cloud-Managed File Systems over DIY Approaches
Reduced Operational Complexity: Rolling out your own clustered distributed file system requires specialized network engineering, continuous management of hardware, and careful tuning of CAP trade-offs. Cloud-managed services relieve you of this burden. Automatic Scaling and Maintenance: Managed file systems automatically adjust capacity, offer redundancy, and handle failover without needing manual intervention. Cost and Resource Efficiency: While DIY solutions might seem appealing for custom configurations, the overhead in time, expertise, and ongoing maintenance can be far greater than using a cloud service with pay-as-you-use pricing
34
Implementations of File Systems by Cloud Providers
AWS Offerings: AWS FSx: Offers options such as FSx for Lustre (high-performance scratch storage) and FSx for Windows File Server (leveraging Microsoft DFS for namespace management). AWS EFS (Elastic File System): A fully managed, NFSv4-based file system designed for scalable, concurrent access across multiple EC2 instances and Availability Zones. Azure Files: Provides managed file storage based on Windows File Server technology, accessible via SMB and REST APIs, and Azure Data Lake Storage Gen2 for Hadoop-compatible workflows. Google Cloud Filestore: Offers a managed file system solution that typically provisions a single beefy machine with defined limits (e.g., up to 64 TB), suitable for workloads that do not require a fully distributed backend.
35
What are the key design goals of distributed file systems?
- Access Transparency: Users interact with the file system as if it were local. - Location Transparency: Users do not need to know where data physically resides. - Concurrency Transparency: Multiple users can access and modify files concurrently without conflict.
36
How do the file systems ensure transparency for users?
These systems hide the complexity of data distribution, replication, and network communication behind standard file system operations (e.g., POSIX compliance). This makes distributed file systems feel like local file systems, despite being backed by multiple nodes.
37
What is the primary motivation for Amazon AWS EFS?
AWS EFS was created to bridge the gap between object storage and block storage by providing a fully managed, scalable file system that supports concurrent access via the NFS protocol.
38
how does EFS differ from S3 and EBS?
- S3 is an object store with eventual consistency and does not offer file system semantics like hierarchical directories or file locking. - EBS offers low-latency block storage but is typically confined to a single EC2 instance within one Availability Zone. - EFS is accessible concurrently by multiple instances and automatically spans multiple AZs for higher availability.
39
How does AWS EFS ensure high availability compared to other AWS storage services?
AWS EFS automatically stores data redundantly across multiple Availability Zones. This means if one zone experiences issues, the data remains accessible from another zone.
40
How does AWS EFS ensure high durability compared to other AWS storage services?
Its architecture allows many EC2 instances to mount the file system simultaneously, ensuring continuous availability and quick recovery from localized failures.
41
What are the key advantages of AWS EFS, such as scalability and redundancy across Availability Zones?
Scalability: EFS grows and shrinks automatically to meet the storage demands of your application without the need for manual provisioning. High Throughput and Low Latency: Being SSD-backed and distributed, EFS provides consistent performance suitable for high-demand workloads. Redundancy Across Availability Zones: With built-in multi-AZ data replication, EFS offers superior durability and availability compared to single-AZ services like EBS. Ease of Management: The service abstracts the underlying complexity of managing distributed storage, allowing you to focus on application development rather than infrastructure maintenance.
42
How does Internet-level personal filesystem storage differ from traditional object storage like AWS S3?
Traditional Object Storage (e.g., AWS S3): Designed to store and serve objects (files) in a flat structure (buckets). Optimized for high-scale applications where millions of users access data (e.g., hosting static websites). Lacks native support for synchronizing a user’s local file system or advanced features like versioning or recovery tailored for personal use. Typically a lower-level service used by developers as a building block. Internet-Level Personal Filesystem Storage: Built to serve end users who want to access and synchronize files across a limited number of personal devices (laptops, phones, tablets). Offers a rich set of features that add value beyond simple file storage (see below). Often built on top of underlying object storage but adds file system semantics and user-friendly interfaces.
43
What are key features of Internet-level personal filesystem storage, such as synchronization and data recovery?
File Synchronization: Automatically synchronizes file changes across devices. Uses techniques like block-level copying to transfer only the changed portions of files, improving efficiency. Sync Throttling: Allows users to control how much bandwidth is used during synchronization, which is especially useful on limited connections. Data Recovery & Versioning: Maintains historical versions of files (e.g., 30 to 180 days of backups) so that accidental changes or deletions can be recovered. Shared Folders and Links: Enables collaboration by allowing users to share files or entire folders easily. Local File System Integration & Web Access: Seamless integration with desktop clients and mobile devices, along with web interfaces for file access. Value-Added Services: Some platforms include features such as text search within documents or even within images, enhancing the utility of stored data.
44
Why did Dropbox initially use AWS S3, and what motivated its transition to its own data centers in 2016?
Initial Use of AWS S3: As a startup, Dropbox used AWS S3 for file storage because it offered a low initial cost and allowed rapid deployment. S3 provided the scalability needed to support early growth without a large capital expense. Transition to Own Data Centers (2014–2016): Economies of Scale: As Dropbox’s user base grew, the cost of storing large volumes of data on S3 became unsustainable given subscription revenue (e.g., storing 2 TB on S3 might cost $46 while revenue was around $15). Cold Storage Strategies: They leveraged the fact that many users rarely access most files, allowing a move to more cost-efficient storage tiers. The shift reflects a broader "cloudonomics" strategy where, past a certain scale, owning the infrastructure can be more cost-effective than renting it.
45
How does Dropbox API work (architecture, definition, use cases, etc.)?
Overall Architecture: Client Interaction: Users access Dropbox via various clients (desktop, mobile, or web), which communicate with Dropbox’s metadata service and file storage system. Metadata vs. File Data: Dropbox manages file metadata (small, structured information about files) on its own servers, while the actual file data was initially stored on AWS S3 (and later migrated to Dropbox’s own data centers). Compute Layer: Tasks such as encoding, decoding, and encryption are handled on cloud compute instances (originally AWS EC2). Use Cases: Integrating file storage and synchronization into third-party applications. Building custom workflows that rely on file sharing, version control, and collaboration features. Simplifying app development by outsourcing file management and authentication to Dropbox.
46
What are the two levels of Dropbox API access, and how do they differ?
Drop-In API (Simplified Integration): Chooser: Allows users to select and download files from their Dropbox into an application. Comes with pre-built, cross-platform components that handle file browsing and authentication. Saver: Enables users to upload files from an application directly into their Dropbox account. Typically used on web and mobile web interfaces. Ideal for quickly adding Dropbox functionality to an app with minimal coding effort. Core API (Full Programmatic Access): Provides comprehensive control over a user’s Dropbox account. Supports advanced operations such as: File upload/download, Metadata access, Revision history, Delta synchronization (tracking file changes), Detailed file sharing permissions. Suited for more complex integrations where fine-grained control is needed.
47
How does Dropbox store metadata and actual file data using cloud services?
Metadata Storage: Dropbox maintains detailed metadata (information about file names, sizes, versions, etc.) on its own servers. This enables quick searches, file management, and change notifications. File Data Storage: Initially, the actual file data was stored on AWS S3—leveraging S3’s object storage capabilities. Over time, as Dropbox scaled, they transitioned the file storage to their own data centers to better manage costs and optimize performance.
48
Why is Dropbox an especially good example of an internet-level personal file system?
ser-Centric Design: Built to serve individual users and small groups with seamless synchronization across devices. Integrated Features: Combines robust synchronization, efficient data recovery, and intuitive sharing features that are critical for everyday file access. Ecosystem Integration: Dropbox’s APIs (both Drop-In and Core) allow for rapid third-party integration, extending its functionality into other applications and services. Evolution Driven by Scale: Its journey from using third-party storage (AWS S3) to owning its own data centers is a clear example of adapting infrastructure to meet growing demand and cost-efficiency.