Week 5: Data Storage Part 1: Cloud-based object storage and file systems Flashcards

Question

What are AWS S3 Cost Optimization Best Practices?

Answer 1

Lifecycle Policies: Automate moving objects to lower-cost storage classes. Intelligent Tiering: Helps dynamically optimize costs by moving data based on access patterns. Data Transfer Considerations: Optimize application design to minimize unnecessary data transfer (data egress costs). Monitoring: Use CloudWatch and CloudTrail to track access patterns and unexpected charges.

Answer 2

- Designed for long-term data retention at low cost. - Best for data that is infrequently accessed such as compliance records, logs, or historical media files.

Answer 3

Expedited Retrieval: Access data within minutes (ideal for urgent needs). Standard Retrieval: Typically takes a few hours. Bulk Retrieval: Suitable for large-scale recoveries where time is less critical. Glacier Deep Archive: Offers even lower storage costs with retrieval times of up to 48 hours.

Answer 4

- Rules that automate the transition of objects between storage classes based on object age or access patterns. They optimize cost and performance by keeping frequently accessed data in higher-cost tiers and moving older data to less expensive archival tiers. They often configured to move data from S3 Standard to Infrequent Access or Glacier after a set period (e.g., 30 or 60 days).

Answer 5

Storage Cost: Glacier tiers are among the lowest in cost compared to S3 Standard, Standard-IA, and other AWS storage options (like EFS or EBS). Retrieval Fees: Although storage is cheap, you incur costs on data retrieval, and these fees vary by retrieval option (expedited, standard, bulk). Cost Trade-offs: Ideal for large volumes of data that are seldom accessed, making it an attractive option for long-term archival where retrieval speed is not critical.

Answer 6

A storage gateway acts as a bridge between on-premises infrastructure and cloud storage, facilitating data transfer, synchronization, and hybrid cloud setups. Enables seamless integration of on-premises systems with cloud-based storage. Supports data migration, backup, and disaster recovery scenarios where both local and cloud storage are used.

Answer 7

Clustered (or Network) File System: - A clustered file system presents a unified, local-like view of files and directories across many networked machines. - It allows multiple servers (or nodes) to work together so that users see the same set of operations (e.g., mount, read, write) regardless of the underlying distribution of data. Traditional Network File System: -a single server typically manages file operations for multiple clients. - This limits scalability and may offer less robust concurrency control compared to a clustered file system, which is designed to scale out with multiple servers sharing the workload.

Answer 8

Consistency: Ensuring every read reflects the most recent write is challenging when data is spread across nodes. Techniques such as file locking and fencing (preventing concurrent writes) are used to avoid conflicts. Availability: The system must always respond to user requests, yet distributed setups can introduce delays or failures if nodes become overloaded or unreachable. Partition Tolerance: In the event of network issues that split the system into partitions, the file system must still operate reliably. Balancing these three factors is nontrivial because, as per the CAP theorem, a distributed system can fully achieve at most two of these guarantees simultaneously.

Answer 9

Reduced Operational Complexity: Rolling out your own clustered distributed file system requires specialized network engineering, continuous management of hardware, and careful tuning of CAP trade-offs. Cloud-managed services relieve you of this burden. Automatic Scaling and Maintenance: Managed file systems automatically adjust capacity, offer redundancy, and handle failover without needing manual intervention. Cost and Resource Efficiency: While DIY solutions might seem appealing for custom configurations, the overhead in time, expertise, and ongoing maintenance can be far greater than using a cloud service with pay-as-you-use pricing

Answer 10

AWS Offerings: AWS FSx: Offers options such as FSx for Lustre (high-performance scratch storage) and FSx for Windows File Server (leveraging Microsoft DFS for namespace management). AWS EFS (Elastic File System): A fully managed, NFSv4-based file system designed for scalable, concurrent access across multiple EC2 instances and Availability Zones. Azure Files: Provides managed file storage based on Windows File Server technology, accessible via SMB and REST APIs, and Azure Data Lake Storage Gen2 for Hadoop-compatible workflows. Google Cloud Filestore: Offers a managed file system solution that typically provisions a single beefy machine with defined limits (e.g., up to 64 TB), suitable for workloads that do not require a fully distributed backend.

Answer 11

- Access Transparency: Users interact with the file system as if it were local. - Location Transparency: Users do not need to know where data physically resides. - Concurrency Transparency: Multiple users can access and modify files concurrently without conflict.

Answer 12

These systems hide the complexity of data distribution, replication, and network communication behind standard file system operations (e.g., POSIX compliance). This makes distributed file systems feel like local file systems, despite being backed by multiple nodes.

Answer 13

AWS EFS was created to bridge the gap between object storage and block storage by providing a fully managed, scalable file system that supports concurrent access via the NFS protocol.

Answer 14

- S3 is an object store with eventual consistency and does not offer file system semantics like hierarchical directories or file locking. - EBS offers low-latency block storage but is typically confined to a single EC2 instance within one Availability Zone. - EFS is accessible concurrently by multiple instances and automatically spans multiple AZs for higher availability.

Answer 15

AWS EFS automatically stores data redundantly across multiple Availability Zones. This means if one zone experiences issues, the data remains accessible from another zone.

Answer 16

Its architecture allows many EC2 instances to mount the file system simultaneously, ensuring continuous availability and quick recovery from localized failures.

Answer 17

Scalability: EFS grows and shrinks automatically to meet the storage demands of your application without the need for manual provisioning. High Throughput and Low Latency: Being SSD-backed and distributed, EFS provides consistent performance suitable for high-demand workloads. Redundancy Across Availability Zones: With built-in multi-AZ data replication, EFS offers superior durability and availability compared to single-AZ services like EBS. Ease of Management: The service abstracts the underlying complexity of managing distributed storage, allowing you to focus on application development rather than infrastructure maintenance.

Answer 18

Traditional Object Storage (e.g., AWS S3): Designed to store and serve objects (files) in a flat structure (buckets). Optimized for high-scale applications where millions of users access data (e.g., hosting static websites). Lacks native support for synchronizing a user’s local file system or advanced features like versioning or recovery tailored for personal use. Typically a lower-level service used by developers as a building block. Internet-Level Personal Filesystem Storage: Built to serve end users who want to access and synchronize files across a limited number of personal devices (laptops, phones, tablets). Offers a rich set of features that add value beyond simple file storage (see below). Often built on top of underlying object storage but adds file system semantics and user-friendly interfaces.

Answer 19

File Synchronization: Automatically synchronizes file changes across devices. Uses techniques like block-level copying to transfer only the changed portions of files, improving efficiency. Sync Throttling: Allows users to control how much bandwidth is used during synchronization, which is especially useful on limited connections. Data Recovery & Versioning: Maintains historical versions of files (e.g., 30 to 180 days of backups) so that accidental changes or deletions can be recovered. Shared Folders and Links: Enables collaboration by allowing users to share files or entire folders easily. Local File System Integration & Web Access: Seamless integration with desktop clients and mobile devices, along with web interfaces for file access. Value-Added Services: Some platforms include features such as text search within documents or even within images, enhancing the utility of stored data.

Answer 20

Initial Use of AWS S3: As a startup, Dropbox used AWS S3 for file storage because it offered a low initial cost and allowed rapid deployment. S3 provided the scalability needed to support early growth without a large capital expense. Transition to Own Data Centers (2014–2016): Economies of Scale: As Dropbox’s user base grew, the cost of storing large volumes of data on S3 became unsustainable given subscription revenue (e.g., storing 2 TB on S3 might cost $46 while revenue was around $15). Cold Storage Strategies: They leveraged the fact that many users rarely access most files, allowing a move to more cost-efficient storage tiers. The shift reflects a broader "cloudonomics" strategy where, past a certain scale, owning the infrastructure can be more cost-effective than renting it.

Answer 21

Overall Architecture: Client Interaction: Users access Dropbox via various clients (desktop, mobile, or web), which communicate with Dropbox’s metadata service and file storage system. Metadata vs. File Data: Dropbox manages file metadata (small, structured information about files) on its own servers, while the actual file data was initially stored on AWS S3 (and later migrated to Dropbox’s own data centers). Compute Layer: Tasks such as encoding, decoding, and encryption are handled on cloud compute instances (originally AWS EC2). Use Cases: Integrating file storage and synchronization into third-party applications. Building custom workflows that rely on file sharing, version control, and collaboration features. Simplifying app development by outsourcing file management and authentication to Dropbox.

Answer 22

Drop-In API (Simplified Integration): Chooser: Allows users to select and download files from their Dropbox into an application. Comes with pre-built, cross-platform components that handle file browsing and authentication. Saver: Enables users to upload files from an application directly into their Dropbox account. Typically used on web and mobile web interfaces. Ideal for quickly adding Dropbox functionality to an app with minimal coding effort. Core API (Full Programmatic Access): Provides comprehensive control over a user’s Dropbox account. Supports advanced operations such as: File upload/download, Metadata access, Revision history, Delta synchronization (tracking file changes), Detailed file sharing permissions. Suited for more complex integrations where fine-grained control is needed.

Answer 23

Metadata Storage: Dropbox maintains detailed metadata (information about file names, sizes, versions, etc.) on its own servers. This enables quick searches, file management, and change notifications. File Data Storage: Initially, the actual file data was stored on AWS S3—leveraging S3’s object storage capabilities. Over time, as Dropbox scaled, they transitioned the file storage to their own data centers to better manage costs and optimize performance.

Answer 24

ser-Centric Design: Built to serve individual users and small groups with seamless synchronization across devices. Integrated Features: Combines robust synchronization, efficient data recovery, and intuitive sharing features that are critical for everyday file access. Ecosystem Integration: Dropbox’s APIs (both Drop-In and Core) allow for rapid third-party integration, extending its functionality into other applications and services. Evolution Driven by Scale: Its journey from using third-party storage (AWS S3) to owning its own data centers is a clear example of adapting infrastructure to meet growing demand and cost-efficiency.

Week 5: Data Storage Part 1: Cloud-based object storage and file systems Flashcards

(48 cards)