L9 - Cloud Storage Systems Flashcards
4 types of AWS storage
- Amazon Elastic Block Storage (EBS)
- Amazon EC2 Instance Storage
- Amazon Elastic File System (EFS)
- Amazon Simple Storage Service (S3)
2 types of storage devices for VMs
- Instance volumes
- EBS volumes
Instance volumes
Disks/SSDs attached to physical server
- Optimized for high IOPs rates
- Lost when VM is stopped
EBS volumes
Service providing volumes (Storage Area Network (SAN))
- can only be mounted to a single VM at a time
- survives stopping or termination of VM
- Boot device lost when VM is terminated
Types of storage
- Object store (S3)
- Shared file system (NAS) (EFS)
- Relational Databases (RDS)
- NoSQL databases
- data warehouses
6 characteristics of Cloud Storage Systems
- voluminous data
- commodity hardware (discrepancy btw. processor speed and storage access time)
- distributed data
- expect failures
- processing by applications
- optimization for dominant usage
CAP Theorem
Consistency, availability, and partition-tolerance cannot be achieved together in a distributed system
consistency (CP) = read returns the last write value (strict)
availability (AP) = all requests are answered in an acceptable time
partition-tolerance = the system continues working even if some nodes are separated
Which of the 3 aspects of the CAP Theorem is essentialal in large scale distributed cloud systems?
Partition-tolerance
–> Storage solutions focus on either availability (AP) or consistency (CP)
-> AP systems apply eventual consistency: providing consistency only after a certain time
What is S3 Object Storage most used for?
- backup
- data spread across >= 3 data centers in a region
Data management in S3
Two level hierarchy of buckets and data objects
- data objects have a name, blob of data (<5TB) and metadata
- data objects ca be searched by name, bucket name, and metadata BUT NOT CONTENT
5 storage classes in AWS S3
- standard
- reduced redundancy (for expected loss)
- intelligent tiering
- glacier (retrieval 1-5min)
- deep archive (retrieval 12h)
Data access and versioning and lifecyle in S3
- via Simple Object Access Protocol (SOAP), REST, Bit Torrent
- data cannot be modified only uploaded, deleted, retrieved
- versioning possible
- lifecycle: rules can be set of transition (migration of objects to another storage class; expiration: when an object can be deleted)
Consistency in S3
- When creating new objects the key (name) becomes visible only after all replicas were written (read-after-write)
- eventual consistency
Requirements of Google File System (GFS)
- most writes are appending at the end
- optimized for long sequential and short random reads/writes
- bandwidth is more important than latency (batch processing)
- support for concurrent modifications
Is it better to put files or larger ones on GFS?
Better put larger ones because:
Single master server and many chunk servers
-> large chunks reduce metadata and frequent connections to the chunk server
How is the directory implemented in GFS?
- lookup table
What is special about the GFS Architecture?
Control and data flow are decoupled
–> Client first contacts the master but then interacts directly with the chunk servers (one is selected as primary and updates the replicas)
Data integrity, consistency, and metadata in GFS
Data integrity:
- each chunk server keeps a checksum
- corrupted chunks are overwritten with replica
Consistency:
- concurrent writes and appends to chunks
Metadata:
- master server contains metadata about all chunks
- each chunk server stores metadata and checksum
7 system interaction steps in GFS
- Client asks maser for all chunkservers
- Master grants a new lease on chunk, version # of all chunks increased
- Client pushes data to all servers
- Client sends write request to primary
- Primary forwards write request to secondaries
- Secondaries reply to primary upon completion
- Primary replies to client with success or error (if write at primary succeeds but fails at secondaries)
How is system interaction for appends in GFS?
Same as before, but in step 4 (4. Client sends write request to primary) the primary checks if appending to the current chunk exceeds the max size of 64MB. If it exceeds then the chunk is padded.
Limitations of GFS
- scalability of the single master –> partitioning of the file system and development of distributed master
- 64MB chunk size (e.g. google mail has much smaller files)
- no latency guarantees
Characteristics of Amazon Elastic File System (EFS)
- distributed
- capacity: unlimited file system size, individual files 48TB
- automatics provisioning of capacity
- integrated lifecycle management (infrequent files moved to cheaper storage)
- parallel access up to 1000s of EC2 instances
- throughput scales with the file system
- Aggregated IOPS scales with # of threads accessing EFS
- many security measures available
Consistency in AWS EFS
- close-to-open consistency: Any changes are flushed to the server on closing the file, and a cache revalidation occurs when you re-open it
- EFS can provide stronger consistency with read-after-write consistency (strict consistency)
What are the ACID properties of Relational Databases (RDB)?
ACID:
- Atomic: the set of operations is executed successfully or it does not change anything
- Consistency
- Isolation: during the execution of the transaction no intermediate status is visible to the outside
- Durability: result is stored persistently
Are RDB designed for vertical scaling?
Yes
Characteristics of Amazon Aurora
- Amazon’s own RDB as an alterative to mySQL
- fully managed
- database instance up to 64 TB
- low price
- 6 copies of data are replicated
- automatic backup in S3, scaling
NoSQL storage
- Schema-free: Easy to incorporate changes in applications
- Support for non-relational data
- designed for horizontal scaling (automatic distribution)
Types of NoSQL databases
- Key-value database
- Document-oriented (JSON)
- Graph
- Column-family
Amazons’s NoSQL database is?
Amazon Dynamo
What is Amazon Dynamo?
NoSQL, Key-value database
- optimized for small requests, quick access, high availability
- fault-tolerant
- automatic scaling of tables
- support for ACID transactions
- fine-grained access control for tables
DynamoDB
- decentralized architecture and eventual consistency
- stores key-value pairs in a table
- schema-less
Management of Partitions in Dynamo
- Mapping of keys to partitions
- keys are hashed
- hash space is treated as a ring - Mapping partitions to nodes
- ring is split into segments that are handled by virtual nodes
- hashing the key and going clockwise determines the responsible virtual node - Virtual nodes are assigned to physical nodes
- ensures heterogeneity of physical nodes
What is (N,R,W) replication in Dynamo?
Replication (N,R,W)
- to N consecutive nodes
- if read is successful on R copies, it is overall successful
- same for write on W copies
-> ensures that the replicas are on distinct physical nodes
What is the typical replication configuration in Dynamo?
(3,2,2) -> R+W>N
–> ensures that the most recent written info is returned (strongly consistent reads -> always the latest value is returned)
What can N, R, W be used for in Dynamo?
To meet the SLA requirements of the service
- N (# of consecutive nodes) determines the durability
- R and W determine the latency
How is failure handled in Dynamo?
- Gossip protocol: once a node stops responding, other nodes will eventually propagate knowledge of the failure
- Admin can replace the node
What is used to handle failures?
Replication
Comparison of S3, EBS, EFS
https://cloud.netapp.com/blog/ebs-efs-amazons3-best-cloud-storage-system