L9 - Cloud Storage Systems Flashcards

Question 1

Q

4 types of AWS storage

Answer

A

Amazon Elastic Block Storage (EBS)
Amazon EC2 Instance Storage
Amazon Elastic File System (EFS)
Amazon Simple Storage Service (S3)

Question 2

Q

2 types of storage devices for VMs

Answer

A

Instance volumes
EBS volumes

Question 3

Q

Instance volumes

Answer

A

Disks/SSDs attached to physical server
- Optimized for high IOPs rates
- Lost when VM is stopped

Question 4

Q

EBS volumes

Answer

A

Service providing volumes (Storage Area Network (SAN))
- can only be mounted to a single VM at a time
- survives stopping or termination of VM
- Boot device lost when VM is terminated

Question 5

Q

Types of storage

Answer

A

Object store (S3)
Shared file system (NAS) (EFS)
Relational Databases (RDS)
NoSQL databases
data warehouses

Question 6

Q

6 characteristics of Cloud Storage Systems

Answer

A

voluminous data
commodity hardware (discrepancy btw. processor speed and storage access time)
distributed data
expect failures
processing by applications
optimization for dominant usage

Question 7

Q

CAP Theorem

Answer

A

Consistency, availability, and partition-tolerance cannot be achieved together in a distributed system

consistency (CP) = read returns the last write value (strict)
availability (AP) = all requests are answered in an acceptable time
partition-tolerance = the system continues working even if some nodes are separated

Question 8

Q

Which of the 3 aspects of the CAP Theorem is essentialal in large scale distributed cloud systems?

Answer

A

Partition-tolerance
–> Storage solutions focus on either availability (AP) or consistency (CP)
-> AP systems apply eventual consistency: providing consistency only after a certain time

Question 9

Q

What is S3 Object Storage most used for?

Answer

A

backup
data spread across >= 3 data centers in a region

Question 10

Q

Data management in S3

Answer

A

Two level hierarchy of buckets and data objects
- data objects have a name, blob of data (<5TB) and metadata
- data objects ca be searched by name, bucket name, and metadata BUT NOT CONTENT

Question 11

Q

5 storage classes in AWS S3

Answer

A

standard
reduced redundancy (for expected loss)
intelligent tiering
glacier (retrieval 1-5min)
deep archive (retrieval 12h)

Question 12

Q

Data access and versioning and lifecyle in S3

Answer

A

via Simple Object Access Protocol (SOAP), REST, Bit Torrent
data cannot be modified only uploaded, deleted, retrieved
versioning possible
lifecycle: rules can be set of transition (migration of objects to another storage class; expiration: when an object can be deleted)

Question 13

Q

Consistency in S3

Answer

A

When creating new objects the key (name) becomes visible only after all replicas were written (read-after-write)
eventual consistency

Question 14

Q

Requirements of Google File System (GFS)

Answer

A

most writes are appending at the end
optimized for long sequential and short random reads/writes
bandwidth is more important than latency (batch processing)
support for concurrent modifications

Question 15

Q

Is it better to put files or larger ones on GFS?

Answer

A

Better put larger ones because:

Single master server and many chunk servers
-> large chunks reduce metadata and frequent connections to the chunk server

Question 16

Q

How is the directory implemented in GFS?

Answer

A

lookup table

Question 17

Q

What is special about the GFS Architecture?

Answer

A

Control and data flow are decoupled

–> Client first contacts the master but then interacts directly with the chunk servers (one is selected as primary and updates the replicas)

Question 18

Q

Data integrity, consistency, and metadata in GFS

Answer

A

Data integrity:
- each chunk server keeps a checksum
- corrupted chunks are overwritten with replica
Consistency:
- concurrent writes and appends to chunks
Metadata:
- master server contains metadata about all chunks
- each chunk server stores metadata and checksum

Question 19

Q

7 system interaction steps in GFS

Answer

A

Client asks maser for all chunkservers
Master grants a new lease on chunk, version # of all chunks increased
Client pushes data to all servers
Client sends write request to primary
Primary forwards write request to secondaries
Secondaries reply to primary upon completion
Primary replies to client with success or error (if write at primary succeeds but fails at secondaries)

Question 20

Q

How is system interaction for appends in GFS?

Answer

A

Same as before, but in step 4 (4. Client sends write request to primary) the primary checks if appending to the current chunk exceeds the max size of 64MB. If it exceeds then the chunk is padded.

Question 21

Q

Limitations of GFS

Answer

A

scalability of the single master –> partitioning of the file system and development of distributed master
64MB chunk size (e.g. google mail has much smaller files)
no latency guarantees

Question 22

Q

Characteristics of Amazon Elastic File System (EFS)

Answer

A

distributed
capacity: unlimited file system size, individual files 48TB
automatics provisioning of capacity
integrated lifecycle management (infrequent files moved to cheaper storage)
parallel access up to 1000s of EC2 instances
throughput scales with the file system
Aggregated IOPS scales with # of threads accessing EFS
many security measures available

Question 23

Q

Consistency in AWS EFS

Answer

A

close-to-open consistency: Any changes are flushed to the server on closing the file, and a cache revalidation occurs when you re-open it
EFS can provide stronger consistency with read-after-write consistency (strict consistency)

Question 24

Q

What are the ACID properties of Relational Databases (RDB)?

Answer

A

ACID:
- Atomic: the set of operations is executed successfully or it does not change anything
- Consistency
- Isolation: during the execution of the transaction no intermediate status is visible to the outside
- Durability: result is stored persistently

Question 25

Q

Are RDB designed for vertical scaling?

Question 26

Q

Characteristics of Amazon Aurora

Answer

A

Amazon’s own RDB as an alterative to mySQL
fully managed
database instance up to 64 TB
low price
6 copies of data are replicated
automatic backup in S3, scaling

Question 27

Q

NoSQL storage

Answer

A

Schema-free: Easy to incorporate changes in applications
Support for non-relational data
designed for horizontal scaling (automatic distribution)

Question 28

Q

Types of NoSQL databases

Answer

A

Key-value database
Document-oriented (JSON)
Graph
Column-family

Question 29

Q

Amazons’s NoSQL database is?

Answer

A

Amazon Dynamo

Question 30

Q

What is Amazon Dynamo?

Answer

A

NoSQL, Key-value database
- optimized for small requests, quick access, high availability
- fault-tolerant
- automatic scaling of tables
- support for ACID transactions
- fine-grained access control for tables

Question 31

Q

DynamoDB

Answer

A

decentralized architecture and eventual consistency
stores key-value pairs in a table
schema-less

Question 32

Q

Management of Partitions in Dynamo

Answer

A

Mapping of keys to partitions
- keys are hashed
- hash space is treated as a ring
Mapping partitions to nodes
- ring is split into segments that are handled by virtual nodes
- hashing the key and going clockwise determines the responsible virtual node
Virtual nodes are assigned to physical nodes
- ensures heterogeneity of physical nodes

Question 33

Q

What is (N,R,W) replication in Dynamo?

Answer

A

Replication (N,R,W)
- to N consecutive nodes
- if read is successful on R copies, it is overall successful
- same for write on W copies

-> ensures that the replicas are on distinct physical nodes

Question 34

Q

What is the typical replication configuration in Dynamo?

Answer

A

(3,2,2) -> R+W>N
–> ensures that the most recent written info is returned (strongly consistent reads -> always the latest value is returned)

Question 35

Q

What can N, R, W be used for in Dynamo?

Answer

A

To meet the SLA requirements of the service
- N (# of consecutive nodes) determines the durability
- R and W determine the latency

Question 36

Q

How is failure handled in Dynamo?

Answer

A

Gossip protocol: once a node stops responding, other nodes will eventually propagate knowledge of the failure
Admin can replace the node

Question 37

Q

What is used to handle failures?

Answer

A

Replication

Question 38

Q

Comparison of S3, EBS, EFS

Answer

A

https://cloud.netapp.com/blog/ebs-efs-amazons3-best-cloud-storage-system