Data Stores Flashcards
What are the 3 types (concepts) of data store in AWS?
1) Persistent datastore
2) Transient datastore
3) Ephemeral datastore
Define persistent data storage and give 2 examples…
Data that is durable and sticks around after a reboot, restart or power cycles
e.g. Glacier, RDS
Define a transient data store and give 2 examples…
Data is just temporary stored and passed along to another process or persistent store
e.g. SQS, SNS
Define an ephemeral data store and give 2 examples…
Data is lost when stopped.
e.g. EC2 instance store, Elasticache- Memcached
What does IOPS stand for and what does it measure?
IOPS- Input output Operations Per Second
It is a measure of how fast we can read and write to a device
What does throughput measure?
It is the measure of how much data can be moved at a time
What are the two types of data storage consistency models?
1) ACID
2) BASE
What does ACID stand for?
Atomic- Transactions are all or nothing
Consistent- Transactions must be valid
Isolated- Transactions can’t mess with one another
Durable- Completed transactions must stick around
What does BASE stand for?
Basic Availablility- Values available even if stale
Soft-state- Might not be instantly consistent across stores
Eventually consistent- Will achieve consistency at some point
Why would you want a model (BASE) that was not consistent?
Because as accurate and precise ACID is they don’t scale very well.
BASE is not inconsistent just not parallel
What type of store is S3?
An Object store
What is the maximum object size in S3 and what is the largest object in a single PUT?
Max object size is 5TB
Largest single put 5GB
How can you increase the efficiency of uploads with files larger than 100MB?
You can use multi-part uploads
How are objects referenced in S3?
By a KEY, essentially a URL path like key.
s3:///finance/April/16/invoice.pdf
What is S3’s consistency model for read-after-writes? and what does this mean in lay terms?
S3 provides read-after-write consistency for PUTS of new objects
If a new file is added that S3 has never seen before once written you can read it immediately
What is S3’s consistency model for HEAD or GET requests of a KEY before the object exists? and what does this mean in lay terms?
HEAD or GET requests for a KEY before the object exists will result in eventual consistency.
Until an object has been fully written and replicated across AZs S3 will say that they don’t know what the object is. So I’ll let you read it eventually.
What is S3’s consistency model for overwrite PUTS and DELETES of objects? and what does this mean in lay terms?
S3 offers eventual consistency for overwrite PUTS (updates) and DELETES.
S3 will serve the original object until it has updated or deleted the file and has replicated this change across all other AZs. It will serve the updated/delete once it has been fully replicated eventually.
What is S3’s consistency model for updates to a single KEY? and what does this mean in lay terms?
Updates to a single KEY are atomic
Whoa there, only one person can update this object at a time. If I get two requests I’ll process them in order of their timestamps and you’ll see the updates as soon as I replicate them elsewhere.
What are the 3 methods of securing objects in an S3 bucket?
1) Resource-based (object ACL bucket policy)
2) User-based (IAM policies)
3) Object-based (Object ACL)
4) Optional MFA before delete
In what order does S3 evaluate the security access of an object?
User-based (IAM policy) > Resourced based (bucket policy) > Object-based (Object ACL)
What does versioning in S3 enable?
Enables “roll-back” and “un-delete” capabilities
Do you get charged for old versions of objects?
Yes
Why use MFA in S3?
1) If you require safeguarding against accidental deletion of an object
2) If you would like to change the versioning state of your bucket
Why use cross-region replication in S3?
1) increased durability
2) reduced latency
3) To meet compliance requirements
What are the 7 storage classes of S3? and what types of data are they suited for?
1) Standard- Frequently accessed
2) Standard IA- Long-lived, infrequently accessed
3) One Zone IA- Long-lived, non-critical
4) Reduced redundancy- Frequently accessed, non-critical
5) Intelligent tiering- Long-lived with changing or unknown access patterns
6) Glacier- Long-term data archiving with retrieval mins-hours
7) Glacier Deep Archive- Long term retrieval within 12-48 hours.
Why use S3 lifecycle management in S3?
1) optimise storage costs
2) Adhering to a data retention policy
3) Keep S3 volumes well-maintained
Name 4 ways S3 can be used in analytics…
1) Data lake concept- S3 data used as a data lake to be accessible to Athena, Redshift or quick sight
2) IoT streaming data repo- Stream data into Kinesis Firehose
3) Machine learning and AI storage- Rekognition, Lex, Mxnet
4) Storage class analysis- Analyses current usage… used by S3 management analytics to recommend areas where you can save
Name the 3 encryption at rest options available with S3?
1) SSE-S3 - S3’s existing encryption key for AES-256
2) SSE-C - Upload your own custom AES-256 encryption key which S3 will use when it writes the objects
3) SSE-KMS - Use a key generated and managed by AWS key management service
4) Client-side - Encrypt objects using own local encryption process before uploading to S3 (i.e. PGP. GPG)
What is transfer acceleration in S3?
A process of speeding up data uploads using CloudFront in reverse
What does the requester pays mean in S3?
The user pays for requests and data transfer rather than the owner.
What is a tag in the context of S3?
Assign tags to objects for use in costing, billing and security etc…
What is an event in the context of S3?
Events can be used when certain events happen in your S3 bucket (modification/add/delete). These events can trigger notifications to SNS, SQS or Lambda when certain events happen.
What is static web hosting in S3?
Simple and massively scalable static website hosting
How can BitTorrent be used with S3?
You can use BitTorrent protocol to retrieve any publically available object by automatically generating a .torrent file
What type of data is AWS Glacier useful for?
Seldomly accessed data, cold storage
Which hybrid cloud service uses Glacier for storage?
AWS storage gateway virtual tape library
Is Glacier integrated with lifecycle cycle manager?
Yes
What is a glacier vault?
A way to group archives together in S3 galcier
What is an archive in Glacier?
Any object such as a photo, video or document. It is a base unit Glacier storage. Each archive has a unique ID and an optional description. This archive ID is unique in the AWS region the archive is stored.
What is the max size of an archive?
40TB
What are the two levels and ways access to a vault is controlled?
1) Resource-based- Vault access policy
2) Identity-based- IAM policies
What is a vault access policy? Give an example of it’s use…
Sets rules which vaults must abide by.
e.g. no one can delete an object or before anyone deletes an object they must use MFA
How are IAM policies used for access to vaults? Also, Vault locks are ___….
Access managed though IAM give users permissions to administer a vault or to overwrite or delete a vault lock.
Immutable… They cannot be changed
What are the 4 steps of locking a vault?
1) Create a lock
2) Initiate vault lock
3) wait 24 hours and then confirm the lock is performing
a) if lock confirmed the lock is applied forever… no changes
b) if the lock is not confirmed then the lock dissolves
What is EBS? (2 points)
Elastic Block Storage. Essentially virtual hard drives. Can be unplugged and used with a different instance
Can EBS volumes be used in mutli-AZ
No, confined to a single AZ. Only one instance can access volume by default.
What backup strategy can you use with EBS volumes?
EBS snapshots
When would you use an instance store over and EBS?
when you want very fast access e.g. cache/buffer/scratch.
EBS is over the network so not as fast
What are the 3 benefits of using EBS snapshots?
1) Provides a cost-effective and easy backup-snapshot
2) Easy to share data sets with other users/accounts
3) Easy to migrate a system to a volume a new AZ or region
What are the 4 steps to convert an unencrypted volume to an encrypted volume?
1) take a snapshot
2) Use snapshot to create a new volume
3) Check encryption when creating
4) mount voume in EC2
What information is stored in a volume snapshot?
Changes only
Given that we have 1 snapshots 1,2,3. If we delete 2, do we loose 3?
No, we still have 1 and 3, but we cannot re-created 2 at that point in time
What is a snapshot?
A collection of pointer data which is stored in S3
What are the 2 ways we can use lifecycle manager to manage EBS snapshots?
1) Schedule snapshots to be created for volumes e.g. every hour
2) Set retention rules to remove stale snapshots
What is EFS? and what is it an implementation of?
Elastic File System. An implementation of NFS- Network File Share protocol
What is the pay model for EBS?
You pay for a set about of GB per month, regardless of use!
What is the pay model for EFS?
You only pay for the amount of storage you use
Is EFS multi-AZ?
Yes
How do EC2 instances access files on EFS?
Through mount points in one or many AZs
Can you use EFS to mount on prem?
Yes, but caution here… you would need to have a stable connection e.g. direct connect or Amazon Data Sync with EFS sync
How does EFS compare price-wise to EBS and S3?
3x more expensive than EBS
20x more expensive than S3
What is Amazon Storage Gateway?
A virtural machine that you run on-prem with VMware. It provides local resources and backends onto S3 and Glacier
What are 2 common use cases for Amazon Storage Gateway?
1) Disaster recovery
2) Cloud migrations
What are the four types of Amazon Storage Gateway?
1) File gateway
2) Volume gateway stored mode
3) Volume gateway cached mode
4) Tape gateway
What is file gateway and which interfaces does it allow?
Allows on prem to store objects in S3 via NFS or SMB mount points
NFS, SMB
What is volume gateway and which interface does it use?
Asynchronous replication of on-prem to S3
iSCSI
What is volume gateway cached mode and which interface does it use?
Primary data stored in S3 with frequently accessed data cached locally on prem
iSCSI
What is the tape gateway and which interface does it use?
A virtual media change and tape library for use with existing backup software
iSCSI
What is Amazon WorkDocs?
AWS’s version of DropBox or google drive
When would you run a database on EC2?
When you want ultimate flexibility or a database that is not currently supported by RDS e.g. SAP HANNA
What are the disadvantages or running your own database on EC2?
You are responsible for backup, patching and scaling…
What RDS?
A manage option for mySQL, PostgreSQL, MariaDB,Aurora….
What type of data is RDS most suited?
Structured and relational data
What are the benefits of using and RDS? (3 points)
1) Automates backup and patching in customer-defined maintenance windows
2) push-button scaling
3) redundancy
What service do you use if you need to store large binary objects (BLOBS)?
S3
What service do you use if you need automated scalability for the data you want to store?
DynamoDB
What service do you use if you need to store name/value data?
DynamoDB
Which service do you use if you want to store data that not well structured or unpredictable?
DynamoDB
Which service you use if you require a non-supported database such as SAP HANNA or you want complete control?
EC2
What are the two types of multi-AZ replication available for RDS databases?
1) Synchronous replication
2) Asynchronous replication
What is synchronous replication and how does a master and standby use this type of replication in a multi-AZ architecture?
Instant replication of data from a master to a standby in same AZ.
What happens if a master RDS fails?
The standby RDS gets promoted to the master and has ALL of the data that the master had
What is asynchronous replication and how does a read replica use this type of replication in a multi-AZ architecture?
Read replicas are seconds or mins behind the master.
What happens if a region fails that contains a master and standby? (Read replicas are in a different region)
Read replicas are promoted to master and new standby created (This will be done manually)
What is DynamoDB?
DynamoDB is a managed multi-AZ noSQL datastore with cross-region replication option.
What is the consistency model for DynamoDB?
BASE, Eventual consistency by default.
How does the pricing model work for DynamoDB?
Based on throughput
How does autoscaling work for DynamoDB? what is alternative can you use to allow full scaling?
Set min/max level in anticipation of need. Can you on demand capacity if you do not know the amount of capacity you need.
Can DynamoDB be ACID?
Yes you can force ACID
What is an attribute (DynamoDB)?
A name and value pair
What is an item (similar to a record) (DynamoDB)?
A collection of attributes
What is a table (DynamoDB)?
A collection of items
Each item has a partition (aka primary) key associated with it. What does DynamoDB do with this key?
It creates a HASH of the key value. Used to assign a partition or the underlying physical storage to use. AKA a hash attribute.
What is a composite key (DynamoDB)?
Partition key + sort key
What is the role of the partition key and the sort key?
Partition key- the location the data will be physically stored
Sort key- The order the data will be stored in for all keys with the same partition key
Name 2 secondary indexes…
1) Global secondary index
2) Local secondary index
What is a Global secondary index?
Partition key and sort key can be different that those on the table
I AM GLOBAL BABY!
What is a local secondary index?
Same partition key as the table, but a different sort key
When would you use a global secondary index?
When you want a fast query of attributes outside of the primary key without having to do a table scan
e.g. querying sales orders by customer number rather than sales by order number
When would you use a local secondary index?
When you already know the partition key and want to quickly query on some other attribute
e.g. I have a sales order number but I would like to retrieve only those records with a certain material number
(DynamoDB use case- solution, cost and benefit) What would you do if you need to… access just A FEW attributes the fastest way possible?
solution- project just those few attributes in a global secondary index
cost- minimal
benefit- lowest possible latency access for non-key items
(DynamoDB use case- solution, cost and benefit) What would you do if you need to… frequently SOME access non-key attributes
solution- project those attributes in a global secondary index
cost- moderate, aims to offset table scan cost
benefit- low latency for access to non-key items
(DynamoDB use case- solution, cost and benefit) What would you do if you need to… frequently access MOST non-key attributes
Solution- Project those attributes or even the entire table into a global secondary index
cost- up to double
benefit- Maximim flexibility
(DynamoDB use case- solution, cost and benefit) What would you do if you need to… rarely query but write or update frequently
Solution- Project keys only for the global secondary index
cost- minimal
benefit- very fast or updates for non-partition key items
Why would you use global secondary for table replicas?
To apply different WRU (write capacity unit) and RCU (read…) to tables e.g. free and premium customers.
What is Redshift?
A cost-effective scalable data warehouse, you this to query large data sets and identify correlations between disparate datasets. You can also query S3 using RedShift spectrum
What is Neptune?
A graph database. Allows you to store and query relationship data.
What is elasticache?
An in memory data store (not persistent in traditional sense)
Which two memory store does Elasticache provide?
1) Memcached
2) Redis
Which is faster Elasticache or DynmoDB?
Elasticache
Which in-memory store is most appropriate for (and why?)…. Web session storage?
Redis, using Redis avoids storing session data on server
Which in-memory store is most appropriate for (and why?)…. database caching?
Memcached, cheap and fast!
Which in-memory store is most appropriate for (and why?)…. leader boards
Redis, uses sorted sets! can keep order of millions of users instantly
Which in-memory store is most appropriate for (and why?)…. streaming
Use either! e.g. lading spot for streaming sensor data on the factor floor
What are the 4 key reasons to choose memchaced?
1) simple and straightforward
2) you need to scale and and in as demand changes
3) you need mulitlple CPU cores and threads
4) you need to cache objects like database queries
What are the 8 reasons to choose Redis?
1) you need encryption
2) you need HIPPA compliance
3) you need support for clustering
4) you need complex data types
5) you need HA
6) you need pub/sub compatibility
7) you need geospatial indexing
8) you need backup and restore
What is Amazon manage blockchain and what is QLDB
Managed bockchain framework
QLBD is a ordering service that is used to maintain complete history of all transactions
What is Amazon timestream database? and when would you use it?
Fully managed database designed to manage time-series data e.g. industrial machinery
What is Amazon DocumentDB
AWS based MongoDB - HA, multiAZ, scalable
What is Amazon Elastisearch?
Search engine but also a doc store, also known as an ELK stack… basically just a way to perform analytics on data.
Choose a database option based on the scenario below….
You need ultimate control over the database and the preferred DB is not available on RDS
Database on EC2
Choose a database option based on the scenario below….
Need traditional relational database for OLTP (online transactional processing), data is well structured
Amazon RDS
Choose a database option based on the scenario below….
Your data is in name/value pairs or in an unpredictable structure.
you also need in-memory performance with persistence
DynamoDB
Choose a database option based on the scenario below….
You have massive amounts of data that will primarily be used for OLAP workloads
Amazon Redshift
Choose a database option based on the scenario below….
Relationships between objects a major portion of the data value
Amazon Neptune
Choose a database option based on the scenario below….
You need fast temporary storage for small amounts of data
The data is highly volatile
Amazon Elasticache
What does file gateway expose its interface as?
NFS only it does not expose as NFS!
How would you improve the performance of a queries against your DynamoDB table, if most of the queries do not use the partition key, what should you do?
Create a global secondary index with the most common queried attribute as the hash key (partition key)
You try and get a file that doesn’t exist, then you add the file and try and fetch again… What are the two outcomes of fetching metadata from a newly added file in S3? and why is this?
1) get 404 error as the upload had not propagated
2) you get the metadata
Because of eventual consistency for read after write
What is a lazy write?
Another name for eventual consistency
What does FQDN stand for and which service is this used in?
FULLY QUALIFIED DOMAIN NAME
e.g. when specifying a mount point in EFS
How would you ensure that EFS can tolerate an AZ failure?
Create EFS mount targets in each AZ and configure each EC2 instance to mound the common mount target via it’s FQDN
How do EC2 use the FQDN for EFS?
The EC2 instances use the common FQDN as a mount target. The EFS file system will resolve to its local mount target in each AZ
What type of databse is SAP HANNA or Neo4j?
Graph databases
What 3 formats does Amazon Athena support?
1) JSON
2) Apache Paraquet
3) Apache ORC
NOT XML
What 2 features can be used to increase the speed of read operations?
1) DynamoDB Accelerator (DAX)- in memory cache in front of DynamoDB
2) Secondary indexes