AWS Hello, Storage Definitions Flashcards
3 types of AWS storage
- Block: EBS (persistent), EC2 Instance Store (ephemeral)
- File: EFS
- Object: S3, S3 Glacier
EBS
- Elastic Block Store
EFS
- Elastic File System
S3
- Amazon Simple Storage Solution
3 V’s of big data
- Velocity
- Variety
- Volume
Velocity
- Speed at which data is being read/written
- Measured in RPS (reads per second) or
- Measured in WPS (writes per second)
- Can be based on batch processing, periodic, near real time, or real time speed
Variety
- Determines how structured the data is AND
- How many different structures exist in the data.
- Ex: Highly structured -> loosely structured, unstructured, or BLOB
BLOB
- Binary large object data.
Volume
- Total size of dataset.
- Typical metrics that measure availability of data store to support volume are:
- maximum storage and cost - Ex: $/GB
Hot data
- Actively worked on (new ingests, updates, transformations)
- Read and writes tend to be single item.
- Items tend to be small (up to hundreds of kilobytes)
- Speed of access = essential
- Tends to be high velocity + low volume
Warm data
- Still being actively accessed (less frequent than hot)
- Items can be small like hot, but are updated and read in sets.
- Speed of access while important is less than hot.
- More balanced across velocity and volume dimensions.
Cold data
- Still needs occaisional access.
- Updates to data are rare
- Reads can tolerate higher latency
- Items tend to be large (tens of hundreds of mega/giga bytes)
- Often written / read individually.
- High durability, low cost = essential
- High volume and low velocity.
Frozen data
- Needs to be preserved for business continuity / archival / regulatory reasons.
- Not actively worked on.
- New data can be regularly added to data store, existing data is NEVER updated.
- Reads are very infrequent (“write once, read never”)
- Can tolerate high latency.
- Very high volume, very low-velocity
Transient data
- Usually short-lived.
- Loss of a subset of transient data does not have a big impact on system.
- Ex: clickstream or Twitter data.
- Usually don’t need high durability of this data (b/c we expect it to be quickly consumed, yielding higher value data)
- Note: not all streaming data is transient. (ex: intrusion alert system)
Reproducible data
- Contains a copy of useful information that is often created to improve performance or simplify consumption.
- Ex: adding more structure or altering structure to match consumption patterns.
- Loss of some or all this data may affect system’s performance or availablity.
- Not result in data loss (b/c it’s reproducible)
- Ex: Data warehouse data, read replicas of OLTP, many types of caches.
- Invest a bit of durability (to reduce impact on system’s performance/ availablity) but only to a point.
OLTP
- Online transaction processing systems.
- Category of data processing focused on transaction-oriented tasks.
- Usually Inserting, Updating, Deleting small amounts of data in a database.
- Mainly deals with large numbers of transactions by large number of users.
Authoritative data
- Source of truth.
- Losing it will significantly impact business b/c difficult/impossible to restore or replace.
- Willing to invest additional durability. More important, more durability desired.
Critical/Regulated data
- Business must retain at any cost.
- Tends to be stored for longer periods of time.
- Needs to be protected from accidental or malicious changes, not just data loss.
- In addition to durability, cost and security are equally important.
ERP
- Enterprise resource planning systems.
Block storage
- Offer low latency, high performance workloads.
- Analogous to DAS (direct-attached storage) or SAN (storage area network).
- Ex: EC2 and EBS.
- ERPs are a good example of an enterprise application that requires dedicated, low-latency storage for each host.
DAS
- Direct-attached storage
2. Analogous to Block storage.
SAN
- Storage Area Network
- Analogous to Block Storage.
- Computer network which provides access to consolidated, block-level storage.
Object storage
- Ideal for building modern applications from scratch that require scale and flexibility.
- Can be used to import existing data stores for analytics, backup, or archive.
- Cloud storage makes it possible to store virtually limitless data in native format.
- Ex: S3
File storage
- For applications that need access to shared files and require a file system.
- Ideal for large content repositories, development environments, media stores, user home directors.
- Often supported with NAS (network-attached storage) server
NAS
- Network-attached storage server usually supports File Storage.
Confidentiality
- Equated to privacy level of your data.
- Refers to levels of encryption or access policies for your storage / files.
- Limit access to prevent accidental information disclosure by restricting access and enabling encryption.
Integrity
- Refers to whether your data is trustworthy and accurate.
- Ex: Are you sure the file you generated has not been changed when audited later?
- Tip - restrict permission of who can modify data.
- Tip - Enable backup and versioning.
Availablity
- Refers to Availablity of a service on AWS for storage, where an authorized party can gain reliable access to the resource.
- Tip - restrict permission of who can delete data.
- Tip - enable MFA for S3 delete operation.
- Tip - enable backup and versioning.
AFR
- Annual Failure Rate
- EBS are 20x more reliable than typical commodity disk drives (AFR around 4%)
- EBS AFR 0.1 - 0.2 %
IOPS
- Input/output operations per second.
- Common performance measurement used to benchmark computer storage devices like hard disk drives (HDD) and solid state drives (SSD) and storage area network (SAN)
HDD
- Hard disk drive
- HDD backed volumes are optimized for large streaming workloads where throughput (measured in MiB/s) is a better performance measure than IOPS.
SSD
- Solid state drives
- SSD backed volumes are optimized for transactional workloads involving frequent read/write operations with small I/O size
- Where dominant performance attribute is IOPS.
- Newer, faster type of device that stores data on instantly-accessible memory chips (than HDD)
MiB/s
- Mebibyte per second
2. Unit of data transfer rate = 1,048,576 bits per second
AES-256
- 256-bit Advanced Encryption Standard
- An algorithm used in EBS encryption
- Encryption occurs on the server that hosts the EC2 instance.
- This provides encryption of data in transit from EC2 instance to EBS Storage.
- This is used in conjunction with AWS KMS.
AWS KMS
- AWS Key Management Service.
- Amazon-managed key infrastructure.
- Encryption occurs on the server that hosts the EC2 instance.
- This provides encryption of data in transit from EC2 instance to EBS Storage.
- This is used in conjunction with AES-256.
CMK
- Customer master key.
- One of two options for EBS encryption key creation .
- AWS KMS will create the CMK if you choose this option. (Instead of creating a KMS generated key)
Pre-warming / Initialization
- Pre-warming is the previous term for “initialization”
- The time it takes an EBS volume created from a snap shot before you can access the block.
- This preliminary time can cause a significant increase in latency of an I/O operation the first time each block is accessed.
- Performance returns after the data is accessed once
Initialization process
- For most applications, it is ok to amortize the cost of initializing a volume from a snapshot over the lifetime of the application.
- If this is not acceptable, you can avoid a performance hit by accessing each block (thus absorbing the downtime) prior to putting the volume into production.
- This process = initialization.
RAID 0
- Configuration that allows you to join certain types of instances together.
- Recommended to maximize utilization of instance resources.
What is the the configuration to achieve maximum throughput for a block device?
- 1 MiB
- This is a per-block-device setting.
- Only apply to HDD volumes.
RDP
- Remote Desktop Protocol
AMI
- Amazon Machine Image
EBS volume vs EC2 instance store
- EBS = Persistent
- Location: NETWORK-attached
- recommended for durable data - EC2 = Temporary
- Location: Disks which are PHYSICALLY attached to host computer
- cannot be the only source of truth for your data
- good for incurring large amounts of I/O at lowest possible latency
T/F - You can add an instance store after an EC2 instance has been launched.
- False, it must be enabled when the EC2 instance is launched.
T/F - Instance store provides the lowest-latency storage to your instance (other than RAM)
- True.
Object
- Piece of data like a document, image, or video that is stored with some metadata in a flat structure.
- Object storage provides that data to applications via APIs over the internet.
Metadata
- A set of data that describes and gives information about other data.
- “Data about data”
- Ex: descriptive, structural, administrative, reference, statistical.
S3
- Simple Storage Service
Bucket
- A container for objects stored in S3.
- Every object is contained in a bucket.
- Bucket is like a drive or volume in traditional terminology.
T/F - It is a good idea to use buckets like folders in S3.
- False. This is not best practice as there is 100 bucket limit. (You could reach the limit as your application or data grows)
DNS
- Domain Naming System (S3 bucket names must be in compliance)
SSL / TSL
- Secure Sockets Layer
- It’s successor is TSL (Transport Layer Security)
- Protocols for establishing authenticated and encrypted links between computer networks
T/F - Amazon bucket names must be universally unique.
- True
Versioning
- Keeping multiple variants of an object in same bucket.
- When versioning is turned on, S3 will create new versions of your object every time you overwrite a particular object key.
- Every time you update an object with the same key, S3 will maintain a new version of it.
Versioning- enabled buckets
- Let you recover objects from accidental deletion or overwrite.
- Bucket’s versioning configuration can also be MFA Delete-enabled for additional layer of security.
- If you overwrite an object, it results in a new object version in the bucket.
- You an always restore from a pervious version.
Versioning and Lifecycle policies
- Can use versioning in combination with lifecycle policies to implement them if the object is the current or previous version.
- If concerned with building up of many versions and using space for a particular object, configure lifecycle policy that will delete the old version of the object after a certain period of time.
- Tip - Easy to set up lifecycle policy to control the amount of data that’s being retained when you use versioning on a bucket.
How to discontinue versioning on a bucket
- Copy all of your objects to a new bucket that has versioning disabled and use that bucket moving forward.
- Tip - Can never return to an un-versioned state.
- But you can suspend versioning on the bucket.
Cost implications of the versioned- enabled bucket
- Must calculate as though every version is a completely separate object that takes up the same space as the object itself.
- This may make this option cost prohibitive.
Buckets in regions
- S3 creates buckets in region you specify.
- Can choose a region that is geographically close to optimize latency, minimize costs or address regulator requirements.
- Tip - Objects belonging to a bucket that you create in a specific AWS Region never leave that region unless you explicitly transfer them to another region.
Python code- Create a bucket
import boto3
s3 = boto3.client(‘s3’)
s3.create_bucket(Bucket=’my-bucket’)
Python code - Get list of all bucket names
import boto3
# Create an S3 client s3 = boto3.client('s3')
# Call S3 to list current buckets response = s3.list_buckets()
# Get a list of all bucket names from the response buckets = [bucket['Name'] for bucket in response['Buckets']]
# Print out the bucket list print("Bucket List: %s" % buckets)
Java code - Delete a bucket
- Note: bucket must be empty before you delete it, unless you use a force parameter
import java.io.IOException;
import com.amazonaws.AmazonServiceException;
import com.amazonaws.SdkClientException;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.DeleteObjectRequest;
public class DeleteObjectNonVersionedBucket {
public static void main(String[] args) throws IOException { String clientRegion = "*** Client region ***"; String bucketName = "*** Bucket name ***"; String keyName = "*** Key name ****"; try { AmazonS3 s3Client = AmazonS3ClientBuilder.standard() .withCredentials(new ProfileCredentialsProvider()) .withRegion(clientRegion) .build();
s3Client.deleteObject(new DeleteObjectRequest(bucketName, keyName)); } catch(AmazonServiceException e) { // The call was transmitted successfully, but Amazon S3 couldn't process // it, so it returned an error response. e.printStackTrace(); } catch(SdkClientException e) { // Amazon S3 couldn't be contacted for a response, or the client // couldn't parse the response from Amazon S3. e.printStackTrace(); } } }
CLI - Delete a bucket (with force parameter)
- Note: the –force will delete all objects first and then delete the bucket
$ aws s3 rb s3://bucket-name –force
Object tagging
- Enables you to categorize storage.
- Each tag is a key-value pair.
- Ex for personal health information: PHI = true OR Classification = PHI
- WARNING - Acceptable to use tags to label objects with confidential or PII data, the tags themselves should not contain confidential information.
- Can use multiple tags on one object
- Can tag new or existing objects
Object keys and values
- Key and Values are case sensitive.
How developers typically name their folders (tagging).
- Categorize their files in file-like structure in the key name.
- S3 has a flat file structure
- Ex:
- photos/photo1.jpg
- project/projectx/document.pdf
- project/projecty/document2.pdf - Allows you one dimensional categorization, everything under a prefix is one category.
Benefits of tagging
- Object tags enable file-grained access control of permissions.
- Ex: Can grant IAM user permission to read-only objects with specific tags. - Enable fine-grained object lifecycle management in which you can specify a tag-based filter, in addition to key-name prefix, in a lifecycle rule.
- When using S3 analytics, can configure filters to group objects together