AWS Hello, Storage Concepts Flashcards
Data Dimension
- 3 V’s of big data.
- Consider the storage mechanism most suitable for a particular workload. NOT a single data store for the entire system.
- Right tool for the right job
Highly structured data
- Has a pre-defined schema.
- Ex: Relational database
- Each entity of the same type has the same number of attributes and the domain of allowed values for an attribute can be further constrained.
- Advantages: self-described nature
Loosely structured data
- Has entities, which have attribute / fields
- Field uniquely identifies an entity
- However, attributes are not required to be the same in every entity
- Result: data more difficult to analyst and process in an automated fashion. Higher burden of reasoning about the data on the consumer or application.
Unstructured data
- Does not have sense or structure.
- No entities or attributes
- Can contain useful information.
- Result: any useful information must be extracted from consumer
BLOB data
- Useful as a whole
- But little benefit trying to extract value from a piece or attribute.
- Result: systems that store BLOB treat as a “black box” to store/retrieve as a whole.
Data Temperature
- Another useful way to look at data to determine the right storage for application
- Helps us to understand how “lively” data is (how much is being written/read and how soon it needs to be available)
- Ex: Hot, Warm, Cold, Frozen
- The same data can start hot and gradually cool.
- When this happens, tolerance of read latency increases as does data set size.
Data value
- Some data must be preserved at all costs, other data can be easily regenerated or even lost without significant impact.
- Value of data will impact the investment in durability.
3.
Data value tip!
- To optimize cost and/or performance further, segment data within each workload by value and temperature, and consider different data storage options for different segments.
Data dimensions tip!
- Think in terms of a data storage mechanism that is most suitable for a particular workload - not a single data store for the entire system. Choose the right tool for the job.
Storage tip - One size does not fit all!
- Know the availablity, level of durability, and cost factors for each storage option and how they compare.
AWS Shared Responsibility Model and Storage
- AWS: responsible for securing the storage services
- Developer/customer: responsible for securing access to and using encryption on artifacts you create/store.
- Best practice to always use principle of least privilege.
CIA model
- Confidentiality, Integrity, Availablity forms the fundamentals of information security. These should be applied to AWS storage.
- Availablity (1) sits on top of Integrity (2) and Confidentiality (3) to form “Information Security”
EBS characteristics
- EBS presents data to EC2 instance as a disk volume.
- Provides lowest-latency access to your data from single EC2 instances.
- EBS provides durable, persistent block storage volume for use with EC2 instances.
- Automatically replicated within AZ (offering high availablity and durability)
- Offers consitent low-latency performance.
- Can scale up and down within minutes. Pay for what you provision
Typical use cases for EBS
- Boot volumes on EC2 instances
- Relational / NoSql databases
- Steam and log processing aapplications
- Data warehousing applications
- Big data analytics engines (Hadoop) and Amazon EMR clusters.
EBS designed to achieve:
- Availablity 99.999%
- Durability of replication within a single AZ.
- Annual failure rate (AFR) between 0.1 - 0.2 percent
EBS Volume attributes
- Persist independently from the running life of an EC2 instance. (After EBS is attached to an instance, use it like any other physical hard drive.)
- Very flexible. (Current generation volumes attached to current generation instance types, can dynamically increase size, modify provisioned input/output operations per second (OPS) capacity, and change the volume type on live production volumes.
EBS Volume types
- SSD-backed volumes
SDD Use Cases
- GENERAL PURPOSE: recommended for most workloads.
- System boot volumes.
- Virtual Desktops
- Low-latency interactive.
- Apps.
- Development and test environments. - PROVISIONED IOPS:
- I/O intensive workloads
- Relational DBs
- NoSql DBs
HDD Use Cases
- THROUGHPUT-OPTIMIZED:
- Streaming workloads requiring consistent, fast throughput at a low price
- Big data
- Data warehouse
- Log processing
- Cannot be a boot volume - COLD:
- Throughput-oriented storage for large volumes of data that is infrequently accessed
- Scenarios where the lowest storage cost is important
- Cannot be a boot volume
Elastic Volume benefits
- Can be done with no downtime, performance impact, changes to application.
- Create the volume with capacity/performance needed to deploy b/c you can always change later.
- Saves hours of planning cycles and prevents overprovisioning.
EBS Snapshot
- Point in time snapshot of EBS volumes
- Backed up to S3 for long-term durability.
- Volume does not need to be attached to a running instance to take a snapshot.
- Snapshots are incremental back ups, only the blocks that have changed are updated, making it much more cost-effective way to store block data
- When deleting, EBS will retain the most recent snapshot to restore from.
- EBS determines which dependent snapshots can be deleted to ensure that all other snapshots will still work.
Elastic Volume
- Allows you to increase capacity dynamically, tune performance, and change the type of volume live.
- Feature of EBS.
- Can be done with no downtime, performance impact, changes to application.
EBS Optimization
- Remember EBS volumes are network-attached (not attached directly to the host like instance stores)
- On instances WITHOUT support from EBS-optimized throughput, network traffic can contend with traffic b/n your instance and your amazon EBS volumes.
- EBS-optimized instances, these two types of traffic are separated.
- Some instance configurations incur an extra cost for using Amazon EBS-optimized, while other are always EBS-optimized, at no extra cost.
EBS Encryption
- For simplified DATA encryption, create encrypted EBS volumes with EBS Encryption feature
- All EBS volume types support encryption.
- EBS uses 256-bit Advanced Encryption Standard (AES-256) algorthims and Amazon-managed Amazon Key Management Service (AWS KMS).
EBS encryption options
- Use AWS KMS-generated key OR
- Can choose to select Customer Master Key (CMK) that you create separately using AWS KMS.
- Can also encrypt files prior to placing them on volume.
- Snapshots of EBS volumes are automatically encrypted. (As are any restorations from snapshots.)
EBS Performance Best Practices*
- Use EBS-optimized Instances.
- Dedicated throughput makes volume performance more predicable and consistent.
- EBS volume network traffic won’t compete with your other instance traffic b/c they are separated in EBS-optimized. - Understand how performance is calculated.
- Must understand the units of measures involved and how performance is calculated. - Understand your workload.
- Relationship between maximum performance of EBS volumes, size and number of I/O operations, and teh time it takes for each action to complete.
- Each of these factors affects the others and different applications are more sensitive to one factor or another. - Be Aware- performance penalty when initializing volumes from snapshots.
- (New EBS volumes receive their maximum performance the moment they are available and do not require initialization.)
- aka Initialization
EBS Workload implications
- One of the EBS Performance best practices.
- On a given volume configuration, certain I/O characteristics drive the performance behavior for your Amazon EBS volumes.
- SSD-backed volumes, General purpose SSD, Provisioned IOPS SSD = consistent performance whether an I/O operation is random or sequential.
- HDD-backed volumes, Throughput-Optimized HDD, and Cold HDD deliver optimal performance only when I/O operations are large and sequential
EBS Workload theory
- To understand how SSD and HDD backed volumes will perform, must understand the connection between:
- demand on the volume
- the quantity of IOPS available to it
- the time it takes for an I/O operation to complete
- volume’s throughput limits
Factors that can degrade HDD performance
- When you create a snapshot of a Throughput-optimized HDD or Cold HDD volume, performance may drop as far as the volume’s baseline while the snapshot is in progress
- Specific only to these volume types
- Other factors that can limit performance
- driving more throughput than the instance can support
- performance penalty encountered when initializing volumes restored from a snapshot
- excessive amounts of small, random I/O on the volume
How to increase read-ahead for high-throughput, read-heavy workloads
- If your workload is read-heavy and accesses the block device through the operation system page cache (ex: from file system)
- To achieve max. throughput, recommended that you configure the read-ahead setting to 1MiB.
- This is a per-block-device setting that should be applied ONLY to your HDD volumes.
How to maximize utilization of instance resources.
- Use RAID 0.
- Some instance types can drive more I/O throughput than what you can provision in a single EBS volume. Can join multiple volumes of certain instance types together in a RAID 0 configuration.
- This will use the available bandwidth of these instances.
EBS Troubleshooting - If using EBS volume as a boot volume and your instance is no longer accessible, what do you do?
- If you are using as boot volume, instance is no longer accessible.
- Can’t use SSH or RDP to access boot volume. - However, can use these steps to access volume:
- If you have an EC2 instance based on an AMI, you can choose to terminate the instance and create a new one. - If you need access to that EBS boot volume, do these steps to make accessible:
- Create a new EC2 instance with it’s own boot volume (a micro instance is great for this)
- Detach the root EBS from the troubled instance.
- Attach the root EBS volume from the troubled instance to your new EC2 instance as a secondary volume.
- Connect the new EC2 instance, and access the files on the secondary volume.
AMI
- Amazon Machine Image
- Provides the information required to launch an instance.
- Must specify an AMI when you launch an instance.
- Can launch multiple instances from a single AMI (when you need multiple instances with the same configurations).
- Can use different AMIS to launch instances when you need different configurations.
- AMI includes:
- One or more EBS snapshots or (for instance-store-backed AMIs) a template for the root volume of the instance (ex: an OS, application server and applications)
- Launch permissions that control which AWS accounts can use the AMI to launch instances.
- A block device mapping that specifies the volumes to attach to the instance when it’s launched.
Instance Store
- Another type of block storage available to your EC2 instance for short-lived storage.
- Provides TEMPORARY block-level storage.
- Storage is located on disks that are physically attached to the host computer. (Unlike EBS volumes which are network attached)
- Does not persist if the instance fails or is terminated.
- Because it is on the host computer of EC2 instance, will provide the lowest-latency storage to your instance (other than RAM).
- Used when incurring large amounts of I/O for your application for lowest possible latency.
- MUST ensure you have another source of truth for your data and that the only copy is NOT place in instance store!
- For durable data, EBS volumes are recommended.
When is your data a candidate for the EC2 instance store?
- If your data does NOT need to be resilient to reboots, restarts, or auto recovery.
- But, exercise caution.
Instance Store Volumes - available instance types.
- Not all instance types come with available instance store volume(s).
- The size and type of volume vary by instance type.
- When you launch an instance, the instance store is available at not additional cost (depending on instance type).
- However, must enable these volumes when you launch EC2 instance b/c you cannot add instance store volumes to EC2 instance after launch.
When is an Instance Store volume available to the EC2 instance?
- After you launch an instance, the storage volume is available.
- However, you cannot access them until they are mounted.
ADDITIONAL INFORMATION:
1. learn more about how to mount EBS volumes on different OS.
TBD
Using both EBS and Instance Store data with instances.
- Many customers use a combination of EBS volumes and Instance Store.
- Ex: May want to put scratch data, tempdb, or other temporary files on instance store while your root volume is on EBS.
- NEVER use instance store for any production data.
Instance Store-backed EC2 instances
- Can have your instance boot of instance store, however, would want to configure so you are using an AMI and that new instances will be created if one fails.
- NOT recommended for primary instances (where uses would have issues if instance fails)
- But this configuration can save money on storage costs instead of using EBS as your boot volume in cases where your system is configured to e resilient to instances re-launching.
- Must understand application and infrastructure needs before choosing to use instance store-backed EC2 instances. Choose carefully!
- EC2 Instance Store-backed instances CANNOT be stopped or take advantage of auto-recovery feature of EC2 instances. - It is possible to build instances on the fly that are completely resilient to reboot, relaunch or failure and use instance-store as their root volume. (But requires due diligence regarding your application and infrastructure to ensure this scenario would work for you.)
S3
- Allows you to build web applications, delivering content to users by retrieving data via API calls over the internet.
- Storage for the internet.
- Simple storage service offers developers highly scalable, reliable, and low-latency data storage infrastructure at low cost.
Bucket Limitations (in S3)
- *Do not use buckets as folders (b/c there is a 100 bucket limit per account)
- Cannot create a bucket within another bucket.
- Bucket is owned by the AWS account that created it.
- Bucket ownership is NOT transferable.
- A bucket must be empty before you can delete it.
- After a bucket is deleted, that name becomes available for reuse.
- However, you might not be able to reuse the name if someone else has taken the name after you release it when deleting the bucket
- If you expect to reuse the bucket, do not delete it.
Universal Namespace (buckets)
- A bucket name must be unique across all existing bucket names in S3 across ALL of AWS. (Not just within your account or AWS Region.)
- Must comply with DNS naming conventions when choosing a bucket name.
DNS - compliant bucket name rules
- Must be at least 3 and no more than 63 characters long.
- Must consist of a series of one or more labels, with adjacent labels separated by a single period.
- Must contain lowercase letters, numbers, and hyphens.
- Each label must start and end with a lowercase letter or number
- Must not be formatted like IP addresses
- AWS recommends that you do not use periods in bucket names. (B/c when using virtual hosted-style buckets with SSL, the SSL wildcard certificate only matches buckets that do not have periods.)
- To work around this use HTTP or write your own certificate verification logic.
Create a bucket using Java - code snippet
private static String bucketName = "*** bucket name ***"; public static void main(String[] args) throws IOException { AmazonS3 s3client = new AmazonS3Client(new ProfileCredentialsProvider()); s3client.setRegion(Region.getRegion(Regions.US_WEST_1)); if(!(s3client.doesBucketExist(bucketName))){ // Note that CreateBucketRequest does not specify region. So bucket is // created in the region specified in the client. s3client.createBucket(new CreateBucketRequest(bucketName)); }
// Get location. String bucketLocation = s3client.getBucketLocation(new GetBucketLocationRequest (bucketName)); System.out.println("bucket location = " + bucketLocation);
When to use versioning
- To preserve, retrieve, and restore every version of every object stored in your S3 bucket.
- including recovering deleted objects. - With versioning, can easily recover from both unintended user actions and application failures.
- Versioning is turned OFF by default.
Reasons a developer would turn on versioning of files in S3.
- Protecting from accidental deletion.
- Recovering an earlier version.
- Retrieving deleted objects.
How to retrieve any particular object in a versioned bucket.
- Perform a GET on the object key name and the particular version.
- S3 versioning tracks changes over time.
How does S3 versioning protect against unintended deletes?
- If you issue a delete command against an object in a versioned bucket, AWS places a delete marker at the top of that object.
- Then when you perform a GET on it, you’ll get an error since the object does not exists.
- However, an administrator or someone with the necessary permissions, could remove the delete marker and access the data.
- When a delete request is issued against a versioned bucket on a particular object, S3 retains the data but removes access for users to retrieve that data.
- Can also be MFA delete-enabled for an additional security layer.
T/F - Versioning is turned off by default?
- True.
How many objects can you store within S3?
- Unlimited
- But, an object size can only be between 1 byte to 5TB
- If you have an object larger than 5TB, use a file splitter and upload in chunks to S3 (reassemble later if you download for later use)
Largest object that can be uploaded in a single PUT
- 5GB
- For objects larger than 100MB, should consider using multipart upload
- (Anything larger than 5GB, you must use a multipart upload)
Object facets
- Key
- VersionID
- Value
- Metadata
- Subresources
- Access Control Information
Key (object facet)
- Name that you assign to an object, may include a simulated folder structure.
- Each key mus be unique within a bucket (unless versioning is turned on)
- S3 URLs are a basic data map between “bucket + key + version” and the webservice endpoint.
- Ex: URL http://doc.s3.amazonaws.com/2006-03-01/AmazonS3.wsdl, doc is the name of the bucket and 2006-03-01/AmazonS3.wsdl is the key.
VersionID (object facet)
- Within a bucket, a key and versionID uniquely identify an object.
- If versioning is turned on you have multiple versions of a stored object.
Value (object facet)
- Actual content you are storing.
- Can be any sequence of bytes.
- Objects can range in size from 1byte -> 5TB.
Metadata (object facet)
- Set of name-value pairs with which you can store information regarding the object.
- Can assign metadata (referred to as user-defined metadata) to your objects in S3.
- S3 also assigns system metadata to manage these objects.
Subresources (object facets)
- S3 uses this sub-resource mechanism to store additional object-specific information.
- Subordinates to objects so they are always associated with some other entity like a bucket or object (which it uses for managing objects)
- Ex: ACL and Torrent
ACL
- Access Control List
- A list of grants identifying the grantees and the permissions they granted.
- A Type of subresource associated with S3 objects.
- Resource-based
Torrent
- Returns the torrent file associated with the specific object.
- A type of subresource associated with S3 objects.
Resource-based v. user-based access control
- Resource-based:
- ACL
- bucket policies - User-based
How many tags can you associate with an object?
- 10 tags to an object
- each tag associated with an object must have unique tag keys. - Tag key can be up to 128 unicode characters long
- Tag values can be up to 256 unicode characters long