1. MAAREK Flashcards
Multi-AZ Resharding:
Multi-AZ: Indicates support for running Redis in multiple availability zones (AZs) to enhance fault tolerance and high availability.
Resharding: Refers to the process of redistributing data across nodes, possibly to accommodate changes in the cluster size or improve load distribution.
Online and Offline Resharding:
Online Resharding: This suggests that the resharding process can be performed without taking the cluster offline, ensuring continuous availability.
Offline Resharding: In some cases, resharding might require the cluster to be taken offline temporarily. This can impact availability during the process.
Maximum of 5 Read Replicas (RR) per Cluster:
Read Replicas: Additional nodes that replicate the data from the primary node for read-heavy workloads.
Limitation: The cluster is configured to support a maximum of 5 read replicas.
Cluster Scaling:
Scaling: The ability to add or remove nodes dynamically to adapt to changing workloads.
Tasks Spread Across Nodes: This suggests that the cluster distributes tasks or data across its nodes to balance the load.
New Nodes Immediately Updated:
Dynamic Updates: Changes or additions to the cluster, such as the creation of new nodes, are immediately reflected in the cluster’s state.
Important Considerations:
High Availability (HA): Multi-AZ deployment and the ability to perform online resharding are key elements for ensuring high availability.
Performance: Read replicas and the ability to scale the cluster contribute to improved performance and the ability to handle increased workloads.
These features and capabilities are typical in a Redis cluster that is designed for scalability, fault tolerance, and high availability, which are essential for many production systems. If you have specific questions or if there’s a particular aspect you’d like more information on, feel free to ask!
Memcached
Addition of New Nodes:
Horizontal scaling involves adding more nodes to the system to handle increased load.
New nodes can be dynamically added to the cluster.
Autodiscovery Function: There is a function in place that automatically discovers and updates all new nodes in the cluster.
Limit: The system is designed to handle a maximum of 40 nodes.
Vertical Scaling:
Node Upgrade Limitation:
Vertical scaling involves upgrading the existing nodes to handle increased load.
Limitation: Memcached nodes cannot be upgraded directly. Instead, the approach is to swap out the old node for a new one. This typically involves taking the old node offline during the process.
Empty New Nodes:
When adding new nodes through vertical scaling, these nodes start empty and do not retain the data from the old node.
Application Reload: To fill the new nodes with data, the application needs to reload the data. This implies a potential data migration or reloading process.
Offline Swap:
Vertical scaling often involves taking the old node offline during the swap process.
Data Loading Requirement:
The mention of “new nodes are empty” suggests that when a new node is introduced through vertical scaling, it doesn’t automatically inherit the data from the old node. The application needs to handle the reloading or migration of data.
Both horizontal and vertical scaling have their trade-offs and are chosen based on specific use cases and requirements. Horizontal scaling offers more flexibility in handling increased load dynamically, while vertical scaling involves upgrading the capacity of the existing resources to handle additional load. The decision between them often depends on factors such as system architecture, performance requirements, and the nature of the application
REDDIS RR vs RESHARDING
Read Replicas (RR) and resharding are two distinct concepts in the context of distributed systems like Redis.
Read Replicas (RR):
Purpose:
Read Scaling: Read Replicas are used to scale read operations in a distributed database system.
Improved Performance: By offloading read operations to replicas, the primary node is freed up to handle write operations.
Functionality:
Data Replication: Data from the primary node is replicated to the Read Replicas.
Read-Only: Read Replicas are typically read-only nodes, meaning that they can’t accept write operations.
Availability:
Fault Tolerance: Read Replicas provide fault tolerance. If the primary node fails, one of the replicas can be promoted to become the new primary.
Consistency:
Eventual Consistency: Depending on the replication mechanism, there might be some delay (latency) between the primary node and the replicas, resulting in eventual consistency.
Resharding:
Purpose:
Write Scaling: Resharding is a process used to scale write operations and distribute data more evenly across nodes in a cluster.
Dynamic Load Balancing: It allows for redistributing data to accommodate changes in workload or cluster size.
Functionality:
Data Redistribution: Resharding involves moving data from one set of nodes to another, typically to balance the load or accommodate changes in the cluster.
Availability:
Impact on Availability: Resharding might require taking the cluster offline temporarily or have some impact on availability during the process, depending on the implementation.
Consistency:
Maintaining Consistency: Resharding must ensure that data consistency is maintained during the redistribution process.
Key Differences:
Purpose: Read Replicas are primarily for read scaling, while resharding is for write scaling and dynamic load balancing.
Functionality: Read Replicas replicate data for improved read performance, while resharding redistributes data to optimize the distribution of writes.
Impact on Availability: Read Replicas provide fault tolerance without much impact on availability, while resharding might have some impact, especially if it involves taking the cluster offline temporarily.
In summary, Read Replicas and resharding serve different purposes in a distributed system. Read Replicas focus on improving read performance and fault tolerance, while resharding is about optimizing the distribution of writes and dynamically balancing the cluster.
Elasticache eviction issues
Eviction Scenario:
Cause: Low capacity in the system, leading to the need to remove data to make space for new data.
Solutions:
Scale Up or Scale Out:
Scale Up (Vertical Scaling):
Description: Increase the capacity of individual nodes by changing to larger nodes.
Advantages: This provides more resources (CPU, memory) to handle increased load.
Considerations: There might be limits to how much you can scale up, and larger nodes could be more expensive.
Scale Out (Horizontal Scaling):
Description: Add more nodes to the system to distribute the load.
Advantages: This provides increased capacity by distributing the workload across multiple nodes.
Considerations: This approach often offers better scalability, but it requires a distributed architecture.
Change Eviction Policy:
Description: Modify the eviction policy to allow for the early retirement of non-current data.
Example: You might consider changing the eviction policy to prioritize removing less frequently accessed or older data.
Considerations: This can help manage space more effectively, but it’s important to align the eviction policy with the application’s requirements.
Considerations:
Eviction Policy:
The choice of eviction policy depends on the nature of your application and the importance of different types of data.
Common eviction policies include
LRU (Least Recently Used),
LFU (Least Frequently Used), and others.
Monitoring:
Regularly monitor the system to identify trends in data access patterns and capacity usage.
Capacity Planning:
Plan for future growth and consider both vertical and horizontal scaling strategies.
Cost and Performance Trade-offs:
Consider the cost implications and performance trade-offs associated with scaling up or scaling out.
In summary, addressing eviction issues involves a combination of capacity planning, scaling strategies, and tuning the eviction policy to align with the application’s requirements. The choice between scaling up and scaling out depends on your specific use case and requirements.
Ssh syntax
ssh -i keypair.pem ec2-user @10.0.0.0
Check which network adapter/interface is present in the instance
ethtool -i eth0
Enhance Networking
Enabled by default on Amazon Linux2 AMI.
However, wether the instance type harnesses it depends on the instance type.
Changing instance type
Only ebs backed instances can have an instance upgrade.
Cluster placement advantage
Low Latence, High bandwidth high PPT
Best for HPC
Partition placement
- Max of 7 partitions per az
- Can be across Azs
- Partition information can be found in the instance’s metadata.
Partition placement use case
Application has to be partition aware(ability to distribute data across instances within the cluster)
Tightly linked distributed systems
Ec2 Vcpu limits
Only applicable to On-demand and Spot instances
Insufficient capacity error
Aws does not have free instance available for you in a particular az
Ssh troubleshooting
Instance connect uses one of the reserved ip ranges for your region. As long as port 22 is opened, instance connect picks up one of the ip’s and connects. Be careful when whitelisting a CIDR range for ssh in-bound, because if the reserved range is not whitelisted, it is implicitly blacklisted.
Cloud watch Metrics types
- Basic (Default 5mins, enhanced 1minute)
- CPU usage, CPU Credit
- Disk (instance store only)
- Network
- Status check - Custom (Default 1min, High Res1sec)
- Ram
- Application level
- Requires an IAM role
Procstat plugin
Collects system and application level metrics of individual processes for Cloudwatch agent (WINDOWS AND LINUX)
Terminal
Upgrade privilege
~sudo su
Terminal
Install Apache
~ yum install httpd
~ echo “Hello world” > var/www/html/index.html
To enable apache
~ sudo systemctl start httpd
To persist through system restarts
~ systemctl enable httpd
$(hostname)
Httpd server logs examples
~ cat var/log/httpd/access_log
~ cat var/log/httpd/error_log
Cloudwatch Logs vs Metrics
Logs - Report(text files)
Metrics - measurements(graphs)
Troubleshoot ec2 status failure (system) ?
Migrate Instance to another host (Stop and start instance)
Cloudwatch system recovery (triggered by a configured CW Alarm
Maintains system’s public/private IP, Elastic IP, Metadata, and placement group.
Examples of problems that require instance recovery:
- Loss of network connectivity
- Loss of system power
- Software issues on the physical host
- Hardware issues on the physical host that impact network reachability
If your instance fails a system status check, then you can use CloudWatch alarm actions to automatically recover your instance. The recover option is available for over 90% of deployed Amazon EC2 instances. However, the recover option works only for system check failures, not for instance status check failures.
AMI create volume permission
a created AMI can be encrypted or decrypted with the relevant permissions.
Can be shared privately(with an account or ARN)
permission to Copy can also be granted to the receiving account or organization.
EBS Multi-attach
only available to io1/io2
SSM Manager
By default, the SSM manager is already installed on Amazon Linux 2
The SSM manager can work on both vMs and on-premises instances
Troubleshooting the SSM manager
- Permission issues
- Corrupt reinstall the agent
Resource group purpose
To Automate patching and managing resources at group level
SSM Documents
the configuration scrip for all planned operations
SSM SSH
SSM does not need SSH or HTTP, the agent connects to ssm by itself
SSM Run command
- Executes the Document
- Error and Rate control
- Integrated with IAM and Cloudtrail
- Runs command on multiple instances and groups
- No need for SSH(Magical)
- Command output printed on the screen or can be sent to S3 or cloud watch
- Status can be viewd on console
- Can be invoked using events Brige
SSM get parameter
aws ssm get-parameters –names <parameter> <parameter></parameter></parameter>
SSM inventory
(SSM) Inventory is a feature that enables you to collect metadata from your managed instances. It provides a detailed view of your infrastructure, making it easier to understand its current state and track changes over time.
Here’s a brief overview of how SSM Inventory works:
Data Collection: You can configure SSM Inventory to collect information such as installed applications, network configurations, OS updates, and more from your EC2 instances or on-premises servers.
Resource Data Sync: Collected data can be stored in an Amazon S3 bucket or in an AWS Systems Manager Association. This allows you to centralize and aggregate inventory data from multiple AWS accounts and regions.
Querying and Reporting: You can use AWS Config or the AWS Systems Manager Console to query and generate reports based on the collected inventory data. This helps you understand the state of your resources and their configurations.
Automation: You can use inventory data to create automation workflows, such as triggering actions based on changes detected in the environment.
To set up SSM Inventory, you typically need to:
Configure Inventory Collection: Use SSM Documents to specify what inventory data you want to collect. These documents are associated with an inventory configuration.
Define Inventory Configurations: Create an inventory configuration that references your SSM Documents. This configuration specifies the type of data you want to collect.
Attach Inventory Configurations: Associate inventory configurations with your managed instances.
View and Query Data: Use the AWS Management Console, AWS CLI, or APIs to view and query the collected inventory data.
SSM State
manager
manages the state of nodes in a group, ensuring that the inventory is always equal to the defined state
SSM Inventory
- data can be viewed on the console,
- stored on s3,
- Querried and analyzed using quicksight and Athena
Elb Sticky Sessions
Always redirect a specific client request to a particular server/instance by adding a cookie to the request
Can cause load imbalance
ELB Helth checks
If a target group contains only unhealthy targets, ELB routs requests across its unhealthy targets. This is usually the case during warmup/booting
ELB
ELB Access Logs
ELB Access Logs are encrypted by default
Lambda permission
Resource policy - another resource invoking lambda (Synchronous invocation)
Execution Policy - Lambda polling another service for jobs(Asynchronous)
Lambda Function throttling
DLM
DLM does not work with instance store
DLM
- Uses Tags to identify resources
- DLM Creates snapshots and AMIs
- Cant be used to manage - Snapshots/AMIs created outsid DLM
- can not be used to manage instance store backed AMIs
EBS Multi attach
- Max 16 instances
- File system that must be cluster-aware
- can only happen within a single AZ
one iam role can contain multiple policies
true
New EBS prep
After EBS size is increased, partitioning must be carried out befor the new space is useable.
You cannot redue the size of an ebs volume
EFS Operations
certain operations can be performed on the go, while some cant.
In Place -
- LifeCycle Policy
- Throughput mode and provisioned throughput number
- EFS Access Point
Requires Migration (using Datasync)-
- encryption
- Decryption
- Performance mode(eg, max i/o)
S3 Replication
replicates only newer objects,
in order to replicate older files, use s3 BatchReplication
S3 Analytics
requires 24-48hrs after activation in order to start generating data analysis reports.
Its recommendation does not work for Standard IA and Glacier
S3 Multi-Part
recommended for objects above 100MB
mandatory for objects > 5G
s3 Transfer acceleration
Upload and download
S3 select
retrieving only the data needed as against having to retrieve a bulk before filtering or ETL using sql. It’s only meant to use for a subset of data.
S3 Batch operation
to perform bulk operations on existing S3 objects with a single request
eg,
modify object metadata
modify object properties
copy objects between s3 buckets
encrypt an unencrypted object
modify acl tags
restore objects from S3 glacier
invoke lambda function to perform a custom operation on an object
S3 inventory
comprehensive report on the objects in our bucket
S3 glacier
you can place a file into S3 Glacier same minute you create it.
Glacier operates two types of policy
Glacier Vault - like a bucket in Glacier
Glacier retrieval methods
Expedited (Minutes to seconds) - you will need to purchase capacity unit
Standard 3-5hrs
Bulk 5-12Hrs
** In between the restoration time, there has to be some sort of a Notification job to facilitate the asynchronous process; S3 Events Notification(restore initiated and complete) or Events bridge
Glacier Vault Policies
Strong access Policies for strict regulatory or compliance on the files in glacier.
Vault lock is immutable/irreversible
Vault lock is completed by re-entering the lock iD back into the vault lock
Upload files to glacier
This is not possible via the console, you would have to use the API, CLI or SDK
Multipart upload
Divide and conquer algorithms
Split and upload files in parallel, then concatenate them at the receiving end.
- Multipart upload is done in part and in any order
- recommended for uploading files >100MB and
- mandatory for files >5GB
Use life cycle policies to handle the failed upload(There’s a lifecycle preset for Multipart)
**Multipart is only available via CLI/SDK
Athena
Serverless machine Queries and analysis files in S3 without moving the data using SQL.
Athena best practice
use columnar data for cost savings. They perform fewer and faster scan. eg, (ORC, Apache Parquet)
Compress data for smaller retrievals
Partition your files to ease queries.
use folder/path structure to ease queries, directly querying a specific directory/prefix
Performs better with larger files
SSE KMS
Uses API calls to and from KMS for decryption, this may result into a situation whereby KMS may run into throttling.
SSE-C
All SSE-C must be encrypted in transit
Enforce TLS for client requests
condition
“aws:securetransport”: true
SSE-C on the console
Can only be done over CLI and not allowed in the console
MFA delete
- Only Root user can enable
- Can only be enabled via the AWS CLI, AWS SDK, or the Amazon S3 REST API
S3 Retention Mode
Compliance - Strict and immutable
Government - privileged principals can change versions and retention mode
retention period for both is fixed
S3 Legal Hold
(s3:PutLegalHold)
- can protect the object indefinitely irrespective of the retention mode or period
- the identity with the permission can deactivate this mode(internet or vpc)
S3 access point policy
is a scaled-down bucket policy to specific prefixes.
this policy grants exclusive access to specific prefixes/directories and limits all access to that prefix
a single access point policy can contain access to more than one prefix
each access point will have its own DNS name