1. MAAREK Flashcards
Multi-AZ Resharding:
Multi-AZ: Indicates support for running Redis in multiple availability zones (AZs) to enhance fault tolerance and high availability.
Resharding: Refers to the process of redistributing data across nodes, possibly to accommodate changes in the cluster size or improve load distribution.
Online and Offline Resharding:
Online Resharding: This suggests that the resharding process can be performed without taking the cluster offline, ensuring continuous availability.
Offline Resharding: In some cases, resharding might require the cluster to be taken offline temporarily. This can impact availability during the process.
Maximum of 5 Read Replicas (RR) per Cluster:
Read Replicas: Additional nodes that replicate the data from the primary node for read-heavy workloads.
Limitation: The cluster is configured to support a maximum of 5 read replicas.
Cluster Scaling:
Scaling: The ability to add or remove nodes dynamically to adapt to changing workloads.
Tasks Spread Across Nodes: This suggests that the cluster distributes tasks or data across its nodes to balance the load.
New Nodes Immediately Updated:
Dynamic Updates: Changes or additions to the cluster, such as the creation of new nodes, are immediately reflected in the cluster’s state.
Important Considerations:
High Availability (HA): Multi-AZ deployment and the ability to perform online resharding are key elements for ensuring high availability.
Performance: Read replicas and the ability to scale the cluster contribute to improved performance and the ability to handle increased workloads.
These features and capabilities are typical in a Redis cluster that is designed for scalability, fault tolerance, and high availability, which are essential for many production systems. If you have specific questions or if there’s a particular aspect you’d like more information on, feel free to ask!
Memcached
Addition of New Nodes:
Horizontal scaling involves adding more nodes to the system to handle increased load.
New nodes can be dynamically added to the cluster.
Autodiscovery Function: There is a function in place that automatically discovers and updates all new nodes in the cluster.
Limit: The system is designed to handle a maximum of 40 nodes.
Vertical Scaling:
Node Upgrade Limitation:
Vertical scaling involves upgrading the existing nodes to handle increased load.
Limitation: Memcached nodes cannot be upgraded directly. Instead, the approach is to swap out the old node for a new one. This typically involves taking the old node offline during the process.
Empty New Nodes:
When adding new nodes through vertical scaling, these nodes start empty and do not retain the data from the old node.
Application Reload: To fill the new nodes with data, the application needs to reload the data. This implies a potential data migration or reloading process.
Offline Swap:
Vertical scaling often involves taking the old node offline during the swap process.
Data Loading Requirement:
The mention of “new nodes are empty” suggests that when a new node is introduced through vertical scaling, it doesn’t automatically inherit the data from the old node. The application needs to handle the reloading or migration of data.
Both horizontal and vertical scaling have their trade-offs and are chosen based on specific use cases and requirements. Horizontal scaling offers more flexibility in handling increased load dynamically, while vertical scaling involves upgrading the capacity of the existing resources to handle additional load. The decision between them often depends on factors such as system architecture, performance requirements, and the nature of the application
REDDIS RR vs RESHARDING
Read Replicas (RR) and resharding are two distinct concepts in the context of distributed systems like Redis.
Read Replicas (RR):
Purpose:
Read Scaling: Read Replicas are used to scale read operations in a distributed database system.
Improved Performance: By offloading read operations to replicas, the primary node is freed up to handle write operations.
Functionality:
Data Replication: Data from the primary node is replicated to the Read Replicas.
Read-Only: Read Replicas are typically read-only nodes, meaning that they can’t accept write operations.
Availability:
Fault Tolerance: Read Replicas provide fault tolerance. If the primary node fails, one of the replicas can be promoted to become the new primary.
Consistency:
Eventual Consistency: Depending on the replication mechanism, there might be some delay (latency) between the primary node and the replicas, resulting in eventual consistency.
Resharding:
Purpose:
Write Scaling: Resharding is a process used to scale write operations and distribute data more evenly across nodes in a cluster.
Dynamic Load Balancing: It allows for redistributing data to accommodate changes in workload or cluster size.
Functionality:
Data Redistribution: Resharding involves moving data from one set of nodes to another, typically to balance the load or accommodate changes in the cluster.
Availability:
Impact on Availability: Resharding might require taking the cluster offline temporarily or have some impact on availability during the process, depending on the implementation.
Consistency:
Maintaining Consistency: Resharding must ensure that data consistency is maintained during the redistribution process.
Key Differences:
Purpose: Read Replicas are primarily for read scaling, while resharding is for write scaling and dynamic load balancing.
Functionality: Read Replicas replicate data for improved read performance, while resharding redistributes data to optimize the distribution of writes.
Impact on Availability: Read Replicas provide fault tolerance without much impact on availability, while resharding might have some impact, especially if it involves taking the cluster offline temporarily.
In summary, Read Replicas and resharding serve different purposes in a distributed system. Read Replicas focus on improving read performance and fault tolerance, while resharding is about optimizing the distribution of writes and dynamically balancing the cluster.
Elasticache eviction issues
Eviction Scenario:
Cause: Low capacity in the system, leading to the need to remove data to make space for new data.
Solutions:
Scale Up or Scale Out:
Scale Up (Vertical Scaling):
Description: Increase the capacity of individual nodes by changing to larger nodes.
Advantages: This provides more resources (CPU, memory) to handle increased load.
Considerations: There might be limits to how much you can scale up, and larger nodes could be more expensive.
Scale Out (Horizontal Scaling):
Description: Add more nodes to the system to distribute the load.
Advantages: This provides increased capacity by distributing the workload across multiple nodes.
Considerations: This approach often offers better scalability, but it requires a distributed architecture.
Change Eviction Policy:
Description: Modify the eviction policy to allow for the early retirement of non-current data.
Example: You might consider changing the eviction policy to prioritize removing less frequently accessed or older data.
Considerations: This can help manage space more effectively, but it’s important to align the eviction policy with the application’s requirements.
Considerations:
Eviction Policy:
The choice of eviction policy depends on the nature of your application and the importance of different types of data.
Common eviction policies include
LRU (Least Recently Used),
LFU (Least Frequently Used), and others.
Monitoring:
Regularly monitor the system to identify trends in data access patterns and capacity usage.
Capacity Planning:
Plan for future growth and consider both vertical and horizontal scaling strategies.
Cost and Performance Trade-offs:
Consider the cost implications and performance trade-offs associated with scaling up or scaling out.
In summary, addressing eviction issues involves a combination of capacity planning, scaling strategies, and tuning the eviction policy to align with the application’s requirements. The choice between scaling up and scaling out depends on your specific use case and requirements.
Ssh syntax
ssh -i keypair.pem ec2-user @10.0.0.0
Check which network adapter/interface is present in the instance
ethtool -i eth0
Enhance Networking
Enabled by default on Amazon Linux2 AMI.
However, wether the instance type harnesses it depends on the instance type.
Changing instance type
Only ebs backed instances can have an instance upgrade.
Cluster placement advantage
Low Latence, High bandwidth high PPT
Best for HPC
Partition placement
- Max of 7 partitions per az
- Can be across Azs
- Partition information can be found in the instance’s metadata.
Partition placement use case
Application has to be partition aware(ability to distribute data across instances within the cluster)
Tightly linked distributed systems
Ec2 Vcpu limits
Only applicable to On-demand and Spot instances
Insufficient capacity error
Aws does not have free instance available for you in a particular az
Ssh troubleshooting
Instance connect uses one of the reserved ip ranges for your region. As long as port 22 is opened, instance connect picks up one of the ip’s and connects. Be careful when whitelisting a CIDR range for ssh in-bound, because if the reserved range is not whitelisted, it is implicitly blacklisted.
Cloud watch Metrics types
- Basic (Default 5mins, enhanced 1minute)
- CPU usage, CPU Credit
- Disk (instance store only)
- Network
- Status check - Custom (Default 1min, High Res1sec)
- Ram
- Application level
- Requires an IAM role
Procstat plugin
Collects system and application level metrics of individual processes for Cloudwatch agent (WINDOWS AND LINUX)
Terminal
Upgrade privilege
~sudo su
Terminal
Install Apache
~ yum install httpd
~ echo “Hello world” > var/www/html/index.html
To enable apache
~ sudo systemctl start httpd
To persist through system restarts
~ systemctl enable httpd
$(hostname)
Httpd server logs examples
~ cat var/log/httpd/access_log
~ cat var/log/httpd/error_log
Cloudwatch Logs vs Metrics
Logs - Report(text files)
Metrics - measurements(graphs)
Troubleshoot ec2 status failure (system) ?
Migrate Instance to another host (Stop and start instance)
Cloudwatch system recovery (triggered by a configured CW Alarm
Maintains system’s public/private IP, Elastic IP, Metadata, and placement group.
Examples of problems that require instance recovery:
- Loss of network connectivity
- Loss of system power
- Software issues on the physical host
- Hardware issues on the physical host that impact network reachability
If your instance fails a system status check, then you can use CloudWatch alarm actions to automatically recover your instance. The recover option is available for over 90% of deployed Amazon EC2 instances. However, the recover option works only for system check failures, not for instance status check failures.
AMI create volume permission
a created AMI can be encrypted or decrypted with the relevant permissions.
Can be shared privately(with an account or ARN)
permission to Copy can also be granted to the receiving account or organization.
EBS Multi-attach
only available to io1/io2