1. MAAREK Flashcards
Multi-AZ Resharding:
Multi-AZ: Indicates support for running Redis in multiple availability zones (AZs) to enhance fault tolerance and high availability.
Resharding: Refers to the process of redistributing data across nodes, possibly to accommodate changes in the cluster size or improve load distribution.
Online and Offline Resharding:
Online Resharding: This suggests that the resharding process can be performed without taking the cluster offline, ensuring continuous availability.
Offline Resharding: In some cases, resharding might require the cluster to be taken offline temporarily. This can impact availability during the process.
Maximum of 5 Read Replicas (RR) per Cluster:
Read Replicas: Additional nodes that replicate the data from the primary node for read-heavy workloads.
Limitation: The cluster is configured to support a maximum of 5 read replicas.
Cluster Scaling:
Scaling: The ability to add or remove nodes dynamically to adapt to changing workloads.
Tasks Spread Across Nodes: This suggests that the cluster distributes tasks or data across its nodes to balance the load.
New Nodes Immediately Updated:
Dynamic Updates: Changes or additions to the cluster, such as the creation of new nodes, are immediately reflected in the cluster’s state.
Important Considerations:
High Availability (HA): Multi-AZ deployment and the ability to perform online resharding are key elements for ensuring high availability.
Performance: Read replicas and the ability to scale the cluster contribute to improved performance and the ability to handle increased workloads.
These features and capabilities are typical in a Redis cluster that is designed for scalability, fault tolerance, and high availability, which are essential for many production systems. If you have specific questions or if there’s a particular aspect you’d like more information on, feel free to ask!
Memcached
Addition of New Nodes:
Horizontal scaling involves adding more nodes to the system to handle increased load.
New nodes can be dynamically added to the cluster.
Autodiscovery Function: There is a function in place that automatically discovers and updates all new nodes in the cluster.
Limit: The system is designed to handle a maximum of 40 nodes.
Vertical Scaling:
Node Upgrade Limitation:
Vertical scaling involves upgrading the existing nodes to handle increased load.
Limitation: Memcached nodes cannot be upgraded directly. Instead, the approach is to swap out the old node for a new one. This typically involves taking the old node offline during the process.
Empty New Nodes:
When adding new nodes through vertical scaling, these nodes start empty and do not retain the data from the old node.
Application Reload: To fill the new nodes with data, the application needs to reload the data. This implies a potential data migration or reloading process.
Offline Swap:
Vertical scaling often involves taking the old node offline during the swap process.
Data Loading Requirement:
The mention of “new nodes are empty” suggests that when a new node is introduced through vertical scaling, it doesn’t automatically inherit the data from the old node. The application needs to handle the reloading or migration of data.
Both horizontal and vertical scaling have their trade-offs and are chosen based on specific use cases and requirements. Horizontal scaling offers more flexibility in handling increased load dynamically, while vertical scaling involves upgrading the capacity of the existing resources to handle additional load. The decision between them often depends on factors such as system architecture, performance requirements, and the nature of the application
REDDIS RR vs RESHARDING
Read Replicas (RR) and resharding are two distinct concepts in the context of distributed systems like Redis.
Read Replicas (RR):
Purpose:
Read Scaling: Read Replicas are used to scale read operations in a distributed database system.
Improved Performance: By offloading read operations to replicas, the primary node is freed up to handle write operations.
Functionality:
Data Replication: Data from the primary node is replicated to the Read Replicas.
Read-Only: Read Replicas are typically read-only nodes, meaning that they can’t accept write operations.
Availability:
Fault Tolerance: Read Replicas provide fault tolerance. If the primary node fails, one of the replicas can be promoted to become the new primary.
Consistency:
Eventual Consistency: Depending on the replication mechanism, there might be some delay (latency) between the primary node and the replicas, resulting in eventual consistency.
Resharding:
Purpose:
Write Scaling: Resharding is a process used to scale write operations and distribute data more evenly across nodes in a cluster.
Dynamic Load Balancing: It allows for redistributing data to accommodate changes in workload or cluster size.
Functionality:
Data Redistribution: Resharding involves moving data from one set of nodes to another, typically to balance the load or accommodate changes in the cluster.
Availability:
Impact on Availability: Resharding might require taking the cluster offline temporarily or have some impact on availability during the process, depending on the implementation.
Consistency:
Maintaining Consistency: Resharding must ensure that data consistency is maintained during the redistribution process.
Key Differences:
Purpose: Read Replicas are primarily for read scaling, while resharding is for write scaling and dynamic load balancing.
Functionality: Read Replicas replicate data for improved read performance, while resharding redistributes data to optimize the distribution of writes.
Impact on Availability: Read Replicas provide fault tolerance without much impact on availability, while resharding might have some impact, especially if it involves taking the cluster offline temporarily.
In summary, Read Replicas and resharding serve different purposes in a distributed system. Read Replicas focus on improving read performance and fault tolerance, while resharding is about optimizing the distribution of writes and dynamically balancing the cluster.
Elasticache eviction issues
Eviction Scenario:
Cause: Low capacity in the system, leading to the need to remove data to make space for new data.
Solutions:
Scale Up or Scale Out:
Scale Up (Vertical Scaling):
Description: Increase the capacity of individual nodes by changing to larger nodes.
Advantages: This provides more resources (CPU, memory) to handle increased load.
Considerations: There might be limits to how much you can scale up, and larger nodes could be more expensive.
Scale Out (Horizontal Scaling):
Description: Add more nodes to the system to distribute the load.
Advantages: This provides increased capacity by distributing the workload across multiple nodes.
Considerations: This approach often offers better scalability, but it requires a distributed architecture.
Change Eviction Policy:
Description: Modify the eviction policy to allow for the early retirement of non-current data.
Example: You might consider changing the eviction policy to prioritize removing less frequently accessed or older data.
Considerations: This can help manage space more effectively, but it’s important to align the eviction policy with the application’s requirements.
Considerations:
Eviction Policy:
The choice of eviction policy depends on the nature of your application and the importance of different types of data.
Common eviction policies include
LRU (Least Recently Used),
LFU (Least Frequently Used), and others.
Monitoring:
Regularly monitor the system to identify trends in data access patterns and capacity usage.
Capacity Planning:
Plan for future growth and consider both vertical and horizontal scaling strategies.
Cost and Performance Trade-offs:
Consider the cost implications and performance trade-offs associated with scaling up or scaling out.
In summary, addressing eviction issues involves a combination of capacity planning, scaling strategies, and tuning the eviction policy to align with the application’s requirements. The choice between scaling up and scaling out depends on your specific use case and requirements.
Ssh syntax
ssh -i keypair.pem ec2-user @10.0.0.0
Check which network adapter/interface is present in the instance
ethtool -i eth0
Enhance Networking
Enabled by default on Amazon Linux2 AMI.
However, wether the instance type harnesses it depends on the instance type.
Changing instance type
Only ebs backed instances can have an instance upgrade.
Cluster placement advantage
Low Latence, High bandwidth high PPT
Best for HPC
Partition placement
- Max of 7 partitions per az
- Can be across Azs
- Partition information can be found in the instance’s metadata.
Partition placement use case
Application has to be partition aware(ability to distribute data across instances within the cluster)
Tightly linked distributed systems
Ec2 Vcpu limits
Only applicable to On-demand and Spot instances
Insufficient capacity error
Aws does not have free instance available for you in a particular az
Ssh troubleshooting
Instance connect uses one of the reserved ip ranges for your region. As long as port 22 is opened, instance connect picks up one of the ip’s and connects. Be careful when whitelisting a CIDR range for ssh in-bound, because if the reserved range is not whitelisted, it is implicitly blacklisted.
Cloud watch Metrics types
- Basic (Default 5mins, enhanced 1minute)
- CPU usage, CPU Credit
- Disk (instance store only)
- Network
- Status check - Custom (Default 1min, High Res1sec)
- Ram
- Application level
- Requires an IAM role
Procstat plugin
Collects system and application level metrics of individual processes for Cloudwatch agent (WINDOWS AND LINUX)
Terminal
Upgrade privilege
~sudo su
Terminal
Install Apache
~ yum install httpd
~ echo “Hello world” > var/www/html/index.html
To enable apache
~ sudo systemctl start httpd
To persist through system restarts
~ systemctl enable httpd
$(hostname)
Httpd server logs examples
~ cat var/log/httpd/access_log
~ cat var/log/httpd/error_log
Cloudwatch Logs vs Metrics
Logs - Report(text files)
Metrics - measurements(graphs)
Troubleshoot ec2 status failure (system) ?
Migrate Instance to another host (Stop and start instance)
Cloudwatch system recovery (triggered by a configured CW Alarm
Maintains system’s public/private IP, Elastic IP, Metadata, and placement group.
Examples of problems that require instance recovery:
- Loss of network connectivity
- Loss of system power
- Software issues on the physical host
- Hardware issues on the physical host that impact network reachability
If your instance fails a system status check, then you can use CloudWatch alarm actions to automatically recover your instance. The recover option is available for over 90% of deployed Amazon EC2 instances. However, the recover option works only for system check failures, not for instance status check failures.
AMI create volume permission
a created AMI can be encrypted or decrypted with the relevant permissions.
Can be shared privately(with an account or ARN)
permission to Copy can also be granted to the receiving account or organization.
EBS Multi-attach
only available to io1/io2
SSM Manager
By default, the SSM manager is already installed on Amazon Linux 2
The SSM manager can work on both vMs and on-premises instances
Troubleshooting the SSM manager
- Permission issues
- Corrupt reinstall the agent
Resource group purpose
To Automate patching and managing resources at group level
SSM Documents
the configuration scrip for all planned operations
SSM SSH
SSM does not need SSH or HTTP, the agent connects to ssm by itself
SSM Run command
- Executes the Document
- Error and Rate control
- Integrated with IAM and Cloudtrail
- Runs command on multiple instances and groups
- No need for SSH(Magical)
- Command output printed on the screen or can be sent to S3 or cloud watch
- Status can be viewd on console
- Can be invoked using events Brige
SSM get parameter
aws ssm get-parameters –names <parameter> <parameter></parameter></parameter>
SSM inventory
(SSM) Inventory is a feature that enables you to collect metadata from your managed instances. It provides a detailed view of your infrastructure, making it easier to understand its current state and track changes over time.
Here’s a brief overview of how SSM Inventory works:
Data Collection: You can configure SSM Inventory to collect information such as installed applications, network configurations, OS updates, and more from your EC2 instances or on-premises servers.
Resource Data Sync: Collected data can be stored in an Amazon S3 bucket or in an AWS Systems Manager Association. This allows you to centralize and aggregate inventory data from multiple AWS accounts and regions.
Querying and Reporting: You can use AWS Config or the AWS Systems Manager Console to query and generate reports based on the collected inventory data. This helps you understand the state of your resources and their configurations.
Automation: You can use inventory data to create automation workflows, such as triggering actions based on changes detected in the environment.
To set up SSM Inventory, you typically need to:
Configure Inventory Collection: Use SSM Documents to specify what inventory data you want to collect. These documents are associated with an inventory configuration.
Define Inventory Configurations: Create an inventory configuration that references your SSM Documents. This configuration specifies the type of data you want to collect.
Attach Inventory Configurations: Associate inventory configurations with your managed instances.
View and Query Data: Use the AWS Management Console, AWS CLI, or APIs to view and query the collected inventory data.
SSM State
manager
manages the state of nodes in a group, ensuring that the inventory is always equal to the defined state
SSM Inventory
- data can be viewed on the console,
- stored on s3,
- Querried and analyzed using quicksight and Athena
Elb Sticky Sessions
Always redirect a specific client request to a particular server/instance by adding a cookie to the request
Can cause load imbalance
ELB Helth checks
If a target group contains only unhealthy targets, ELB routs requests across its unhealthy targets. This is usually the case during warmup/booting
ELB
ELB Access Logs
ELB Access Logs are encrypted by default
Lambda permission
Resource policy - another resource invoking lambda (Synchronous invocation)
Execution Policy - Lambda polling another service for jobs(Asynchronous)
Lambda Function throttling
DLM
DLM does not work with instance store
DLM
- Uses Tags to identify resources
- DLM Creates snapshots and AMIs
- Cant be used to manage - Snapshots/AMIs created outsid DLM
- can not be used to manage instance store backed AMIs
EBS Multi attach
- Max 16 instances
- File system that must be cluster-aware
- can only happen within a single AZ
one iam role can contain multiple policies
true
New EBS prep
After EBS size is increased, partitioning must be carried out befor the new space is useable.
You cannot redue the size of an ebs volume
EFS Operations
certain operations can be performed on the go, while some cant.
In Place -
- LifeCycle Policy
- Throughput mode and provisioned throughput number
- EFS Access Point
Requires Migration (using Datasync)-
- encryption
- Decryption
- Performance mode(eg, max i/o)
S3 Replication
replicates only newer objects,
in order to replicate older files, use s3 BatchReplication
S3 Analytics
requires 24-48hrs after activation in order to start generating data analysis reports.
Its recommendation does not work for Standard IA and Glacier
S3 Multi-Part
recommended for objects above 100MB
mandatory for objects > 5G
s3 Transfer acceleration
Upload and download
S3 select
retrieving only the data needed as against having to retrieve a bulk before filtering or ETL using sql. It’s only meant to use for a subset of data.
S3 Batch operation
to perform bulk operations on existing S3 objects with a single request
eg,
modify object metadata
modify object properties
copy objects between s3 buckets
encrypt an unencrypted object
modify acl tags
restore objects from S3 glacier
invoke lambda function to perform a custom operation on an object
S3 inventory
comprehensive report on the objects in our bucket
S3 glacier
you can place a file into S3 Glacier same minute you create it.
Glacier operates two types of policy
Glacier Vault - like a bucket in Glacier
Glacier retrieval methods
Expedited (Minutes to seconds) - you will need to purchase capacity unit
Standard 3-5hrs
Bulk 5-12Hrs
** In between the restoration time, there has to be some sort of a Notification job to facilitate the asynchronous process; S3 Events Notification(restore initiated and complete) or Events bridge
Glacier Vault Policies
Strong access Policies for strict regulatory or compliance on the files in glacier.
Vault lock is immutable/irreversible
Vault lock is completed by re-entering the lock iD back into the vault lock
Upload files to glacier
This is not possible via the console, you would have to use the API, CLI or SDK
Multipart upload
Divide and conquer algorithms
Split and upload files in parallel, then concatenate them at the receiving end.
- Multipart upload is done in part and in any order
- recommended for uploading files >100MB and
- mandatory for files >5GB
Use life cycle policies to handle the failed upload(There’s a lifecycle preset for Multipart)
**Multipart is only available via CLI/SDK
Athena
Serverless machine Queries and analysis files in S3 without moving the data using SQL.
Athena best practice
use columnar data for cost savings. They perform fewer and faster scan. eg, (ORC, Apache Parquet)
Compress data for smaller retrievals
Partition your files to ease queries.
use folder/path structure to ease queries, directly querying a specific directory/prefix
Performs better with larger files
SSE KMS
Uses API calls to and from KMS for decryption, this may result into a situation whereby KMS may run into throttling.
SSE-C
All SSE-C must be encrypted in transit
Enforce TLS for client requests
condition
“aws:securetransport”: true
SSE-C on the console
Can only be done over CLI and not allowed in the console
MFA delete
- Only Root user can enable
- Can only be enabled via the AWS CLI, AWS SDK, or the Amazon S3 REST API
S3 Retention Mode
Compliance - Strict and immutable
Government - privileged principals can change versions and retention mode
retention period for both is fixed
S3 Legal Hold
(s3:PutLegalHold)
- can protect the object indefinitely irrespective of the retention mode or period
- the identity with the permission can deactivate this mode(internet or vpc)
S3 access point policy
is a scaled-down bucket policy to specific prefixes.
this policy grants exclusive access to specific prefixes/directories and limits all access to that prefix
a single access point policy can contain access to more than one prefix
each access point will have its own DNS name
Gateway Policy
This is also known as an access point policy allowing access to a target prefix from the vpc
Cross region access point
there’s an implicit creation of CRR when a cross-region access point is created.
Fsx for Windows 💨
Fsx for windows can also be mounted on a Linux server
Luster (Linux Cluster)
for HPC
R/W with S3 is seamless
accesible via Dx and VPN
resides on single AZ
Scratch file system - single copy of data
Persistent file system - Duplicates data in same az, but can persist data to S3
fsx file gateway
gateway for windows
Fsx net On Tap
for migrating NET ON TAP servers to AWS.
compatible with all popular OS
Storage Gateway
file gateway
- Linux compliant
- file system(SMB,NFS) backs up to S3
Volume gateway
- For Block storage
- Cached or Stored gateway applies to the nature of the Local Configuration
Storage Gateway Lifecycle management
File Gateway(Linux file system) - Restart
Volume gateway(Block Storage management) - Stop gateway, retart server, then start/ attach gateway
RDS on EC2 instance
Fully automated OS patching, fully managed DB, no user access for underlying instance
RDS RR vs Multi AZ
RR
- Asynchronous
- On fail over promotion, application has to update read/writer endpoint
Multi-AZ
- Synchronous replication
- Inter AZ-Transfer fee = free
- Auto failover to Standby in case of AZ failure
RDS Single to Multi AZ conversion
On the go, Zero downtime operation.
Just modify
RDS EBS Autoscaling and threshold setting
the threshold sets a limit for the Autoscaling
RDS IAM Auth
allows users to login to the database using their IAM credentials.
RDS Deployment options
Multi-AZ DB Cluster
Creates a DB cluster with a primary DB instance and two readable standby DB instances, with each DB instance in a different Availability Zone (AZ). Provides high availability, data redundancy and increases capacity to serve read workloads.
Multi-AZ DB instance
Creates a primary DB instance and a standby DB instance in a different AZ. Provides high availability and data redundancy, but the standby DB instance doesn’t support connections for read workloads.
Single DB instance
Creates a single DB instance with no standby DB instances.
troubleshooting RDS
- endpoint
- Security group ingress setting:
protocol: MysQL/MariaDB
port 3306
RDS Best Practices
If one of these instances (secondary) is launched in a private subnet and the primary is launched in a public subnet, after a Multi-AZ failover the RDS instance becomes inaccessible to the public network because the promoted secondary (new primary) instance was launched in the private subnet.
- In addition to disabling public access at the subnet level, Amazon RDS provides a feature to enable or disable public access to the respective instances. Even if an instance is launched in a public subnet for any reason, it’s still possible to disable internet access to the instance by disabling public access. When you disable public access on the RDS instance, RDS end point resolve to private IP address only and accessible to the instances in the same VPC (or VPC connected via other means like VPC peering).
- If all your applications servers are in the same VPC as your RDS instance, consider disabling public access to the instance. Furthermore, to help developers or admins who need access to the RDS instance to perform required tasks, create bastion instances in the same VPC as the RDS instances. These bastion instances have public access (with proper security group rules), and users (developers and admins) can connect to the bastion instances via the internet and connect to the RDS instance from the respective bastion instances
- In addition to having least privilege to AWS APIs, it’s very important to perform regular audits on the security rules to ensure that public access to any instance isn’t enabled due to human or automation errors. In these cases, it’s recommended to use AWS Config to create rules to check for any changes and perform remediation automatically. For example, we can create an AWS Config rule to identify any RDS instances with public access enabled and perform remediation using managed rules (rds-instance-public-access-check). In a similar way, we can utilize existing managed rules or create custom rules to perform required checks and remediation.
- Restrict access to cloud users on Amazon RDS using IAM
5,6,7. Additional logging and monitoring
You can use the following services and features for additional logging and monitoring:
AWS CloudTrail – CloudTrail provides a record of actions taken by a user, role, or AWS service in Amazon RDS. CloudTrail captures all API calls for Amazon RDS as events, including calls from the Amazon RDS console and from code calls to Amazon RDS API operations. It’s important to monitor the API calls to understand the different operations performed by the users and applications in your AWS account. This can help you perform audits on different operations and manage permissions. It’s also helpful to provide an incident report when unintended operations are run on Amazon RDS resources. For more information, refer to Monitoring Amazon RDS API calls in AWS CloudTrail.
Amazon RDS recommendations – This feature is enabled by default and provides recommendations on different details related to the DB instances, read replicas, and parameter groups. These recommendations provide best practice guidance by analyzing DB instance configuration, usage, and performance data. For example, any pending engine version upgrades or instance maintenance operations are included in the recommendations. You can consider taking action on the provided recommendations immediately or in the following maintenance window.
Amazon RDS event notifications – Event notifications are the best way to track changes and get notifications when an Amazon RDS event occurs. For example, if you subscribe to a configuration change category for a DB security group, you’re notified when the DB security group is changed. This helps you address unintended changes immediately and take appropriate action to remediate them.
AWS Trusted Advisor – Trusted Advisor draws upon best practices learned from serving hundreds of thousands of AWS customers and helps close any security gaps in your AWS account. With respect to Amazon RDS security, Trusted Advisor checks for any security group access risks. For more information, refer to AWS Trusted Advisor check reference.
Automatic minor version upgrades
Lambda residnce
outside vpc(in aws owned vpc
lambda can only talk to your public endpoints)
Launch Lambda in a vpc so as you do not have to expose your rds to the public
Once defined, lambda will create an ENI using the right permissions
RDS TooManyConnections
lambda scales out and is overpowering the RDS, Solution? RDS PROXY
RDS Proxy-Lambda connection
if the RDS Proxy is in a pubic subnet, then no need to deploy lambda in the vpc.
RDS closes idle collections(handles lambda scale-in)
RDS proxy supports IAM Auth
Supports DB Password Auth
is inherently Autoscaling
DB Parameter group
Parameter Group is a collection of database engine parameter values that can be applied to one or more database instances. These parameter groups allow you to configure various settings for your RDS instance, including security, performance, and behavior.
A Parameter Group in Amazon RDS is a set of parameters that define the configuration settings for a particular database engine. These parameters can control various aspects of the database, such as memory allocation, logging, backup behavior, and security settings.
Importance: Parameter groups are important because they allow you to customize the behavior of your database instances without modifying the instance itself. This helps in maintaining consistency across multiple instances and makes it easier to manage and update configurations.
Use Cases:
Security: You can use parameter groups to set security-related parameters, such as controlling access, enabling encryption, or configuring SSL.
Performance: Parameters related to performance tuning, cache sizes, and query optimization can be adjusted using parameter groups.
Behavior: Parameters that govern the behavior of the database engine, such as the handling of connections, transaction timeouts, and logging, can be configured.
Dynamic Nature:
Explanation: Parameter groups are dynamic, meaning you can modify the parameter values in a group at any time, and these changes will be applied to the associated database instances.
Example: If you need to adjust the maximum number of allowed connections or change the log retention period, you can do so by modifying the parameter group without requiring a reboot of the RDS instance.
In summary, Parameter Groups in Amazon RDS play a crucial role in configuring and customizing the behavior of your database instances, and their dynamic nature allows for flexibility in adapting to changing requirements.
DB Parameter group activation
DB instance always needs a restart in order to kickstart a parameer group into action
DATABASE Restore
Both Backup and snapshots will create a new database. They do not do an in-place restore
RDS Snapshot sharing
Manual snapshots can be shared, while Automated snapshots cannot be shared.
RDS recommendations
Amazon RDS recommendations – This feature is enabled by default and provides recommendations on different details related to the DB instances, read replicas, and parameter groups. These recommendations provide best practice guidance by analyzing DB instance configuration, usage, and performance data. For example, any pending engine version upgrades or instance maintenance operations are included in the recommendations. You can consider taking action on the provided recommendations immediately or in the following maintenance window.
RDS event notifications
RDS event notifications – Event notifications are the best way to track changes and get notifications when an Amazon RDS event occurs. For example, if you subscribe to a configuration change category for a DB security group, you’re notified when the DB security group is changed. This helps you address unintended changes immediately and take appropriate action to remediate them.
S3 operations exclusive to CLI/SDK
- Multipart upload
- SSE-C
- CSE must be encrypted(securetransportenable-true)
- upload files straight to glacier
RDS - Cloudwatch integration
to be able to access VM-level logs, Cloudwatch uses its native rds agent to gather system-level metrics
RDS performance insights
once enabled(right click, modify), RDS performance Insight can filter metrics by
- By waits - which resource?(CPU,IO, etc)
- By SQL statements - which sql querry ?
- By Hosts - which Host
- By users - who?
Aurora
Distributed writes ( two copies of write per AZ
Aurora Reader/Writer Endpoints
both acts as a single access point to all RR and write heads. Under the hood, aurora performs Autoscaling on RR to maintain the required number of RR while maintaining a single access point; Reader endpoint.
Maintaining a Single Access Point (Reader Endpoint): Despite autoscaling on Read Replicas, Aurora maintains a single access point for read operations through the Reader endpoint. This simplifies the application’s connection management, as the application can always connect to the same endpoint for read operations, and Aurora handles the distribution of those reads across available replicas.
Aurora Regional cluster
There’s a minimum instance requirement to enable a cross region cluster
PITR vs Back tracking
PITR (Backup)- is a recovery process, and RDS/Aurora recovery involve restoring the db; spinning up a new db.
Backtracking - in-place restore, simply rewind to an earlier time (<=72hrs)
Aurora RDS Encryption
Must be done at Launch time.
Optionally, on a running DB, encryption can only be done via snapshot and restore process.
RDS and Aurora audit logs
short-term logs. Retainable logs would include the use of cloudwatch.
Migrating RDS to Aurora
Yes, RDS snapshot can be migrated to Aurora
RR promotion
Aurora follows the declared priority:
1. Highest priority is promoted first
2. Or the Replica with the largest size in a case of equal priority
3. else, a random Replica
Elasticache
Makes our db stateless by taking off all the read workload from the db
Can be used to store session data to enhance statefulness
Elsaticache connectivity
use of elasticache would neccesitate heavy application code changes
Reddis
HA
Backup and restore features
Data durability
Multi-AZ with Auto-failover
Sets and sorted sets support
memcached
- Multinode for partitioning of data(Sharding), but all in a single AZ
- No HA
- No data persistence
- No backup and restore
- Multithreaded architecture
Reddis cluster
Each reddis cluster has one write node and max 5 RR
All replicas are asynchronous
Primary Node does R/W
One shard = One cluster
Cluster mode disabled:
Data is replicated betweeen shards/Nodes
Cluster mode enabled:
Data is spread between shards
Important metrics to monitor
- Cache evictions(memory overload
- swap usage
- current Connection
- CPU utilization
- DatabaseMemoryUsagePercentage
- NetworkBytesIn, NetworkBytesOut
- ReplicationBytes
- ReplicationLag
RDS Storage scaling
you can only scale RDS storage once in 6Hours
Aurora DB Automatic Backup
Can not be disabled
AWS cloudwatch put-metric-data <–flag1 – flag2 –flagn>
for sending custom metrics to cloudwatch
common CW metrics
put-metric-data
–namespace ‹value›
[–metric-name <value›]
[–metric-data <value›]
[–timestamp <value>]
[--unit < value›] [--value < value>]
[--dimensions <value›]
L--statistic=values <value>]
L--storage-resolution <value>]
[--cli-input-json | --cli-input-yam1]
[--generate-cli-skeleton <value>]</value></value></value></value>
CW agent
unified cloudwatch just does a consistent put-metric-data API calls regularly
CW Custom metrics
for sending Custom metrics eg, logged in users count, RAM, Disk space, etc
Resolution
Standard: 1min
High res: 1 or 5 or 15 or 30s
Accepts data point from upto -14days past to 2hrs retrospect.
Has the ability to segment metrics into attributes
CW Dashboard xteristics
- can display metrics from different accounts in different regions
- you can change the time zone of the dashboard
- you can setup automatic refresh
- you can share dashboard with a non aws identity
CW logs insight
Data Query feature/engine within cloudwatch that presents its result in both log lines(text) and a visualizer
results(logs) can either be exported(CreateExportTask API)
or saved for future use
Can querry multiple log groups simultaneously
Log data consistency
Not near real time - log data can take upto 24hrs to become available for export.
Cloudwatch logs subscription
- for realtime log eventsfrom cloudwatch logs for processing and analytics.
- data can be streamed to kineses
- to lambda
- you can specify the subscription filter of the target logs
Cloudwatch Troubleshooting
The following issues can prevent the unified CloudWatch agent from pushing log events:
- Out-of-sync metadata caused by creating an Amazon Machine Image (AMI) after the CloudWatch agent is installed
- Using an outdated version of the CloudWatch agent
- Failure to connect to the CloudWatch Logs endpoint
- Incorrect account, Region, or log group configurations
- Insufficient AWS Identity and Access Management (IAM) permissions
- CloudWatch agent run errors
- Timestamp issue
To push log events to the CloudWatch service, the CloudWatch agent requires credentials from either the IAM user or the IAM role policy.
VPC Flow logs definition ad scope
VPC Flow Logs is a feature that enables you to capture information about the IP traffic going to and from network interfaces in your VPC. Flow log data can be published to Amazon CloudWatch Logs and Amazon S3. After you’ve created a flow log, you can retrieve and view its data in the chosen destination.
NAT Gateway best practice
To create a NAT gateway, you must specify the public subnet in which the NAT gateway should reside. You must also specify an Elastic IP address to associate with the NAT gateway when you create it. After you’ve created a NAT gateway, you must update the route table associated with one or more of your private subnets to point Internet-bound traffic to the NAT gateway. This enables instances in your private subnets to communicate with the internet.
Cloudfront Headers
If you want CloudFront to cache different versions of your objects based on the device that a user is using to view your content, we recommend that you configure CloudFront to forward one or more of the following headers to your custom origin:
- CloudFront-Is-Desktop-Viewer
- CloudFront-Is-Mobile-Viewer
- CloudFront-Is-SmartTV-Viewer
- CloudFront-Is-Tablet-Viewer
you can’t set the cache behavior of a CloudFront distribution to forward the User-Agent header. This is configured in the Origin Custom Headers setting.
VERSIONING
An IAM Administrator account can suspend Versioning on an S3 bucket, but only the bucket owner can enable/suspend the MFA-Delete on the objects.
How CLF Creation policy and Wait condition works
if you install and configure software applications on an EC2 instance, you might want those applications to be running before proceeding. In such cases, you can add a CreationPolicy attribute to the instance and then send a success signal to the instance after the applications are installed and configured.
CreationPolicy:
AutoScalingCreationPolicy:
MinSuccessfulInstancesPercent: Integer
ResourceSignal:
Count: Integer
Timeout: String
SQS - ASG Scaling
You can configure the scale-out policy to check the number of messages in your SQS queue and then verify that your Auto Scaling group has launched an additional EC2 instance. Similarly, you can test your scale-in policy by decreasing the number of messages in your SQS queue and then verifying that the Auto Scaling group has terminated an EC2 instance.
SYSTEM MANAGER
Run Command - AWS System Manager Run Command to automate common administrative tasks and execute scripts remotely. Run Command enables you to automate common administrative tasks and perform adhoc configuration changes at scale.
Automation -it is used to create custom workflows and not to remotely configure managed instances at scale.
Session Manager - mainly used to quickly and securely access your Windows and Linux instances. It cannot automate administrative tasks and execute shell scripts remotely.
Config Rules
Rules types
1. Managed Rules (Over 150)
2. Custom Rules (using Lambda)
Config can either be triggered or Scheduled
Config Auto-remidiation
Using config to trigger SSM Automation to reverse a Non-compiant Service State back to compliant. The SSM Automation Can invoke a lambda function, but not config to Lambda directly
Aggregator - Centralize config findings from various accounts.
Stacksets will be a better option
Egress only IPV6
You must update your route tables to route your IPv6 traffic.
For a public subnet, create a route that routes all IPv6 traffic from the subnet to the Internet gateway.
For a private subnet, create a route that routes all Internet-bound IPv6 traffic from the subnet to an egress-only Internet gateway.
Route 53 Healthcheck types
You can create three types of Amazon Route 53 health checks:
- Health checks that monitor an endpoint - You can configure a health check that monitors an endpoint that you specify either by IP address or by domain name. At regular intervals that you specify, Route 53 submits automated requests over the internet to your application, server, or other resources to verify that it’s reachable, available, and functional. Optionally, you can configure the health check to make requests similar to those that your users make, such as requesting a web page from a specific URL.
- Health checks that monitor other health checks (calculated health checks) - You can create a health check that monitors whether Route 53 considers other health checks healthy or unhealthy. One situation where this might be useful is when you have multiple resources that perform the same function, such as multiple web servers, and your chief concern is whether some minimum number of your resources are healthy. You can create a health check for each resource without configuring notifications for those health checks. Then you can create a health check that monitors the status of the other health checks, and that notifies you only when the number of available web resources drops below a specified threshold.
- Health checks that monitor CloudWatch alarms - You can create CloudWatch alarms that monitor the status of CloudWatch metrics, such as the number of throttled read events for an Amazon DynamoDB database or the number of Elastic Load Balancing hosts that are considered healthy. After you create an alarm, you can create a health check that monitors the same data stream that CloudWatch monitors for the alarm.
Failover Routing Policy
Unlike the Weighted, This routing policy does not let you control how much traffic is routed accross your resources. It’s 50-50
CFN Creation Policy
You can associate the CreationPolicy attribute with a resource to prevent its status from reaching create complete until AWS CloudFormation receives a specified number of success signals or the timeout period is exceeded. To signal a resource, you can use the cfn-signal helper script or SignalResource API. AWS CloudFormation publishes valid signals to the stack events so that you track the number of signals sent.
The creation policy is invoked only when AWS CloudFormation creates the associated resource. Currently, the only AWS CloudFormation resources that support creation policies are AWS::AutoScaling::AutoScalingGroup, AWS::EC2::Instance, and AWS::CloudFormation::WaitCondition.
Direct connect Gateway
Route 53 Alias record
It is possible to create an Alias record that points to a resource in another account. In this case the fully qualified domain name of the ALB must be obtained and then entered when creating the record set. This is the most cost-effective option as you do not pay for Alias records and there is minimal configuration required.
EC2 Instance Lifecycle
STOP/START (Amazon EBS-backed instances only):
We move the instance to a new host computer (though in some cases, it remains on the current host). Public IPV4 is changed
HIBERNATE (Amazon EBS-backed instances only):
We move the instance to a new host computer (though in some cases, it remains on the current host). Public IPV4 is changed. The RAM is saved to a file on the root volume
REBOOT/RESTART:
The instance stays on the same host computer. Rebooting an instance is equivalent to rebooting an operating system. Public IPV4 is changed
But Elastic IP is constant
Cloudfront and Dynamic Contents
dynamic content are not cacheable on CloudFront edge locations
Dealing With Custom AMI
You would have to install some default agents that should have natively come with the EC2 instance, such as SSM AGENT, etc
Signed URL vs Signed cookies
Signed URL - One User Per file
Signed Cookies - Several Users to several files
AWS Storage servicesEncryption status changing
S3 - Encrypts objects by default
EFS and RDS - Encryption status cannot be changed after deployment
Http 403, 503
403 - Bucket policy issues
503 - High request rates for new buckets
How to enable AWS Shield Standard
AWS Shield Standard is automatically enabled to all AWS customers at no additional cost. AWS Shield Advanced is an optional paid service.
AWS Service Health Troubleshooting
Note that most AWS outages a limited to a single az. Therefore, in a Multi-Az app behind an ASG, service health info might not be useful
Adding SSL cert to a cloudfront distribution
Once you have an SSL Cert through ACM, you need to add it to your cloudfront distribution, then update cache behavoir to reroute traffic to HTTPS
Ec2 access to SSM
AWS Systems Manager requires an IAM role for EC2 instances that it manages, to perform actions on your behalf. This IAM role is referred to as an instance profile.
If an instance is not managed by Systems Manager, one likely reason is that the instance does not have an instance profile, or the instance profile does not have the necessary permissions to allow Systems Manager to manage the instance.
Power of tags
Tags are used for organizing resources, not for controlling access. While they can be used in conjunction with IAM policies to allow or deny access, tags alone do not grant or deny access to Systems Manager. They are metadata to categorize your AWS resources and do not affect the operational aspects of AWS Systems Manager.
Special note on Usage:
User-defined tags are tags that you define, create, and apply to resources. After you have created and applied the user-defined tags, you can activate by using the Billing and Cost Management console for cost allocation tracking. Cost Allocation Tags appear on the console after you’ve enabled Cost Explorer, Budgets, AWS Cost and Usage reports, or legacy reports.
When using AWS Organizations, you must use the Billing and Cost Management console in the payer account to mark the tags as cost allocation tags. You can use the Cost Allocation Tags manager to do this.
VPC Endpoints
Interface Endpoint(Paid): Is a private connection via ENI, therefore it requires security group settings to connect the instance’s private ip to the AWS service. Accessible via VPN, VPC peering
Gateway Endpoint(free and scales more): Is more like an AWS Public service. It only requires a Route table configuration to the endpoint. Once targeted from the route table, this configuration directly exposes the instance to DynamoDb and S3 bucket
In order to share an encrypted snapshot with another account, Policy has to be reviewed and relevant permissions granted by the origin account
HAKIKA !
Amazon Inspector (Amazon Antivirus)
Amazon Inspector is an automated vulnerability management service that continually scans Amazon Elastic Compute Cloud (EC2), AWS Lambda functions, and container images in Amazon ECR and within continuous integration and continuous delivery (CI/CD) tools, in near-real time for software vulnerabilities and unintended …
Amazon Inspector helps you discover potential security issues by using security rules to analyze your AWS resources. Amazon Inspector monitors and collects behavioral data (telemetry) about your resources.
To get started, you create an assessment target (a collection of the AWS resources that you want Amazon Inspector to analyze). Next, you create an assessment template (a blueprint that you use to configure your assessment). You use the template to start an assessment run, which is the monitoring and analysis process that results in a set of findings.
CloudTrail Security
Enabling log file integrity validation in AWS CloudTrail allows for the detection of whether a log file was modified, deleted, or unchanged after CloudTrail delivered it. You can use AWS CLI commands to manually validate the integrity of the log files. This provides a cryptographically verifiable method of ensuring log files have not been altered.
Memcached Scaling
Evictions occur when memory is over filled or greater than the maxmemory setting in the cache, resulting in the engine selecting keys to evict in order to manage its memory. The keys that are chosen are based on the eviction policy that is selected.
You cannot add read replicas to a Memcached cluster.
ELB Healthcheck troubleshooting
- All Healthchecks failed = Misconfigured sg, no Route table route ,target not in service, etc
- HealthyHostCount dropping(eg, from 5 to 2) = health checks have failed, and the ALB has taken EC2 instances out of service
Ec2 sending events to Events bridge
Amazon EC2 sends an EC2 Instance State-change Notification event to Amazon EventBridge when the state of an instance changes.
eg:
- pending
- running
- stopping
- stopped
- shutting-down
- terminated
SSM
Run Command= To run live commands from managed or manual documents
Automation: More of like a complete pipeline of commands and events
State Manager: Maintain configuration consistency by reapplying configuration state,and view detailed configuration history and output. Quickly identify and remediate compliant and non-compliant machines across multiple accounts.
Route Propagation
Route propagation allows a virtual private gateway to automatically propagate routes to the route tables. This means that you don’t need to manually enter VPN routes to your route tables. You can enable or disable route propagation. Edge association use to route inbound VPC traffic to an appliance
Route propagation enables the automatic propagation of routes from a gateway (like a transit gateway) to a route table.
Shared Responsibility Model -OS
Managing the health of the Linux operating systems” is incorrect. The operating systems of EC2 instances are managed by customers, not AWS
AD connector
Over VPN, using their on-premises Active Directory user accounts to login to AWS IAM. They would then be able to access the AWS Management Console.
CW Logs vs Metrics
Metrics are containers for certain indexes about an instance or a service.
Logs are text report on anticipated or unanticipTED events
Note
Metrics:
Metrics are quantitative measures or data points that provide information about the performance or status of a system, service, or application.
They are typically numerical values that can be collected over time, allowing for the analysis of trends, patterns, and anomalies.
Metrics are used to monitor and evaluate the health and efficiency of a system, helping to identify issues or optimize performance.
Examples of metrics include response time, error rate, CPU usage, memory usage, and throughput.
Logs:
Logs are records or entries generated by a system, application, or service to capture information about events, activities, or transactions.
They are typically text-based and can include details about errors, warnings, user actions, and system events.
Logs serve as a chronological record of what has happened within a system, and they are crucial for troubleshooting, debugging, and auditing.
While logs can contain metrics, they also provide context and narrative information about the events that occurred.
In summary, metrics are quantitative measurements that help monitor the performance of a system, while logs are textual records that provide a detailed account of events. While metrics are often used for trend analysis and alerting, logs are essential for diagnosing and understanding the context of issues. Both metrics and logs are crucial components of effective monitoring and troubleshooting in various IT and software contexts.
RDS
- You cannot copy an automated DB snapshot.
- Snapshots exist on S3 but you cannot directly work with them.
- You cannot create multi-AZ standby instances in another account
RDS Failover conditions/Troubleshooting
Amazon RDS handles failovers automatically so you can resume database operations as quickly as possible without administrative intervention. The primary DB instance switches over automatically to the standby replica if any of the following conditions occur:
* An Availability Zone outage.
* The primary DB instance fails.
* The DB instance’s server type is changed.
* The operating system of the DB instance is undergoing software patching.
* A manual failover of the DB instance was initiated using Reboot with failover.
- Failure on the primary database and the DB instance type being changed are both conditions that would cause a failover event to occur.
AWS Guard duty
AWS GuardDuty operates at the application layer, which corresponds to the OSI model’s Layer 7 (Application Layer). GuardDuty is a managed threat detection service that continuously monitors for malicious activity and unauthorized behavior within your AWS accounts and workloads.
GuardDuty analyzes events and logs generated by various AWS services, such as CloudTrail logs, VPC Flow Logs, and DNS logs, to identify potential security threats. It uses machine learning, anomaly detection, and threat intelligence to detect activities such as reconnaissance, privilege escalation, and communication with known malicious IP addresses.
While GuardDuty operates at the application layer for analysis, it leverages data from multiple layers of the OSI model to provide a comprehensive view of potential security issues in your AWS environment. It’s important to note that GuardDuty focuses on detecting threats and suspicious activities within the application layer, which includes services and protocols operating at higher levels of the OSI model.
AWS Service Catalog portfolio
When you share a portfolio using account-to-account sharing or Organizations, you are sharing a reference of that portfolio. The products and constraints in the imported portfolio stay in sync with changes that you make to the shared portfolio, the original portfolio that you shared. The recipient cannot change the products or constraints, but can add AWS Identity and Access Management (IAM) access for end users.
Service Health vs Personal Health
AWS Personal Health Dashboard provides alerts and remediation guidance when AWS is experiencing events that may impact you. While the Service Health Dashboard displays the general status of AWS services, Personal Health Dashboard gives you a personalized view into the performance and availability of the AWS services underlying your AWS resources.
The dashboard displays relevant and timely information to help you manage events in progress and provides proactive notification to help you plan for scheduled activities. With Personal Health Dashboard, alerts are triggered by changes in the health of AWS resources, giving you event visibility, and guidance to help quickly diagnose and resolve issues.
RDS Encryption
You can only enable encryption for an Amazon RDS DB instance when you create it, not after the DB instance is created. However, because you can encrypt a copy of an unencrypted DB snapshot, you can effectively add encryption to an unencrypted DB instance.
To do this you create a snapshot of your DB instance, and then create an encrypted copy of that snapshot. You can then restore a DB instance from the encrypted snapshot, and thus you have an encrypted copy of your original DB instance.
Transfer accelleration
Transfer Acceleration: S3
Global accelerator; EC2, ELB, ECS
Accelrator uses Edge location to transfer files into AWS. More of like a reverse f CDN, etc
Cloudwatch for ec2 Lifecycle
Using Amazon CloudWatch alarm actions, you can create alarms that automatically
**stop,
terminate,
reboot,
or recover your EC2 instances.
Cloudfront Cache-hit/Cache-miss
RDS Single to Multi-Az
click and apply with no downtime
choose apply Immediately
ELB = EC2
There can be ELB connected to ec2 instances without Autoscaling group
EC2 Rescue
EC2Rescue for EC2 Windows is a convenient, straightforward, GUI-based troubleshooting tool that can be run on your Amazon EC2 Windows Server instances to troubleshoot operating system-level issues and collect advanced logs and configuration files for further analysis. EC2Rescue simplifies and expedites the troubleshooting of EC2 Windows instances.