Database Specialty - RDS Flashcards
Instance Class Types in RDS
Standard, Memory-optimized, Burstable performance
Storage Type
General Purpose, Provisioned IOPS
Parameter groups important points
> Define configuration values specific to the select DB engine
Default parameter group cannot be edited
To make config changes, you must create a new parameter group
New parameter group inherits settings from the default parameter group
Can be applied to any DB instances within the AWS region
Restoring from a snapshot points
- By default, restored cluster gets applied with
- New security group
- Default parameter group
- Option group that was associated with the snapshot
- While restoring from a snapshot, be sure to
- Choose the correct security group to
ensure connectivity for the restored DB - Choose correct parameter group for the
restored DB - Recommended to retain parameter group
of the snapshot to help restore with the
correct parameter group
- Choose the correct security group to
PITR with RDS
- Point-In-Time Recovery
- Can only restore to a new instance
- The backup retention period controls the
PITR window - Can restore to any point in time during your
backup retention period - RDS uploads DB transaction logs to S3 every
5 minutes (affects latest restorable time) - You can move/restore a DB instance from
outside VPC to inside VPC with PITR (but not
other way round)
Exporting DB Snapshot Data to S3
- All types of backups can be exported (automatic/manual
or those created with AWS Backup service) - How to export?
- Setup an S3 bucket with appropriate IAM
permissions and create a KMS key for SSE - Export the snapshot using console (Actions à Export to Amazon S3) or using start-export-task CLI command
- Setup an S3 bucket with appropriate IAM
- Export runs in the background
- Doesn’t affect the DB performance
- Data exported in Apache Parquet format (=compressed and consistent)
- Allows you to analyze the DB data using Athena or Redshift Spectrum
Multi-AZ Deployments in RDS
- For high availability, data durability and fault-tolerance (not used for scaling)
- Offers SYNC replication to standby instance in another AZ over low latency links
- Performs automatic failover to standby instance in another AZ in case of planned or unplanned outage
- Uses DNS routing to point to the new master (no need to update connection strings)
- Failover times (RTO) are typically 60-120 seconds (minimal downtime)
- Backups are taken from standby instead of primary to ensure performance level
during backup activity - Recommended for production use cases
- To force a failover or simulate AZ-failure, reboot the master instance and choose Reboot with failover
RDS Read Replicas
- Read-only copies of master (primary) DB instance
- Up to 5 Read Replicas
- Within AZ, Cross AZ or Cross Region
- Replication is ASYNC, so reads are eventually
consistent - Applications must update the connection string to leverage read replicas
RDS Read Replicas 2
- Boost DB performance
and durability - Useful for scaling of read- heavy workloads
- Can be promoted to primary (complements
Multi-AZ) - To create a replica, you must enable automatic backups with at least one
day retention period - Replica can be Multi-AZ (= a replica with its own standby instance)
RDS Read Replicas as Multi-AZ
- Supported for MySQL / MariaDB / PostgreSQL / Oracle
- Works as a DR target. When promoted to primary, it works as Multi-AZ
- There’s added network cost when data goes from one AZ to another
RDS Read Replicas – Use Case
- You have a production database that is
taking on normal load - You want to run a reporting application
to run some analytics - You create a Read Replica to run the new
workload there - The production application is unaffected
- Read replicas are used for SELECT (=read) only kind of statements (not INSERT, UPDATE, DELETE)
Promoting a Read Replica to a Standalone DB
Instance
- Promoted instance is rebooted and becomes an independent DB instance
(separate from its source) - Will no longer work as a replica. Does not affect other replicas of the original DB
instance - You cannot promote a replica to a standalone instance while a backup is running
Promoting a Read Replica to a Standalone DB
Instance – Use cases
- Use as a DR strategy
- Avoid performance penalty of DDL operations (like rebuilding indexes)
- Perform DDL ops on a read replica and promote it to a standalone instance. Then
point your app to this new instance. - Sharding (splitting a large DB
into multiple smaller DBs)
Enabling writes on a read replica
- For MySQL / MariaDB read replica, set
the parameter read_only = 0 for the read replica to make it writable - You can then perform DDL operations on the read replica as needed without affecting the source DB
- Actions taken on the read replica don’t
affect the performance of the source DB instance - You can then promote the replica to a standalone DB
RDS Read Replica capabilities
- Can create multiple read replicas in quick succession
- Can use DB snapshot to perform PITR of a Read Replica
- Can create a replica from an existing replica
- reduces replication load from the master DB instance
- second-tier replica can have higher replication lag
Cross-Region Read Replicas in RDS
- Supported for MariaDB, MySQL, Oracle, and PostgreSQL
- Not supported for SQL Server
- Advantages
- Enhanced DR capability
- Scale read operations closer to the
end-users - Limitations
- Higher replica lag times
- AWS does not guarantee more than
five cross-region read replica instances
RDS replicas with an external database
- Replication b/w an external DB and an
RDS replica - Supported for MySQL / MariaDB engines
- Two ways
- Binlog replication
- GTID based Replication
RDS Disaster Recovery Strategies
- To ensure business continuity despite
unexpected failures/events - Multi-AZ is not enough (it can’t protect from
logical DB corruption, malicious attacks etc.) - Key metrics for DR plan – RTO and RPO
- RDS PITR offers RPO of 5 minutes (typically)
- RTO (Recovery time objective)
- How long it takes you to recover after a disaster
- Expressed in hours
- RPO (Recovery point objective)
- How much data you could lose due to a disaster
- Expressed in hours (e.g. RPO of 1 hour means you could lose an hour worth of data)
Comparing RDS DR Strategies Automated backups
RTO - Good
RPO - Better
Cost - Low
Scope - Single region
Comparing RDS DR Strategies Manual snapshots
RTO - Better
RPO - Good
Cost - Medium
Scope - Cross-Region
Comparing RDS DR Strategies Read replicas
RTO - Best
RPO - Best
Cost - High
Scope - Cross-Region
RDS Disaster Recovery Strategies
- To ensure business continuity despite
unexpected failures/events - Multi-AZ is not enough (it can’t protect from
logical DB corruption, malicious attacks etc.) - Key metrics for DR plan – RTO and RPO
- RDS PITR offers RPO of 5 minutes (typically)
- RTO (Recovery time objective)
- How long it takes you to recover after a disaster
- Expressed in hours
- RPO (Recovery point objective)
- How much data you could lose due to a disaster
- Expressed in hours (e.g. RPO of 1 hour means you could
lose an hour worth of data)
Troubleshooting high replica lag
- Asynchronous logical replication typically results in replica lag
- You can monitor ReplicaLag metrics in
CloudWatch - ReplicaLag metric reports Seconds_Behind_Master values
- Replication delays can happen due to:
- Long-running queries on the primary instance (slow query log can help)
- Insufficient instance class size or storage
- Parallel queries executed on the primary instance
Troubleshooting replication errors - Recommendations:
- Size the replica to match the source DB (storage size and DB instance class)
- Use compatible DB parameter group settings for source DB and replica
- Ex. max_allowed_packet for read replica must same as that of the source DB
instance - Monitor the Replication State field of the replica instance
- If Replication State = Error, then see error details in the Replication Error field
- Use RDS event notifications to get alerts on such replica issues
Troubleshooting replication errors (contd.)
- Writing to tables on a read replica
- Set read_only=0 to make read replica writable
- Use only for maintenance tasks (like creating indexes only on replica)
- If you write to tables on read replica, it might make it incompatible with source DB and break the replication
- So set read_only=1 immediately after completing maintenance tasks
- Replication is only supported with transactional storage engines like InnoDB. Using
engines like MyISAM will cause replication errors - Using unsafe nondeterministic queries such as SYSDATE() can break replication
- You can either skip replication errors (if its not a major one) or delete and recreate
the replica
Troubleshooting MySQL read replica issues
- Errors or data inconsistencies b/w source instance and replica
- Can happen due to binlog events or InnoDB redo logs aren’t flushed during a
replica or source instance failure - Must manually delete and recreate the replica
- Can happen due to binlog events or InnoDB redo logs aren’t flushed during a
- Preventive recommendations:
- sync_binlog=1
- innodb_flush_log_at_trx_commit=1
- innodb_support_xa=1
- These settings might reduce performance (so test before moving to production)
Performance hit on new read replicas
- RDS snapshots are EBS snapshots stored in S3
- When you spin up a new replica, its EBS volume loads lazily in the background
- This results in first-touch penalty (when you query any data, it takes longer to retrieve it for the first time)
- Suggestions:
- If DB is small, run “SELECT * FROM <table>” query on each table on the replica
- Initiate a full table scan with VACUUM ANALYZE (in PostgreSQL)
- Another reason could be an empty buffer pool (cache for table and index data)
Scaling in RDS
- Vertical Scaling (Scaling up)
- Single-AZ instance will be unavailable during scaling op
- Multi-AZ setup offers minimal downtime during scaling op – standby DB gets upgraded first and then primary will failover to the upgraded instance
- Horizontal Scaling (Scaling out)
- Useful for read-heavy workloads
- Use read-replicas
- Replicas also act as a DR target
Sharding in RDS
- Sharding = horizontal partitioning
- Split and distribute data across
multiple DBs (called shards) - Mapping / routing logic
maintained at application tier - Offers additional fault tolerance
(since no single point of failure) - If any shard goes through failover,
other shards are not impacted
RDS Monitoring - common metrics, native logs, manual monitoring tools
- Common metrics
- CPU, RAM, disk space consumption /
Network traffic / DB connections / IOPS
metrics
- CPU, RAM, disk space consumption /
- Native logs / extensions
- e.g. pgaudit extension in PostgreSQL for
auditing (DML / DCL / DDL etc)
- e.g. pgaudit extension in PostgreSQL for
- Manual Monitoring Tools
- RDS console (DB connections, R/W ops,
storage consumption, memory
utilization, N/W traffic) - AWS Trusted Advisor (cost optimization,
security, fault tolerance, performance
improvement checks) - CloudWatch (service health status etc.)
- RDS console (DB connections, R/W ops,
RDS Monitoring - Automated Monitoring Tools
- RDS event notifications
- Database logs (can be exported to
CloudWatch Logs) - CloudWatch (Metrics / Alarms / Logs)
- Enhanced Monitoring (real-time)
- Performance Insights
- RDS Recommendations
- CloudTrail (captures all RDS API calls, can
be viewed in CloudTrail console or
delivered to an S3 bucket) - Up to 90 days of your account activity can
be viewed in CloudTrail console (can create
a trail to deliver the audit logs to S3)
RDS Notifications / Event subscriptions
- Available within the RDS console
- Allows you to create CloudWatch alarms to notify you
whenever certain metric data crosses a threshold - You can send alarm notifications to an SNS topic (email
/ SMS) - You can also subscribe to RDS events
- Event sources can be snapshots, instances, security
groups, parameter groups, clusters, cluster snapshots,
etc. - Events like DB instance creation, deletion, availability
(shutdown / restart), backup, recovery, failover, failure,
backtrack, config change etc.
RDS Recommendations
- Periodic automated suggestions for DB instances, read replicas, and DB parameter
groups
RDS Logs
- View / watch / download DB logs from the RDS console
- Can export logs to CloudWatch Logs (log types vary by DB engine)
- CloudWatch Logs never expire. To expire them, set log group retention policy (1 day - 10 yrs)
- Logs are accessible from RDS console even if you disable log export to CloudWatch Logs
RDS Logs - Log types that can be exported to
CloudWatch Logs - topics
- Alert log – Oracle
- Audit log – Oracle, MariaDB, MySQL (must use option group with MARIADB_AUDIT_PLUGIN option for MariaDB and MySQL to audit database activity)
- Listener log – Oracle
- Trace log – Oracle
- Error log – SQL Server, MariaDB, MySQL
- Postgresql log – PostgreSQL (contains audit
logs) - Upgrade log – PostgreSQL
- General log – MariaDB, MySQL
- Slow query log – MariaDB, MySQL
Exporting AWS RDS logs to S3
- RDS database log files can be accessed
via RDS console, CLI or API - Transaction logs cannot be accessed
- You can export log data from
CloudWatch Logs to S3 by creating an
export task in CloudWatch (createexport-task CLI command) - Log files can also be downloaded using
the RDS API and uploaded to S3 (using
Lambda or AWS SDK)
RDS Enhanced Monitoring
- To analyze real-time OS level metrics
(CPU / memory usage etc.) - To monitor different processes or
threads that are using the CPU - Helps identify performance issues
- Increased granularity of 1 to 60
seconds * 1, 5, 10, 15, 30, or 60 seconds - Requires an agent to be installed on the
DB server to collect metrics
RDS Performance Insights
- Offers visual dashboard for performance
tuning, analysis and monitoring - Monitors DB load for the instance (if the
instance has multiple DBs, you’ll see
aggregated metrics) - DB load – average number of active
sessions (AAS – average active sessions) - Performance problems will appear as
spikes in the DB load graph - Helps identify performance bottlenecks,
expensive SQL statements, etc.
RDS Performance Insights - DB load
- You can visualize the DB load,
filter it by waits / SQL / hosts /
users - Waits - wait state for CPU, IO, Lock
etc. - SQL – SQL statements
- Hosts
- Users
- Identify slow queries (top SQL),
locks
CloudWatch Application Insights
- For .NET and SQL Server
- Also supports DynamoDB tables
- Identifies and sets up key metrics, logs, and alarms for SQL Server workloads
- Uses CloudWatch events and alarms
- Useful for problem detection, notification and troubleshooting
RDS on VMware
- Lets you deploy RDS DBs in on-premises VMware environments (VMware vSphere)
- Same user interface as in AWS
- Supports MySQL, PostgreSQL, and SQL Server
- Fully managed DBs
- Uses health monitoring to detect unhealthy
database instances and automatically recovers them - Support manual and automatic backups with PITR
- Can use CloudWatch for monitoring
RDS – Good things to know
- Read replica can be made writable (for MySQL/MariaDB)
- For other engines
- read replica cannot be made writable, but you can promote it to make it writable
- For Oracle and SQL Server
- Automatic backups or manual snapshots are not supported on the replica
- For Oracle
- Does not yet support Oracle RAC (a cluster DB with a shared cache architecture)
- For SQL Server
- Supports both Multi-AZ options – Database Mirroring and Always On
- For PostgreSQL
- Only manual snapshots are supported on the replica (no automatic backups)
- Set log retention with parameter rds.log_retention_period
- For MySQL/MariaDB
- Set log retention with stored procedures * e.g. call mysql.rds_set_configuration(‘binlog reten