Section 12: Databases and Analytics Flashcards
Relational vs Non-Relational databases
Relational
* SQL
* organised into tables, rows and columns
* ridig schema
* rules enforced in database
* usually verticially scalled
* supports complex queries and joins
* Amazon RDS, Orange, MySQL, PostgreSQL
Non-relational
* NoSQL
* varied data storage models
* flexible schema stored in key-value pairs, columns, documents or graphs
* rules can be defined in application code (outside of database)
* scales horiztonally
* unstructred, supports any kind of schema
* AWS DynamoDB, MongoDB, Redis, Neo4j
AWS Relational Database Service (RDS)
- Scales vertically, which means upgrading the EC2 instance (more CPU and RAM)
- Is an OLTP type of database (Online Transaction Processing)
- Horizontal scaling for queries (reads) can be done by creating a read replica. Meaning the is a RDS master and RDS read replica database. The master database syncs to the read replica.
Relational Database Service (RDS) backups
Relational Database Service backups
Automated backups
* automated backups are retained for 0 to 35 days
* restore can be to any point in time during the retention period
Manual backups (snapshots)
* backs up entire DB instance, not just individual database
* snapshots do not expire
What is Amazon Aurora
Amazon Aurora:
* database in the RDS family
* great in durability and scailability
* MySQL and PostgreSQL compatible
* built-in fault tolerence
Aurora key features
Aurora key features:
* high performance and scailability
* supports MySQL and PostgreSQL
* aurora replicas: in-region read scaling and failover target (up to 15 replicas)
* global database: cross-reguib cluser with read scailing
* multi-master: scales out writes within a region
* serverless: on-demand, autoscaling config, does not support read replicas or public IP’s. Aurora Serverless is a seperate service to Aurora
When to use Aurora Serverless
Use cases:
* inrequently used apps
* new apps
* variable workload
* unpredicatable workloads
* dev and test databases
* multi-tenant apps
What is RDS Proxy?
- RDS Proxy is a fully managed database proxy for RDS
- highly available across multiple AZ’s
- increases scailability, faul tolerence and security
- reduced stresss on CPU/Memory
- control authentication method
- controls pool of connections to database
What is Amazon ElastiCache
- Fully managed implementation of Redis and Memcached
- It is a key/value store
- Can be put in front of databases such as RDS and DyanmoDB
- ElastiCache runs on Amazon EC2 instances, so you must choose and instance family/type
ElastiCache - Memcahced vs Redis
Redis:
* Data persistance
* Complex data types
* Partitioning (only in Cluster Mode)
* high availability
* NOT multi threaded
Memcached
* No data persistance
* Simple data types
* Partitioning
* Not high availability
* Multithreaded
ElastiCache use cases
- data that is relatively static and frequently accessed
- apps that are tolerant of stale data
- often used for storing session state (DynamoDB can also be used)
What is Amazon DynamoDB?
- NoSQL database service
- key/value store and document store
- non-relational, key-value type of database
- fully serverless
- autoscailing based on read/write capacity defined
DynamoDB - TTL
- TTL (time to live) which lets you define when data can be deleted. Great for using DynamoDB like you would Redis for caching purposes
- allows you to add a timestamp on an item in the table to delete after TTL has expired
- No extra cost and does not use WCU/RCU (write capacity units / read capacity units)
What is DynamoDB Steams?
DynamoDB Streams:
Captures a time-ordered sequence of item-level modifications to any DynamoDB table and stores this information in a log for up to 24 hours
What is DynamoDB Accelator (DAX)?
- DAX is a fully managed, highly available, in-memory cache for DynamoDB
- improved performance from milliseconds to microseconds (will help with latency etc)
- used to improve read and write performance due to read-through and write-through cache
What is DynamoDB Global Tables
DynamoDB Global Tables:
* multi-region, multi-active database
* DynamoDB databases async replication across regions (same data set)
What is Amazon RedShift?
Amazon Redshift:
* data warehouse
* use to analyse data using SQL and other Business Intelligence (BI) tools such as Amazon QuickSight, Tableau, Microsoft Power BI
* relation database
* used for OLAP (online analytical processing)
* uses EC2 instances
* keeps 3 copied of your day
* continuous and incremental backup
Uses cases for Amazon Redshift
Amazon Redshift (data warehouse) use cases:
* perform** complex queries** on massive collections of structured and semi-structured data with fast performance
* use Redshift Spectrum for direct access of S3 objects in a data lake
What is Amazon Elastic Map Reduce (EMR)?
- Amazon Elatic Map Reduce is Amazon’s version of Hadoop
- It is used for running big data frameworks such as Apache Hadoop and Apache Spark
- used for processing data for analyics and business intelligance
- can also be use for transforming and moving large amounts of data
- performs extract, transform and load functions (ETL)
What is Amazon Kinesis?
Amazon Kinesis:
Amazon Kinesis cost-effectively processes and analyzes streaming data at any scale as a fully managed service. With Kinesis, you can ingest real-time data, such as video, audio, application logs, website clickstreams, and IoT telemetry data, for machine learning (ML), analytics, and other applications.
What is Amazon Athena?
Amazon Athena is an interactive query service that makes it simple to analyze data directly in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can choose to pay based on the queries you run or compute needed by your queries.
What is AWS Glue?
AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.
AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can more easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows in a few steps in AWS Glue Studio.
What is Amazon OpenSearch Service (ElasticSearch)
Search, visualise, and analyise text and unstrucutred data. Is is ElasticSearch, meaning you can use with Logstash and Kibana Dashboard (ELK stack)
Supports queries using SQL.
Amazon OpenSearch Service is a managed service that makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch.
Amazon OpenSearch (ElasticSearch) best practices
- deploy OpenSearch data instances across 3 Availability Zones
- provision instances in multiples of 3
- if 3 is not available, use 2 AZ’s with equal number of instances
- configure at least 1 replica for each index
- apply restrictive resource-based access policies to the domain (or use fin-grained access control)
- create the domain within an Amazon VPC
- for sentitiva data enable node-to-node encryption for encryption at rest
What is AWS Batch?
AWS Batch is a set of batch management capabilities that enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS.
A script such as shell script, executable or Docker container image is ran as the “batch job”.
Other AWS databases
- DocumentDB = MongoDB. Document database for JSON data management
- Amazon Keyspaces (for Apache Cassandra). Uses Cassandra Query Langauge (CQL) code
- Amazon Neptune = graph database
- Amazon Quantum Ledger Database = ledger database. Provides transparent, immutable (append-only, meaning can NOT be overwritten or deleted) and cryptographically verifiable transaction logo.
Other AWS analytics services
- Amazon Timestream = Amazon Timestream is a fast, scalable, and serverless time-series database service that makes it easier to store and analyze trillions of events per day up to 1,000 times faster
- AWS Data Exchange = AWS Data Exchange is a service that helps AWS easily share and manage data entitlements from other organizations at scale.
- AWS Data Pipeline = managed ETL (extract, transform, load) services. Data sources can be on-prem and can be processed and transformed.
- AWS Lake Formation = data lake (structured, semi-structured and unstructured data). RedShift is a data warehouse (structured data)
- Amazon Managed Streaming for Apache Kafka (MSK) = used for ingesting and processing data in real-time
Which DynamoDB feature integrates with AWS Lambda to automatically execute functions in response to table updates?
DynamoDB Steams
DynamoDB Streams maintains a list of item level changes and can integrate with Lambda to create triggers.
An organization is migrating databases into the AWS Cloud. They require a managed service for their MySQL database and need automatic failover to a secondary database. Which solution should they use?
Amazon RDS with Multi-AZ
RDS Multi-AZ does provide automatic failover to a secondary database.
How many PUT records per second does Amazon Kinesis Data Streams support?
1000
Each shard can support up to 1000 PUT records per second.
Which Amazon Kinesis service stores data for later processing by applications?
Amazon Kinesis Data Streams
Kinesis Data Streams stores data for later processing by applications.
You need to implement an in-memory caching layer in front of an Amazon RDS database. The caching layer should allow encryption and replication. Which solution meets these requirements?
Amazon ElastiCache Redis
Redis provides encryption and replication.
A new application requires a database that can allow writes to DB instances in multiple availability zones with read after write consistency. Which solution meets these requirements?
Amazon Aurora Multi-Master
Amazon Aurora Multi-Master adds the ability to scale out write performance across multiple Availability Zones and provides configurable read after write consistency.
An organization is migrating their relational databases to the AWS Cloud. They require full operating system access to install custom operational toolsets. Which AWS service should they use to host their databases?
Amazon EC2
If you need to access the underlying operating system you must use Amazon EC2 for a relational database.
An existing Amazon RDS database needs to be encrypted. How can you enable encryption for an unencrypted Amazon RDS database?
Take an encrypted snapshot of the DB instance and create a new database instance from the snapshot
You need to take an encrypted snapshot and then create a new database instance from the snapshot.
Which Amazon Kinesis service uses AWS Lambda to transform data?
Amazon Kinesis Firehose
Kinesis Firehose can deliver data to Lambda for transformation.
How can you scale an Amazon Kinesis Data Stream that is reaching capacity?
Add shards
You scale Kinesis by adding shards to a stream.
Cheat sheets
- DynamoDB - https://digitalcloud.training/amazon-dynamodb/
- ElastiCache - https://digitalcloud.training/amazon-elasticache/
- RedShift - https://digitalcloud.training/amazon-redshift/
- EMR - https://digitalcloud.training/amazon-emr/
- Kinesis - https://digitalcloud.training/amazon-kinesis/
- Athena - https://digitalcloud.training/amazon-athena/
- Glue - https://digitalcloud.training/aws-glue/
- RDS - https://digitalcloud.training/amazon-rds/
- Aurora - https://digitalcloud.training/amazon-aurora/