DynamoDB/RedShift/etc. Flashcards
Describe DynamoDB.
Serverless NoSQL database.
Fully managed. Highly available.
Key/value type, non-relational, also document store.
Push-button scaling (no down time)
On-Demand capacity.
How can you scale DynamoDB?
Auto Scaling
Or,
You define RCU (read capacity units) and WCU (write capacity units) and AWS will scale horizontally across nodes. For Performance.
What is DynamoDB TTL?
Time to Live.
Defines when an item in a table expires (e.g. session state storage). Saves $. No need to read the table
My data has an unpredictable structure. What db should I use?
DynamoDB (NoSQL) gives you flexibility; lets you add more attributes.
What is a Dynamo Stream?
When updates are made to DynamoDB, the record is written to the stream, then you process it somehow, like with a Lambda function (e.g.write to CloudWatch logs).
What pieces of information can you include in DynamoDB Streams?
KEYS_ONLY
NEW_IMAGE (entire item after the update)
OLD_IMAGE
NEW_AND_OLD_IMAGES
What is DynamoDB DAX?
DynamoDB Accelerator.
- Fully managed, highly available in-memory CACHE.
- Read through and write-through cache => improve read AND write performance
- improve performance of DynamoDB from milliseconds to microseconds.
- no code changes needed
Where does DAX live? What do you need to set it up?
In your VPC.
Configure port 8000 (DynamoDB) from 0.0.0.0/0
Configure port 8111 (DAX) from 0.0.0.0/0 or just app
- DAX needs IAM role for permissions (to access DynamoDB tables)
- Your app needs IAM role for permissions to DAX and DynamoDB.
What is the difference between DAX and ElastiCache?
DAX is optimized for DynamoDB.
ElastiCache can support more datastores (including DynamoDB), but:
• must modify application code to support
• more management overhead (e.g. invalidation)
What are DynamoDB Global Tables?
A multi-region, Multi-active (multi-master) database. Great for DR.
DB in multiple regions, replicated asynchronously across regions using streams.
What is RedShift?
A PostgreSQL-based data warehouse for analytics (OLAP - online analytics processing).
Uses EC2 instances so you must choose a family/type
Keeps 3 copies of data in COLUMNS
MPP (Massively Parallel Query Execution engine)
Continuous and incremental backups to S3, auto-provisioned
Exabyte scale query capability
• 1-128 nodes
• up to 128TB per node
I want to run a reporting analysis against one database without a performance hit to the database. What should I do?
What if I need to do it with multiple databases?
Use a read replica - live data, no performance hit.
Use RedShift.
Where can RedShift get data?
Firehose to RedShift through S3
S3 (via internet OR within VPC))
Direct from EC2 - best to write in big batches
"EDGE RODS" EC2 On-prem servers RDS DynamoDB EMR S3 - RedShift Spectrum can query directly in S3 Data Pipeline Glue
What is EMR?
Elastic Map Reduce.
Managed cluster platforms for Big Data frameworks like Hadoop or Spark.
For BI, analytics
Can be used to ETL large amounts of data
What is Kinesis Data Analytics?
Real-time SQL processing (Apache Flink) for streaming data.
Provides analytics for data from Data Streams or Firehose.
Destination can be Data Streams, Firehose, Lambda.
How does Kinesis Data Streams work?
Producers send data to Kinesis, stored in shards for 24 hours (up to 7 days)
Consumers take data, process it, can save to another service.
Realtime (200ms). cf. Firehose which is NEAR realtime.
What do you need to process data from Kinesis data streams?
KCL (Kinesis Client Library) on the EC2 instance that consumes data.
KCL numbers the shards, makes a record processor for each shard. The KCL worker (on the consumer) can have record processors too.
Each shard is processed by ONE worker and has ONE record processor. (But a worker can process many shards)
Can you have ordering with Kinesis data streams?
Yes, but only within a shard, not across shards.
What is Kinesis Firehose?
ETL service. (actually ITL - ingest transform load)
Data is loaded continuously directly to the destination, not stored in Firehose. Can be processed before storing using Lambda.
Fully managed, no shards, elastic scalability, fully automated
NEAR realtime.
What destinations can you use with Firehose?
RedShift ElasticSearch S3 Splunk DataDog MongoDB NewRelic HTTP endpoint
What is Amazon Athena?
Mnemonic: goddess holding bucket
- Serverless service for querying data in S3 using SQL
- Can connect to other data sources with Lambda
- Data can be in CSV, TSV, JSON, Parquet, ORC
- Uses a managed data catalog(GLUE) to store metadata
pay per query, can output results back to S3.
Secured through IAM.
Good for 1-time queries, log analytics
How can you optimize performance in Athena?
Partition or bucket data
Use compression (Apache Parquet, Apache ORC)
Use approximate functions
Only include columns you need
Optimize: • file size • columnar data store generation • ORDER BY • GROUP BY
What is AWS Glue?
Fully managed, serverless ETL service using Spark.
• used for data analytics
• discovers data, stores metadata in its catalog
• works with lakes (S3), warehouses (RedShift), stores (EC2, RDS, DynamoDB)
- CRAWLER covers multiple stores, can populate/update the catalog tables with metadata
- you can define an ETL job to use catalog tables as source/target
What is OpenSearch?
Same as ElasticSearch (based on it). For Search and Indexing.
Service for searching, visualizing, and analyzing UNSTRUCTURED text/data using SQL syntax.
Search on any field, even partial matches. Good as a complement to a DB.
Petabyte scale, secure
Backup with snapshots.
Encryption at rest and in transit.
How do you scale OpenSearch?
Add/remove EC2 instances running OpenSearch.
Up to 3 AZs.
How do you deploy OpenSearch?
Create a cluster (aka Service Domain)
Specify # and type of instances
Specify storage options like Ultra Warm or Cold
Where does OpenSearch data come from?
Firehose, Logstash, ElasticSearch(OpenSearch) API
You can search/visualize/analyze by pointing Kibana at the OpenSearch service.
Where should you put an OpenSearch cluster? What are the drawbacks?
Deploy into a VPC for secure communication within the VPC. (cannot use IP-based access policies)
Use VPN or proxy to communicate to internet unless you use a public domain.
- Can’t switch from VPC to public endpoint and vice versa.
- Can’t launch in VPC that uses dedicated tenancy.
- Once launched you can’t move to a different VPC
What are the access control options for OpenSearch?
Resource-based policies (aka domain access policies)
Identity-based policies (attached to users/roles)
IP based policies (except for a cluster in a VPC)
Fine-grained access control:
• role-based control
• security at index, document, field level
• OpenSearch dashboards multi-tenancy
• HTTP basic authentication
What are my options for authentication when using OpenSearch?
Federation using SAML to on-prem directories
Amazon Cognito and social IdPs.
What is Amazon Cognito?
A simple user identity and data synchronization service that helps you securely manage and synchronize app data for your users across their MOBILE devices
How do I get the best availability for my OpenSearch deployment?
Deploy across 3 AZs.
Provision instances in multiples of 3 for even distribution across AZs. (If your region only has 2 AZs, then use them both with equal numbers of instances.)
Use 3 dedicated master nodes
At least 1 replica for each index
Use resource-based access control (or fine-grained) for max restriction
Create within a VPC
If data is sensitive, enable node-to-node encryption and encryption at rest.
Lambda is processing streaming data from an API gateway and is generating a TooManyRequestsException as volume increases. What do you do?
Stream data into Kinesis Data Stream from the API Gateway and process in batches.
Security logs generated by WAF must be sent to a 3rd party auditing application.
Send logs to Kinesis Firehose; configure the 3rd party application using an HTTP endpoint.
Which DynamoDB feature integrates with AWS Lambda to automatically execute functions in response to table updates?
DynamoDB Streams maintains a list of item-level changes and can integrate with Lambda to create triggers.
How many PUT records per second does Amazon Kinesis Data Streams support?
1000.
Which Amazon Kinesis service stores data for later processing by applications?
Kinesis Data Streams stores data for later processing by applications.
Which Amazon Kinesis service uses AWS Lambda to transform data?
Firehose.
How can you scale an Amazon Kinesis Data Stream that is reaching capacity?
You scale Kinesis by adding shards to a stream.
What are some options for storing session data?
ElastiCache (submillisecond speed)
DynamoDB (1-9 milliseconds speed)
What is RedShift Enhanced VPC Routing?
A feature to copy data directly within VPC without touching the internet. COPY/UNLOAD from S3 to RedShift directly.
How do you do backups in RedShift?
There is NO Multi-AZ mode.
Take snapshots of a cluster; store in S3.
• manual (retained until you delete)
• automated: every 8 hours, every 5 GB, other schedule. You set retention.
Only the changes will be saved, not the whole thing.
** You can configure RedShift to automatically copy snapshots to another Region (for DR) **
How does RedShift Spectrum work?
You have to have a RedShift cluster already.
The data you want to analyze is in S3.
Submit your query which goes to thousands of spectrum nodes. Results are returned to the compute nodes of Redshift for aggregation.
Lots more processing power without having to load all the data from S3.
How does RedShift compare to Athena?
Faster queries/joins/aggregations thanks to indexes.
** High performance, Analytics, BI Data warehouse ==> RedShift.
Who can use the Glue Data Catalogs?
Glue for ETL
For data discovery:
•Athena
• Redshift Spectrum
• EMR
What is Neptune?
A fully managed graph database (e.g. for social networking, Wikipedia)