DynamoDB/RedShift/etc. Flashcards by Lauren Shindo

Describe DynamoDB.

Serverless NoSQL database.

Fully managed. Highly available.
Key/value type, non-relational, also document store.

Push-button scaling (no down time)
On-Demand capacity.

How well did you know this?

Not at all

Perfectly

How can you scale DynamoDB?

Auto Scaling

Or,
You define RCU (read capacity units) and WCU (write capacity units) and AWS will scale horizontally across nodes. For Performance.

How well did you know this?

Not at all

Perfectly

What is DynamoDB TTL?

Time to Live.

Defines when an item in a table expires (e.g. session state storage). Saves $. No need to read the table

How well did you know this?

Not at all

Perfectly

My data has an unpredictable structure. What db should I use?

DynamoDB (NoSQL) gives you flexibility; lets you add more attributes.

How well did you know this?

Not at all

Perfectly

What is a Dynamo Stream?

When updates are made to DynamoDB, the record is written to the stream, then you process it somehow, like with a Lambda function (e.g.write to CloudWatch logs).

How well did you know this?

Not at all

Perfectly

What pieces of information can you include in DynamoDB Streams?

KEYS_ONLY
NEW_IMAGE (entire item after the update)
OLD_IMAGE
NEW_AND_OLD_IMAGES

How well did you know this?

Not at all

Perfectly

What is DynamoDB DAX?

DynamoDB Accelerator.

Fully managed, highly available in-memory CACHE.
Read through and write-through cache => improve read AND write performance
improve performance of DynamoDB from milliseconds to microseconds.
no code changes needed

How well did you know this?

Not at all

Perfectly

Where does DAX live? What do you need to set it up?

In your VPC.

Configure port 8000 (DynamoDB) from 0.0.0.0/0
Configure port 8111 (DAX) from 0.0.0.0/0 or just app

DAX needs IAM role for permissions (to access DynamoDB tables)
Your app needs IAM role for permissions to DAX and DynamoDB.

How well did you know this?

Not at all

Perfectly

What is the difference between DAX and ElastiCache?

DAX is optimized for DynamoDB.

ElastiCache can support more datastores (including DynamoDB), but:
• must modify application code to support
• more management overhead (e.g. invalidation)

How well did you know this?

Not at all

Perfectly

What are DynamoDB Global Tables?

A multi-region, Multi-active (multi-master) database. Great for DR.

DB in multiple regions, replicated asynchronously across regions using streams.

How well did you know this?

Not at all

Perfectly

What is RedShift?

A PostgreSQL-based data warehouse for analytics (OLAP - online analytics processing).

Uses EC2 instances so you must choose a family/type

Keeps 3 copies of data in COLUMNS

MPP (Massively Parallel Query Execution engine)

Continuous and incremental backups to S3, auto-provisioned

Exabyte scale query capability
• 1-128 nodes
• up to 128TB per node

How well did you know this?

Not at all

Perfectly

I want to run a reporting analysis against one database without a performance hit to the database. What should I do?

What if I need to do it with multiple databases?

Use a read replica - live data, no performance hit.

Use RedShift.

How well did you know this?

Not at all

Perfectly

Where can RedShift get data?

Firehose to RedShift through S3
S3 (via internet OR within VPC))
Direct from EC2 - best to write in big batches

"EDGE RODS"
EC2
On-prem servers
RDS
DynamoDB
EMR
S3 - RedShift Spectrum can query directly in S3
Data Pipeline
Glue

How well did you know this?

Not at all

Perfectly

What is EMR?

Elastic Map Reduce.

Managed cluster platforms for Big Data frameworks like Hadoop or Spark.

For BI, analytics

Can be used to ETL large amounts of data

How well did you know this?

Not at all

Perfectly

What is Kinesis Data Analytics?

Real-time SQL processing (Apache Flink) for streaming data.

Provides analytics for data from Data Streams or Firehose.

Destination can be Data Streams, Firehose, Lambda.

How well did you know this?

Not at all

Perfectly

How does Kinesis Data Streams work?

Producers send data to Kinesis, stored in shards for 24 hours (up to 7 days)

Consumers take data, process it, can save to another service.

Realtime (200ms). cf. Firehose which is NEAR realtime.

How well did you know this?

Not at all

Perfectly

What do you need to process data from Kinesis data streams?

KCL (Kinesis Client Library) on the EC2 instance that consumes data.

KCL numbers the shards, makes a record processor for each shard. The KCL worker (on the consumer) can have record processors too.

Each shard is processed by ONE worker and has ONE record processor. (But a worker can process many shards)

How well did you know this?

Not at all

Perfectly

Can you have ordering with Kinesis data streams?

Yes, but only within a shard, not across shards.

How well did you know this?

Not at all

Perfectly

What is Kinesis Firehose?

Study These Flashcards

ETL service. (actually ITL - ingest transform load)

Data is loaded continuously directly to the destination, not stored in Firehose. Can be processed before storing using Lambda.

Fully managed, no shards, elastic scalability, fully automated

NEAR realtime.

What destinations can you use with Firehose?

Study These Flashcards

RedShift
ElasticSearch
S3
Splunk
DataDog
MongoDB
NewRelic
HTTP endpoint

What is Amazon Athena?

Study These Flashcards

Mnemonic: goddess holding bucket

Serverless service for querying data in S3 using SQL
Can connect to other data sources with Lambda
Data can be in CSV, TSV, JSON, Parquet, ORC
Uses a managed data catalog(GLUE) to store metadata

pay per query, can output results back to S3.
Secured through IAM.

Good for 1-time queries, log analytics

How can you optimize performance in Athena?

Study These Flashcards

Partition or bucket data
Use compression (Apache Parquet, Apache ORC)
Use approximate functions
Only include columns you need

Optimize:
     • file size
     • columnar data store generation
     • ORDER BY
     • GROUP BY

What is AWS Glue?

Study These Flashcards

Fully managed, serverless ETL service using Spark.
• used for data analytics
• discovers data, stores metadata in its catalog
• works with lakes (S3), warehouses (RedShift), stores (EC2, RDS, DynamoDB)

CRAWLER covers multiple stores, can populate/update the catalog tables with metadata
you can define an ETL job to use catalog tables as source/target

What is OpenSearch?

Study These Flashcards

Same as ElasticSearch (based on it). For Search and Indexing.

Service for searching, visualizing, and analyzing UNSTRUCTURED text/data using SQL syntax.

Search on any field, even partial matches. Good as a complement to a DB.

Petabyte scale, secure

Backup with snapshots.

Encryption at rest and in transit.

How do you scale OpenSearch?

Add/remove EC2 instances running OpenSearch. | Up to 3 AZs.

How do you deploy OpenSearch?

Create a cluster (aka Service Domain) Specify # and type of instances Specify storage options like Ultra Warm or Cold

Where does OpenSearch data come from?

Firehose, Logstash, ElasticSearch(OpenSearch) API You can search/visualize/analyze by pointing Kibana at the OpenSearch service.

Where should you put an OpenSearch cluster? What are the drawbacks?

Deploy into a VPC for secure communication within the VPC. (cannot use IP-based access policies) Use VPN or proxy to communicate to internet unless you use a public domain. * Can't switch from VPC to public endpoint and vice versa. * Can't launch in VPC that uses dedicated tenancy. * Once launched you can't move to a different VPC

What are the access control options for OpenSearch?

Resource-based policies (aka domain access policies) Identity-based policies (attached to users/roles) IP based policies (except for a cluster in a VPC) Fine-grained access control: • role-based control • security at index, document, field level • OpenSearch dashboards multi-tenancy • HTTP basic authentication

What are my options for authentication when using OpenSearch?

Federation using SAML to on-prem directories | Amazon Cognito and social IdPs.

What is Amazon Cognito?

A simple user identity and data synchronization service that helps you securely manage and synchronize app data for your users across their MOBILE devices

How do I get the best availability for my OpenSearch deployment?

Deploy across 3 AZs. Provision instances in multiples of 3 for even distribution across AZs. (If your region only has 2 AZs, then use them both with equal numbers of instances.) Use 3 dedicated master nodes At least 1 replica for each index Use resource-based access control (or fine-grained) for max restriction Create within a VPC If data is sensitive, enable node-to-node encryption and encryption at rest.

Lambda is processing streaming data from an API gateway and is generating a TooManyRequestsException as volume increases. What do you do?

Stream data into Kinesis Data Stream from the API Gateway and process in batches.

Security logs generated by WAF must be sent to a 3rd party auditing application.

Send logs to Kinesis Firehose; configure the 3rd party application using an HTTP endpoint.

Which DynamoDB feature integrates with AWS Lambda to automatically execute functions in response to table updates?

DynamoDB Streams maintains a list of item-level changes and can integrate with Lambda to create triggers.

How many PUT records per second does Amazon Kinesis Data Streams support?

1000.

Which Amazon Kinesis service stores data for later processing by applications?

Kinesis Data Streams stores data for later processing by applications.

Which Amazon Kinesis service uses AWS Lambda to transform data?

Firehose.

How can you scale an Amazon Kinesis Data Stream that is reaching capacity?

You scale Kinesis by adding shards to a stream.

What are some options for storing session data?

ElastiCache (submillisecond speed) | DynamoDB (1-9 milliseconds speed)

What is RedShift Enhanced VPC Routing?

A feature to copy data directly within VPC without touching the internet. COPY/UNLOAD from S3 to RedShift directly.

How do you do backups in RedShift?

There is NO Multi-AZ mode. Take snapshots of a cluster; store in S3. • manual (retained until you delete) • automated: every 8 hours, every 5 GB, other schedule. You set retention. Only the changes will be saved, not the whole thing. ** You can configure RedShift to automatically copy snapshots to another Region (for DR) **

How does RedShift Spectrum work?

You have to have a RedShift cluster already. The data you want to analyze is in S3. Submit your query which goes to thousands of spectrum nodes. Results are returned to the compute nodes of Redshift for aggregation. Lots more processing power without having to load all the data from S3.

How does RedShift compare to Athena?

Faster queries/joins/aggregations thanks to indexes. ** High performance, Analytics, BI Data warehouse ==> RedShift.

Who can use the Glue Data Catalogs?

Glue for ETL For data discovery: • Athena • Redshift Spectrum • EMR

What is Neptune?

A fully managed graph database (e.g. for social networking, Wikipedia)

DynamoDB/RedShift/etc. Flashcards

(46 cards)