Athena, OpenSearch, EMR, QuickSight Flashcards by David Dugas

Serverless query service to analyze data stored in Amazon S3
Uses standard SQL language to query the files (built on Presto)
SupportsCSV,JSON,ORC,Avro,andParquet
Pricing: $5.00 per TB of data scanned
Commonly used with Amazon Quicksight for repor ting/dashboards

Amazon Athena

How well did you know this?

Not at all

Perfectly

analyze data in S3 using serverless SQL

Athena

How well did you know this?

Not at all

Perfectly

Amazon Athena – Performance Improvement, Use __________ for cost-savings (less scan)

columnar data

How well did you know this?

Not at all

Perfectly

What is recommended when using columnar data for cost-savings?

Apache Parquet or ORC

How well did you know this?

Not at all

Perfectly

With Amazon Athena use ________ to convert your data to Parquet or ORC

Glue

How well did you know this?

Not at all

Perfectly

__________ datasets in S3 for easy querying on virtual columns

Partition

How well did you know this?

Not at all

Perfectly

What are 4 Performance Improvements for Athena

Use columnar data for cost-savings (less scan)
Compress data for smaller retrievals
Partition datasets in S3 for easy querying on virtual columns
Use larger files (> 128 MB) to minimize overhead

How well did you know this?

Not at all

Perfectly

Allows you to run SQL queries across data stored in relational, non-relational, object, and custom data sources (AWS or on-premises)
Uses Data Source Connectors that run on AWS Lambda to run Federated Queries (e.g., CloudWatch Logs, DynamoDB, RDS, …)
Store the results back in Amazon S3

Amazon Athena – Federated Query

How well did you know this?

Not at all

Perfectly

Based on PostgreSQL
It’s OLAP – online analytical processing (analytics and data warehousing)
10x better performance than other data warehouses, scale to PBs of data
Columnar storage of data (instead of row based) & parallel query engine
Pay as you go based on the instances provisioned
Has a SQL interface for performing the queries
BI tools such as Amazon Quicksight or Tableau integrate with it
vs Athena: faster queries / joins / aggregations thanks to indexes

Redshift

How well did you know this?

Not at all

Perfectly

2 types of nodes for Redshift Cluster

Leader Node
Compute Node

How well did you know this?

Not at all

Perfectly

Redshift Cluster node for query planning, results aggregation

Leader node

How well did you know this?

Not at all

Perfectly

Redshift Cluster node for performing the queries, send results to leader

Compute node

How well did you know this?

Not at all

Perfectly

Do you need to provision the node size in advance?

YES

How well did you know this?

Not at all

Perfectly

With Redshift Cluster, can you used Reserved Instances for cost saving

YES

How well did you know this?

Not at all

Perfectly

Redshift has “Multi-AZ” mode for ____________

some clusters

How well did you know this?

Not at all

Perfectly

Redshift Snapshots are __________ backups of a cluster, stored internally in __________

Study These Flashcards

point-in-time
S3

Redshift Snapshots are ____________

Study These Flashcards

incremental
(only what has changed is saved)

Can you restore a Redshift snapshot into a new cluster?

Study These Flashcards

YES

Automated: Redshift Snapshots

Study These Flashcards

every 8 hours, every 5 GB, or on a schedule. Set retention between 1 to 35 days

Manual: Redshift Snapshots

Study These Flashcards

snapshot is retained until you delete it

You can configure Amazon Redshift to ___________ copy snapshots (automated or manual) of a cluster to another AWS Region

Study These Flashcards

automatically

3 options when Loading data into Redshift

Study These Flashcards

Amazon Kinesis Data Firehose
S3 using COPY command
EC2 Instance JDBC driver

What is better when loading data into Redshift

Study These Flashcards

Large inserts

Query data that is already in S3 without loading it
Must have a Redshift cluster available to start the query

Study These Flashcards

Redshift Spectrum

* is successor to Amazon ElasticSearch * In DynamoDB, queries only exist by primary key or indexes... * you can search any field, even partially matches * It’s common to use as a complement to another database * Two modes: managed cluster or serverless cluster * Does not natively support SQL (can be enabled via a plugin) * Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs * Security through Cognito & IAM, KMS encryption,TLS * Comes with Dashboards (visualization)

OpenSearch

What does Amazon EMR stand for??

Elastic MapReduce

* helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data * The clusters can be made of hundreds of EC2 instances * comes bundled with Apache Spark, HBase, Presto, Flink... * takes care of all the provisioning and configuration * Auto-scaling and integrated with Spot instances

Amazon EMR

What are 4 use cases for EMR?

data processing machine learning web indexing big data

What are 3 node types for EMR?

Master Node Core Node Task Node

Amazon EMR – Node types: Manage the cluster, coordinate, manage health – long running

Master Node

Amazon EMR – Node types: Run tasks and store data – long running

Core Node

Amazon EMR – Node types: Just to run tasks – usually Spot

Task Node

3 EMR Purchasing Options

* On-demand: reliable, predictable, won’t be terminated * Reserved (min 1 year): cost savings (EMR will automatically use if available) * Spot Instances: cheaper, can be terminated, less reliable

* Serverless machine learning-powered business intelligence service to create interactive dashboards * Fast, automatically scalable, embeddable, with per-session pricing * Integrated with RDS, Aurora, Athena, Redshift, S3... * In-memory computation using SPICE engine if data is imported into QuickSight * Enterprise edition: Possibility to setup Column-Level security (CLS)

Amazon QuickSight

4 use cases for QuickSight

* Business analytics * Building visualizations * Perform ad-hoc analysis * Get business insights using data

Athena, OpenSearch, EMR, QuickSight Flashcards

(37 cards)