Athena, OpenSearch, EMR, QuickSight Flashcards

1
Q
  • Serverless query service to analyze data stored in Amazon S3
  • Uses standard SQL language to query the files (built on Presto)
  • SupportsCSV,JSON,ORC,Avro,andParquet
  • Pricing: $5.00 per TB of data scanned
  • Commonly used with Amazon Quicksight for repor ting/dashboards
A

Amazon Athena

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

analyze data in S3 using serverless SQL

A

Athena

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Amazon Athena – Performance Improvement, Use __________ for cost-savings (less scan)

A

columnar data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is recommended when using columnar data for cost-savings?

A

Apache Parquet or ORC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

With Amazon Athena use ________ to convert your data to Parquet or ORC

A

Glue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

__________ datasets in S3 for easy querying on virtual columns

A

Partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are 4 Performance Improvements for Athena

A

Use columnar data for cost-savings (less scan)
Compress data for smaller retrievals
Partition datasets in S3 for easy querying on virtual columns
Use larger files (> 128 MB) to minimize overhead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  • Allows you to run SQL queries across data stored in relational, non-relational, object, and custom data sources (AWS or on-premises)
  • Uses Data Source Connectors that run on AWS Lambda to run Federated Queries (e.g., CloudWatch Logs, DynamoDB, RDS, …)
  • Store the results back in Amazon S3
A

Amazon Athena – Federated Query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  • Based on PostgreSQL
  • It’s OLAP – online analytical processing (analytics and data warehousing)
  • 10x better performance than other data warehouses, scale to PBs of data
  • Columnar storage of data (instead of row based) & parallel query engine
  • Pay as you go based on the instances provisioned
  • Has a SQL interface for performing the queries
  • BI tools such as Amazon Quicksight or Tableau integrate with it
  • vs Athena: faster queries / joins / aggregations thanks to indexes
A

Redshift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

2 types of nodes for Redshift Cluster

A

Leader Node
Compute Node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Redshift Cluster node for query planning, results aggregation

A

Leader node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Redshift Cluster node for performing the queries, send results to leader

A

Compute node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Do you need to provision the node size in advance?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

With Redshift Cluster, can you used Reserved Instances for cost saving

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Redshift has “Multi-AZ” mode for ____________

A

some clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Redshift Snapshots are __________ backups of a cluster, stored internally in __________

A

point-in-time
S3

17
Q

Redshift Snapshots are ____________

A

incremental
(only what has changed is saved)

18
Q

Can you restore a Redshift snapshot into a new cluster?

A

YES

19
Q

Automated: Redshift Snapshots

A

every 8 hours, every 5 GB, or on a schedule. Set retention between 1 to 35 days

20
Q

Manual: Redshift Snapshots

A

snapshot is retained until you delete it

21
Q

You can configure Amazon Redshift to ___________ copy snapshots (automated or manual) of a cluster to another AWS Region

A

automatically

22
Q

3 options when Loading data into Redshift

A

Amazon Kinesis Data Firehose
S3 using COPY command
EC2 Instance JDBC driver

23
Q

What is better when loading data into Redshift

A

Large inserts

24
Q
  • Query data that is already in S3 without loading it
  • Must have a Redshift cluster available to start the query
A

Redshift Spectrum

25
Q
  • is successor to Amazon ElasticSearch
  • In DynamoDB, queries only exist by primary key or indexes…
  • you can search any field, even partially matches
  • It’s common to use as a complement to another database
  • Two modes: managed cluster or serverless cluster
  • Does not natively support SQL (can be enabled via a plugin)
  • Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs
  • Security through Cognito & IAM, KMS encryption,TLS
  • Comes with Dashboards (visualization)
A

OpenSearch

26
Q

What does Amazon EMR stand for??

A

Elastic MapReduce

27
Q
  • helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data
  • The clusters can be made of hundreds of EC2 instances
  • comes bundled with Apache Spark, HBase, Presto, Flink…
  • takes care of all the provisioning and configuration
  • Auto-scaling and integrated with Spot instances
A

Amazon EMR

28
Q

What are 4 use cases for EMR?

A

data processing
machine learning
web indexing
big data

29
Q

What are 3 node types for EMR?

A

Master Node
Core Node
Task Node

30
Q

Amazon EMR – Node types: Manage the cluster, coordinate, manage health – long running

A

Master Node

31
Q

Amazon EMR – Node types: Run tasks and store data – long running

A

Core Node

32
Q

Amazon EMR – Node types: Just to run tasks – usually Spot

A

Task Node

33
Q

3 EMR Purchasing Options

A
  • On-demand: reliable, predictable, won’t be terminated
  • Reserved (min 1 year): cost savings (EMR will automatically use if available)
  • Spot Instances: cheaper, can be terminated, less reliable
34
Q
  • Serverless machine learning-powered business intelligence service to create interactive dashboards
  • Fast, automatically scalable, embeddable, with per-session pricing
  • Integrated with RDS, Aurora, Athena, Redshift, S3…
  • In-memory computation using SPICE engine if data is imported into QuickSight
  • Enterprise edition: Possibility to setup Column-Level security (CLS)
A

Amazon QuickSight

35
Q

4 use cases for QuickSight

A
  • Business analytics
  • Building visualizations
  • Perform ad-hoc analysis
  • Get business insights using data
36
Q
A
37
Q
A