Athena, OpenSearch, EMR, QuickSight Flashcards
- Serverless query service to analyze data stored in Amazon S3
- Uses standard SQL language to query the files (built on Presto)
- SupportsCSV,JSON,ORC,Avro,andParquet
- Pricing: $5.00 per TB of data scanned
- Commonly used with Amazon Quicksight for repor ting/dashboards
Amazon Athena
analyze data in S3 using serverless SQL
Athena
Amazon Athena – Performance Improvement, Use __________ for cost-savings (less scan)
columnar data
What is recommended when using columnar data for cost-savings?
Apache Parquet or ORC
With Amazon Athena use ________ to convert your data to Parquet or ORC
Glue
__________ datasets in S3 for easy querying on virtual columns
Partition
What are 4 Performance Improvements for Athena
Use columnar data for cost-savings (less scan)
Compress data for smaller retrievals
Partition datasets in S3 for easy querying on virtual columns
Use larger files (> 128 MB) to minimize overhead
- Allows you to run SQL queries across data stored in relational, non-relational, object, and custom data sources (AWS or on-premises)
- Uses Data Source Connectors that run on AWS Lambda to run Federated Queries (e.g., CloudWatch Logs, DynamoDB, RDS, …)
- Store the results back in Amazon S3
Amazon Athena – Federated Query
- Based on PostgreSQL
- It’s OLAP – online analytical processing (analytics and data warehousing)
- 10x better performance than other data warehouses, scale to PBs of data
- Columnar storage of data (instead of row based) & parallel query engine
- Pay as you go based on the instances provisioned
- Has a SQL interface for performing the queries
- BI tools such as Amazon Quicksight or Tableau integrate with it
- vs Athena: faster queries / joins / aggregations thanks to indexes
Redshift
2 types of nodes for Redshift Cluster
Leader Node
Compute Node
Redshift Cluster node for query planning, results aggregation
Leader node
Redshift Cluster node for performing the queries, send results to leader
Compute node
Do you need to provision the node size in advance?
YES
With Redshift Cluster, can you used Reserved Instances for cost saving
YES
Redshift has “Multi-AZ” mode for ____________
some clusters
Redshift Snapshots are __________ backups of a cluster, stored internally in __________
point-in-time
S3
Redshift Snapshots are ____________
incremental
(only what has changed is saved)
Can you restore a Redshift snapshot into a new cluster?
YES
Automated: Redshift Snapshots
every 8 hours, every 5 GB, or on a schedule. Set retention between 1 to 35 days
Manual: Redshift Snapshots
snapshot is retained until you delete it
You can configure Amazon Redshift to ___________ copy snapshots (automated or manual) of a cluster to another AWS Region
automatically
3 options when Loading data into Redshift
Amazon Kinesis Data Firehose
S3 using COPY command
EC2 Instance JDBC driver
What is better when loading data into Redshift
Large inserts
- Query data that is already in S3 without loading it
- Must have a Redshift cluster available to start the query
Redshift Spectrum
- is successor to Amazon ElasticSearch
- In DynamoDB, queries only exist by primary key or indexes…
- you can search any field, even partially matches
- It’s common to use as a complement to another database
- Two modes: managed cluster or serverless cluster
- Does not natively support SQL (can be enabled via a plugin)
- Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs
- Security through Cognito & IAM, KMS encryption,TLS
- Comes with Dashboards (visualization)
OpenSearch
What does Amazon EMR stand for??
Elastic MapReduce
- helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data
- The clusters can be made of hundreds of EC2 instances
- comes bundled with Apache Spark, HBase, Presto, Flink…
- takes care of all the provisioning and configuration
- Auto-scaling and integrated with Spot instances
Amazon EMR
What are 4 use cases for EMR?
data processing
machine learning
web indexing
big data
What are 3 node types for EMR?
Master Node
Core Node
Task Node
Amazon EMR – Node types: Manage the cluster, coordinate, manage health – long running
Master Node
Amazon EMR – Node types: Run tasks and store data – long running
Core Node
Amazon EMR – Node types: Just to run tasks – usually Spot
Task Node
3 EMR Purchasing Options
- On-demand: reliable, predictable, won’t be terminated
- Reserved (min 1 year): cost savings (EMR will automatically use if available)
- Spot Instances: cheaper, can be terminated, less reliable
- Serverless machine learning-powered business intelligence service to create interactive dashboards
- Fast, automatically scalable, embeddable, with per-session pricing
- Integrated with RDS, Aurora, Athena, Redshift, S3…
- In-memory computation using SPICE engine if data is imported into QuickSight
- Enterprise edition: Possibility to setup Column-Level security (CLS)
Amazon QuickSight
4 use cases for QuickSight
- Business analytics
- Building visualizations
- Perform ad-hoc analysis
- Get business insights using data