AWS Analytics Flashcards
Amazon Athena
- Exam Tip: analyze data in S3 using serverless SQL, use Athena
- *Use cases: Business intelligence / analytics / reporting, analyze &
query VPC Flow Logs, ELB Logs, CloudTrail trails, etc…
Serverless query service to analyze data stored in Amazon S3 - Amazon Athena – Federated Query Allows you to run SQL queries across
data stored in relational, non-relational, object, and custom data sources (AWS or on-premises)
Redshift
- vs Athena: faster queries / joins / aggregations thanks to indexes
- Redshift is based on PostgreSQL, but it’s not used for OLTP
- It’s OLAP – online analytical processing (analytics and data warehousing)
- 10x better performance than other data warehouses, scale to PBs of data
Redshift Cluster
- Leader node: for query planning, results aggregation
- Compute node: for performing the queries, send results to leader
- You provision the node size in advance
- You can used Reserved Instances for cost saving
Redshift – Snapshots & DR
- Redshift has “Multi-AZ” mode for some clusters
- Snapshots are point-in-time backups of a cluster, stored internally in S3
- Snapshots are incremental (only what has
changed is saved) - You can restore a snapshot into a new cluster
- Automated: every 8 hours, every 5 GB, or on a
schedule. Set retention between 1 to 35 days - Manual: snapshot is retained until you delete it
- You can configure Amazon Redshift to
automatically copy snapshots (automated or
manual) of a cluster to another AWS Region
Redshift Spectrum
- Query data that is already in S3 without loading it
- Must have a Redshift cluster available to start the query
- The query is then submitted to thousands of Redshift Spectrum nodes
Amazon OpenSearch Service
- dynamoDb queries exists only by primary key or indexes - Opeansearch can search by any field even partiallu matching
- commonly used as extra for databases
- Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs
- Security through Cognito & IAM, KMS encryption, TLS
- Comes with OpenSearch Dashboards (visualization)
Amazon EMR
- Use cases: data processing, machine learning, web indexing, big data…
- EMR stands for “Elastic MapReduce”
- EMR helps creating Hadoop clusters (Big Data) to analyze and process vast
amount of data - The clusters can be made of hundreds of EC2 instances
- EMR comes bundled with Apache Spark, HBase, Presto, Flink…
- EMR takes care of all the provisioning and configuration
Amazon QuickSight
- interactive dashboard
- Fast, automatically scalable, embeddable, with per-session pricing
- Define Users (standard versions) and Groups (enterprise version)
- These users & groups only exist within QuickSight, not IAM !!
- A dashboard…
- is a read-only snapshot of an analysis that you can share
- preserves the configuration of the analysis (filtering, parameters, controls, sort)
- You can share the analysis or the dashboard with Users or Groups
- To share a dashboard, you must first publish it
- Users who see the dashboard can also see the underlying data
AWS Glue
- Managed extract, transform, and load (ETL) service
- Glue Job Bookmarks: prevent re-processing old data
- Glue Elastic Views:
- Combine and replicate data across multiple data stores using SQL
- No custom code, Glue monitors for changes in the source data, serverless
- Leverages a “virtual table” (materialized view)
- Glue DataBrew: clean and normalize data using pre-built transformation
- Glue Studio: new GUI to create, run and monitor ETL jobs in Glue
- Glue Streaming ETL (built on Apache Spark Structured Streaming):
compatible with Kinesis Data Streaming, Kafka, MSK (managed Kafka)
AWS Lake Formation
- Fully managed service that makes it easy to setup a data lake in days
- Combine structured and unstructured data in the data lake
- Out-of-the-box source blueprints: S3, RDS, Relational & NoSQL DB…
- Fine-grained Access Control for your applications (row and column-level)
- Built on top of AWS Glue
Kinesis Data Analytics (SQL application)
- Real-time analytics on Kinesis Data Streams & Firehose using SQL
- Add reference data from Amazon S3 to enrich streaming data
- Fully managed, no servers to provision
- Automatic scaling
- Pay for actual consumption rate
- Output:
- Kinesis Data Streams: create streams out of the real-time analytics queries
- Kinesis Data Firehose: send analytics query results to destinations
- Use cases:
- Time-series analytics
- Real-time dashboards
- Real-time metrics
Kinesis Data Analytics for Apache Flink
Run any Apache Flink application on a managed cluster on AWS
* provisioning compute resources, parallel computation, automatic scaling
* application backups (implemented as checkpoints and snapshots)
* Use any Apache Flink programming features
* Flink does not read from Firehose (use Kinesis Analytics for SQL instead)*
Amazon Managed Streaming for Apache
Kafka (Amazon MSK)
Alternative to Amazon Kinesis
* Fully managed Apache Kafka on AWS
* Allow you to create, update, delete clusters
* MSK creates & manages Kafka brokers nodes & Zookeeper nodes for you
* Deploy the MSK cluster in your VPC, multi-AZ (up to 3 for HA)
* Automatic recovery from common Apache Kafka failures
* Data is stored on EBS volumes for as long as you want
* MSK Serverless
* Run Apache Kafka on MSK without managing the capacity
* MSK automatically provisions resources and scales compute & storage
Consumers: Glue, Lambda, EC2,EKS, ecs, KinesisDataAnalytids for Apache f
Kinesis Data Streams vs. Amazon MSK
- Kinesis has 1 MB message size limit - Kafka has bigger
- Kinesis has data streams with shards, which can split and merge
- Kafka topc with Partitions, can add partitions to topic
- Kinesis & Kafka- TLS in flight encryption - Kafka plain text as well
- KMS at rest encryption
Big Data Ingestion Pipeline
- IoT Core allows you to harvest data from IoT devices
- Kinesis is great for real-time data collection
- Firehose helps with data delivery to S3 in near real-time (1 minute)
- **Lambda **can help Firehose with data transformations
- Amazon S3 can trigger notifications to SQS
- Lambda can subscribe to SQS (we could have connecter S3 to Lambda)
- Athena is a serverless SQL service and results are stored in S3
- The reporting bucket contains analyzed data and can be used by
reporting tool such as AWS QuickSight, Redshift, etc…