AWS Analytics Flashcards

1
Q

Amazon Athena

A
  • Exam Tip: analyze data in S3 using serverless SQL, use Athena
  • *Use cases: Business intelligence / analytics / reporting, analyze &
    query VPC Flow Logs, ELB Logs, CloudTrail trails, etc…
    Serverless query service to analyze data stored in Amazon S3
  • Amazon Athena – Federated Query Allows you to run SQL queries across
    data stored in relational, non-relational, object, and custom data sources (AWS or on-premises)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Redshift

A
  • vs Athena: faster queries / joins / aggregations thanks to indexes
  • Redshift is based on PostgreSQL, but it’s not used for OLTP
  • It’s OLAP – online analytical processing (analytics and data warehousing)
  • 10x better performance than other data warehouses, scale to PBs of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Redshift Cluster

A
  • Leader node: for query planning, results aggregation
  • Compute node: for performing the queries, send results to leader
  • You provision the node size in advance
  • You can used Reserved Instances for cost saving
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Redshift – Snapshots & DR

A
  • Redshift has “Multi-AZ” mode for some clusters
  • Snapshots are point-in-time backups of a cluster, stored internally in S3
  • Snapshots are incremental (only what has
    changed is saved)
  • You can restore a snapshot into a new cluster
  • Automated: every 8 hours, every 5 GB, or on a
    schedule. Set retention between 1 to 35 days
  • Manual: snapshot is retained until you delete it
  • You can configure Amazon Redshift to
    automatically copy snapshots (automated or
    manual) of a cluster to another AWS Region
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Redshift Spectrum

A
  • Query data that is already in S3 without loading it
  • Must have a Redshift cluster available to start the query
  • The query is then submitted to thousands of Redshift Spectrum nodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Amazon OpenSearch Service

A
  • dynamoDb queries exists only by primary key or indexes - Opeansearch can search by any field even partiallu matching
  • commonly used as extra for databases
  • Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs
  • Security through Cognito & IAM, KMS encryption, TLS
  • Comes with OpenSearch Dashboards (visualization)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Amazon EMR

A
    • Use cases: data processing, machine learning, web indexing, big data…
  • EMR stands for “Elastic MapReduce”
  • EMR helps creating Hadoop clusters (Big Data) to analyze and process vast
    amount of data
  • The clusters can be made of hundreds of EC2 instances
  • EMR comes bundled with Apache Spark, HBase, Presto, Flink…
  • EMR takes care of all the provisioning and configuration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Amazon QuickSight

A
  • interactive dashboard
  • Fast, automatically scalable, embeddable, with per-session pricing
    • Define Users (standard versions) and Groups (enterprise version)
  • These users & groups only exist within QuickSight, not IAM !!
  • A dashboard…
  • is a read-only snapshot of an analysis that you can share
  • preserves the configuration of the analysis (filtering, parameters, controls, sort)
  • You can share the analysis or the dashboard with Users or Groups
  • To share a dashboard, you must first publish it
  • Users who see the dashboard can also see the underlying data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

AWS Glue

A
  • Managed extract, transform, and load (ETL) service
  • Glue Job Bookmarks: prevent re-processing old data
  • Glue Elastic Views:
  • Combine and replicate data across multiple data stores using SQL
  • No custom code, Glue monitors for changes in the source data, serverless
  • Leverages a “virtual table” (materialized view)
  • Glue DataBrew: clean and normalize data using pre-built transformation
  • Glue Studio: new GUI to create, run and monitor ETL jobs in Glue
  • Glue Streaming ETL (built on Apache Spark Structured Streaming):
    compatible with Kinesis Data Streaming, Kafka, MSK (managed Kafka)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

AWS Lake Formation

A
  • Fully managed service that makes it easy to setup a data lake in days
    • Combine structured and unstructured data in the data lake
  • Out-of-the-box source blueprints: S3, RDS, Relational & NoSQL DB…
  • Fine-grained Access Control for your applications (row and column-level)
  • Built on top of AWS Glue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Kinesis Data Analytics (SQL application)

A
  • Real-time analytics on Kinesis Data Streams & Firehose using SQL
  • Add reference data from Amazon S3 to enrich streaming data
  • Fully managed, no servers to provision
  • Automatic scaling
  • Pay for actual consumption rate
  • Output:
  • Kinesis Data Streams: create streams out of the real-time analytics queries
  • Kinesis Data Firehose: send analytics query results to destinations
  • Use cases:
  • Time-series analytics
  • Real-time dashboards
  • Real-time metrics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Kinesis Data Analytics for Apache Flink

A

Run any Apache Flink application on a managed cluster on AWS
* provisioning compute resources, parallel computation, automatic scaling
* application backups (implemented as checkpoints and snapshots)
* Use any Apache Flink programming features
* Flink does not read from Firehose (use Kinesis Analytics for SQL instead)*

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Amazon Managed Streaming for Apache
Kafka (Amazon MSK)

A

Alternative to Amazon Kinesis
* Fully managed Apache Kafka on AWS
* Allow you to create, update, delete clusters
* MSK creates & manages Kafka brokers nodes & Zookeeper nodes for you
* Deploy the MSK cluster in your VPC, multi-AZ (up to 3 for HA)
* Automatic recovery from common Apache Kafka failures
* Data is stored on EBS volumes for as long as you want
* MSK Serverless
* Run Apache Kafka on MSK without managing the capacity
* MSK automatically provisions resources and scales compute & storage

Consumers: Glue, Lambda, EC2,EKS, ecs, KinesisDataAnalytids for Apache f

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Kinesis Data Streams vs. Amazon MSK

A
  • Kinesis has 1 MB message size limit - Kafka has bigger
  • Kinesis has data streams with shards, which can split and merge
  • Kafka topc with Partitions, can add partitions to topic
  • Kinesis & Kafka- TLS in flight encryption - Kafka plain text as well
  • KMS at rest encryption
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Big Data Ingestion Pipeline

A
  • IoT Core allows you to harvest data from IoT devices
  • Kinesis is great for real-time data collection
  • Firehose helps with data delivery to S3 in near real-time (1 minute)
  • **Lambda **can help Firehose with data transformations
  • Amazon S3 can trigger notifications to SQS
  • Lambda can subscribe to SQS (we could have connecter S3 to Lambda)
  • Athena is a serverless SQL service and results are stored in S3
  • The reporting bucket contains analyzed data and can be used by
    reporting tool such as AWS QuickSight, Redshift, etc…
How well did you know this?
1
Not at all
2
3
4
5
Perfectly