AWS Analytics Flashcards

Question 1

Q

Amazon Athena

Answer

A

Exam Tip: analyze data in S3 using serverless SQL, use Athena
*Use cases: Business intelligence / analytics / reporting, analyze &
query VPC Flow Logs, ELB Logs, CloudTrail trails, etc…
Serverless query service to analyze data stored in Amazon S3
Amazon Athena – Federated Query Allows you to run SQL queries across
data stored in relational, non-relational, object, and custom data sources (AWS or on-premises)

Question 2

Q

Redshift

Answer

A

vs Athena: faster queries / joins / aggregations thanks to indexes
Redshift is based on PostgreSQL, but it’s not used for OLTP
It’s OLAP – online analytical processing (analytics and data warehousing)
10x better performance than other data warehouses, scale to PBs of data

Question 3

Q

Redshift Cluster

Answer

A

Leader node: for query planning, results aggregation
Compute node: for performing the queries, send results to leader
You provision the node size in advance
You can used Reserved Instances for cost saving

Question 4

Q

Redshift – Snapshots & DR

Answer

A

Redshift has “Multi-AZ” mode for some clusters
Snapshots are point-in-time backups of a cluster, stored internally in S3
Snapshots are incremental (only what has
changed is saved)
You can restore a snapshot into a new cluster
Automated: every 8 hours, every 5 GB, or on a
schedule. Set retention between 1 to 35 days
Manual: snapshot is retained until you delete it
You can configure Amazon Redshift to
automatically copy snapshots (automated or
manual) of a cluster to another AWS Region

Question 5

Q

Redshift Spectrum

Answer

A

Query data that is already in S3 without loading it
Must have a Redshift cluster available to start the query
The query is then submitted to thousands of Redshift Spectrum nodes

Question 6

Q

Amazon OpenSearch Service

Answer

A

dynamoDb queries exists only by primary key or indexes - Opeansearch can search by any field even partiallu matching
commonly used as extra for databases
Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs
Security through Cognito & IAM, KMS encryption, TLS
Comes with OpenSearch Dashboards (visualization)

Question 7

Q

Amazon EMR

Answer

A

- Use cases: data processing, machine learning, web indexing, big data…
EMR stands for “Elastic MapReduce”
EMR helps creating Hadoop clusters (Big Data) to analyze and process vast
amount of data
The clusters can be made of hundreds of EC2 instances
EMR comes bundled with Apache Spark, HBase, Presto, Flink…
EMR takes care of all the provisioning and configuration

Question 8

Q

Amazon QuickSight

Answer

A

interactive dashboard
Fast, automatically scalable, embeddable, with per-session pricing
- Define Users (standard versions) and Groups (enterprise version)
These users & groups only exist within QuickSight, not IAM !!
A dashboard…
is a read-only snapshot of an analysis that you can share
preserves the configuration of the analysis (filtering, parameters, controls, sort)
You can share the analysis or the dashboard with Users or Groups
To share a dashboard, you must first publish it
Users who see the dashboard can also see the underlying data

Question 9

Q

AWS Glue

Answer

A

Managed extract, transform, and load (ETL) service
Glue Job Bookmarks: prevent re-processing old data
Glue Elastic Views:
Combine and replicate data across multiple data stores using SQL
No custom code, Glue monitors for changes in the source data, serverless
Leverages a “virtual table” (materialized view)
Glue DataBrew: clean and normalize data using pre-built transformation
Glue Studio: new GUI to create, run and monitor ETL jobs in Glue
Glue Streaming ETL (built on Apache Spark Structured Streaming):
compatible with Kinesis Data Streaming, Kafka, MSK (managed Kafka)

Question 10

Q

AWS Lake Formation

Answer

A

Fully managed service that makes it easy to setup a data lake in days
- Combine structured and unstructured data in the data lake
Out-of-the-box source blueprints: S3, RDS, Relational & NoSQL DB…
Fine-grained Access Control for your applications (row and column-level)
Built on top of AWS Glue

Question 11

Q

Kinesis Data Analytics (SQL application)

Answer

A

Real-time analytics on Kinesis Data Streams & Firehose using SQL
Add reference data from Amazon S3 to enrich streaming data
Fully managed, no servers to provision
Automatic scaling
Pay for actual consumption rate
Output:
Kinesis Data Streams: create streams out of the real-time analytics queries
Kinesis Data Firehose: send analytics query results to destinations
Use cases:
Time-series analytics
Real-time dashboards
Real-time metrics

Question 12

Q

Kinesis Data Analytics for Apache Flink

Answer

A

Run any Apache Flink application on a managed cluster on AWS
* provisioning compute resources, parallel computation, automatic scaling
* application backups (implemented as checkpoints and snapshots)
* Use any Apache Flink programming features
* Flink does not read from Firehose (use Kinesis Analytics for SQL instead)*

Question 13

Q

Amazon Managed Streaming for Apache
Kafka (Amazon MSK)

Answer

A

Alternative to Amazon Kinesis
* Fully managed Apache Kafka on AWS
* Allow you to create, update, delete clusters
* MSK creates & manages Kafka brokers nodes & Zookeeper nodes for you
* Deploy the MSK cluster in your VPC, multi-AZ (up to 3 for HA)
* Automatic recovery from common Apache Kafka failures
* Data is stored on EBS volumes for as long as you want
* MSK Serverless
* Run Apache Kafka on MSK without managing the capacity
* MSK automatically provisions resources and scales compute & storage

Consumers: Glue, Lambda, EC2,EKS, ecs, KinesisDataAnalytids for Apache f

Question 14

Q

Kinesis Data Streams vs. Amazon MSK

Answer

A

Kinesis has 1 MB message size limit - Kafka has bigger
Kinesis has data streams with shards, which can split and merge
Kafka topc with Partitions, can add partitions to topic
Kinesis & Kafka- TLS in flight encryption - Kafka plain text as well
KMS at rest encryption

Question 15

Q

Big Data Ingestion Pipeline

Answer

A

IoT Core allows you to harvest data from IoT devices
Kinesis is great for real-time data collection
Firehose helps with data delivery to S3 in near real-time (1 minute)
**Lambda **can help Firehose with data transformations
Amazon S3 can trigger notifications to SQS
Lambda can subscribe to SQS (we could have connecter S3 to Lambda)
Athena is a serverless SQL service and results are stored in S3
The reporting bucket contains analyzed data and can be used by
reporting tool such as AWS QuickSight, Redshift, etc…

AWS Analytics Flashcards

(15 cards)