AWS Analytics Flashcards

1
Q

What is Amazon Elastic Map Reduce (EMR)?

A

Amazon EMR is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does Amazon EMR work?

A

EMR utilizes a Managed hosted Hadoop framework running on Amazon EC2 and Amazon S3 for processing huge amounts of data.

Also support Apache Spark, HBase, Presto and Flink.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is EMR normally used for?

A

Most commonly used for log analysis, financial analysis, or extract, translate and loading (ETL) activities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a Step?

A

A Step is a programmatic task for performing some process on the data (e.g. count words).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Cluster?

A

A cluster is a collection of EC2 instances provisioned by EMR to run your Steps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does EMR use as distributed data processing engine?

A

EMR uses Apache Hadoop

Apache Hadoop which is an open source, Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What software is EMR a good place to deploy for?

A

EMR is a good place to deploy Apache Spark.

Apache Spark is an open-source distributed processing used for big data workloads which utilizes in-memory caching and optimized query execution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What clusters can EMR launch?

A

EMR can launch Presto clusters.

Presto is an open-source distributed SQL query engine designed for fast analytic queries against large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Where does EMR launch all its nodes?

A

EMR launches all nodes for a given cluster in the same Amazon EC2 Availability Zone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Amazon Athena?

A

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are Amazon Athena’s characteristics?

A

Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use – simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does Amazon Athena use?

A

Amazon Athena uses Presto with full standard SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet and Avro.

Amazon Athena uses a managed Data Catalog to store information and schemas about the databases and tables that you create for your data stored in Amazon S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Amazon Athena’s use case?

A

Amazon Athena is ideal for quick, ad-hoc querying and integrates with Amazon QuickSight for easy visualization

It can also handle complex analysis, including large joins, window functions, and arrays.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is AWS Glue?

A

AWS Glue is a fully managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the time-consuming steps of data preparation for analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does AWS Glue work?

A

AWS Glue automatically discovers and profiles data via the Glue Data Catalog, recommends and generates ETL code to transform your source data into target schemas.

AWS Glue runs the ETL jobs on a fully managed, scale-out Apache Spark environment to load your data into its destination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does AWS Glue allow you to do? What use case does it have?

A

You can create and run an ETL job with a few clicks in the AWS Management Console.

Use AWS Glue to discover properties of data, transform it, and prepare it for analytics.

AWS Glue also allows you to setup, orchestrate, and monitor complex data flows.

17
Q

What are some features of AWS Glue?

A

Glue can automatically discover both structured and semi-structured data stored in data lakes on Amazon S3, data warehouses in Amazon Redshift, and various databases running on AWS.

It provides a unified view of data via the Glue Data Catalog that is available for ETL, querying and reporting using services like Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

Glue automatically generates Scala or Python code for ETL jobs that you can further customize using tools you are already familiar with.

18
Q

Is AWS Glue Serverless

A

AWS Glue is serverless, so there are no compute resources to configure and manage.

19
Q

Data Analysis and Query Use Cases Summary - Primary Use Case

A

Amazon Athena - Query

Amazon RedShift - Data Warehouse

Amazon EMR - Data Processing

AWS Glue - ETL Service

20
Q

Data Analysis and Query Use Cases Summary - When to Use

A

Amazon Athena - Run interactive queries against data directly in Amazon S3 without worrying about formatting data or managing infrastructure. Can use with other services such as Amazon RedShift

Amazon RedShift - Pull data from many sources, format and organize it, store it, and support complex, high speed queries that produce business reports.

Amazon EMR - Highly distributed processing frameworks such as Hadoop, Spark, and Presto. Run a wide variety of scale-out data processing tasks for applications such as machine learning, graph analytics, data transformation, streaming data.

AWS Glue - Transform and move data to various destinations. Used to prepare and load data for analytics. Data source can be S3, RedShift or another database. Glue Data Catalog can be queried by Athena, EMR and RedShift Spectrum

21
Q

What is Amazon Kinesis?

A

Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information.

Collection of services for processing streams of various data.

Data is processed in “shards”.

Four types of Kinesis service:

  • Kinesis Video Streams
  • Kinesis Data Streams
  • Kinesis Data Firehose
  • Kinesis Data Analytics
22
Q

What is Kinesis Video Streams?

A

Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing.

Durably stores, encrypts, and indexes video data streams, and allows access to data through easy-to-use APIs.

23
Q

What are some features of Kinesis Video Streams?

A

Producers provide data streams.

Stores data for 24 hours by default, up to 7 days.

Consumers receive and process data.

Can have multiple shards in a stream.

Supports encryption at rest with server-side encryption (KMS) with a customer master key.

24
Q

What is Kinesis Data Streams?

A

Kinesis Data Streams enables you to build custom applications that process or analyze streaming data for specialized needs.

Kinesis Data Streams enables real-time processing of streaming big data.

Kinesis Data Streams is useful for rapidly moving data off data producers and then continuously processing the data.

Kinesis Data Streams stores data for later processing by applications (key difference with Firehose which delivers data directly to AWS services).

25
Q

What is Kinesis Data Streams use case?

A

Accelerated log and data feed intake.
Real-time metrics and reporting.
Real-time data analytics.
Complex stream processing.

26
Q

What is Kinesis Data Firehose?

A

Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools.

Captures, transforms, and loads streaming data.

Enables near real-time analytics with existing business intelligence tools and dashboards.

27
Q

How can Kinesis Data Stream and Kinesis Data Firehose work together?

A

Kinesis Data Streams can be used as the source(s) to Kinesis Data Firehose.

You can configure Kinesis Data Firehose to transform your data before delivering it.

28
Q

What are some features of Kinesis Data Firehose?

A

With Kinesis Data Firehose you don’t need to write an application or manage resources.

Firehose can batch, compress, and encrypt data before loading it.

Firehose synchronously replicates data across three AZs as it is transported to destinations.

Each delivery stream stores data records for up to 24 hours.

29
Q

What is Kinesis Data Analytics?

A

Amazon Kinesis Data Analytics is the easiest way to process and analyze real-time, streaming data.

Can use standard SQL queries to process Kinesis data streams.

Provides real-time analysis.

30
Q

What is Kinesis Data Analytics use case?

A

Generate time-series analytics.

Feed real-time dashboards.

Create real-time alerts and notifications.

Quickly author and run powerful SQL code against streaming sources.

Can ingest data from Kinesis Streams and Kinesis Firehose.

Output to S3, RedShift, Elasticsearch and Kinesis Data Streams.

Sits over Kinesis Data Streams and Kinesis Data Firehose.