AWS Analytics Flashcards

1
Q

A web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data.

Highly distributed processing frameworks such as Hadoop, Spark, and Presto. Run a wide variety of scale-out data processing tasks for applications such as machine learning, graph analytics, data transformation, streaming data.

A

Amazon Elastic Map Reduce (EMR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Amazon Elastic Map Reduce (EMR) Features

A

Utilizes a hosted Hadoop framework running on Amazon EC2 and Amazon S3

Managed Hadoop framework for processing huge amounts of data.

Also support Apache Spark, HBase, Presto and Flink

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Common Amazon Elastic Map Reduce (EMR) uses

A

Most commonly used for log analysis, financial analysis, or extract, translate and loading (ETL) activities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Amazon Elastic Map Reduce (EMR) programmatic task for performing some process on data (e.g. count words)

A

Step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A collection of EC2 instances provisioned by EMR to run your Steps

A

Cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Other EMR Facts

A

EMR uses Apache Hadoop as its distributed data processing engine, which is an open source, Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware.

EMR is a good place to deploy Apache Spark, an open-source distributed processing used for big data workloads which utilizes in-memory caching and optimized query execution.

You can also launch Presto clusters. Presto is an open-source distributed SQL query engine designed for fast analytic queries against large datasets.

EMR launches all nodes for a given cluster in the same Amazon EC2 Availability Zone.

You can access Amazon EMR by using the AWS Management Console, Command Line Tools, SDKS, or the EMR API.

With EMR you have access to the underlying operating system (you can SSH in)

Simple and cost effective to run highly distributed processing frameworks such as Hadoop, Spark, and Presto when compared to on-premises deployments

Flexible – you can run custom applications and code, and define specific compute, memory, storage, and application parameters to optimize your analytic requirements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL

Run interactive queries against data directly in Amazon S3 without worrying about formatting data or managing infrastructure. Can use with other services such as Amazon RedShift

A

Amazon Athena

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Amazon Athena Features

A

Serverless, so there is no infrastructure to manage, and you pay only for the queries that you run

Easy to use – simply point to your data in Amazon S3, define the schema, and start querying using standard SQL

Uses Presto with full standard SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet and Avro

While Amazon Athena is ideal for quick, ad-hoc querying and integrates with Amazon QuickSight for easy visualization, it can also handle complex analysis, including large joins, window functions, and arrays.

Amazon Athena uses a managed Data Catalog to store information and schemas about the databases and tables that you create for your data stored in Amazon S3

Easiest way to run ad-hoc queries for data in S3 without the need to setup or manage any servers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fully managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the time-consuming steps of data preparation for analytics

Transform and move data to various destinations. Used to prepare and load data for analytics. Data source can be S3, RedShift or another database. Glue Data Catalog can be queried by Athena, EMR and RedShift Spectrum

A

AWS Glue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

AWS Glue Features

A

Automatically discovers and profiles data via the Glue Data Catalog, recommends and generates ETL code to transform your source data into target schemas.

Runs the ETL jobs on a fully managed, scale-out Apache Spark environment to load your data into its destination

Allows you to setup, orchestrate, and monitor complex data flows

discover properties of data, transform it, and prepare it for analytics.

Glue can automatically discover both structured and semi-structured data stored in data lakes on Amazon S3, data warehouses in Amazon Redshift, and various databases running on AWS.

It provides a unified view of data via the Glue Data Catalog that is available for ETL, querying and reporting using services like Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

Glue automatically generates Scala or Python code for ETL jobs that you can further customize using tools you are already familiar with.

AWS Glue is serverless, so there are no compute resources to configure and manage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

AWS data warehouse that provides the fastest query performance for enterprise reporting and business intelligence workloads, particularly those involving extremely complex SQL with multiple joins and sub-queries.

Pull data from many sources, format and organize it, store it, and support complex, high speed queries that produce business reports.

A

Amazon Redshift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

AWS Service that makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information.

Collection of services for processing streams of various data.

Data is processed in “shards”.

A

Amazon Kinesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing.

Durably stores, encrypts, and indexes video data streams, and allows access to data through easy-to-use APIs.

Producers provide data streams.

Stores data for 24 hours by default, up to 7 days.

Consumers receive and process data.

Can have multiple shards in a stream.

Supports encryption at rest with server-side encryption (KMS) with a customer master key.

A

AWS Kinesis Video Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Enables you to build custom applications that process or analyze streaming data for specialized needs.

Enables real-time processing of streaming big data.

Useful for rapidly moving data off data producers and then continuously processing the data.

Stores data for later processing by applications (key difference with Firehose which delivers data directly to AWS services)

A

AWS Kinesis Data Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Accelerated log and data feed intake.
Real-time metrics and reporting.
Real-time data analytics.
Complex stream processing.

A

AWS Kinesis Data Streams Common Use Cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The easiest way to load streaming data into data stores and analytics tools.

Captures, transforms, and loads streaming data.

Enables near real-time analytics with existing business intelligence tools and dashboards.

Can be used as the source(s) to Kinesis Data Firehose.

Enables you to transform your data before delivering it.

Don’t need to write an application or manage resources.

Can batch, compress, and encrypt data before loading it.

Synchronously replicates data across three AZs as it is transported to destinations.

Each delivery stream stores data records for up to 24 hours

A

AWS Kinesis Data Firehose

17
Q

Easiest way to process and analyze real-time, streaming data.

Can use standard SQL queries to process Kinesis data streams.

Provides real-time analysis

Quickly author and run powerful SQL code against streaming sources.

Can ingest data from Kinesis Streams and Kinesis Firehose.

Output to S3, RedShift, Elasticsearch and Kinesis Data Streams.

Sits over Kinesis Data Streams and Kinesis Data Firehose

A

AWS Kinesis Data Anatlytics

18
Q

Generate time-series analytics.
Feed real-time dashboards.
Create real-time alerts and notifications.

A

AWS Kinesis Data Analytics Use Cases