08 Analytics Flashcards

Question 1

Q

Describe Kinesis and its primary function.

Answer

A

Kinesis is a highly-available scalable streaming service that allows producers to send data into streams for processing and analysis.

Question 2

Q

How does Kinesis ensure scalability?

Answer

A

Kinesis uses sharding to provide scalability, with each shard supporting 1 MB/s of ingestion and 2 MB/s of read throughput.

Question 3

Q

Define the data retention period in Kinesis by default.

Answer

A

Data in Kinesis is retained for 24 hours by default, but this can be increased to 365 days.

Question 4

Q

What is the role of Kinesis Data Firehose?

Answer

A

Kinesis Data Firehose delivers records from Kinesis streams to supported destinations such as S3, OpenSearch, or Redshift.

Question 5

Q

Explain the near-realtime aspect of Kinesis Data Firehose.

Answer

A

Kinesis Data Firehose operates in near-realtime, but there is a delay due to batching and other processing.

Question 6

Q

How can data be transformed in Kinesis Data Firehose?

Answer

A

Data can be transformed in-flight using AWS Lambda functions.

Question 7

Q

Describe the function of Kinesis Data Analytics.

Answer

A

Kinesis Data Analytics uses SQL to analyze data in real-time, with results sent to other Kinesis streams or Kinesis Firehose.

Question 8

Q

What type of data can Kinesis Data Analytics reference?

Answer

A

Kinesis Data Analytics can reference static data from an S3 bucket.

Question 9

Q

Define Elastic Map Reduce and its purpose.

Answer

A

Elastic Map Reduce is a managed implementation of Apache Hadoop, designed to process large amounts of data using the Hadoop ecosystem.

Question 10

Q

What additional elements are included in Elastic Map Reduce?

Answer

A

Elastic Map Reduce includes other elements of the Apache ecosystem alongside Hadoop.

Question 11

Q

Describe the architecture of an EMR cluster.

Answer

A

An EMR cluster consists of a master node, core nodes that provide HDFS storage and run jobs, and task nodes that only run jobs.

Question 12

Q

How does EMRFS enhance durability compared to HDFS?

Answer

A

EMRFS is backed by S3, making it more durable than HDFS, which is tied to instances in a single Availability Zone (AZ).

Question 13

Q

Define Redshift and its primary use case.

Answer

A

Redshift is a petabyte-scale data warehouse optimized for Online Analytical Processing (OLAP) and uses a columnar data format.

Question 14

Q

Explain the node structure of a Redshift cluster.

Answer

A

A Redshift cluster has a single leader node and multiple compute nodes.

Question 15

Q

What is the significance of Enhanced VPC routing in Redshift?

Answer

A

Enhanced VPC routing causes traffic in the Redshift cluster to use the VPC’s routing settings instead of public endpoints.

Question 16

Q

How often are backups taken in Redshift?

Answer

A

Backups in Redshift are taken automatically every approximately 8 hours or when more than 5GB of data has changed.

Question 17

Q

Discuss the cost efficiency strategy for EMR clusters.

Answer

A

For cost efficiency, use on-demand instances for the master and core nodes, and spot instances for the task nodes.

Question 18

Q

Identify the limitations of both EMR and Redshift regarding availability.

Answer

A

Both EMR and Redshift run in a single Availability Zone (AZ), which means they are not highly available.

Question 19

Q

What happens to snapshots in Redshift?

Answer

A

Snapshots in Redshift are automatically taken and stored in S3.

Question 20

Q

Describe the default retention period for data in Redshift.

Answer

A

Data is retained for 1 day by default, but a retention period of up to 35 days can be selected.

Question 21

Q

How does Redshift Spectrum function with data in S3?

Answer

A

Redshift Spectrum directly queries data in S3 without loading it into Redshift first.

Question 22

Q

Define the requirement for using Redshift Spectrum.

Answer

A

Redshift Spectrum is not serverless; a Redshift cluster is still required.

Question 23

Q

What is the purpose of Federated Query in Redshift?

Answer

A

Federated Query allows Redshift to query data stored in other databases without loading it into Redshift first.

Question 24

Q

How can one mitigate the single-AZ risk in Redshift?

Answer

A

To overcome the single-AZ risk, take regular snapshots and restore to a new cluster.

Question 25

Q

Describe Amazon MQ.

Answer

A

Amazon MQ is a managed implementation of Apache ActiveMQ, providing functionality similar to SNS and SQS.

Question 26

Q

What messaging protocols does Amazon MQ support?

Answer

A

Amazon MQ supports JSM, AMQP, MQTT, OpenWire, and STOMP protocols.

Question 27

Q

How does Amazon MQ handle messaging?

Answer

A

Amazon MQ allows for one-to-one and one-to-many messaging.

Question 28

Q

Where does Amazon MQ run?

Answer

A

Amazon MQ runs in a VPC either as a single instance or as a high availability (HA) pair (active/standby).