08 Analytics Flashcards

1
Q

Describe Kinesis and its primary function.

A

Kinesis is a highly-available scalable streaming service that allows producers to send data into streams for processing and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does Kinesis ensure scalability?

A

Kinesis uses sharding to provide scalability, with each shard supporting 1 MB/s of ingestion and 2 MB/s of read throughput.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define the data retention period in Kinesis by default.

A

Data in Kinesis is retained for 24 hours by default, but this can be increased to 365 days.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the role of Kinesis Data Firehose?

A

Kinesis Data Firehose delivers records from Kinesis streams to supported destinations such as S3, OpenSearch, or Redshift.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the near-realtime aspect of Kinesis Data Firehose.

A

Kinesis Data Firehose operates in near-realtime, but there is a delay due to batching and other processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can data be transformed in Kinesis Data Firehose?

A

Data can be transformed in-flight using AWS Lambda functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the function of Kinesis Data Analytics.

A

Kinesis Data Analytics uses SQL to analyze data in real-time, with results sent to other Kinesis streams or Kinesis Firehose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What type of data can Kinesis Data Analytics reference?

A

Kinesis Data Analytics can reference static data from an S3 bucket.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define Elastic Map Reduce and its purpose.

A

Elastic Map Reduce is a managed implementation of Apache Hadoop, designed to process large amounts of data using the Hadoop ecosystem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What additional elements are included in Elastic Map Reduce?

A

Elastic Map Reduce includes other elements of the Apache ecosystem alongside Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe the architecture of an EMR cluster.

A

An EMR cluster consists of a master node, core nodes that provide HDFS storage and run jobs, and task nodes that only run jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does EMRFS enhance durability compared to HDFS?

A

EMRFS is backed by S3, making it more durable than HDFS, which is tied to instances in a single Availability Zone (AZ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define Redshift and its primary use case.

A

Redshift is a petabyte-scale data warehouse optimized for Online Analytical Processing (OLAP) and uses a columnar data format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the node structure of a Redshift cluster.

A

A Redshift cluster has a single leader node and multiple compute nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the significance of Enhanced VPC routing in Redshift?

A

Enhanced VPC routing causes traffic in the Redshift cluster to use the VPC’s routing settings instead of public endpoints.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How often are backups taken in Redshift?

A

Backups in Redshift are taken automatically every approximately 8 hours or when more than 5GB of data has changed.

17
Q

Discuss the cost efficiency strategy for EMR clusters.

A

For cost efficiency, use on-demand instances for the master and core nodes, and spot instances for the task nodes.

18
Q

Identify the limitations of both EMR and Redshift regarding availability.

A

Both EMR and Redshift run in a single Availability Zone (AZ), which means they are not highly available.

19
Q

What happens to snapshots in Redshift?

A

Snapshots in Redshift are automatically taken and stored in S3.

20
Q

Describe the default retention period for data in Redshift.

A

Data is retained for 1 day by default, but a retention period of up to 35 days can be selected.

21
Q

How does Redshift Spectrum function with data in S3?

A

Redshift Spectrum directly queries data in S3 without loading it into Redshift first.

22
Q

Define the requirement for using Redshift Spectrum.

A

Redshift Spectrum is not serverless; a Redshift cluster is still required.

23
Q

What is the purpose of Federated Query in Redshift?

A

Federated Query allows Redshift to query data stored in other databases without loading it into Redshift first.

24
Q

How can one mitigate the single-AZ risk in Redshift?

A

To overcome the single-AZ risk, take regular snapshots and restore to a new cluster.

25
Q

Describe Amazon MQ.

A

Amazon MQ is a managed implementation of Apache ActiveMQ, providing functionality similar to SNS and SQS.

26
Q

What messaging protocols does Amazon MQ support?

A

Amazon MQ supports JSM, AMQP, MQTT, OpenWire, and STOMP protocols.

27
Q

How does Amazon MQ handle messaging?

A

Amazon MQ allows for one-to-one and one-to-many messaging.

28
Q

Where does Amazon MQ run?

A

Amazon MQ runs in a VPC either as a single instance or as a high availability (HA) pair (active/standby).