08 Analytics Flashcards
Describe Kinesis and its primary function.
Kinesis is a highly-available scalable streaming service that allows producers to send data into streams for processing and analysis.
How does Kinesis ensure scalability?
Kinesis uses sharding to provide scalability, with each shard supporting 1 MB/s of ingestion and 2 MB/s of read throughput.
Define the data retention period in Kinesis by default.
Data in Kinesis is retained for 24 hours by default, but this can be increased to 365 days.
What is the role of Kinesis Data Firehose?
Kinesis Data Firehose delivers records from Kinesis streams to supported destinations such as S3, OpenSearch, or Redshift.
Explain the near-realtime aspect of Kinesis Data Firehose.
Kinesis Data Firehose operates in near-realtime, but there is a delay due to batching and other processing.
How can data be transformed in Kinesis Data Firehose?
Data can be transformed in-flight using AWS Lambda functions.
Describe the function of Kinesis Data Analytics.
Kinesis Data Analytics uses SQL to analyze data in real-time, with results sent to other Kinesis streams or Kinesis Firehose.
What type of data can Kinesis Data Analytics reference?
Kinesis Data Analytics can reference static data from an S3 bucket.
Define Elastic Map Reduce and its purpose.
Elastic Map Reduce is a managed implementation of Apache Hadoop, designed to process large amounts of data using the Hadoop ecosystem.
What additional elements are included in Elastic Map Reduce?
Elastic Map Reduce includes other elements of the Apache ecosystem alongside Hadoop.
Describe the architecture of an EMR cluster.
An EMR cluster consists of a master node, core nodes that provide HDFS storage and run jobs, and task nodes that only run jobs.
How does EMRFS enhance durability compared to HDFS?
EMRFS is backed by S3, making it more durable than HDFS, which is tied to instances in a single Availability Zone (AZ).
Define Redshift and its primary use case.
Redshift is a petabyte-scale data warehouse optimized for Online Analytical Processing (OLAP) and uses a columnar data format.
Explain the node structure of a Redshift cluster.
A Redshift cluster has a single leader node and multiple compute nodes.
What is the significance of Enhanced VPC routing in Redshift?
Enhanced VPC routing causes traffic in the Redshift cluster to use the VPC’s routing settings instead of public endpoints.
How often are backups taken in Redshift?
Backups in Redshift are taken automatically every approximately 8 hours or when more than 5GB of data has changed.
Discuss the cost efficiency strategy for EMR clusters.
For cost efficiency, use on-demand instances for the master and core nodes, and spot instances for the task nodes.
Identify the limitations of both EMR and Redshift regarding availability.
Both EMR and Redshift run in a single Availability Zone (AZ), which means they are not highly available.
What happens to snapshots in Redshift?
Snapshots in Redshift are automatically taken and stored in S3.
Describe the default retention period for data in Redshift.
Data is retained for 1 day by default, but a retention period of up to 35 days can be selected.
How does Redshift Spectrum function with data in S3?
Redshift Spectrum directly queries data in S3 without loading it into Redshift first.
Define the requirement for using Redshift Spectrum.
Redshift Spectrum is not serverless; a Redshift cluster is still required.
What is the purpose of Federated Query in Redshift?
Federated Query allows Redshift to query data stored in other databases without loading it into Redshift first.
How can one mitigate the single-AZ risk in Redshift?
To overcome the single-AZ risk, take regular snapshots and restore to a new cluster.
Describe Amazon MQ.
Amazon MQ is a managed implementation of Apache ActiveMQ, providing functionality similar to SNS and SQS.
What messaging protocols does Amazon MQ support?
Amazon MQ supports JSM, AMQP, MQTT, OpenWire, and STOMP protocols.
How does Amazon MQ handle messaging?
Amazon MQ allows for one-to-one and one-to-many messaging.
Where does Amazon MQ run?
Amazon MQ runs in a VPC either as a single instance or as a high availability (HA) pair (active/standby).