08 Analytics Flashcards
Describe Kinesis and its primary function.
Kinesis is a highly-available scalable streaming service that allows producers to send data into streams for processing and analysis.
How does Kinesis ensure scalability?
Kinesis uses sharding to provide scalability, with each shard supporting 1 MB/s of ingestion and 2 MB/s of read throughput.
Define the data retention period in Kinesis by default.
Data in Kinesis is retained for 24 hours by default, but this can be increased to 365 days.
What is the role of Kinesis Data Firehose?
Kinesis Data Firehose delivers records from Kinesis streams to supported destinations such as S3, OpenSearch, or Redshift.
Explain the near-realtime aspect of Kinesis Data Firehose.
Kinesis Data Firehose operates in near-realtime, but there is a delay due to batching and other processing.
How can data be transformed in Kinesis Data Firehose?
Data can be transformed in-flight using AWS Lambda functions.
Describe the function of Kinesis Data Analytics.
Kinesis Data Analytics uses SQL to analyze data in real-time, with results sent to other Kinesis streams or Kinesis Firehose.
What type of data can Kinesis Data Analytics reference?
Kinesis Data Analytics can reference static data from an S3 bucket.
Define Elastic Map Reduce and its purpose.
Elastic Map Reduce is a managed implementation of Apache Hadoop, designed to process large amounts of data using the Hadoop ecosystem.
What additional elements are included in Elastic Map Reduce?
Elastic Map Reduce includes other elements of the Apache ecosystem alongside Hadoop.
Describe the architecture of an EMR cluster.
An EMR cluster consists of a master node, core nodes that provide HDFS storage and run jobs, and task nodes that only run jobs.
How does EMRFS enhance durability compared to HDFS?
EMRFS is backed by S3, making it more durable than HDFS, which is tied to instances in a single Availability Zone (AZ).
Define Redshift and its primary use case.
Redshift is a petabyte-scale data warehouse optimized for Online Analytical Processing (OLAP) and uses a columnar data format.
Explain the node structure of a Redshift cluster.
A Redshift cluster has a single leader node and multiple compute nodes.
What is the significance of Enhanced VPC routing in Redshift?
Enhanced VPC routing causes traffic in the Redshift cluster to use the VPC’s routing settings instead of public endpoints.