Data Analytics Flashcards
How many hours is data available in the moving time window that Kinesis Stream uses?
24 hours (can be increased to 7 days for additional cost)
How many MB does a single Shard in Kinesis allow for ingestion and consumption?
1 MB for Ingestion
2 MB for Consumption
How many Shards does a Kinesis Stream have when newly created?
1
What’s the size of a single Kinesis Data Record?
1 MB
How quickly is data delivered using Kinesis Firehose?
Near-Real-Time, anything between 1-60 seconds (depends on the amount being ingested, i.e. how quickly the 1 MB buffer it uses is filled up).
What are the 11 valid destinations for Kinesis Firehose?
- Amazon OpenSearch/Elasticsearch Service
- Amazon Redshift
- Amazon S3
- Http endpoints
- Datadog
- Dynatrace
- LogicMonitor
- MongoDB
- New Relic
- Splunk
- Sumo Logic
How quickly is data delivered through Kinesis Streams?
In Real-Time (~ 200 ms)
Not to be confused with Kinesis Firehose, that delivers Near-Real-Time only!
What’s the right product to use when (potentially complex) real-time SQL processing is required?
Kinesis Data Analytics
What are the six 3rd party big data products does Amazon EMR provides as a managed service?
- Spark
- Hadoop (incl. Pig)
- HBase
- Hive
- Hudi
- Presto
Is Amazon EMR a Multi-AZ or Single-AZ product?
Single-AZ
What compute products can be used with Amazon EMR (i.e. which compute products are used to run EMR)?
EC2 & EKS
What’s the master node used for with Amazon EMR?
- manages the cluster and its health
- distributes workloads
- acts as the NAME node within MapReduce
- allows SSH access to the cluster
- if it’s the only node in the cluster: runs MapReduce workload
What are core nodes used for with Amazon EMR?
- provide the HDFS (Hadoop File System)
- run task trackers
- can run MapReduce workload
Note: losing a core node means losing HDFS and track of tasks => should not be run on Spot instances!
Note #2: Multi-node clusters have at least one core node.
What are task nodes used for with Amazon EMR?
- run MapReduce workload
Note: ideal to be run on Spot instances
What’s EMRFS?
S3-based file system for EMR. Can be used to store results of EMR workloads to ensure resilience with EMR.
What’s the right product to use when you want to directly query S3 data via Redshift?
Redshift Spectrum
Is Amazon Redshift a Multi-AZ or Single-AZ product?
Single-AZ
What’s the role of the Leader Node in Amazon Redshift?
Receive query input and distribute it to Compute nodes for execution
If you want to customize the network options for Amazon Redshift, what do you need to enable?
Enhanced VPC Routing
At which intervals are automatic snapshots taken with Amazon Redshift?
Every ~8 hours or ~5 GB
What are valid data sources for Amazon Redshift (name 7)?
Amazon S3 Amazon RDS Amazon DynamoDB Amazon EMR AWS Glue AWS Data Pipeline SSH-enabled host on Amazon EC2 or on-premises
What are the available retention periods available for automatic snapshots taken with Amazon Redshift?
Anything between 1 day (default) up to 35 days.
What are valid data sources for AWS Batch?
- AWS Step Functions
- AWS Lambda
- Amazon EventBridge
- Amazon S3
What’s the right product to use for long-running (> 15 minutes) compute tasks?
NOT AWS Lambda!
Use AWS Batch, EC2, ECS instead for example
Is AWS Batch serverless?
No
How many records per second does a single Shard in Kinesis allow for ingestion and consumption?
1000 records / second
What is default limit for the number of Shards in Kinesis?
500, but can be increased unlimited
What does a Kinesis Shard consist of?
Partition Key, Sequence Number, Data
What are the two main file systems used with Amazon EMR and what are their key differences?
HDFS and EMRFS
HDFS is fast, but ephemeral.
EMRFS is slower, but persistent as backed by S3.
What’s the Amazon Kinesis Client Library (KCL) and when would you use it?
Library for reading and processing data from an Amazon Kinesis data stream. Removes some of the heavy-lifting when working with stream data, therefore more efficient than using the Kinesis API directly.
What are the differences between AWS Glue and AWS Data Pipeline in regards to the compute infrastructure (and control of that infrastructure)?
- AWS Glue is serverless (but uses Apache Spark behind the scenes). No direct control on the compute resources.
- AWS Data Pipeline spins up EMR clusters and EC2 instances, which can be accessed directly.
What are the differences between AWS Glue and AWS Data Pipeline in regards to the engines they use?
- AWS Glue uses a serverless Apache Spark engine and generates Scala or Python code
- AWS Data Pipeline uses Amazon EMR and through that is flexible on the engine (Spark, Hive, Hudi, Pig, etc.)
What AWS service makes use of Apache Flink capabilities?
Kinesis Data Analytics