Data Analytics Flashcards
How many hours is data available in the moving time window that Kinesis Stream uses?
24 hours (can be increased to 7 days for additional cost)
How many MB does a single Shard in Kinesis allow for ingestion and consumption?
1 MB for Ingestion
2 MB for Consumption
How many Shards does a Kinesis Stream have when newly created?
1
What’s the size of a single Kinesis Data Record?
1 MB
How quickly is data delivered using Kinesis Firehose?
Near-Real-Time, anything between 1-60 seconds (depends on the amount being ingested, i.e. how quickly the 1 MB buffer it uses is filled up).
What are the 11 valid destinations for Kinesis Firehose?
- Amazon OpenSearch/Elasticsearch Service
- Amazon Redshift
- Amazon S3
- Http endpoints
- Datadog
- Dynatrace
- LogicMonitor
- MongoDB
- New Relic
- Splunk
- Sumo Logic
How quickly is data delivered through Kinesis Streams?
In Real-Time (~ 200 ms)
Not to be confused with Kinesis Firehose, that delivers Near-Real-Time only!
What’s the right product to use when (potentially complex) real-time SQL processing is required?
Kinesis Data Analytics
What are the six 3rd party big data products does Amazon EMR provides as a managed service?
- Spark
- Hadoop (incl. Pig)
- HBase
- Hive
- Hudi
- Presto
Is Amazon EMR a Multi-AZ or Single-AZ product?
Single-AZ
What compute products can be used with Amazon EMR (i.e. which compute products are used to run EMR)?
EC2 & EKS
What’s the master node used for with Amazon EMR?
- manages the cluster and its health
- distributes workloads
- acts as the NAME node within MapReduce
- allows SSH access to the cluster
- if it’s the only node in the cluster: runs MapReduce workload
What are core nodes used for with Amazon EMR?
- provide the HDFS (Hadoop File System)
- run task trackers
- can run MapReduce workload
Note: losing a core node means losing HDFS and track of tasks => should not be run on Spot instances!
Note #2: Multi-node clusters have at least one core node.
What are task nodes used for with Amazon EMR?
- run MapReduce workload
Note: ideal to be run on Spot instances
What’s EMRFS?
S3-based file system for EMR. Can be used to store results of EMR workloads to ensure resilience with EMR.