Data Analytics Flashcards by Daniel Lemke

How many hours is data available in the moving time window that Kinesis Stream uses?

24 hours (can be increased to 7 days for additional cost)

How well did you know this?

Not at all

Perfectly

How many MB does a single Shard in Kinesis allow for ingestion and consumption?

1 MB for Ingestion

2 MB for Consumption

How well did you know this?

Not at all

Perfectly

How many Shards does a Kinesis Stream have when newly created?

How well did you know this?

Not at all

Perfectly

What’s the size of a single Kinesis Data Record?

1 MB

How well did you know this?

Not at all

Perfectly

How quickly is data delivered using Kinesis Firehose?

Near-Real-Time, anything between 1-60 seconds (depends on the amount being ingested, i.e. how quickly the 1 MB buffer it uses is filled up).

How well did you know this?

Not at all

Perfectly

What are the 11 valid destinations for Kinesis Firehose?

Amazon OpenSearch/Elasticsearch Service
Amazon Redshift
Amazon S3
Http endpoints
Datadog
Dynatrace
LogicMonitor
MongoDB
New Relic
Splunk
Sumo Logic

How well did you know this?

Not at all

Perfectly

How quickly is data delivered through Kinesis Streams?

In Real-Time (~ 200 ms)

Not to be confused with Kinesis Firehose, that delivers Near-Real-Time only!

How well did you know this?

Not at all

Perfectly

What’s the right product to use when (potentially complex) real-time SQL processing is required?

Kinesis Data Analytics

How well did you know this?

Not at all

Perfectly

What are the six 3rd party big data products does Amazon EMR provides as a managed service?

Spark
Hadoop (incl. Pig)
HBase
Hive
Hudi
Presto

How well did you know this?

Not at all

Perfectly

Is Amazon EMR a Multi-AZ or Single-AZ product?

Single-AZ

How well did you know this?

Not at all

Perfectly

What compute products can be used with Amazon EMR (i.e. which compute products are used to run EMR)?

EC2 & EKS

How well did you know this?

Not at all

Perfectly

What’s the master node used for with Amazon EMR?

manages the cluster and its health
distributes workloads
acts as the NAME node within MapReduce
allows SSH access to the cluster
if it’s the only node in the cluster: runs MapReduce workload

How well did you know this?

Not at all

Perfectly

What are core nodes used for with Amazon EMR?

provide the HDFS (Hadoop File System)
run task trackers
can run MapReduce workload

Note: losing a core node means losing HDFS and track of tasks => should not be run on Spot instances!
Note #2: Multi-node clusters have at least one core node.

How well did you know this?

Not at all

Perfectly

What are task nodes used for with Amazon EMR?

run MapReduce workload

Note: ideal to be run on Spot instances

How well did you know this?

Not at all

Perfectly

What’s EMRFS?

S3-based file system for EMR. Can be used to store results of EMR workloads to ensure resilience with EMR.

How well did you know this?

Not at all

Perfectly

What’s the right product to use when you want to directly query S3 data via Redshift?

Study These Flashcards

Redshift Spectrum

Is Amazon Redshift a Multi-AZ or Single-AZ product?

Study These Flashcards

Single-AZ

What’s the role of the Leader Node in Amazon Redshift?

Study These Flashcards

Receive query input and distribute it to Compute nodes for execution

If you want to customize the network options for Amazon Redshift, what do you need to enable?

Study These Flashcards

Enhanced VPC Routing

At which intervals are automatic snapshots taken with Amazon Redshift?

Study These Flashcards

Every ~8 hours or ~5 GB

What are valid data sources for Amazon Redshift (name 7)?

Study These Flashcards

Amazon S3
Amazon RDS
Amazon DynamoDB
Amazon EMR
AWS Glue
AWS Data Pipeline 
SSH-enabled host on Amazon EC2 or on-premises

What are the available retention periods available for automatic snapshots taken with Amazon Redshift?

Study These Flashcards

Anything between 1 day (default) up to 35 days.

What are valid data sources for AWS Batch?

Study These Flashcards

AWS Step Functions
AWS Lambda
Amazon EventBridge
Amazon S3

What’s the right product to use for long-running (> 15 minutes) compute tasks?

Study These Flashcards

NOT AWS Lambda!

Use AWS Batch, EC2, ECS instead for example

Is AWS Batch serverless?

How many records per second does a single Shard in Kinesis allow for ingestion and consumption?

1000 records / second

What is default limit for the number of Shards in Kinesis?

500, but can be increased unlimited

What does a Kinesis Shard consist of?

Partition Key, Sequence Number, Data

What are the two main file systems used with Amazon EMR and what are their key differences?

HDFS and EMRFS HDFS is fast, but ephemeral. EMRFS is slower, but persistent as backed by S3.

What's the Amazon Kinesis Client Library (KCL) and when would you use it?

Library for reading and processing data from an Amazon Kinesis data stream. Removes some of the heavy-lifting when working with stream data, therefore more efficient than using the Kinesis API directly.

What are the differences between AWS Glue and AWS Data Pipeline in regards to the compute infrastructure (and control of that infrastructure)?

- AWS Glue is serverless (but uses Apache Spark behind the scenes). No direct control on the compute resources. - AWS Data Pipeline spins up EMR clusters and EC2 instances, which can be accessed directly.

What are the differences between AWS Glue and AWS Data Pipeline in regards to the engines they use?

- AWS Glue uses a serverless Apache Spark engine and generates Scala or Python code - AWS Data Pipeline uses Amazon EMR and through that is flexible on the engine (Spark, Hive, Hudi, Pig, etc.)

What AWS service makes use of Apache Flink capabilities?

Kinesis Data Analytics

Data Analytics Flashcards

(33 cards)