Data Analytics Flashcards

1
Q

How many hours is data available in the moving time window that Kinesis Stream uses?

A

24 hours (can be increased to 7 days for additional cost)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How many MB does a single Shard in Kinesis allow for ingestion and consumption?

A

1 MB for Ingestion

2 MB for Consumption

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How many Shards does a Kinesis Stream have when newly created?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the size of a single Kinesis Data Record?

A

1 MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How quickly is data delivered using Kinesis Firehose?

A

Near-Real-Time, anything between 1-60 seconds (depends on the amount being ingested, i.e. how quickly the 1 MB buffer it uses is filled up).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 11 valid destinations for Kinesis Firehose?

A
  • Amazon OpenSearch/Elasticsearch Service
  • Amazon Redshift
  • Amazon S3
  • Http endpoints
  • Datadog
  • Dynatrace
  • LogicMonitor
  • MongoDB
  • New Relic
  • Splunk
  • Sumo Logic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How quickly is data delivered through Kinesis Streams?

A

In Real-Time (~ 200 ms)

Not to be confused with Kinesis Firehose, that delivers Near-Real-Time only!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s the right product to use when (potentially complex) real-time SQL processing is required?

A

Kinesis Data Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the six 3rd party big data products does Amazon EMR provides as a managed service?

A
  • Spark
  • Hadoop (incl. Pig)
  • HBase
  • Hive
  • Hudi
  • Presto
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is Amazon EMR a Multi-AZ or Single-AZ product?

A

Single-AZ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What compute products can be used with Amazon EMR (i.e. which compute products are used to run EMR)?

A

EC2 & EKS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What’s the master node used for with Amazon EMR?

A
  • manages the cluster and its health
  • distributes workloads
  • acts as the NAME node within MapReduce
  • allows SSH access to the cluster
  • if it’s the only node in the cluster: runs MapReduce workload
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are core nodes used for with Amazon EMR?

A
  • provide the HDFS (Hadoop File System)
  • run task trackers
  • can run MapReduce workload

Note: losing a core node means losing HDFS and track of tasks => should not be run on Spot instances!
Note #2: Multi-node clusters have at least one core node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are task nodes used for with Amazon EMR?

A
  • run MapReduce workload

Note: ideal to be run on Spot instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What’s EMRFS?

A

S3-based file system for EMR. Can be used to store results of EMR workloads to ensure resilience with EMR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What’s the right product to use when you want to directly query S3 data via Redshift?

A

Redshift Spectrum

17
Q

Is Amazon Redshift a Multi-AZ or Single-AZ product?

A

Single-AZ

18
Q

What’s the role of the Leader Node in Amazon Redshift?

A

Receive query input and distribute it to Compute nodes for execution

19
Q

If you want to customize the network options for Amazon Redshift, what do you need to enable?

A

Enhanced VPC Routing

20
Q

At which intervals are automatic snapshots taken with Amazon Redshift?

A

Every ~8 hours or ~5 GB

21
Q

What are valid data sources for Amazon Redshift (name 7)?

A
Amazon S3
Amazon RDS
Amazon DynamoDB
Amazon EMR
AWS Glue
AWS Data Pipeline 
SSH-enabled host on Amazon EC2 or on-premises
22
Q

What are the available retention periods available for automatic snapshots taken with Amazon Redshift?

A

Anything between 1 day (default) up to 35 days.

23
Q

What are valid data sources for AWS Batch?

A
  • AWS Step Functions
  • AWS Lambda
  • Amazon EventBridge
  • Amazon S3
24
Q

What’s the right product to use for long-running (> 15 minutes) compute tasks?

A

NOT AWS Lambda!

Use AWS Batch, EC2, ECS instead for example

25
Q

Is AWS Batch serverless?

A

No

26
Q

How many records per second does a single Shard in Kinesis allow for ingestion and consumption?

A

1000 records / second

27
Q

What is default limit for the number of Shards in Kinesis?

A

500, but can be increased unlimited

28
Q

What does a Kinesis Shard consist of?

A

Partition Key, Sequence Number, Data

29
Q

What are the two main file systems used with Amazon EMR and what are their key differences?

A

HDFS and EMRFS

HDFS is fast, but ephemeral.
EMRFS is slower, but persistent as backed by S3.

30
Q

What’s the Amazon Kinesis Client Library (KCL) and when would you use it?

A

Library for reading and processing data from an Amazon Kinesis data stream. Removes some of the heavy-lifting when working with stream data, therefore more efficient than using the Kinesis API directly.

31
Q

What are the differences between AWS Glue and AWS Data Pipeline in regards to the compute infrastructure (and control of that infrastructure)?

A
  • AWS Glue is serverless (but uses Apache Spark behind the scenes). No direct control on the compute resources.
  • AWS Data Pipeline spins up EMR clusters and EC2 instances, which can be accessed directly.
32
Q

What are the differences between AWS Glue and AWS Data Pipeline in regards to the engines they use?

A
  • AWS Glue uses a serverless Apache Spark engine and generates Scala or Python code
  • AWS Data Pipeline uses Amazon EMR and through that is flexible on the engine (Spark, Hive, Hudi, Pig, etc.)
33
Q

What AWS service makes use of Apache Flink capabilities?

A

Kinesis Data Analytics