Collection: 18% (Kinesis Streams/Firehose, MSK, SQS, Data Pipeline, Snow, DMS, IoT Core) Flashcards

Be able to: a) determine the operational characteristics of the collection system b) select a collection system that handles the frequency, volume and source of data

1
Q

What is the throughput capacity of a PutRecords call to Kinesis Streams shard?

A

1 MiB of data (including partition keys) per second

1,000 records per second

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the throughput capacity of a GetRecords call to the Kinesis Streams API?

A

2 MiB or 5 transactions per second

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the default and maximum Kinesis Stream record retention period?

A

Default: 24 hours
Maximum: 7 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between the aggregation and collection mechanisms in Kinesis Streams?

A

Aggregation batches KPL user records into a single Streams record, increasing payload size, providing better throughput and improving shard efficiency

Collection batches Streams records sent to a single HTTP request, reducing request overhead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three means of adding a record to a Kinesis Stream?

A

a) Kinesis Agent
b) Kinesis Streams REST API in the SDK
c) Kinesis Producer Library

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two scenarios that may cause a ProvisionedThroughputExceededException? What can be done to address them?

A

a) Frequent checkpointing
b) Too many shards

Provide additional throughput to the DynamoDB application state table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name four services that move data into AWS, other than the Snowball family?

A

a) Direct Connect
b) Storage Gateway
c) S3 Transfer Acceleration
d) Database Migration Service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the three advantages of using Direct Connect?

A

a) reduced costs
b) increased bandwidth throughput
c) consistent network performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Name three advantages of using Snowball

A

a) scales to petabytes of data
b) faster than transmitting the data over the network
b) avoids creating networking bottlenecks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Name two advantages of using Snowball Edge

A

a) scales to petabytes of data

b) supports generation of data despite intermittent connectivity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the advantage of using Snowmobile?

A

Scales to exabytes of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does Storage Gateway provide?

A

Hybrid on-prem/cloud storage using a hardware gateway appliance and native integration with S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the advantage of using S3 Transfer Acceleration?

A

Supports fast uploading from locations distant to regions with S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Name two advantages of using the Database Migration Service?

A

a) the source database remains fully operational during the migration
b) supports continuous data replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Where does Kinesis Data Streams store shard state and checkpointing information?

A

DynamoDB, one table for each stream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which service is MSK most similar to?

A

Kinesis Data Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Name two differences between Kinesis Data Streams and MSK?

A

a) MSK performs slightly better

b) Kinesis is more fully managed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does “checkpointing” refer to in Kinesis Streams?

A

The tracking of records that have already been processed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Name two consumers that are supported by Streams but not by Firehose

A

a) Spark

b) KCL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Which Kinesis service supports multiple S3 destinations?

A

Kinesis Data Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Kinesis Connector Library can be used to emit data to which four AWS data services?

A

a) S3
b) DynamoDB
c) Redshift
d) ElasticSearch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the minimum and maximum sizes for a Kinesis Firehose delivery stream buffer?

A

1 MB to 128 MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the two streaming models supported by Kafka, and which third model enables them to work together?

A

Queueing, publish and subscribe

Partitioned log

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Name the four Data Pipeline components

A

a) datanode (end destination)
b) activity (pipeline action)
c) precondition (readiness check)
d) schedule (activity timing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Name five differences between Kinesis Firehose and Kinesis Data Streams

A

a) Firehose is fully managed whereas Streams requires some manual configuration
b) Firehose has a somewhat greater latency
c) Firehose does not support data storage or replay
d) Firehose can load data directly into storage services
d) Firehose does not support KCL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Which service, Kinesis Firehose or Kinesis Data Streams, supports connection to multiple destinations?

A

Kinesis Data Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Which AWS service has largely replaced Data Pipeline?

A

Lambda

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Which service creates a dedicated private network connection between a customer network and AWS?

A

Direct Connect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How can the CloudWatch logs agent be integrated with Kinesis?

A

Log data can be shared cross-region and cross-account by configuring Kinesis Data Stream subscriptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

When would it be appropriate to store application logs in S3?

A

Consolidating CloudTrail audit logs OR implementing serverless log analytics using Kinesis Analytics [uncertain, question from Milner post]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How can the Managed Service for Kafka be integrated with Kinesis Data Analytics?

A

It can’t be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Can Kinesis Data Streams integrate directly with any data storage services? If yes, how is it done? If no, what should be done instead?

A

No

Consumers running on EC2 or as Lambda functions must use the Kinesis Client Library to retrieve records from the stream and then emit them using the a storage service connector from the Kinesis Connector Library

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Kinesis Firehose can integrate directly with which three data storage services?

A

S3, Elasticsearch, and Redshift

Integration with DynamoDB is not supported, and Kinesis Analytics and Splunk are not storage services

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Name three beneficial abilities of stream processing

A

a) decouples collection and processing, which may be operating at different rates
b) multiple ingestion streams can be merged to a combined stream for consumption
c) multiple endpoints can work on the same data in parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Name three benefits of the Storage Gateway implementation

A

a) low latency, achieved through local caching of frequently accessed data
b) transfer optimisation, through sending only modified data and by compressing data prior to transfer
c) native integration with S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are the three key use cases for Storage Gateway?

A

a) backups and archives to the cloud
b) reduction of on-prem storage by using cloud-backed file shares
c) on-prem applications that require low latency access to data stored in AWS since data is cached

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the difference between a KPL user record and a Kinesis Data Stream record?

A

A user record is a blob of data that has particular meaning to the user

A Streams record is an instance of the service API Record structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the maximum size of a Streams record data blob, before Base64 encoding?

A

1 MB

39
Q

What is the difference between the intended purpose of Kinesis Data Streams and the Simple Queue Service?

A

Kinesis Streams is designed for the real-time processing of large volumes of data

SQS is designed to as a polled buffer for much smaller data packets where the messages are processed independently

40
Q

Name three things that Kinesis Data Streams can do that SQS cannot

A

a) preserve record ordering (note that SQS FIFO queues also do this)
b) route related records to the same consumer
c) allow multiple consumers to consume the same stream concurrently

41
Q

Name four things that SQS can do that Kinesis Data Streams cannot

A

a) track the successful completion of each item independently
b) support parallel processing since each message can only be retrieved once
c) support priority queues
d) scale transparently (Kinesis requires shard numbers to be adjusted manually)

42
Q

What does MQTT stand for and how is are MQTT messages authenticated by IoT Core?

A

Message Queuing Telemetry Transport, a lightweight pub/sub protocol

Using X.509 certificates

43
Q

What is the IoT Device Gateway?

A

A secure, scalable service that manages all active device connections to enable their efficient communication with IoT Core, over one of three low level protocols

44
Q

Which three communications protocols are supported by the IoT Device Gateway?

A

a) HTTPS
b) Websockets
c) MQTT

Note that FTP is not supported

45
Q

What is the IoT Device Registry?

A

A central location for storing attributes related to each connected device, or device that may be connected in the future

46
Q

What is the IoT Device Shadow?

A

The last known state of each connected device, stored as a JSON document, and used as a message channel to send commands to that device through a uniform interface

47
Q

What is the IoT Core Rules Engine?

A

The Rules Engine enables continuous filtering, transformation and routing of incoming device messages according to configured rules specified using an SQL based syntax

48
Q

What are IoT Core rule actions?

A

Rule actions specify which action the Rules Engine should take when a rule is triggered on an incoming device message

49
Q

Name three major types of behaviour can be specified by IoT Core rule actions

A

a) filter and transform incoming device data
b) routing device data to other AWS services directly or via Lambda
c) triggering CloudWatch alerts

50
Q

Which five endpoints can the IoT Core Rules Engine route device data to directly, without the need for Lambda, other than data storage services?

A

a) Kinesis
b) Simple Notification Service
c) Simple Queue Service
d) CloudWatch
e) IoT Analytics

51
Q

What specific setting needs to be changed after adding a new topic for MSK?

A

The Zookeeper connection string [uncertain, from Milner post]

52
Q

Name four use cases for Kinesis Data Streams

A

a) accelerated log and data feed intake
b) realtime metrics and reporting
c) loading of aggregate data into a warehouse or MR cluster
d) complex stream processing

53
Q

Name three use cases for the Simple Queue Service

A

a) decoupling microservices
b) scheduling batch jobs
c) distributing tasks to worker nodes

54
Q

Which service does Firehose use to perform asynchronous batch transformations, and how are these carried out?

A

Lambda (function blueprints are available for common transformations)

Firehose buffers data using a specified size or interval, then invokes the specified Lambda function on each batch

Transformed data is sent back to Firehose for further buffering before being sent to the consumer

55
Q

What happens when a Firehose transformation fails?

A

Transformations can be retried up to three times
Failed records are sent to S3
Errors can also be sent to CloudWatch

56
Q

Which library is used to emit data from Kinesis Data Streams to an AWS data service

A

Kinesis Connector Library (not to be confused with Kinesis Client Library)

57
Q

What is an alternative to using the Kinesis Client and Connector Libraries in order to send data from a Kinesis Data Stream to an AWS data service?

A

Lambda functions

58
Q

What are two primary purposes of Kinesis Data Streams?

A

a) accepting data as soon as it has been produced, without the need for batching
b) enabling custom applications to process and analyse streaming data

59
Q

What is a Kinesis shard?

A

A sequence of records in a Kinesis stream

60
Q

Name four capabilities of the Kinesis Producer Library

A

b) using PutRecords to write multiple records to one or more shards per request
c) integrating with the KCL to provide consumer record aggregation and disaggregation
a) providing an automatic and configurable retry mechanism
d) submitting CloudWatch metrics to provide performance visibility

61
Q

Name the five primary low level tasks of the Kinesis Client Library, other than deaggregation

A

a) connecting to a Stream and enumerating its shards
b) instantiating a record processor for every shard managed
c) pulling records from the stream and pushing them to the corresponding record processor
d) checkpointing processed records
e) rebalancing shard-worker associations when the worker or shard counts change

62
Q

What are the three components of a Kinesis Streams data record?

A

a) partition key
b) sequence number
c) aggregated data blob

63
Q

How does Data Pipeline integrate with on-premise servers?

A

A Task Runner installed on the on-premise hosts polls Data Pipeline for work, and issues appropriate commands to run the specified activity, eg running a stored procedure

64
Q

If Kinesis Firehose fails to deliver data to S3, how often will it retry delivery? What is the maximum retention period?

A

5 seconds

24 hours, following which the data is discarded

65
Q

What range of retry durations can be specified for a Firehose delivery stream to Redshift or Elasticsearch?

A

0 - 7200 seconds (2 hours)

66
Q

What are the default and maximum SQS retention times?

A

Default: 4 days
Maximum: 14 days

67
Q

What happens if Kinesis Firehose fails to delivery data from S3 to Redshift after the maximum retry period?

A

Firehose delivers the skipped files to the S3 bucket as a manifest file in the errors folder. The manifest can then be used to manually load the data into Redshift using the COPY command once the issue causing the failure has been addressed.

68
Q

What happens if Firehose data delivery falls behind ingestion?

A

Firehose will automatically increase the buffer size

69
Q

Name the two formats that Kinesis Firehose can convert data to prior to delivery?

A

Parquet and ORC

70
Q

How does Kinesis Data Streams ensure data durability?

A

By synchronously replicating data across three Availability Zones

71
Q

Name two services that can efficiently deliver data from a third party service to S3 for upload into Redshift?

A

a) Data Pipeline, using a RedshiftCopyActivity with S3 and Redshift data nodes
b) Lambda, using the Redshift Database Loader

72
Q

What are the minimum and maximum intervals that Kinesis Firehose will buffer data before flushing a delivery stream?

A

60 - 900 seconds

73
Q

Name two key features of Kinesis Stream records

A

a) each record is uniquely identified

b) records have a fixed unit of capacity

74
Q

Name two IoT components that connected devices can communicate with using IoT Gateway

A

a) the Rules Engine

b) Device Shadow service

75
Q

Name five key components of IoT Core

A

a) Device Gateway
b) Message Broker
c) Registry
d) Device Shadow
e) Rules Engine

76
Q

What is the IoT Message Broker?

A

A high-throughput, topic-based pub/sub service that enables the asynchronous transmission of messages over MQTT between devices and applications

77
Q

Which Kinesis service can be configured to compress data before delivery?

A

Kinesis Firehose

78
Q

What is the maximum number of records that can be returned by a GetRecords call to a Kinesis stream? What is the maximum size of the data returned?

A

10,000 records

10 MiB

79
Q

What is the maximum number of records that can be included in a PutRecords call to a Firehose delivery stream? What is the maximum size of the data? How many calls can be made per second?

A

500 records
5 MiB, including partition keys
2,500 requests per second

80
Q

The KPL can be used to aggregate data written to a Firehose delivery stream. On which two occasions are they de-aggregated?

A

Before delivery to the destination

If the stream is configured for transformation, the records are de-aggregated before delivery to Lambda

81
Q

What is the maximum propagation delay of Kinesis Streams standard and of fan-out? What about Firehose?

A

Standard: 200 milliseconds
Fan-out: 70 milliseconds
Firehose: 60 seconds

82
Q

What is resharding?

A

The splitting or merging of shards to meet changing traffic demands

Can be performed without restarting the stream and without impact on producers

83
Q

After a weekend, the content of a Kinesis stream is found to have disappeared. What explains this behaviour?

A

The default retention period is 24 hours, so the records would have been deleted

84
Q

Name two places where Kinesis Streams consumers can run

A

a) on EC2 instances

b) as Lambda functions

85
Q

Which three data storage endpoints can the IoT Core Rules Engine route device data to directly, without the need for Lambda?

A

a) S3
b) DynamoDB
c) ElasticSearch

Note that RDS and Redshift are not directly integrated

86
Q

How many Firehose delivery streams can each AWS account have concurrently?

A

5

87
Q

Both the number of streams per account and the number of PutRecord calls per stream per second have soft limits. What can be done if additional throughput is required?

A

Complete and submit the Firehose Limits form

88
Q

What is the throughput capacity of a PutRecord or PutRecordBatch call to a Firehose delivery stream?

A

1 MiB, 1,000 requests and 1,000 records per second in most regions

5 MiB, 2,000 requests and 5,000 records per second in N. Virginia, Oregon and Ireland

89
Q

What is the default shard quota for a Kinesis data stream?

A

200 in most regions

500 in N. Virginia, Oregon and Ireland

90
Q

Which three compression formats can Firehose apply to data before delivery to S3?

Which single format can Firehose apply if the data will be further loaded into Redshift?

A

gzip, zip. snappy

gzip

91
Q

What are you primarily charged for when using Firehose?

What additional charges might be levied?

A

Volume of data ingested (number of records times the size of each record rounded to nearest 5 kB)

Data format conversion, and volume of outgoing data to a destination resident in a VPC

92
Q

What is the pricing model for Data Pipeline?

A

Use of Pipeline is free, but there may be charges for the resources used

93
Q

What is the maximum number of records that can be returned by a GetRecords call to a Firehose delivery stream?

A

Firehose is not integrated with the KCL, rather data is delivered directly to a specified data storage service.

94
Q

Name a best practice when defining partition keys via the KPL to balance shard load

A

Keys should be generated randomly