Collection Flashcards

Question

Como lidar com duplicação de dados para consumers do Kinesis?

Answer 1

* Consumer retries can make your application read the same data twice * Consumer retries happen when record processors restart: * A worker terminates unexpectedly * Worker instances are added or removed * Shards are merged or split * The application is deployed * Fixes: * Make your consumer application idempotent * If the final destination can handle duplicates, it’s recommended to do it there * More info: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html (Slide 46 - Aula 12)

Answer 2

* Control access / authorization using IAM policies * Encryption in flight using HTTPS endpoints * Encryption at rest using KMS * Client side encryption must be manually implemented (harder) * VPC Endpoints available for Kinesis to access within VPC

Answer 3

Fully Managed Service, no administration Near Real Time (60 seconds latency minimum for non full batches) Load data into Redshift / Amazon S3 / ElasticSearch / Splunk Automatic scaling Supports many data formats Data Conversions from JSON to Parquet / ORC (only for S3) Data Transformation through AWS Lambda (ex: CSV => JSON) Supports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY) Only GZIP is the data is further loaded into Redshift Pay for the amount of data going through Firehose Spark / KCL do not read from KDF

Answer 4

``` SDK (Kinesis Producer Library - KPL) Kinesis Agent Kinesis Data Streams CloudWatch Logs & Events IoT rules actions ``` (slide 48)

Answer 5

Amazon S3 Redshift ElasticSearch Splunk (slide 48)

Answer 6

Imagem no slide 49 Ingestão da source no KDF >> Data Transformation (Several "blueprint" templates available) >> output - Amazon S3 + Copy Amazon Redshift Also Source Records >> KDF >> Amazon S3 KDF >> Transformation Failures >> Amazon S3 KDF >> Delivery Failures >> Amazon S3 (slide 49)

Answer 7

* Firehose accumulates records in a buffer * The buffer is flushed based on time and size rules * Buffer Size (ex: 32MB): if that buffer size is reached, it’s flushed * Buffer Time (ex: 2 minutes): if that time is reached, it’s flushed * Firehose can automatically increase the buffer size to increase throughput * High throughput => Buffer Size will be hit * Low throughput => Buffer Time will be hit

Answer 8

Streams • Going to write custom code (producer / consumer) • Real time (~200 ms latency for classic, ~70 ms latency for enhanced fan-out) • Must manage scaling (shard splitting / merging) • Data Storage for 1 to 7 days, replay capability, multi consumers • Use with Lambda to insert data in real-time to ElasticSearch (for example) Firehose • Fully managed, send to S3, Splunk, Redshift, ElasticSearch • Serverless data transformations with Lambda • Near real time (lowest buffer time is 1 minute) • Automated Scaling • No data storage

Answer 9

* You can stream CloudWatch Logs into * Kinesis Data Streams * Kinesis Data Firehose * AWS Lambda * Using CloudWatch Logs Subscriptions Filters * You can enable them using the AWS CLI NEAR REAL TIME CloudWatch Logs >> Subscription Filter >> Kinesis Data Firehose + Lambda Function Enrichment (transformation) >> (Near Real Time) Amazon ES REAL TIME CloudWatch Logs >> Subscription Filter >> Lambda Function >> (Real Time) Amazon ES REAL TIME ANALYTICS CloudWatch Logs >> Subscription Filter >> Kinesis Data Stream >> Kinesis Data Analytics >> Lambda Function

Answer 10

It's a queue Fully managed Scales from 1 message per second to 10,000s per second Default retention of messages: 4 days, maximum of 14 days No limit to how many messages can be in the queue Low latency (<10 ms on publish and receive) Horizontal scaling in terms of number of consumers Can have duplicate messages (at least once delivery, occasionally) Can have out of order messages (best effort ordering) Limitation of 256KB per message sent

Answer 11

* Define Body * Add message attributes (metadata – optional) * Provide Delay Delivery (optional) * Get back * Message identifier * MD5 hash of the body (slide 58)

Answer 12

* Poll SQS for messages (receive up to 10 messages at a time) * Process the message within the visibility timeout * Delete the message using the message ID & receipt handle SQS >> Poll Messages >> Message >> Process Message >> Delete Message >> SQS (slide 59)

Answer 13

* Newer offering (First In - First out) – not available in all regions! * Name of the queue must end in .fifo * Lower throughput (up to 3,000 per second with batching, 300/s without) * Messages are processed in order by the consumer * Messages are sent exactly once * 5-minute interval de-duplication using “Duplication ID”

Answer 14

Using the SQS Extended Client (Java Library) Producer >> Small Metadata message >> SQS Queue >> Small metadata message >> Consumer Producer >> Send Large Message to S3 Consumer >> Retrieve large message from S3 (slide 61)

Answer 15

* Decouple applications (for example to handle payments asynchronously) * Buffer writes to a database (for example a voting application) * Handle large loads of messages coming in (for example an email sender) • SQS can be integrated with Auto Scaling through CloudWatch!

Answer 16

* Maximum of 120,000 in-flight messages being processed by consumers * Batch Request has a maximum of 10 messages – max 256KB * Message content is XML, JSON, Unformatted text * Standard queues have an unlimited TPS * FIFO queues support up to 3,000 messages per second (using batching) * Max message size is 256KB (or use Extended Client) * Data retention from 1 minute to 14 days * Pricing: * Pay per API Request * Pay per network usage

Answer 17

* Encryption in flight using the HTTPS endpoint * Can enable SSE (Server Side Encryption) using KMS * Can set the CMK (Customer Master Key) we want to use * SSE only encrypts the body, not the metadata (message ID, timestamp, attributes) * IAM policy must allow usage of SQS * SQS queue access policy * Finer grained control over IP * Control over the time the requests come in

Answer 18

Kinesis Data Stream: • Data can be consumed many times • Data is deleted after the retention period • Ordering of records is preserved (at the shard level) – even during replays • Build multiple applications reading from the same stream independently (Pub/Sub) • “Streaming MapReduce” querying capability • Checkpointing needed to track progress of consumption • Shards (capacity) must be provided ahead of time ``` SQS: • Queue, decouple applications • One application per queue • Records are deleted after consumption (ack / fail) • Messages are processed independently for standard queue • Ordering for FIFO queues • Capability to “delay” messages • Dynamic scaling of load (no-ops) ``` (Olhar tabela slide 66)

Answer 19

* Serves as the entry point for IoT devices connecting to AWS * Allows devices to securely and efficiently communicate with AWS IoT * Supports the MQTT, WebSockets, and HTTP 1.1 protocols * Fully managed and scales automatically to support over a billion devices * No need to manage any infrastructure

Answer 20

* Pub/sub (publishers/subscribers) messaging pattern - low latency * Devices can communicate with one another this way * Messages sent using the MQTT, WebSockets, or HTTP 1.1 protocols * Messages are published into topics (just like SNS) * Message Broker forwards messages to all clients connected to the topic Thing >> Device Gateway >> Message broker / topic (Slide 70)

Answer 21

All connected IoT devices are represented in the AWS IoT registry Organizes the resources associated with each device in the AWS Cloud Each device gets a unique ID Supports metadata for each device (ex: Celsius vs Fahrenheit, etc...) Can create X.509 certificate to help IoT devices connect to AWS IoT Groups: group devices together and apply permissions to the group

Answer 22

3 possible authentication methods for Things: • Create X.509 certificates and load them securely onto the Things • AWS SigV4 • Custom tokens with Custom authorizers For mobile apps: • Cognito identities (extension to Google, Facebook login, etc...) Web / Desktop / CLI: • IAM • Federated Identities

Answer 23

AWS IoT policies • Attached to X.509 certificates or Cognito Identities • Able to revoke any device at any time • IoT Policies are JSON documents • Can be attached to groups instead of individual Things. IAM Polices • Attached to users, group or roles • Used for controlling IoT AWS APIs

Answer 24

* JSON document representing the state of a connected Thing * We can set the state to a different desired state (ex: light on) * The IoT thing will retrieve the state when online and adapt Lightbuld (off) >> IoT reported state (off) >> Change State (AWS API) using mobile application >> IoT desired state (on) - device shadow >> syncronization of state >> Lightbulb (on)

Answer 25

* Rules are defined on the MQTT topics * Rules = when it’s triggered | Action = what is does * Rules need IAM Roles to perform their actions * Rules use cases: * Augment or filter data received from a device * Write data received from a device to a DynamoDB database * Save a file to S3 * Send a push notification to all users using SNS

Collection Flashcards

(49 cards)