Collection Flashcards

1
Q

De um overview do AWS Kinesis

A

É uma alternativa gerenciada do Apache Kafka
Ótimo para logs de aplicações, metricas, IoT e clickstreams
Real-time big data
Streaming processing frameworks (Spark, Nifi,…)
Os dados são automaticamente replicados para 3 AZ

Kinesis Stream possui baixa latência de ingestão como streaming

Kinesis Analytics executa real-time analytics on streams using SQL

Kinesis Firehose carrega stream no S3, Redshift, ElasticSearch & Splunk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Como eu crio alertas e dashboards de informações de minhas aplicações (Click Streams, IoT devices, Metrics & Logs)?

A

É possível utilizar o Amazon Kinesis Streams para coletar os dados e envia-los para o Kinesis Analytics, para gerar os alertas baseado em queries SQL. Após isso, o kinesis firehose pode transportar os dados para um amazon redshift ou bucket S3 onde pode ser gerados dashs através do quicksight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cite 5 características do Kinesis Streams

A

Streams are divided in ordered Shards / Partitions
Data retention is 24 hours by default, can go up to 7 days
Ability to reprocess / replay data
Multiple applications can consume the same stream
Real-time processing with scale of throughput
Once data is inserted in Kinesis, it can’t be deleted (immutability)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Como funcionam os Shards do Kinesis Streams?

A

One stream is made of many different shards
Biling is per shard provisioned, can have as many shards as you want
Batching available or per message calls
The number of shards can evolve over time (reshard / merge)
Records are ordered per shard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Como funcionam os registros do Kinesis Streams?

A

Possuem 3 partes: Data Blob, Record Key, Sequence Number

O data blob é o dado transportado, serializado como bytes. Tamanho máximo de 1MB. Pode representar qualquer coisa

O record key auxilia no agrupamento dos registros nos shards. Same key = Same Shard. Use a highly distributed key to avoid the “hot partition” problem

O Sequence number é o identificador único para cara registro inserido nos shards. Adicionado pelo Kinesis após a ingestão

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Qual o tamanho máximo do dado enviado através do Kinesis Streams?

A

O Data Blob tem tamanho máximo é 1MB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Quais são os limites do Kinesis Data Stream?

A

Producer:
• 1MB/s or 1000 messages/s at write PER SHARD
• “ProvisionedThroughputException” otherwise

Consumer Classic:
• 2MB/s at read PER SHARD across all consumers
• 5 API calls per second PER SHARD across all consumers

Consumer Enhanced Fan-Out:
• 2MB/s at read PER SHARD, PER ENHANCED CONSUMER
• No API calls needed (push model)

Data Retention:
• 24 hours data retention by default
• Can be extended to 7 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Qual o nome da API que o producer utiliza para inserir registros no Kinesis Data Stream?

A

APIs that are used are PutRecord (one) and PutRecords (many
records)
• PutRecords uses batching and increases throughput => less HTTP
requests
• ProvisionedThroughputExceeded if we go over the limits
• + AWS Mobile SDK: Android, iOS, etc…
• Use case: low throughput, higher latency, simple API, AWS Lambda
• Managed AWS sources for Kinesis Data Streams:
• CloudWatch Logs
• AWS IoT
• Kinesis Data Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Qual a exceção gerada por enviar mais dados do que o aceito? Como solucionar?

A

A exceção é ProvisionedThroughputExceeded Exceptions

Solução:
Make sure you don’t have a hot shard (such as your partition key is bad and too much data goes to that partition)
Retries with backoff
• Increase shards (scaling)
• Ensure your partition key is a good one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explique o que é e como funciona Kinesis Producer Library (KPL).

A

Easy to use and highly configurable C++ / Java library
Used for building high performance, long-running producers
Automated and configurable retry mechanism
Synchronous or Asynchronous API (better performance for async)
Submits metrics to CloudWatch for monitoring
Batching (both turned on by default) – increase throughput, decrease
cost:
• Collect Records and Write to multiple shards in the same PutRecords API call
• Aggregate – increased latency
• Capability to store multiple records in one record (go over 1000 records per second limit)
• Increase payload size and improve throughput (maximize 1MB/s limit)
Compression must be implemented by the user
KPL Records must be de-coded with KCL or special helper library

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Como funciona Kinesis Producer Library (KPL)

Batching? Qual não utilizar Batching?

A

É gerado uma collection de registros para utilização da API PutRecords. É possível definir a eficiência do batching introduzindo delay com RecordMaxBufferedTime

Applications that cannot tolerate this additional delay may need
to use the AWS SDK directly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

O que é o Kinesis Agent?

A

Monitor Log files and sends them to Kinesis Data Streams
Java-based agent, built on top of KPL
Install in Linux-based server environments
Features:
• Write from multiple directories and write to multiple streams
• Routing feature based on directory / log file
• Pre-process data before sending to streams (single line, csv to json, log to
json…)
• The agent handles file rotation, checkpointing, and retry upon failures
• Emits metrics to CloudWatch for monitoring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cite exemplos Kinesis Consumers

A
Kinesis SDK
Kinesis Client Library (KCL)
Kinesis Connector Library
3 rd party libraries: Spark,
Log4J Appenders, Flume,
Kafka Connect...
Kinesis Firehose
AWS Lambda
(Kinesis Consumer Enhanced
Fan-Out discussed in the next
lecture)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Como funciona o Kinesis Consumer SDK - GetRecords?

A
Classic Kinesis - Records are
polled by consumers from a
shard
Each shard has 2 MB total
aggregate throughput
GetRecords returns up to 10MB
of data (then throttle for 5
seconds) or up to 10000 records
Maximum of 5 GetRecords API
calls per shard per second =
200ms latency
If 5 consumers application
consume from the same shard,
means every consumer can poll
once a second and receive less
than 400 KB/s
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Como funciona o Kinesis Client Library (KCL)?

A

• Java-first library but exists for other
languages too (Golang, Python, Ruby, Node,
.NET …)
• Read records from Kinesis produced with the
KPL (de-aggregation)
• Share multiple shards with multiple
consumers in one “group”, shard discovery

• Checkpointing feature to resume progress (in case your applications goes down)
• Leverages DynamoDB for coordination and
checkpointing (one row per shard)
• Make sure you provision enough WCU / RCU
• Or use On-Demand for DynamoDB
• Otherwise DynamoDB may slow down KCL

  • Record processors will process the data
  • ExpiredIteratorException => increase WCU (problem in checkpointing progress on DynamoDb)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Como funciona AWS Lambda sourcing from Kinesis?

A

• AWS Lambda can source records from Kinesis Data Streams
• Lambda consumer has a library to de-aggregate record from the
KPL
• Lambda can be used to run lightweight ETL to:
• Amazon S3
• DynamoDB
• Redshift
• ElasticSearch
Anywhere you want
• Lambda can be used to trigger notifications / send emails in real time
• Lambda has a configurable batch size (more in Lambda section)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Como funciona Kinesis Enhanced Fan Out?

A

• New game-changing feature from August 2018.
• Works with KCL 2.0 and AWS Lambda (Nov 2018)
• Each Consumer get 2 MB/s of provisioned throughput per shard
• That means 20 consumers will get
40MB/s per shard aggregated • No more 2 MB/s limit!
• Enhanced Fan Out: Kinesis pushes data to consumers over HTTP/2
• Reduce latency (~70 ms)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Qual a diferença entre Enhanced Fan-Out e Standard

Consumers?

A
  • Standard consumers:
    • Low number of consuming applications (1,2,3…)
    • Can tolerate ~200 ms latency
    • Minimize cost

Enhanced Fan Out Consumers:
• Multiple Consumer applications for the same stream
• Low Latency requirements ~70ms
• Higher costs (see Kinesis pricing page)
• Default limit of 5 consumers using enhanced fan-out per data stream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Como funciona o Adding Shards do Kinesis?

A

• Also called “Shard Splitting”
• Can be used to increase the Stream capacity (1 MB/s data in per
shard)
• Can be used to divide a “hot shard”
• The old shard is closed and will be deleted once the data is expired
(slide 39 - aula 11)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Como funciona o Merging Shards do Kinesis?

A

• Decrease the Stream capacity and save costs
• Can be used to group two shards with low traffic
• Old shards are closed and deleted based on data expiration
(slide 40 - aula 11)

21
Q

Em caso de resharding, como garantir a consistência dos registros entre o parent shard e os childs shards?

A
• After a reshard, you can read from
child shards
• However, data you haven’t read yet
could still be in the parent
• If you start reading the child before
completing reading the parent, you
could read data for a particular hash
key out of order
• After a reshard, read entirely from the
parent until you don’t have new records
• Note: The Kinesis Client Library (KCL)
has this logic already built-in, even
after resharding operations

(Slide 41 - Aula 11)

22
Q

Como é feito o auto scaling do Kinesis?

A
• Auto Scaling is not a native
feature of Kinesis
• The API call to change the
number of shards is
UpdateShardCount
• We can implement Auto Scaling
with AWS Lambda
• See:
https://aws.amazon.com/blogs/b
ig-data/scaling-amazon-kinesis-
data-streams-with-aws-
application-auto-scaling/

(Slide 42 - Aula 11)

23
Q

Quais são as limitações do Kinesis Scaling?

A

• Resharding cannot be done in parallel. Plan capacity in advance
• You can only perform one resharding operation at a time and it takes a few
seconds
• For 1000 shards, it takes 30K seconds (8.3 hours) to double the shards to
2000
• You can’t do the following:
• Scale more than twice for each rolling 24-hour period for each stream
• Scale up to more than double your current shard count for a stream
• Scale down below half your current shard count for a stream
• Scale up to more than 500 shards in a stream
• Scale a stream with more than 500 shards down
unless the result is fewer than 500 shards
• Scale up to more than the shard limit for your account

24
Q

Como lidar com duplicação de dados para producers do Kinesis?

A

• Producer retries can create duplicates due to network timeouts (Ack never reaches, but the kinesis receive data)
• Although the two records have identical data, they also have unique sequence
numbers
• Fix: embed unique record ID in the data to de-duplicate on the consumer side

(Slide 45 - Aula 12)

25
Q

Como lidar com duplicação de dados para consumers do Kinesis?

A
  • Consumer retries can make your application read the same data twice
  • Consumer retries happen when record processors restart:
    • A worker terminates unexpectedly
    • Worker instances are added or removed
    • Shards are merged or split
    • The application is deployed
  • Fixes:
    • Make your consumer application idempotent
    • If the final destination can handle duplicates, it’s recommended to do it there
  • More info: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html

(Slide 46 - Aula 12)

26
Q

Como funciona a segurança no Kinesis?

A
  • Control access / authorization using IAM policies
  • Encryption in flight using HTTPS endpoints
  • Encryption at rest using KMS
  • Client side encryption must be manually implemented (harder)
  • VPC Endpoints available for Kinesis to access within VPC
27
Q

Como funciona o AWS Kinesis Data Firehose?

A

Fully Managed Service, no administration
Near Real Time (60 seconds latency minimum for non full batches)
Load data into Redshift / Amazon S3 / ElasticSearch / Splunk
Automatic scaling
Supports many data formats
Data Conversions from JSON to Parquet / ORC (only for S3)
Data Transformation through AWS Lambda (ex: CSV => JSON)
Supports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY)
Only GZIP is the data is further loaded into Redshift
Pay for the amount of data going through Firehose
Spark / KCL do not read from KDF

28
Q

Quais são as sources do KDF (Kinesis Data Firehose)?

A
SDK (Kinesis Producer Library - KPL)
Kinesis Agent
Kinesis Data Streams
CloudWatch Logs & Events
IoT rules actions

(slide 48)

29
Q

Quais são os destinos do KDF (Kinesis Data Firehose)?

A

Amazon S3
Redshift
ElasticSearch
Splunk

(slide 48)

30
Q

Como funciona o delivery do KDF (Kinesis Data Firehose)?

A

Imagem no slide 49

Ingestão da source no KDF&raquo_space; Data Transformation (Several “blueprint” templates available)&raquo_space; output - Amazon S3 + Copy Amazon Redshift

Also
Source Records&raquo_space; KDF&raquo_space; Amazon S3
KDF&raquo_space; Transformation Failures&raquo_space; Amazon S3
KDF&raquo_space; Delivery Failures&raquo_space; Amazon S3

(slide 49)

31
Q

Como funciona o Firehose Buffer Sizing?

A
  • Firehose accumulates records in a buffer
  • The buffer is flushed based on time and size rules
  • Buffer Size (ex: 32MB): if that buffer size is reached, it’s flushed
  • Buffer Time (ex: 2 minutes): if that time is reached, it’s flushed
  • Firehose can automatically increase the buffer size to increase throughput
  • High throughput => Buffer Size will be hit
  • Low throughput => Buffer Time will be hit
32
Q

Qual a diferença entre Kinesis Data Streams e Firehose

A

Streams
• Going to write custom code (producer / consumer)
• Real time (~200 ms latency for classic, ~70 ms latency for enhanced fan-out)
• Must manage scaling (shard splitting / merging)
• Data Storage for 1 to 7 days, replay capability, multi consumers
• Use with Lambda to insert data in real-time to ElasticSearch (for example)

Firehose
• Fully managed, send to S3, Splunk, Redshift, ElasticSearch
• Serverless data transformations with Lambda
• Near real time (lowest buffer time is 1 minute)
• Automated Scaling
• No data storage

33
Q

Como funciona CloudWatch Logs Subscriptions Filters?

A
  • You can stream CloudWatch Logs into
    • Kinesis Data Streams
    • Kinesis Data Firehose
    • AWS Lambda
  • Using CloudWatch Logs Subscriptions Filters
  • You can enable them using the AWS CLI

NEAR REAL TIME
CloudWatch Logs&raquo_space; Subscription Filter&raquo_space; Kinesis Data Firehose + Lambda Function Enrichment (transformation)&raquo_space; (Near Real Time) Amazon ES

REAL TIME
CloudWatch Logs&raquo_space; Subscription Filter&raquo_space; Lambda Function&raquo_space; (Real Time) Amazon ES

REAL TIME ANALYTICS
CloudWatch Logs&raquo_space; Subscription Filter&raquo_space; Kinesis Data Stream&raquo_space; Kinesis Data Analytics&raquo_space; Lambda Function

34
Q

Como funciona AWS SQS – Standard Queue?

A

It’s a queue
Fully managed
Scales from 1 message per second to 10,000s per second
Default retention of messages: 4 days, maximum of 14 days
No limit to how many messages can be in the queue
Low latency (<10 ms on publish and receive)
Horizontal scaling in terms of number of consumers
Can have duplicate messages (at least once delivery, occasionally)
Can have out of order messages (best effort ordering)
Limitation of 256KB per message sent

35
Q

How we produce messages in SQS?

A
  • Define Body
  • Add message attributes (metadata – optional)
  • Provide Delay Delivery (optional)
  • Get back
    • Message identifier
    • MD5 hash of the body

(slide 58)

36
Q

How we consume messages in SQS?

A
  • Poll SQS for messages (receive up to 10 messages at a time)
  • Process the message within the visibility timeout
  • Delete the message using the message ID & receipt handle

SQS&raquo_space; Poll Messages&raquo_space; Message&raquo_space; Process Message&raquo_space; Delete Message&raquo_space; SQS

(slide 59)

37
Q

How does AWS SQS - FIFO Queue work?

A
  • Newer offering (First In - First out) – not available in all regions!
  • Name of the queue must end in .fifo
  • Lower throughput (up to 3,000 per second with batching, 300/s without)
  • Messages are processed in order by the consumer
  • Messages are sent exactly once
  • 5-minute interval de-duplication using “Duplication ID”
38
Q

Message size limit is 256KB, how to send large messages?

A

Using the SQS Extended Client (Java Library)

Producer&raquo_space; Small Metadata message&raquo_space; SQS Queue&raquo_space; Small metadata message&raquo_space; Consumer

Producer&raquo_space; Send Large Message to S3
Consumer&raquo_space; Retrieve large message from S3

(slide 61)

39
Q

What are the use cases for AWS SQS?

A
  • Decouple applications (for example to handle payments asynchronously)
  • Buffer writes to a database (for example a voting application)
  • Handle large loads of messages coming in (for example an email sender)

• SQS can be integrated with Auto Scaling through CloudWatch!

40
Q

What are the limits for SQS?

A
  • Maximum of 120,000 in-flight messages being processed by consumers
  • Batch Request has a maximum of 10 messages – max 256KB
  • Message content is XML, JSON, Unformatted text
  • Standard queues have an unlimited TPS
  • FIFO queues support up to 3,000 messages per second (using batching)
  • Max message size is 256KB (or use Extended Client)
  • Data retention from 1 minute to 14 days
  • Pricing:
    • Pay per API Request
    • Pay per network usage
41
Q

How does AWS SQS Security work?

A
  • Encryption in flight using the HTTPS endpoint
  • Can enable SSE (Server Side Encryption) using KMS
    • Can set the CMK (Customer Master Key) we want to use
    • SSE only encrypts the body, not the metadata (message ID, timestamp, attributes)
  • IAM policy must allow usage of SQS
  • SQS queue access policy
    • Finer grained control over IP
    • Control over the time the requests come in
42
Q

What are the differences between Kineses Data Stream and SQS?

A

Kinesis Data Stream:
• Data can be consumed many times
• Data is deleted after the retention period
• Ordering of records is preserved (at the shard level) – even during replays
• Build multiple applications reading from the same stream independently (Pub/Sub)
• “Streaming MapReduce” querying capability
• Checkpointing needed to track progress of consumption
• Shards (capacity) must be provided ahead of time

SQS:
• Queue, decouple applications
• One application per queue
• Records are deleted after consumption (ack / fail)
• Messages are processed independently for standard
queue
• Ordering for FIFO queues
• Capability to “delay” messages
• Dynamic scaling of load (no-ops)

(Olhar tabela slide 66)

43
Q

What is IoT Device Gateway?

A
  • Serves as the entry point for IoT devices connecting to AWS
  • Allows devices to securely and efficiently communicate with AWS IoT
  • Supports the MQTT, WebSockets, and HTTP 1.1 protocols
  • Fully managed and scales automatically to support over a billion devices
  • No need to manage any infrastructure
44
Q

What is IoT Message Broker?

A
  • Pub/sub (publishers/subscribers) messaging pattern - low latency
  • Devices can communicate with one another this way
  • Messages sent using the MQTT, WebSockets, or HTTP 1.1 protocols
  • Messages are published into topics (just like SNS)
  • Message Broker forwards messages to all clients connected to the topic

Thing&raquo_space; Device Gateway&raquo_space; Message broker / topic

(Slide 70)

45
Q

What is IoT Thing Registry (IAM of IoT)?

A

All connected IoT devices are represented in the AWS IoT registry
Organizes the resources associated with each device in the AWS Cloud
Each device gets a unique ID
Supports metadata for each device (ex: Celsius vs Fahrenheit, etc…)
Can create X.509 certificate to help IoT devices connect to AWS
IoT Groups: group devices together and apply permissions to the group

46
Q

How can we authenticate in IoT Gateway?

A

3 possible authentication methods for Things:
• Create X.509 certificates and load them securely onto the Things
• AWS SigV4
• Custom tokens with Custom authorizers

For mobile apps:
• Cognito identities (extension to Google, Facebook login, etc…)

Web / Desktop / CLI:
• IAM
• Federated Identities

47
Q

How does authorization for IoT Devices work?

A

AWS IoT policies
• Attached to X.509 certificates or Cognito Identities
• Able to revoke any device at any time
• IoT Policies are JSON documents
• Can be attached to groups instead of individual Things.

IAM Polices
• Attached to users, group or roles
• Used for controlling IoT AWS APIs

48
Q

What is Device Shadow?

A
  • JSON document representing the state of a connected Thing
  • We can set the state to a different desired state (ex: light on)
  • The IoT thing will retrieve the state when online and adapt

Lightbuld (off)&raquo_space; IoT reported state (off)&raquo_space; Change State (AWS API) using mobile application&raquo_space; IoT desired state (on) - device shadow&raquo_space; syncronization of state&raquo_space; Lightbulb (on)

49
Q

What is Rules Engine?

A
  • Rules are defined on the MQTT topics
  • Rules = when it’s triggered | Action = what is does
  • Rules need IAM Roles to perform their actions
  • Rules use cases:
    • Augment or filter data received from a device
    • Write data received from a device to a DynamoDB database
    • Save a file to S3
    • Send a push notification to all users using SNS