Collection Flashcards
De um overview do AWS Kinesis
É uma alternativa gerenciada do Apache Kafka
Ótimo para logs de aplicações, metricas, IoT e clickstreams
Real-time big data
Streaming processing frameworks (Spark, Nifi,…)
Os dados são automaticamente replicados para 3 AZ
Kinesis Stream possui baixa latência de ingestão como streaming
Kinesis Analytics executa real-time analytics on streams using SQL
Kinesis Firehose carrega stream no S3, Redshift, ElasticSearch & Splunk
Como eu crio alertas e dashboards de informações de minhas aplicações (Click Streams, IoT devices, Metrics & Logs)?
É possível utilizar o Amazon Kinesis Streams para coletar os dados e envia-los para o Kinesis Analytics, para gerar os alertas baseado em queries SQL. Após isso, o kinesis firehose pode transportar os dados para um amazon redshift ou bucket S3 onde pode ser gerados dashs através do quicksight
Cite 5 características do Kinesis Streams
Streams are divided in ordered Shards / Partitions
Data retention is 24 hours by default, can go up to 7 days
Ability to reprocess / replay data
Multiple applications can consume the same stream
Real-time processing with scale of throughput
Once data is inserted in Kinesis, it can’t be deleted (immutability)
Como funcionam os Shards do Kinesis Streams?
One stream is made of many different shards
Biling is per shard provisioned, can have as many shards as you want
Batching available or per message calls
The number of shards can evolve over time (reshard / merge)
Records are ordered per shard
Como funcionam os registros do Kinesis Streams?
Possuem 3 partes: Data Blob, Record Key, Sequence Number
O data blob é o dado transportado, serializado como bytes. Tamanho máximo de 1MB. Pode representar qualquer coisa
O record key auxilia no agrupamento dos registros nos shards. Same key = Same Shard. Use a highly distributed key to avoid the “hot partition” problem
O Sequence number é o identificador único para cara registro inserido nos shards. Adicionado pelo Kinesis após a ingestão
Qual o tamanho máximo do dado enviado através do Kinesis Streams?
O Data Blob tem tamanho máximo é 1MB.
Quais são os limites do Kinesis Data Stream?
Producer:
• 1MB/s or 1000 messages/s at write PER SHARD
• “ProvisionedThroughputException” otherwise
Consumer Classic:
• 2MB/s at read PER SHARD across all consumers
• 5 API calls per second PER SHARD across all consumers
Consumer Enhanced Fan-Out:
• 2MB/s at read PER SHARD, PER ENHANCED CONSUMER
• No API calls needed (push model)
Data Retention:
• 24 hours data retention by default
• Can be extended to 7 days
Qual o nome da API que o producer utiliza para inserir registros no Kinesis Data Stream?
APIs that are used are PutRecord (one) and PutRecords (many
records)
• PutRecords uses batching and increases throughput => less HTTP
requests
• ProvisionedThroughputExceeded if we go over the limits
• + AWS Mobile SDK: Android, iOS, etc…
• Use case: low throughput, higher latency, simple API, AWS Lambda
• Managed AWS sources for Kinesis Data Streams:
• CloudWatch Logs
• AWS IoT
• Kinesis Data Analytics
Qual a exceção gerada por enviar mais dados do que o aceito? Como solucionar?
A exceção é ProvisionedThroughputExceeded Exceptions
Solução:
Make sure you don’t have a hot shard (such as your partition key is bad and too much data goes to that partition)
Retries with backoff
• Increase shards (scaling)
• Ensure your partition key is a good one
Explique o que é e como funciona Kinesis Producer Library (KPL).
Easy to use and highly configurable C++ / Java library
Used for building high performance, long-running producers
Automated and configurable retry mechanism
Synchronous or Asynchronous API (better performance for async)
Submits metrics to CloudWatch for monitoring
Batching (both turned on by default) – increase throughput, decrease
cost:
• Collect Records and Write to multiple shards in the same PutRecords API call
• Aggregate – increased latency
• Capability to store multiple records in one record (go over 1000 records per second limit)
• Increase payload size and improve throughput (maximize 1MB/s limit)
Compression must be implemented by the user
KPL Records must be de-coded with KCL or special helper library
Como funciona Kinesis Producer Library (KPL)
Batching? Qual não utilizar Batching?
É gerado uma collection de registros para utilização da API PutRecords. É possível definir a eficiência do batching introduzindo delay com RecordMaxBufferedTime
Applications that cannot tolerate this additional delay may need
to use the AWS SDK directly
O que é o Kinesis Agent?
Monitor Log files and sends them to Kinesis Data Streams
Java-based agent, built on top of KPL
Install in Linux-based server environments
Features:
• Write from multiple directories and write to multiple streams
• Routing feature based on directory / log file
• Pre-process data before sending to streams (single line, csv to json, log to
json…)
• The agent handles file rotation, checkpointing, and retry upon failures
• Emits metrics to CloudWatch for monitoring
Cite exemplos Kinesis Consumers
Kinesis SDK Kinesis Client Library (KCL) Kinesis Connector Library 3 rd party libraries: Spark, Log4J Appenders, Flume, Kafka Connect... Kinesis Firehose AWS Lambda (Kinesis Consumer Enhanced Fan-Out discussed in the next lecture)
Como funciona o Kinesis Consumer SDK - GetRecords?
Classic Kinesis - Records are polled by consumers from a shard Each shard has 2 MB total aggregate throughput GetRecords returns up to 10MB of data (then throttle for 5 seconds) or up to 10000 records Maximum of 5 GetRecords API calls per shard per second = 200ms latency If 5 consumers application consume from the same shard, means every consumer can poll once a second and receive less than 400 KB/s
Como funciona o Kinesis Client Library (KCL)?
• Java-first library but exists for other
languages too (Golang, Python, Ruby, Node,
.NET …)
• Read records from Kinesis produced with the
KPL (de-aggregation)
• Share multiple shards with multiple
consumers in one “group”, shard discovery
• Checkpointing feature to resume progress (in case your applications goes down)
• Leverages DynamoDB for coordination and
checkpointing (one row per shard)
• Make sure you provision enough WCU / RCU
• Or use On-Demand for DynamoDB
• Otherwise DynamoDB may slow down KCL
- Record processors will process the data
- ExpiredIteratorException => increase WCU (problem in checkpointing progress on DynamoDb)
Como funciona AWS Lambda sourcing from Kinesis?
• AWS Lambda can source records from Kinesis Data Streams
• Lambda consumer has a library to de-aggregate record from the
KPL
• Lambda can be used to run lightweight ETL to:
• Amazon S3
• DynamoDB
• Redshift
• ElasticSearch
Anywhere you want
• Lambda can be used to trigger notifications / send emails in real time
• Lambda has a configurable batch size (more in Lambda section)
Como funciona Kinesis Enhanced Fan Out?
• New game-changing feature from August 2018.
• Works with KCL 2.0 and AWS Lambda (Nov 2018)
• Each Consumer get 2 MB/s of provisioned throughput per shard
• That means 20 consumers will get
40MB/s per shard aggregated • No more 2 MB/s limit!
• Enhanced Fan Out: Kinesis pushes data to consumers over HTTP/2
• Reduce latency (~70 ms)
Qual a diferença entre Enhanced Fan-Out e Standard
Consumers?
- Standard consumers:
- Low number of consuming applications (1,2,3…)
- Can tolerate ~200 ms latency
- Minimize cost
Enhanced Fan Out Consumers:
• Multiple Consumer applications for the same stream
• Low Latency requirements ~70ms
• Higher costs (see Kinesis pricing page)
• Default limit of 5 consumers using enhanced fan-out per data stream
Como funciona o Adding Shards do Kinesis?
• Also called “Shard Splitting”
• Can be used to increase the Stream capacity (1 MB/s data in per
shard)
• Can be used to divide a “hot shard”
• The old shard is closed and will be deleted once the data is expired
(slide 39 - aula 11)