AWS Data Collections Flashcards
AWS Types Collection
- Real - Time Collection
- Near Real-Time Collections
- Batch
Real-Time Collections Services
Kinesis Data Streams
SQS
IoT
Near Real Time Collections
Kinesis Data Firehose
Data Migration Service
Batch - Historical Analytics Services
Snowball
Data Pipeline
Explane Kinesis Data Streams Service
Managed service that allows you to collect, process and analyze real-time streaming data from various sources such as IoT, mobile devices, server logs, social networks and other real-time data sources.
Producers Kinesis Data Streams
Applications, Client, SDK, KPL, Kinesis Agent
Consumers Kinesis Data Streams
Apps (KCL, SDK), Lambda, Kinesis Data Firehose and Kinesis Data Analytics
Types Capacity Modes Kinesis Data Streams
Provisioned Mode
On-Deman Mode
In Kinesis Data Streams each shard gets in Provisioned Mode
1 MB/s or 1000 records per second
Kinesis In On-demand mode default capacity provisioned
4 MB/s in or 4000 records per second
Points in Kinesis Data Streams Security
IAM - Control Access
Encryption usin HTTPS endpoints
KMS encryption
encryption/decryption of data on client side
VPC Endpoints
Monitor API using CloudTrail
Explane Kinesis Producer SDK - PutRecords
API’s used PutRecords one and many records
PutRecords uses…
Batching and increases less HTTP requests
PutRecords use batching…
less HTTP requests
Kinesis Producer SDK - If we go over the limits
ProvisionedThroughputExceeded if we go over the limits
Managed AWS sources for Kinesis Data Streams
CloudWatch Logs, AWS IoT, Kinesis Data Analytics
We need to send data asynchronously API to Kinesis…
Key Producer Library (KPL)
How to submit metrics Kinesis Producer Library
CloudWatch for monitoring
KPL Batching some delay with…
RecordMaxBufferedTime (default 100 ms)
Define Features Kinesis Agent
Monitor log files send to KDS
Java-based agent
Install Linux server environments
Data Collection Services
Amazon Kinesis
AWS IoT Core
AWS Snowball
SQS
DMS
Direct Connect
O que é ProvisionedThroughputExceeded?
Exceção que pode ocorrer no Kinesis quando aplicação atinge o limite de provisionamento de taxa de transferência definido para o stream.
Causes ProvisionedThroughputExceeded Exceptions
exceeding MB/s or TPS for any shard
Make sure you don’t have a hot shard (such as your partition key is bad
and too much data goes to that partition
Solution for ProvisionedThroughputExceeded Exceptions
- Retries with backoff
- Increase shards (scaling)
- Ensure your partition key is a good one
Influency Kinesis Producer Library (KPL)
Batching
Introducing some
delay with RecordMaxBufferedTime (default 100ms)
Kinesis Producer Library – When not to
use
- The KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user configurable)
- Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance
Kinesis Agent functions
- Monitor Log files and sends them to Kinesis Data Streams
- Java-based agent, built on top of KPL
- Install in Linux-based server environments
Features Kinesis Agent
- Write from multiple directories and multiple streams
- Routing feature based on directory / log file
- Pre-process data before sending to streams (single line, csv to json, log to
json…) - The agent handles file rotation, checkpointing, and retry upon failures
- Emits metrics to CloudWatch for monitoring
Elements Kinesis Consumers Classic
- Kinesis SDK
- Kinesis Client Library (KCL)
- Kinesis Connector Library
- 3rd party libraries: Spark,
Log4J Appenders, Flume,
Kafka Connect… - Kinesis Firehose
- AWS Lambda
Features Kinesis Consumer SDK - GetRecords
- Classic Kinesis - Records
are polled by consumers from
a shard - Each shard has 2 MB total
aggregate throughput - GetRecords returns up to
10MB of data (then throttle for
5 seconds) or up to 10000
records - Maximum of 5 GetRecords
API calls per shard per
second = 200ms latency - If 5 consumers application
consume from the same
shard, means every consumer
can poll once a second and
receive less than 400 KB/s
Kinesis Connector Library write data to:
- Amazon S3
- DynamoDB
- Redshift
- ElasticSearch
Each consumer per shards in Kinesis Enhanced Fan Out
2 MB/s
Means consumers and MB/s per shard in Kinesis Enhanced Fan Out
20 consumers and 40 MB/s
Tolerate latency in Standard Consumers
~ 200 ms
Latency requirements enhanced consumers
~ 70 ms
Destiny Kinesis Firehose
S3, Redshift, Elasticsearch, Splunk
True or False: Spark / KCL read from KDF
False
You can stream CloudWatch Logs into…
- Kinesis Data Streams
- Kinesis Data Firehose
- AWS Lambda
Data Stream write capacity on-demand maximum
200 MiB/sec and 200,000 records/second
Data Stream read capacity on-demand maximum per consumer
400 MiB/second
Data Stream write capacity in provisioned mode
1 MiB/second and 1,000 records/second
Data Stream read capacity in provisioned mode
2 MiB/second
SQS Use cases
- Order processing
- Image Processing
- Auto scaling queues according to messages. * Buffer and Batch messages for future processing.
- Request Offloading
Kinesis Data Streams use cases
*Fast log and event data collection and processing
* Real-Time metrics and reports
* Mobile data capture
* Real-Time data analytics
* Gaming data feed
* Complex Stream Processing
* Data Feed from “Internet of Things
SQS Use cases
- Order processing
- Image Processing
- Auto-scaling queues according to messages. * Buffer and Batch messages for future processing.
- Request Offloading
Features
Kinesis Auto Scaling
- Is not native to Kinesis
- The API call to change the number of shards is UpdateShardCount
- Auto-scaling with Lambda
IoT Overview
- We deploy IoT devices (‘Things’)
- We configure them and retrieve data from them
SQS Limit per message sent
256 KB
SQS how to send large messages
Use SQS Extended Client (Java Library)
SQS use cases
- Decouple applications
- Buffer writes to a database
- Handle large loads of messages coming in
SQS can be integrated with…
- Auto Scaling through CloudWatch!
SQS Max messages per consumers
120.000
SQS Message content format
XML, JSON, Unformatted text
SQS FIFO queues support maximum messages per second
3,000 messages per second (using
batching)
SQS Pricing mode
- Pay per API Request
- Pay per network usage
SQS Types Security
- Encryption in flight using the HTTPS endpoint
- SSE (Server Side Encryption) using KMS
- IAM policy
- SQS queue access policy
IoT messages using the protocols types
MQTT, WebSockets or HTTP 1.1
protocols
Data Migration Service
Quickly and securely migrate databases to AWS, resilient, self healing
DMS Sources
- On-Premise and EC2
instances databases: Oracle,
MS SQL Server, MySQL,
MariaDB, PostgreSQL,
MongoDB, SAP, DB2 - Azure: Azure SQL Database
- Amazon RDS: all including
Aurora - Amazon S3
DMS Targets
TARGETS:
* On-Premise and EC2
instances databases: Oracle,
MS SQL Server, MySQL,
MariaDB, PostgreSQL, SAP
* Amazon RDS
* Amazon Redshift
* Amazon DynamoDB
* Amazon S3
* ElasticSearch Service
* Kinesis Data Streams
* DocumentDB
DMS Convert your Database’s Schema from one engine to another
Schema Conversion Tool (SCT)
Direct Connect (DX)
Provides a dedicated private connection from a remote network to your VPC
Use cases Direct Connect
- Increase bandwidth throughput - working with large data sets – lower cost
- More consistent network experience - applications using real-time data feeds
- Hybrid Environments (on prem + cloud)
Direct Connect Gateway
If you want to setup a Direct Connect to one or more VPC in many different regions (same account), you must use a Direct Connect Gateway
Direct Connect – Connection Types
Dedicated Connections
Hosted Connections
Services AWS Snow Family
Snowcone, Snowball Edge, Snowmobile
Data Migration Services Snow Family
Snowcone, Snowball Edge, Snowmobile
Edge Computing services
Snowcone, Snowball Edge
Snowball Edge Storage Optimized capacity
80 TB of HDD capacity
Snowball Edge Compute Optimized capacity
42 TB of HDD capacity
AWS Snowcone capacity
8 TB
Use cases of Edge Computing
- Preprocess data
- Machine learning at the edge
- Transcoding media streams
Snow Family – Edge Computing
Snowcone (smaller)
Snowball Edge – Compute Optimized
Snowball Edge – Storage Optimized
AWS OpsHub
(a software you install on your computer /
laptop) to manage your Snow Family Device
Amazon MSK is:
Managed Streaming for
Apache Kafka
MSK – Configurations
- Choose the number of AZ
(3 – recommended, or 2) - Choose the VPC & Subnets
- The broker instance type
(ex: kafka.m5.large) - The number of brokers per
AZ (can add brokers later) - Size of your EBS volumes
(1GB – 16TB)
MSK – Security
- Encryption
- Network Security
- Authentication & Authorization
MSK Authentication & Authorization
(important):
- Define who can read/write to which topics
- Mutual TLS (AuthN) + Kafka ACLs (AuthZ)
- SASL/SCRAM (AuthN) + Kafka ACLs
(AuthZ) - IAM Access Control (AuthN + AuthZ)
MSK – Monitoring
- CloudWatch Metrics
- Prometheus (Open-Source Monitoring)
- Broker Log Delivery
MSK options Broken Log Delivery
- Delivery to CloudWatch Logs
- Delivery to Amazon S3
- Delivery to Kinesis Data Streams
MSK Connect
- You can deploy any Kafka Connect connectors to MSK Connect as a
plugin
MSK Data is Stored on…
EBS volumes
Producers Examples MSK
Kinesis, IoT, RDS
Consumers examples MSK
EMR, S3, SageMaker, Kinesis, RDS
MKS Size of your EBS volumes
1GB - 16TB
Componentes do Kinesis Producers
- Kinesis SDK
- Kinesis Producer Library (KPL)
- Kinesis Agent
- Bibliotecas: Spark, Log4J, Appenders, Flume, Kafka Connect, NiFi…
O que é o Kinesis Data Stream?
Serviço de streaming da AWS que permite ingestão, processamento e análise de dados em tempo real.
O que é o Kinesis Data Streams Producers?
Componente responsável pela ingestão de dados em tempo real.
O que é o Kinesis Data Streams Consumers
Componente responsável pelo processamento e análise dos dados. Responsável por ler os dados de 1 ou mais shards
O que é o Kinesis Data Analytics
Serviço que permite processar e analisar dados em tempo real utilizando consultas SQL padrão.