01 - Analytics Flashcards
Amazon Kinesis - Data Streams
1) Used to collect, process, analyse real-time streaming data
2) Data Producer
* Application that sends the data records to a Kinesis data stream
* Assigned partition keys to records which determine what shard ingests the data record
* Sequence Number is the unique identifier to each data record
3) Data Consumer
* Distributed Kinesis application retrieving the most recent data from all shards in a stream
4) Data Stream
* Logical grouping of shards
* No limits on number of shards in a stream
5) Shards
* Base throughout unit if a Kinesis data stream
* 1 Shard = 1000 data records/s or 1MB/s
* Enhanced Fan-out (Does not share consumers)
* Non Enhanced Fan-out (Shared amongst all consumers)
Amazon Athena
1) Serverless interactive query service
2) Point Athena to your data in S3; define schema
3) Start querying right away
4) Anyone with SQL ability can start querying large datasets (Uses SQL only)
5) Compatible with the regular data formats that include CSV, JSON, ORC, AVRO, and Parquet
Amazon Kinesis Cheat Sheet
1) Amazon Kinesis is the AWS solution for collecting, processing, and analysing streaming data in the cloud. When you need “real-time” think Kinesis
2) Kinesis Data Streams - Pay per running shard, data can persist within the stream, data is ordered and every consumer keep its own position. Consumers have to be manually added (coded), Data persists for 24 hours (default) to 168 hours
3) Kinesis Firehose - Pay for only the data ingested, data immediately disappears once processed. Consumer of choice is from a predefined set or services: S3, Redshift, Elasticsearch or Splunk
4) Kinesis Data Analytics - allows you to perform queries in real-time. Needs a Kinesis Data Streams/Firehose as the input and output
5) Kinesis Video Analytics - securely ingests and stores video and audio encoded data to consumers such as SageMaker, Rekognition or other services to apply Machine learning and vide processing
6) KPL (Kinesis Producer Library) is a Java library to write data to a stream
7) You can write data to stream using AWS SDK, but KPL is more efficient