Data Engineering - Streaming data for ML Flashcards
Name the 4 kinesis services
Data Streams, Video Streams, Data Firehose + Data Analytics
Define a data producer
Produces streaming data as JSON objects or blobs as the data is generated
Give examples of a data producer
- IoT devices
- Manufacturing devices
- user interaction with a website or a video game
Describe the two methods data producers can write to Kinesis
- They can use Kinesis Producer Library (KPL) - a Java library for writing to Kinesis
- Use the Kinesis API
Describe Kinesis Data Streams
They get data from producers and then transfer it to consumers as shards
Guve examples of Data Consumers
Lamdba, EC2, Kinesis Data Analytics and EMR
Can data be sent from KDS directly to a data repository?
No, it first needs to go to a consumer
Define a consumer
AWS service or distributed Kinesis application that retrieves data from KDS
Define a shard
the base throughout of KDS. Data consumers retrieve data from all the shards in a stream the data has generated
Explain partition keys
Data producers assign partition keys to records. Partition keys determine which shard ingests the data record from the data stream
Define a data stream
A logical grouping shards. They retain for between 24 hours and 7 days depending on retention settings
What do data consumers use to consume data from KDS?
Java Kinesis Client Library
How many records per second can each shard hold?
1000
How can an individual shard be identified?
It has a unique partition key and esch data record has a sequence number
What is re-sharding?
Changing the number of shards that the KDS has after start-up
When do you use KDS?
- Transfering data into AWS to be processed by data consumers
- Data must be temporarily stored in case it needs to be reprocessed
- Data needs to be processed before it can be stored
- Real time analytics by data consumers
Describe Kinesis Video Streams
Processes video streams fropm connected devices. The data can be sent directly to data consumers to process or a data repository.
When do you KVS?
- When you need to collect video streaming data for processing and real-time analysis
- Batch-process and store streaming video
- Feed streaming data ionto other AWS
Describe Kinesis Firehose
For recieving massive streaming data and stroing it in an AWS repository
When do you use Data Firehose?
- sending data directly to a data repository sithout processing
- when the final destination is S3
- when data retention is not important
Does data firehose have internal storage?
No, Firehose has no shards and so no data retention or storage
Which other Kinesis service can Firehose be combined with?
It can be used as a producer for Kinesis Analytics
Describe Kinesis Data Analytics
Used to query and analyse streaming data using SQL
When should you use KDA?
- When you need to take action in real time
- When you need to organise, enrich and transform
Where can KDA accept streams from?
Kinesis Firehose
Kinesis Data Streams
Describe Apache Kafka
A publish/subscribe messaging system with storage. The sender or producer sends the message ti Kafka which then stores the data for a specified amount of time. The reciever subscribes tia topic and then recieves the data they want.
How can Kafka be setup?
It can be installed on EC2 instances or an Amazon service
When should you use Kafka?
- streaming ingest
- ETL
- CDC
- Big data ingest
Which kinesis service can store data internally?
KDS
Which Kinesis services cannot write directly to storage ?
KDA and KDS
Which Kinesis services can write directly to storage?
KDF and KVS
Which Kinesis service can process data using Lambda
KDF
Which Kinesis services can change data?
KDF and KDA
Which data repositories can KDF send data to directly?
- S3
- Redshift
- Elastic Search
- Splunk
Which Kinesis services can perform ETL pre-processing on data?
- KDF using Lambda
KDA real-time using SQL
What types of data can KVS process?
radar, audio, video and images
What is the advantage of KPL over Kinesis API?
KPL provides a lot of added features such as failed transmission built in. These need to be coded yourself when using ghe Kinesis API
What are the pros of KPL?
- Performance benefits
- Consumer-side ease of use via the KCL
- Producer monitoring via CloudWatch
- Asynchronous architecture, KPL has a buffer to store records whilst they are processed