Data Engineering - Streaming data for ML Flashcards
Name the 4 kinesis services
Data Streams, Video Streams, Data Firehose + Data Analytics
Define a data producer
Produces streaming data as JSON objects or blobs as the data is generated
Give examples of a data producer
- IoT devices
- Manufacturing devices
- user interaction with a website or a video game
Describe the two methods data producers can write to Kinesis
- They can use Kinesis Producer Library (KPL) - a Java library for writing to Kinesis
- Use the Kinesis API
Describe Kinesis Data Streams
They get data from producers and then transfer it to consumers as shards
Guve examples of Data Consumers
Lamdba, EC2, Kinesis Data Analytics and EMR
Can data be sent from KDS directly to a data repository?
No, it first needs to go to a consumer
Define a consumer
AWS service or distributed Kinesis application that retrieves data from KDS
Define a shard
the base throughout of KDS. Data consumers retrieve data from all the shards in a stream the data has generated
Explain partition keys
Data producers assign partition keys to records. Partition keys determine which shard ingests the data record from the data stream
Define a data stream
A logical grouping shards. They retain for between 24 hours and 7 days depending on retention settings
What do data consumers use to consume data from KDS?
Java Kinesis Client Library
How many records per second can each shard hold?
1000
How can an individual shard be identified?
It has a unique partition key and esch data record has a sequence number
What is re-sharding?
Changing the number of shards that the KDS has after start-up