Data Engineering - Streaming data for ML Flashcards

1
Q

Name the 4 kinesis services

A

Data Streams, Video Streams, Data Firehose + Data Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define a data producer

A

Produces streaming data as JSON objects or blobs as the data is generated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give examples of a data producer

A
  • IoT devices
  • Manufacturing devices
  • user interaction with a website or a video game
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe the two methods data producers can write to Kinesis

A
  1. They can use Kinesis Producer Library (KPL) - a Java library for writing to Kinesis
  2. Use the Kinesis API
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe Kinesis Data Streams

A

They get data from producers and then transfer it to consumers as shards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Guve examples of Data Consumers

A

Lamdba, EC2, Kinesis Data Analytics and EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Can data be sent from KDS directly to a data repository?

A

No, it first needs to go to a consumer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define a consumer

A

AWS service or distributed Kinesis application that retrieves data from KDS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define a shard

A

the base throughout of KDS. Data consumers retrieve data from all the shards in a stream the data has generated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain partition keys

A

Data producers assign partition keys to records. Partition keys determine which shard ingests the data record from the data stream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define a data stream

A

A logical grouping shards. They retain for between 24 hours and 7 days depending on retention settings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do data consumers use to consume data from KDS?

A

Java Kinesis Client Library

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How many records per second can each shard hold?

A

1000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can an individual shard be identified?

A

It has a unique partition key and esch data record has a sequence number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is re-sharding?

A

Changing the number of shards that the KDS has after start-up

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When do you use KDS?

A
  • Transfering data into AWS to be processed by data consumers
  • Data must be temporarily stored in case it needs to be reprocessed
  • Data needs to be processed before it can be stored
  • Real time analytics by data consumers
17
Q

Describe Kinesis Video Streams

A

Processes video streams fropm connected devices. The data can be sent directly to data consumers to process or a data repository.

18
Q

When do you KVS?

A
  • When you need to collect video streaming data for processing and real-time analysis
  • Batch-process and store streaming video
  • Feed streaming data ionto other AWS
19
Q

Describe Kinesis Firehose

A

For recieving massive streaming data and stroing it in an AWS repository

20
Q

When do you use Data Firehose?

A
  • sending data directly to a data repository sithout processing
  • when the final destination is S3
  • when data retention is not important
21
Q

Does data firehose have internal storage?

A

No, Firehose has no shards and so no data retention or storage

22
Q

Which other Kinesis service can Firehose be combined with?

A

It can be used as a producer for Kinesis Analytics

23
Q

Describe Kinesis Data Analytics

A

Used to query and analyse streaming data using SQL

24
Q

When should you use KDA?

A
  • When you need to take action in real time
  • When you need to organise, enrich and transform
25
Where can KDA accept streams from?
Kinesis Firehose Kinesis Data Streams
26
Describe Apache Kafka
A publish/subscribe messaging system with storage. The sender or producer sends the message ti Kafka which then stores the data for a specified amount of time. The reciever subscribes tia topic and then recieves the data they want.
27
How can Kafka be setup?
It can be installed on EC2 instances or an Amazon service
28
When should you use Kafka?
- streaming ingest - ETL - CDC - Big data ingest
29
Which kinesis service can store data internally?
KDS
30
Which Kinesis services cannot write directly to storage ?
KDA and KDS
31
Which Kinesis services can write directly to storage?
KDF and KVS
32
Which Kinesis service can process data using Lambda
KDF
33
Which Kinesis services can change data?
KDF and KDA
34
Which data repositories can KDF send data to directly?
- S3 - Redshift - Elastic Search - Splunk
35
Which Kinesis services can perform ETL pre-processing on data?
- KDF using Lambda KDA real-time using SQL
36
What types of data can KVS process?
radar, audio, video and images
37
What is the advantage of KPL over Kinesis API?
KPL provides a lot of added features such as failed transmission built in. These need to be coded yourself when using ghe Kinesis API
38
What are the pros of KPL?
- Performance benefits - Consumer-side ease of use via the KCL - Producer monitoring via CloudWatch - Asynchronous architecture, KPL has a buffer to store records whilst they are processed