Data Engineering Flashcards

Question 1

Q

What is Kinesis

Answer

A

A cloud-based service that enables real-time processing of streaming data at scale.

Question 2

Q

Name the four main components of Amazon Kinesis.

Answer

A

Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics, and Kinesis Video Streams

Question 3

Q

What is the primary purpose of Kinesis Data Streams?

Answer

A

To collect and process real-time streaming data for custom applications.

Question 4

Q

What is the main use of Kinesis Data Firehose?

Answer

A

To load streaming data into destinations like Amazon S3, Redshift, or Elasticsearch.

Question 5

Q

How does Kinesis Data Analytics help in data processing?

Answer

A

It allows you to analyze streaming data in real-time using SQL queries, or 2 types of ML algorithms or lambda functions:
_ Random Cut Forest, find outliers in data, use recent history
_ Hotspots, locate and return info on dense regions in your data

Question 6

Q

Which Kinesis component would you use for video data streaming?

Answer

A

Kinesis Video Streams.

Question 7

Q

What integration capabilities does Kinesis Data Firehose have?

Answer

A

It integrates with AWS services such as Amazon S3, Redshift, Elasticsearch, and third-party tools like Splunk.

Question 8

Q

How does Kinesis ensure data durability?

Answer

A

It automatically replicates data across AZs

Question 9

Q

Explain the difference between Stream and Firehose

Answer

A

Kinesis Data Streams is designed for custom, real-time data processing with low latency and more control. It requires users to manage shards and processing logic and can store data for up to 365 days.

Kinesis Data Firehose is a fully managed service for delivering streaming data to storage or analytics destinations with minimal setup, automatic scaling, and slight delivery latency (typically 60 seconds or less), it does not store data.

Both have automatic scaling

Question 10

Q

What is Glue

Answer

A

A serverless data integration service. Run Apache Spark, Scala, or Python code, focused on ETL.

Generate metadata catalogue for more than 70 data sources, store in S3, using Crawlers
Visually manage ETL jobs using Glue Studio
Search and query data using Athena, EMR, Redshift Spectrum

Question 11

Q

Glue ETL feature

Answer

A

Transform, Enrich data (before doing analysis)

Generate ETL code in Python or Scala, Spark, Pyspark
Targets can be S3, RedShift(RDS), or in Glue Data Catalog
Fully managed, pay-as-you-go

Glue Scheduler to schedule jobs

Glue Triggers to automate jobs runs based on “events”

Question 12

Q

Glue ETL - Transformations

Answer

A

Bundled Transformations:

DropFields, DropNullFields
Filter - use function to filter records
Join - to Enrich Data
Map - add, delete, fields, perform external lookups

Machine Learning Transformations:

FindMatches ML: identify duplicate or matching records in dataset, even when records do not have common unique identifier and no fields match exactly

Format conversions: CSV, JSON, Avro, Parquet, ORC, XML

Can use any apache Spark transformation(K-means)

Question 13

Q

What is Redshift

Answer

A

Data warehousing solution that can handle large-scale datasets and database migrations. Built upon OLAP architecture(Online Analytical Processing). Data is organized in Columns.

Differ from RDS by its ability to handle analytical workloads on relational dbs. RDS/Aurora is based on OLTP(Online Transactional Processing), row-based.

And Redshift can access data stored outside its clusters(S3) via Redshift Spectrum.

Redshift is optimized to store much more data than RDS, can store 128TB per node(or virtually limitless if data comes from external sources). RDS can reach 64TB for entire DB engine.

Question 14

Q

Amazon Athena

Answer

A

Query data instantly from S3 or other federated data sources with SQL

Question 15

Q

What are the relevant Data Stores used in ML

Answer

A

Redshift - Data Warehousing based on OLAP, column-based
RDS/Aurora - Data Warehousing based on OLTP, row-based, not used directly for ML
DynamoDB - NoSQL data store, can be used to store model output/params
Opensearch(previously ElastiSearch) - Indexing of data; Search amongst data points; Clickstream Analytics
Elasticache - Caching of data for quick IO

Question 16

Q

AWS Data Pipeline

Answer

A

ETL orchestrator
Manage Task Dependencies
More control over the environment, compute resources
Can access underlying resources(EC2, EMR) in your account

Question 17

Q

Amazon Kinesis Client library