Data Engineering Flashcards

1
Q

What is Kinesis

A

A cloud-based service that enables real-time processing of streaming data at scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name the four main components of Amazon Kinesis.

A

Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics, and Kinesis Video Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the primary purpose of Kinesis Data Streams?

A

To collect and process real-time streaming data for custom applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the main use of Kinesis Data Firehose?

A

To load streaming data into destinations like Amazon S3, Redshift, or Elasticsearch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does Kinesis Data Analytics help in data processing?

A

It allows you to analyze streaming data in real-time using SQL queries, or 2 types of ML algorithms:
_ Random Cut Forest, find outliers in data, use recent history
_ Hotspots, locate and return info on dense regions in your data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which Kinesis component would you use for video data streaming?

A

Kinesis Video Streams.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What integration capabilities does Kinesis Data Firehose have?

A

It integrates with AWS services such as Amazon S3, Redshift, Elasticsearch, and third-party tools like Splunk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does Kinesis ensure data durability?

A

It automatically replicates data across AZs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain the difference between Stream and Firehose

A

Kinesis Data Streams is designed for custom, real-time data processing with low latency and more control. It requires users to manage shards and processing logic and can store data for up to 365 days.

Kinesis Data Firehose is a fully managed service for delivering streaming data to storage or analytics destinations with minimal setup, automatic scaling, and slight delivery latency (typically 60 seconds or less), it does not store data.

Both have automatic scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Glue

A

A serverless data integration service. Run Apache Spark, Scala, or Python code, focused on ETL.

  • Generate metadata catalogue for more than 70 data sources, store in S3, using Crawlers
  • Visually manage ETL jobs using Glue Studio
  • Search and query data using Athena, EMR, Redshift Spectrum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Glue ETL feature

A

Transform, Enrich data (before doing analysis)

  • Generate ETL code in Python or Scala, Spark, Pyspark
  • Targets can be S3, RedShift(RDS), or in Glue Data Catalog
  • Fully managed, pay-as-you-go

Glue Scheduler to schedule jobs

Glue Triggers to automate jobs runs based on “events”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Glue ETL - Transformations

A

Bundled Transformations:

  • DropFields, DropNullFields
  • Filter - use function to filter records
  • Join - to Enrich Data
  • Map - add, delete, fields, perform external lookups

Machine Learning Transformations:

  • FindMatches ML: identify duplicate or matching records in dataset, even when records do not have common unique identifier and no fields match exactly

Format conversions: CSV, JSON, Avro, Parquet, ORC, XML

Can use any apache Spark transformation(K-means)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Redshift

A

Data warehousing solution that can handle large-scale datasets and database migrations. Built upon OLAP architecture(Online Analytical Processing). Data is organized in Columns.

Differ from RDS by its ability to handle analytical workloads on relational dbs. RDS/Aurora is based on OLTP(Online Transactional Processing), row-based.

And Redshift can access data stored outside its clusters(S3) via Redshift Spectrum.

Redshift is optimized to store much more data than RDS, can store 128TB per node(or virtually limitless if data comes from external sources). RDS can reach 64TB for entire DB engine.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Amazon Athena

A

Query data instantly from S3 or other federated data sources with SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the relevant Data Stores used in ML

A

Redshift - Data Warehousing based on OLAP, column-based
RDS/Aurora - Data Warehousing based on OLTP, row-based, not used directly for ML
DynamoDB - NoSQL data store, can be used to store model output/params
Opensearch(previously ElastiSearch) - Indexing of data; Search amongst data points; Clickstream Analytics
Elasticache - Caching of data for quick IO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

AWS Data Pipeline

A

ETL orchestrator
Manage Task Dependencies
More control over the environment, compute resources
Can access underlying resources(EC2, EMR) in your account