Data Engineering Flashcards
What is Kinesis
A cloud-based service that enables real-time processing of streaming data at scale.
Name the four main components of Amazon Kinesis.
Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics, and Kinesis Video Streams
What is the primary purpose of Kinesis Data Streams?
To collect and process real-time streaming data for custom applications.
What is the main use of Kinesis Data Firehose?
To load streaming data into destinations like Amazon S3, Redshift, or Elasticsearch.
How does Kinesis Data Analytics help in data processing?
It allows you to analyze streaming data in real-time using SQL queries, or 2 types of ML algorithms:
_ Random Cut Forest, find outliers in data, use recent history
_ Hotspots, locate and return info on dense regions in your data
Which Kinesis component would you use for video data streaming?
Kinesis Video Streams.
What integration capabilities does Kinesis Data Firehose have?
It integrates with AWS services such as Amazon S3, Redshift, Elasticsearch, and third-party tools like Splunk.
How does Kinesis ensure data durability?
It automatically replicates data across AZs
Explain the difference between Stream and Firehose
Kinesis Data Streams is designed for custom, real-time data processing with low latency and more control. It requires users to manage shards and processing logic and can store data for up to 365 days.
Kinesis Data Firehose is a fully managed service for delivering streaming data to storage or analytics destinations with minimal setup, automatic scaling, and slight delivery latency (typically 60 seconds or less), it does not store data.
Both have automatic scaling
What is Glue
A serverless data integration service. Run Apache Spark, Scala, or Python code, focused on ETL.
- Generate metadata catalogue for more than 70 data sources, store in S3, using Crawlers
- Visually manage ETL jobs using Glue Studio
- Search and query data using Athena, EMR, Redshift Spectrum
Glue ETL feature
Transform, Enrich data (before doing analysis)
- Generate ETL code in Python or Scala, Spark, Pyspark
- Targets can be S3, RedShift(RDS), or in Glue Data Catalog
- Fully managed, pay-as-you-go
Glue Scheduler to schedule jobs
Glue Triggers to automate jobs runs based on “events”
Glue ETL - Transformations
Bundled Transformations:
- DropFields, DropNullFields
- Filter - use function to filter records
- Join - to Enrich Data
- Map - add, delete, fields, perform external lookups
Machine Learning Transformations:
- FindMatches ML: identify duplicate or matching records in dataset, even when records do not have common unique identifier and no fields match exactly
Format conversions: CSV, JSON, Avro, Parquet, ORC, XML
Can use any apache Spark transformation(K-means)
What is Redshift
Data warehousing solution that can handle large-scale datasets and database migrations. Built upon OLAP architecture(Online Analytical Processing). Data is organized in Columns.
Differ from RDS by its ability to handle analytical workloads on relational dbs. RDS/Aurora is based on OLTP(Online Transactional Processing), row-based.
And Redshift can access data stored outside its clusters(S3) via Redshift Spectrum.
Redshift is optimized to store much more data than RDS, can store 128TB per node(or virtually limitless if data comes from external sources). RDS can reach 64TB for entire DB engine.
Amazon Athena
Query data instantly from S3 or other federated data sources with SQL
What are the relevant Data Stores used in ML
Redshift - Data Warehousing based on OLAP, column-based
RDS/Aurora - Data Warehousing based on OLTP, row-based, not used directly for ML
DynamoDB - NoSQL data store, can be used to store model output/params
Opensearch(previously ElastiSearch) - Indexing of data; Search amongst data points; Clickstream Analytics
Elasticache - Caching of data for quick IO