Data Engineering - Batch Processing for ML Flashcards
Define Batch Processing
Processing usually performed to a specific schedule. Data is often waiting for the next batch to be processed
Name a service that is commonly used for batch processing in ML for ETL
Glue
What type of service is Glue?
An Extract, Transform Load (ETL) service
Name the steps of AWS glue
Gleu crawler followed by data placed in Glue database and tables
Name some built-in data classifiers Glue offers?
Parquet, JSON, BSON, XML, CSV, PostgreSQL, MySQL
What if the data is not in a format that Glue has built-in classifier for?
You can build a custom classifier using a GROK pattern, XML tag, JSON or CSV
Describe AWS Database migration service for data ingestion
Design to transfer data between databases
Which source database can be ingested by AWS Database migration tool?
RDS, EC2 instance and on premises
Why is the AWS Database migration service so reliable?
It transfers by transactions so if any transfer fails it can roll back any records in transit. You can be confident all data has been transferred.
When can AWS Database migration service be used?
- Once off migration
- Configured to move data on schedule
- Continous data replication where data is transferred from the siurce as soon as its made.