Data Engineering - Data Transformation for Machine Learning Flashcards
What are the two types of data transformation?
- Changing the data structure
- Cleaning the data
What are the 6 issues that might appear in “dirty” data?
- Inconsistent schema - names and order of fields varies
- Extraneous text - additional unnecessary text in field
- Missing data - empty or N/A
- Redundant Information - the same data is available in several fields
- Contextual errors - the data is valid but wrong in the real-world context
- Junk vlaues - meaningless data in fields
Describe Apache Spark
A data processing framework that can quickly perform processing tasks on very large data sets. It can run on Amazon EMR
Describe Amazon EMR
A managed cluster platform that simplifies running big data frameworks such as Apache Hadoop and Apache Spark
Which languages does Apache Spark support?
Java, Scala, SQL and Python
How can Apache Spark be used with Amazon SageMaker?
When run with SageMaker Spark is used for pre-processing data and SageMaker for model training and hosting
Describe Amazon Athena
a serverless, interactive analytics service built on open-source frameworks, support open-table and file format. It allows you to use standard SQL on datasets in S3. It requires a Data Catalogue to understand the structure of S3.
Describe Amazon Glue
An ETL service it transfers data from a raw S3 bucket to a processed S3 bucket
What is a glue crawler?
A configurable glue item that can search a data source from a database/S3 bucket and populate a Glue database with data. It fist needs to identify and format before analyses the structeure of the data and then creates tables in the Glue database with the data.
What is a glue trigger?
A configurable Glue item that can make Glue crawlers and Jobs start processing. They can be configured to srart on a schedule or because an event has been detected.
How does the glue crawler recognise the format of the data?
A prioritised list of data classifiers is used to recognise the format
Describe the Glue Database
Comprimises of tables that have been created by the Glue Crawler. The tables describe the data strcuture and are used to retrieve the data and manipulate it during data transformation
What type of store is the Glue Database?
Apache Hive Metastore
Where does the Glue Database sit?
In the glue data catalogue. There is one per region, per account that holds all the Glue Databases that have been created by Glue Crawlers
Describe a Glue Job
A PySpark or Python programme that can access the source data in the Glue Databases. It makes use of an Apache Spark cluster to provide processing power. The Glue Job processies ghr data and performs data transformation and cleansing