Data Engineering - Data Transformation for Machine Learning Flashcards

Question 1

Q

What are the two types of data transformation?

Answer

A

Changing the data structure
Cleaning the data

Question 2

Q

What are the 6 issues that might appear in “dirty” data?

Answer

A

Inconsistent schema - names and order of fields varies
Extraneous text - additional unnecessary text in field
Missing data - empty or N/A
Redundant Information - the same data is available in several fields
Contextual errors - the data is valid but wrong in the real-world context
Junk vlaues - meaningless data in fields

Question 3

Q

Describe Apache Spark

Answer

A

A data processing framework that can quickly perform processing tasks on very large data sets. It can run on Amazon EMR

Question 4

Q

Describe Amazon EMR

Answer

A

A managed cluster platform that simplifies running big data frameworks such as Apache Hadoop and Apache Spark

Question 5

Q

Which languages does Apache Spark support?

Answer

A

Java, Scala, SQL and Python

Question 6

Q

How can Apache Spark be used with Amazon SageMaker?

Answer

A

When run with SageMaker Spark is used for pre-processing data and SageMaker for model training and hosting

Question 7

Q

Describe Amazon Athena

Answer

A

a serverless, interactive analytics service built on open-source frameworks, support open-table and file format. It allows you to use standard SQL on datasets in S3. It requires a Data Catalogue to understand the structure of S3.

Question 8

Q

Describe Amazon Glue

Answer

A

An ETL service it transfers data from a raw S3 bucket to a processed S3 bucket

Question 9

Q

What is a glue crawler?

Answer

A

A configurable glue item that can search a data source from a database/S3 bucket and populate a Glue database with data. It fist needs to identify and format before analyses the structeure of the data and then creates tables in the Glue database with the data.

Question 10

Q

What is a glue trigger?

Answer

A

A configurable Glue item that can make Glue crawlers and Jobs start processing. They can be configured to srart on a schedule or because an event has been detected.

Question 11

Q

How does the glue crawler recognise the format of the data?

Answer

A

A prioritised list of data classifiers is used to recognise the format

Question 12

Q

Describe the Glue Database

Answer

A

Comprimises of tables that have been created by the Glue Crawler. The tables describe the data strcuture and are used to retrieve the data and manipulate it during data transformation

Question 13

Q

What type of store is the Glue Database?

Answer

A

Apache Hive Metastore

Question 14

Q

Where does the Glue Database sit?

Answer

A

In the glue data catalogue. There is one per region, per account that holds all the Glue Databases that have been created by Glue Crawlers

Question 15

Q

Describe a Glue Job

Answer

A

A PySpark or Python programme that can access the source data in the Glue Databases. It makes use of an Apache Spark cluster to provide processing power. The Glue Job processies ghr data and performs data transformation and cleansing