Domain 1: Data Engineering Flashcards
Data that has a well-defined schema and metadata needed to interpret the data such as the attributes and the data types.
Structured Data
Tabular data is an example of:
Structured Data
T/F: Depending on the column data type, you may have to perform different actions to prepare the data for machine learning.
True
An attribute in a tabular dataset is a _____, and a _____corresponds to a data point or an observation.
Column/row
Data that does not have a schema or any well-defined structural properties.
Unstructured Data
What makes up the majority of the data most organizations have?
Unstructured Data
Whose job is it to convert the unstructured data into some form of structured data for machine learning or train an ML model directly on the unstructured data itself?
Data Scientist
Examples include images, videos, audio files, text documents, or application log files.
Unstructured Data
Data that can be in JSON format or XML data that you may have from a NoSQL database.
Semi-structured Data
T/F: You may need to parse this semi-structured data into structured data to make it useful for machine learning.
True
Data that has a single or multiple target columns (dependent variables) or attributes.
Labeled Data
Data with no target attribute or label.
Unlabeled Data
A column in a tabular dataset besides the label column.
Feature
A row in a tabular dataset that consists of one or more features, which can also contain one or more labels.
Data Point
A collection of data points that you will use for model training and validation.
Dataset
A feature that can be represented by a continuous number or an integer but is unbounded in nature.
Numerical Feature
A feature that is discrete and qualitative, and can only take on a finite number of values.
Categorical Feature
In most machine learning problems, you need to convert _____features into _____features using different techniques.
Categorical/numerical
Images that are usually in different formats such as JPEG or PNG.
Image Data
An example of an _____ is the popular handwritten digits dataset such as MNIST or ImageNet.
Image dataset
This data usually consists of audio files in MP3 or WAV formats and can arise from call transcriptions in call centers.
Audio Data
This data is commonly referred to as a corpus and can consists of collections of documents.
Text Data (Corpus)
_____ can be stored in many formats, such as raw PDF or TXT files, JSON, or CSV.
Text Data
Examples of ________ include the newsgroups dataset, Amazon reviews data, the WikiQA corpus, WordNet, and IMDB reviews.
Popular text corpora
This is data that consists of a value varying over time such as the sale price of a product, the price of a stock, the daily temperature or humidity, measurements or readings from a sensor or Internet of things (IoT) device, or the number of passengers who ride the New York City Metro daily.
Time Series Data
This is the dataset that is used to train the model.
Training Data
This is a portion of the dataset that is kept aside to validate your model performance during training.
Validation Data
This should be kept aside from the outset so that your model never sees it until it is trained. Once your model is trained and you are satisfied with the model performance on the training and validation datasets, only then should you test the model performance on this.
Test Data
T/F: The test dataset should mimic as closely as possible the data you expect your model to serve during production.
True
_____ is often used for use cases such as online transaction processing (OLTP), analytics, and reporting, and analysts use a language like _____to query this data.
Tabular data/SQL
_____applications typically run on relational databases, and AWS offers a service called _____ to build and manage this kind of data.
OLTP/AWS RDS (Relational Database Service)
These underlying engines support _____: AWS Aurora, MySQL, MariaDB, Oracle, Microsoft SQL Server, and PostgreSQL
AWS RDS
Relational databases typically use _____ and are suited for queries for specific rows, inserts, and updates.
Row-wise storage
For analytics and reporting workloads that are read heavy, consider a data warehouse solution like _____.
Amazon Redshift
Amazon Redshift uses _____ instead of _____ for fast retrieval of columns and is ideally suited for querying against very large datasets.
Columnar storage/row-level storage
_____ is now integrated with Amazon SageMaker via SageMaker Data Wrangler.
Amazon Redshift
Both Redshift and RDS store _____.
Tabular data
If your data is semi-structured, you should consider a NoSQL database like _____.
DynamoDB
Stores data as key-value pairs and can be used to store data that does not have a specific schema.
DynamoDB
If your data currently lives in an open-source NoSQL store like MongoDB, you can use _____ to migrate that data to AWS.
Amazon DocumentDB
T/F: Amazon recommends using purpose-built databases for specific applications rather than a one-size-fits-all approach.
True
_____ is a data lake solution that helps you centrally catalog your data and establish fine-grained controls on who can access the data.
AWS Lake Formation
Users can query the central catalog in Lake Formation and then run analytics or extract-transform-load (ETL) workstreams on the data using tools like _____.
Amazon Redshift or Amazon EMR
Once your data lands in AWS, you need to move the data to _____ in order to train ML models.
Amazon S3
What are the two ways of migrating data to AWS?
Batch and streaming
For batch migration, you _____ transfer data.
Bulk
For streaming migration, you have a streaming data source like ____ or ____ to stream data into S3.
Sensor/IOT
If your data is already on AWS, you can use _____ to move the data from other data sources such as Redshift, DynamoDB, or RDS to S3.
AWS Data Pipeline
An _____ is a pipeline component that tells Data Pipeline what job to perform.
Activity type
Data Pipeline has some prebuilt activity types that you can use, such as _____ to copy data from one Amazon S3 location to another, _____ to copy data to and from Redshift tables, and _____to run a SQL query on a database and copy the output to S3.
CopyActivity / RedshiftCopyActivity / SqlActivity
What are 3 data sources you can use with AWS Data Pipeline to get data in S3?
Redshift, DynamoDB, and RDS
How do you migrate data from one database to another when your data is in relational format?
AWS Database Migration Service
What’s a migration that moves from, say, Oracle or EC2 on prem to Oracle database in Amazon RDS?
Homogenous migration
What’s a migration that moves from MySQL database to Amazon Aurora?
Heterogenous migration
How do you convert the schema of a dataset?
Schema Conversion Tool
What can you use to land data from one relational database to Amazon S3?
DMS
Data Pipeline can be used with _____ such as Redshift and NoSQL databases such as DynamoDB, whereas DMS can only be used to migrate _____ such as databases on EC2, AzureSQL, and Oracle.
data warehouses / relational databases
_____ is a managed ETL service that allows you to run serverless extract-transform-load workloads without worrying about provisioning compute.
AWS Glue
You can take data from different data sources, and use the _____ to crawl the data to determine the underlying schema.
Glue catalog
_____ will try to infer the data schema and work with a number of data formats such as CSV, JSON, and Apache Avro.
Glue crawlers
_____ the process of combining data from multiple sources into a large, central repository called a data warehouse. This uses a set of business rules to clean and organize raw data and prepare it for storage, data analytics, and machine learning.
Extract, transform, and load (ETL)
Once a schema is determined, how do you change the data format?
By running ETL scripts
_____ is a service that allows you to visually prepare and clean your data, normalize your data, and run a number of different feature transforms on the dataset without writing code.
Glue Data Brew
This is a powerful service with capabilities such as:
- Data visualization using Glue Data Brew
- Serverless ETL
- The ability to crawl and infer the schema of the data using data crawlers
- The ability to catalog your data into a data catalog using Glue Data Catalog
Glue
You use this to catalog data, convert data from one data format to another, run ETL jobs on the data, and land the data in another data source.
Glue
For many applications such as sensors and IoT devices, video or news feeds, and live social media streams, you may want to upload the data to AWS by _____.
Streaming
What word should you think of if the test mentions streaming, sensors, and IoT and concerns data collection?
Kinesis family of services
This provides a set of APIs, SDKs, and a user interface that you can use to store, update, version, and retrieve any amount of data from anywhere on the web.
Amazon Simple Storage Service (S3)
A ____is where objects are stored in Amazon S3. Every object is contained in a _____ you own.
Bucket
An _____that is stored in a bucket consists of the object data and object metadata. Metadata is a set of key-value pairs that describe the object like data modified or standard HTTP metadata such as Content-Type.
object
A bucket is tied down to the _____it is created in. You can choose a _____that optimizes latency or that satisfies regulatory requirements.
region
A single object in S3 can be up to _____TB in size, and you can add up to _____key-value pairs called S3 object tags to each object, which can be updated or deleted at a later time.
5 / 10
T/F: S3 storage is hierarchical
F: nonhierarchical
T/F: Object keys are not folder structures, they’re just a way to organize your data.
True
With S3 batch operations, you can copy large amounts of data between buckets, replace tags, or modify access controls _____
with a simple API or through the console.
How do you prevent accidental S3 bucket deletions?
Data versioning and MFA Delete
How do you copy objects to multiple locations automatically, in same or different regions?
S3 replication
How do you implement write-once, read-many (WORM) policy and retain an object version for a specific period of time?
D3 Object Lock
How do you query data without accessing any other analytics service using SQL statements?
S3 Select
What do you use for more involved SQL queries to query data directly on S3?
Amazon Athena or Redshift Spectrum