Domain 1: Data Engineering Flashcards
Data that has a well-defined schema and metadata needed to interpret the data such as the attributes and the data types.
Structured Data
Tabular data is an example of:
Structured Data
T/F: Depending on the column data type, you may have to perform different actions to prepare the data for machine learning.
True
An attribute in a tabular dataset is a _____, and a _____corresponds to a data point or an observation.
Column/row
Data that does not have a schema or any well-defined structural properties.
Unstructured Data
What makes up the majority of the data most organizations have?
Unstructured Data
Whose job is it to convert the unstructured data into some form of structured data for machine learning or train an ML model directly on the unstructured data itself?
Data Scientist
Examples include images, videos, audio files, text documents, or application log files.
Unstructured Data
Data that can be in JSON format or XML data that you may have from a NoSQL database.
Semi-structured Data
T/F: You may need to parse this semi-structured data into structured data to make it useful for machine learning.
True
Data that has a single or multiple target columns (dependent variables) or attributes.
Labeled Data
Data with no target attribute or label.
Unlabeled Data
A column in a tabular dataset besides the label column.
Feature
A row in a tabular dataset that consists of one or more features, which can also contain one or more labels.
Data Point
A collection of data points that you will use for model training and validation.
Dataset
A feature that can be represented by a continuous number or an integer but is unbounded in nature.
Numerical Feature
A feature that is discrete and qualitative, and can only take on a finite number of values.
Categorical Feature
In most machine learning problems, you need to convert _____features into _____features using different techniques.
Categorical/numerical
Images that are usually in different formats such as JPEG or PNG.
Image Data
An example of an _____ is the popular handwritten digits dataset such as MNIST or ImageNet.
Image dataset
This data usually consists of audio files in MP3 or WAV formats and can arise from call transcriptions in call centers.
Audio Data
This data is commonly referred to as a corpus and can consists of collections of documents.
Text Data (Corpus)
_____ can be stored in many formats, such as raw PDF or TXT files, JSON, or CSV.
Text Data
Examples of ________ include the newsgroups dataset, Amazon reviews data, the WikiQA corpus, WordNet, and IMDB reviews.
Popular text corpora
This is data that consists of a value varying over time such as the sale price of a product, the price of a stock, the daily temperature or humidity, measurements or readings from a sensor or Internet of things (IoT) device, or the number of passengers who ride the New York City Metro daily.
Time Series Data
This is the dataset that is used to train the model.
Training Data
This is a portion of the dataset that is kept aside to validate your model performance during training.
Validation Data
This should be kept aside from the outset so that your model never sees it until it is trained. Once your model is trained and you are satisfied with the model performance on the training and validation datasets, only then should you test the model performance on this.
Test Data
T/F: The test dataset should mimic as closely as possible the data you expect your model to serve during production.
True
_____ is often used for use cases such as online transaction processing (OLTP), analytics, and reporting, and analysts use a language like _____to query this data.
Tabular data/SQL
_____applications typically run on relational databases, and AWS offers a service called _____ to build and manage this kind of data.
OLTP/AWS RDS (Relational Database Service)
These underlying engines support _____: AWS Aurora, MySQL, MariaDB, Oracle, Microsoft SQL Server, and PostgreSQL
AWS RDS
Relational databases typically use _____ and are suited for queries for specific rows, inserts, and updates.
Row-wise storage
For analytics and reporting workloads that are read heavy, consider a data warehouse solution like _____.
Amazon Redshift
Amazon Redshift uses _____ instead of _____ for fast retrieval of columns and is ideally suited for querying against very large datasets.
Columnar storage/row-level storage
_____ is now integrated with Amazon SageMaker via SageMaker Data Wrangler.
Amazon Redshift
Both Redshift and RDS store _____.
Tabular data
If your data is semi-structured, you should consider a NoSQL database like _____.
DynamoDB
Stores data as key-value pairs and can be used to store data that does not have a specific schema.
DynamoDB
If your data currently lives in an open-source NoSQL store like MongoDB, you can use _____ to migrate that data to AWS.
Amazon DocumentDB
T/F: Amazon recommends using purpose-built databases for specific applications rather than a one-size-fits-all approach.
True
_____ is a data lake solution that helps you centrally catalog your data and establish fine-grained controls on who can access the data.
AWS Lake Formation
Users can query the central catalog in Lake Formation and then run analytics or extract-transform-load (ETL) workstreams on the data using tools like _____.
Amazon Redshift or Amazon EMR
Once your data lands in AWS, you need to move the data to _____ in order to train ML models.
Amazon S3
What are the two ways of migrating data to AWS?
Batch and streaming
For batch migration, you _____ transfer data.
Bulk
For streaming migration, you have a streaming data source like ____ or ____ to stream data into S3.
Sensor/IOT
If your data is already on AWS, you can use _____ to move the data from other data sources such as Redshift, DynamoDB, or RDS to S3.
AWS Data Pipeline
An _____ is a pipeline component that tells Data Pipeline what job to perform.
Activity type
Data Pipeline has some prebuilt activity types that you can use, such as _____ to copy data from one Amazon S3 location to another, _____ to copy data to and from Redshift tables, and _____to run a SQL query on a database and copy the output to S3.
CopyActivity / RedshiftCopyActivity / SqlActivity
What are 3 data sources you can use with AWS Data Pipeline to get data in S3?
Redshift, DynamoDB, and RDS
How do you migrate data from one database to another when your data is in relational format?
AWS Database Migration Service
What’s a migration that moves from, say, Oracle or EC2 on prem to Oracle database in Amazon RDS?
Homogenous migration
What’s a migration that moves from MySQL database to Amazon Aurora?
Heterogenous migration
How do you convert the schema of a dataset?
Schema Conversion Tool
What can you use to land data from one relational database to Amazon S3?
DMS
Data Pipeline can be used with _____ such as Redshift and NoSQL databases such as DynamoDB, whereas DMS can only be used to migrate _____ such as databases on EC2, AzureSQL, and Oracle.
data warehouses / relational databases
_____ is a managed ETL service that allows you to run serverless extract-transform-load workloads without worrying about provisioning compute.
AWS Glue
You can take data from different data sources, and use the _____ to crawl the data to determine the underlying schema.
Glue catalog
_____ will try to infer the data schema and work with a number of data formats such as CSV, JSON, and Apache Avro.
Glue crawlers
_____ the process of combining data from multiple sources into a large, central repository called a data warehouse. This uses a set of business rules to clean and organize raw data and prepare it for storage, data analytics, and machine learning.
Extract, transform, and load (ETL)
Once a schema is determined, how do you change the data format?
By running ETL scripts
_____ is a service that allows you to visually prepare and clean your data, normalize your data, and run a number of different feature transforms on the dataset without writing code.
Glue Data Brew
This is a powerful service with capabilities such as:
- Data visualization using Glue Data Brew
- Serverless ETL
- The ability to crawl and infer the schema of the data using data crawlers
- The ability to catalog your data into a data catalog using Glue Data Catalog
Glue
You use this to catalog data, convert data from one data format to another, run ETL jobs on the data, and land the data in another data source.
Glue
For many applications such as sensors and IoT devices, video or news feeds, and live social media streams, you may want to upload the data to AWS by _____.
Streaming
What word should you think of if the test mentions streaming, sensors, and IoT and concerns data collection?
Kinesis family of services
This provides a set of APIs, SDKs, and a user interface that you can use to store, update, version, and retrieve any amount of data from anywhere on the web.
Amazon Simple Storage Service (S3)
A ____is where objects are stored in Amazon S3. Every object is contained in a _____ you own.
Bucket
An _____that is stored in a bucket consists of the object data and object metadata. Metadata is a set of key-value pairs that describe the object like data modified or standard HTTP metadata such as Content-Type.
object
A bucket is tied down to the _____it is created in. You can choose a _____that optimizes latency or that satisfies regulatory requirements.
region
A single object in S3 can be up to _____TB in size, and you can add up to _____key-value pairs called S3 object tags to each object, which can be updated or deleted at a later time.
5 / 10
T/F: S3 storage is hierarchical
F: nonhierarchical
T/F: Object keys are not folder structures, they’re just a way to organize your data.
True
With S3 batch operations, you can copy large amounts of data between buckets, replace tags, or modify access controls _____
with a simple API or through the console.
How do you prevent accidental S3 bucket deletions?
Data versioning and MFA Delete
How do you copy objects to multiple locations automatically, in same or different regions?
S3 replication
How do you implement write-once, read-many (WORM) policy and retain an object version for a specific period of time?
D3 Object Lock
How do you query data without accessing any other analytics service using SQL statements?
S3 Select
What do you use for more involved SQL queries to query data directly on S3?
Amazon Athena or Redshift Spectrum
S3 standard, S3 intelligent tiering, S3 Standard Infrequent Access, S3 One Zone Infrequent Access, S3 Glacier, and S3 Glacier Deep Archive are all:
Storage classes
What are these:
- AWS Identity and Access Management (IAM) and access control lists (ACLs)
- Query string authentication
- AWS Trusted Advisor
- Server-side encryption (SSE-KMS, SSE-C, SSE-S3) and client-side encryption.
- VPC endpoints
AWS security features
_____ provides a fully managed, POSIX-compliant, elastic NFS filesystem that can be shared by multiple instances. It is built for petabyte scale, and it grows and shrinks automatically and seamlessly as you add and remove data.
Amazon Elastic File System (EFS)
What are these:
- Use the console or APIs to create a filesystem.
- Create mount targets for your filesystem.
- Create and configure security groups.
Steps to get started with EFS
How do you mount a EFS filesystem inside your VPC?
By creating a mount target in each availability zone so that all instances in the same availability zone can share the same mount target.
_____ is a fully managed, high-performance filesystem that can be used for large-scale machine learning jobs and high-performance computing (HPC) use cases. It also provides two types of filesystems, one for windows and one for Lustre.
Amazon FSx
This is based on the popular Lustre filesystem that is used for distributed computing workloads such as machine learning and HPC. It can support hundreds of petabytes of data storage and hundreds of gigabytes of aggregate throughput. The majority of the top 100 fastest supercomputers in the world use this.
Amazon FSx for Lustre
T/F: Version control tools like Git are meant for storing large training datasets or trained ML models.
False
How is code typically versioned on AWS?
CodeCommit
_____ is used to track, version, back up, and restore snapshots of datasets by using familiar tools and AWS back-end storage services like S3 and EFS
DVC
_____ uses local caches that can also be shared across users using services like EFS and can use S3 as a persistent store.
DVC
What might prevent you from incorporating DVS into your workflow?
Restrictions of your current stack or lack of appropriate training.
Why are versioning systems like DVC beneficial?
They allow you to branch, commit, merge, and use datasets using a structured approach.
What is when sharing compute and storage resources helps reduce costs; however, it requires strong security measures to prevent cross-tenant data access.
Pool model
What is it called when each tenant have its own set of isolated resources?
Silo model
What do you use to to collect and process large streams of data records in real time?
Amazon Kinesis Data Streams
What reads data from a data stream as data records?
Kinesis Data Streams applications
Kinesis Data Stream applications can use the _____ and run on _____.
Kinesis Client Library / EC2 instances
This can be used for rapid and continuous data intake and aggregation.
Kinesis Data Streams
These are scenarios for using _____:
- Accelerated log and data feed intake and processing
- Real-time metrics and reporting
- Real-time data analytics
- Complex stream processing
Kinesis Data Streams
This is is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon OpenSearch Service, Amazon OpenSearch Serverless, Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, Dynatrace, LogicMonitor, MongoDB, New Relic, Coralogix, and Elastic.
Amazon Data Firehose
This automatically delivers data to a specified destination when you configure your data producers to send data here:
Amazon Data Firehose
This is the underlying entity of Amazon Data Firehose. You use Amazon Data Firehose by creating this and then sending data to it.
Firehose stream
The data of interest that your data producer sends to a Firehose stream, which can be as large as 1,000 KB.
Record
These send records to Firehose streams.
Data producer
T/F: Amazon Data Firehose buffers incoming streaming data to a certain size or for a certain period of time before delivering it to destinations.
True
Buffer Size is in _____and Buffer Interval is in _____.
MBs / seconds
With Firehose, for Amazon S3 destinations, streaming data is delivered to your _____
S3 bucket
With Firehose, for Amazon Redshift destinations, streaming data is delivered to your _____ first. Amazon Data Firehose then issues an _____ to load data from your S3 bucket to your Amazon Redshift cluster.
S3 bucket / Amazon Redshift COPY command
With Firehose, for OpenSearch Service destinations, streaming data is delivered to your _____, and it can optionally be backed up to your S3 bucket concurrently.
OpenSearch Service cluster
With Firehose, for Splunk destinations, streaming data is delivered to _____, and it can optionally be backed up to your S3 bucket concurrently.
Splunk
With _____ for SQL Applications, you can process and analyze streaming data using standard SQL.
Amazon Kinesis Data Analytics
This service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics.
Amazon Kinesis Data Analytics
What application supports ingesting data from Amazon Kinesis Data Streams and Amazon Data Firehose streaming sources.
Kinesis Data Analytics
What are these steps for:
1. Create app
2. Author SQL code using interactive editor
3. Test code with live streaming data
Kinesis Data Analytics
This enables you to quickly author SQL code that continuously reads, processes, and stores data in near real time. Using standard SQL queries on the streaming data, you can construct applications that transform and provide insights into your data.
Kinesis Data Analytics
You can do the following with this:
- Generate time-series analytics
- Feed real-time dashboards
- Create real-time metrics
Kinesis Data Analytics
What benefit does Glue have over EMR?
It’s serverless.
How can you run your ETL scripts?
Using Python, PySpark, or Scala
T/F: Glue offers several built-in data transforms and can even build the processing script for you, or you can bring your custom scripts.
True
What’s the first step in the Glue workflow?
Point your data sources to Glue and define a data crawler to crawl your data.
What’s the second step in the Glue workflow?
Populate the Glue Data Catalog with your table metadata.
What’s the third step in the Glue workflow?
Run a custom processing script on your data.
What happens when you point Glue to your data source and a custom or prebuilt script, and schedule a job or use an event trigger to trigger the Glue workflow?
The processed outputs will be stored in the specified destination, such as an S3 bucket or Redshift.
What data sources and destinations does Glue work with?
Ones that support JDBC connectivity, such as Amazon Redshift or Amazon RDS, in addition to Amazon S3.
What does the Glue Data Catalog use to catalog your data?
AWS Lake Formation
T/F: You can run Athena or Redshift Spectrum queries on data in S3 using the Glue Data Catalog as the underlying metadata store.
True
What is the service that helps data scientists visually inspect their data, explore the data, define transformations, and engineer features.
Glue Data Brew
What is a fully managed Hadoop cluster ecosystem that runs on EC2 and allows you to choose from a menu of open-source tools, such as Spark for ETL and SparkML for machine learning, Presto for SQL queries, Flink for stream processing, Pig and Hive to analyze and query data, and Jupyter-style notebooks with Zeppelin?
Amazon EMR
_____ is useful when you want to run data processing and ETL jobs over petabytes of data.
Amazon EMR
In addition to the Hadoop distributed filesystem for storage, _____ integrates directly with data in S3 using _____.
EMR / EMR File System (EMRFS)
For interactive analysis of Spark jobs, you can either use _____ or connect your EMR cluster to _____.
EMR notebooks / SageMaker notebook instances or SageMaker Studio
This allows you to run Spark-based workloads on EC2 instances.
EMR
This solution is:
- Ideally suited for the extremely large-scale (petabyte-scale) data requirements
- Requires familiarity with the Hadoop ecosystem
- Runs on EC2 instances in your AWS account
- Is ideally suited for big data engineers
EMR
This is a new option in Amazon EMR that makes it easy and cost-effective for data engineers and analysts to run applications built using open source big data frameworks such as Apache Spark, Hive or Presto, without having to tune, operate, optimize, secure or manage clusters.
Amazon EMR Serverless
What are the following use cases for:
- Perform big data analytics
- Build scalable data pipelines
- Process real-time data streams
- Accelerate data science and ML adoption
Amazon EMR
_____ is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.
Apache Hadoop
Instead of using one large computer to store and process the data, _____allows clustering multiple computers to analyze massive datasets in parallel more quickly.
Hadoop
A distributed file system that runs on standard or low-end hardware and provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.
Hadoop Distributed File System (HDFS)
Manages and monitors cluster nodes and resource usage. It schedules jobs and tasks.
Yet Another Resource Negotiator (YARN)
A framework that helps programs do the parallel computation on data. The map task takes input data and converts it into a dataset that can be computed in key value pairs. The output of the map task is consumed by reduce tasks to aggregate output and provide the desired result.
MapReduce
Provides common Java libraries that can be used across all modules.
Hadoop Common
_____makes it easier to use all the storage and processing capacity in cluster servers, and to execute distributed processes against huge amounts of data.
Hadoop
What is this workflow for:
- Applications that collect data in various formats can place data into this cluster by using an API operation to connect to the NameNode.
- The NameNode tracks the file directory structure and placement of “chunks” for each file, replicated across DataNodes.
- To run a job to query the data, provide a MapReduce job made up of many map and reduce tasks that run against the data in HDFS spread across the DataNodes.
- Map tasks run on each node against the input files supplied, and reducers run to aggregate and organize the final output.
Hadoop
An open source, distributed processing system commonly used for big data workloads. This uses in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries.
Spark
Allows users to leverage Hadoop MapReduce using a SQL interface, enabling analytics at a massive scale, in addition to distributed and fault-tolerant data warehousing.
Hive
A programming model for processing big data sets with a parallel, distributed algorithm.
Hadoop MapReduce
What service does this describe:
- An open-source, distributed processing system used for big data workloads
- It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.
- It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.
Spark
This was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations.
Spark
T/F: Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset.
True
Data re-use is accomplished through the creation of _____, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations.
DataFrames
T/F: Spark is faster than MapReduce
True, dramatically
Hadoop is an open source framework that has _____ as storage, _____as a way of managing computing resources used by different applications, and an implementation of the _____ programming model as an execution engine.
the Hadoop Distributed File System (HDFS) / YARN / MapReduce
Does Spark have its own storage system?
No
Spark on Hadoop leverages _____to share a common cluster and dataset as other Hadoop engines, ensuring consistent levels of service, and response.
YARN
The Spark framework includes:
_____ as the foundation for the platform
_____ for interactive queries
_____ for real-time analytics
_____ for machine learning
_____ for graph processing
Spark Core
Spark SQL
Spark Streaming
Spark MLlib
Spark GraphX
What does EMR stand for?
Elastic Map Reduce
What do you use to store structured and unstructured data?
Data lake
_____ is your data lake solution, and _____ is the preferred storage option for data science processing on AWS
AWS Lake Formation / Amazon S3
How do you reduce the cost of data storage?
Amazon S3 storage classes
The storage option for active, frequently accessed data, with milliseconds access, and greater than or equal to 3 availability zones
S3 Standard
The storage class for data with changing access patterns, with milliseconds access, and greater than or equal to 3 availability zones
S3 INT
The storage class for infrequently accessed data, with milliseconds access, and greater than or equal to 3 availability zones
S3 S-IA
The storage class for re-creatable, less accessed data, milliseconds access, and 1 availability zone
S3 1Z-1A
The storage class for archive data, minutes/hours access, and greater than or equal to 3 availability zones
Glacier
With _____, you can build, train and deploy ML models at scale using tools like notebooks, debuggers, profilers, pipelines, MLOps, and more – all in one integrated development environment (IDE).
SageMaker
When your training data is already in Amazon S3 and you plan to run training jobs several times using different algorithms and parameters, consider using _____, a file system service, and it speeds up your training jobs by serving your Amazon S3 data to Amazon SageMaker at high speeds.
Amazon FSx for Lustre
_____ has the benefit of directly launching your training jobs from the service without the need for data movement, resulting in faster training start times.
Amazon EFS
A data scientist can use a _____ to do initial cleansing on a training set, launch a training job from _____, then use their notebook to drop a column and re-launch the training job, comparing the resulting models to see which works better.
Jupyter notebook / Amazon SageMaker
How many images per second can S3 load?
less than 1
How many images per second can EFS load?
1
How many images per second can EBS load?
1.29
How many images per second can FSx load?
more than 1.6
This ingestion method periodically collects and groups source data in any logical order and used when there is no need for real-time or near-real-time data.
Batch processing
What are three services that help with batch ingestions?
Glue, DMS, Step Functions
This service reads from historical data from source systems, such as relational database management systems, data warehouses, and NoSQL databases, at any desired interval.
AWS Database Migration Service (AWS DMS)
You can also automate various ETL tasks that involve complex workflows by using _____.
AWS Step Functions
This is a real-time data ingestion method that involves no grouping at all and in which data is sourced, manipulated, and loaded as soon as it is created or recognized by the data ingestion layer.
Stream processing
What are two uses for stream processing?
Real-time predictions and dashboards
What’s the platform for streaming data?
Kinesis
Use this to ingest and analyze video and audio data
Kinesis Video Streams
Use this to process and transform data streaming through Kinesis Data Streams or Kinesis Data Firehose using SQL and gain real-time insight from incremental stream before storing in S3.
Kinesis Data Analytics
Use this to batch and compress data to generate incremental views and execute custom transformation logic using Lambda before delivering the incremental view to S3.
Kinesis Data Firehose
This is an intermediary between your producer application code and the Kinesis Data Streams API data.
Kinesis Producer Library (KPL)
Use KPL and Kinesis Data Streams API data to write to a _____.
Kinesis Data Stream
Use this to build your own app to preprocess the streaming data as it arrives and emit the data for generating incremental views and downstream analysis.
Kinesis Client Library (KCL)
This is a distributed data store optimized for ingesting and processing streaming data in real-time that allows you to:
- Publish and subscribe to streams of records
- Effectively store streams of records in the order in which records were generated
- Process streams of records in real time
Apache Kafka
Deduplication, incomplete data management, and attribute standardization are all ways you _____.
Transform and clean data
How should you change the data structure to facilitate easy querying of data?
Into an OLAP model
Technology for performing high-speed complex queries or multidimensional analysis on large volumes of data in a data warehouse, data lake or other data repository.
OLAP model
What provides a protocol of data processing and node task distribution and management and also uses algorithms to split datasets into subsets and distribute them across nodes in a compute cluster?
MapReduce and Apache Spark
Using _____ on Amazon EMR provides a managed framework that can process massive quantities of data.
Apache Spark
_____ supports many instance types that have proportionally high CPU with increased network performance, which is well suited for HPC (high-performance computing) applications
Amazon EMR
What are ETL processing services?
Amazon Athena, AWS Glue, Amazon Redshift Spectrum
You can use _____ to provide metadata discovery and management features.
AWS Glue
Tabular data processing with _____lets you manipulate your data files in Amazon S3 using SQL
Athena
If your datasets or computations are not optimally compatible with SQL, you can use _____ to seamlessly run Spark jobs (Scala and Python support) on data stored in your Amazon S3 buckets.
AWS Glue
Customers can store a single source of data in Amazon S3 and perform ad hoc analysis with _____, integrate with a data warehouse on _____, build a visual dashboard for metrics using _____, and build an ML model to predict readmissions using _____.
Athena / Amazon Redshift / Amazon QuickSight / Amazon SageMaker
Rather than develop artificial intelligence (AI) from scratch, data scientists use a _____ as a starting point to develop ML models that power new applications more quickly and cost-effectively.
Foundation model
Using this, you can streamline ML team collaboration, code efficiently using the AI-powered coding companion, tune and debug models, deploy and manage models in production, and automate workflows—all within a single, unified web-based interface.
SageMaker Studio