Domain 1: Data Engineering Flashcards by Natasha WrightPope

Data that has a well-defined schema and metadata needed to interpret the data such as the attributes and the data types.

Structured Data

How well did you know this?

Not at all

Perfectly

Tabular data is an example of:

Structured Data

How well did you know this?

Not at all

Perfectly

T/F: Depending on the column data type, you may have to perform different actions to prepare the data for machine learning.

True

How well did you know this?

Not at all

Perfectly

An attribute in a tabular dataset is a _____, and a _____corresponds to a data point or an observation.

Column/row

How well did you know this?

Not at all

Perfectly

Data that does not have a schema or any well-defined structural properties.

Unstructured Data

How well did you know this?

Not at all

Perfectly

What makes up the majority of the data most organizations have?

Unstructured Data

How well did you know this?

Not at all

Perfectly

Whose job is it to convert the unstructured data into some form of structured data for machine learning or train an ML model directly on the unstructured data itself?

Data Scientist

How well did you know this?

Not at all

Perfectly

Examples include images, videos, audio files, text documents, or application log files.

Unstructured Data

How well did you know this?

Not at all

Perfectly

Data that can be in JSON format or XML data that you may have from a NoSQL database.

Semi-structured Data

How well did you know this?

Not at all

Perfectly

T/F: You may need to parse this semi-structured data into structured data to make it useful for machine learning.

True

How well did you know this?

Not at all

Perfectly

Data that has a single or multiple target columns (dependent variables) or attributes.

Labeled Data

How well did you know this?

Not at all

Perfectly

Data with no target attribute or label.

Unlabeled Data

How well did you know this?

Not at all

Perfectly

A column in a tabular dataset besides the label column.

Feature

How well did you know this?

Not at all

Perfectly

A row in a tabular dataset that consists of one or more features, which can also contain one or more labels.

Data Point

How well did you know this?

Not at all

Perfectly

A collection of data points that you will use for model training and validation.

Dataset

How well did you know this?

Not at all

Perfectly

A feature that can be represented by a continuous number or an integer but is unbounded in nature.

Numerical Feature

How well did you know this?

Not at all

Perfectly

A feature that is discrete and qualitative, and can only take on a finite number of values.

Categorical Feature

How well did you know this?

Not at all

Perfectly

In most machine learning problems, you need to convert _____features into _____features using different techniques.

Categorical/numerical

How well did you know this?

Not at all

Perfectly

Images that are usually in different formats such as JPEG or PNG.

Image Data

How well did you know this?

Not at all

Perfectly

An example of an _____ is the popular handwritten digits dataset such as MNIST or ImageNet.

Image dataset

How well did you know this?

Not at all

Perfectly

This data usually consists of audio files in MP3 or WAV formats and can arise from call transcriptions in call centers.

Audio Data

How well did you know this?

Not at all

Perfectly

This data is commonly referred to as a corpus and can consists of collections of documents.

Text Data (Corpus)

How well did you know this?

Not at all

Perfectly

_____ can be stored in many formats, such as raw PDF or TXT files, JSON, or CSV.

Text Data

How well did you know this?

Not at all

Perfectly

Examples of ________ include the newsgroups dataset, Amazon reviews data, the WikiQA corpus, WordNet, and IMDB reviews.

Popular text corpora

How well did you know this?

Not at all

Perfectly

This is data that consists of a value varying over time such as the sale price of a product, the price of a stock, the daily temperature or humidity, measurements or readings from a sensor or Internet of things (IoT) device, or the number of passengers who ride the New York City Metro daily.

Time Series Data

This is the dataset that is used to train the model.

Training Data

This is a portion of the dataset that is kept aside to validate your model performance during training.

Validation Data

This should be kept aside from the outset so that your model never sees it until it is trained. Once your model is trained and you are satisfied with the model performance on the training and validation datasets, only then should you test the model performance on this.

Test Data

T/F: The test dataset should mimic as closely as possible the data you expect your model to serve during production.

True

_____ is often used for use cases such as online transaction processing (OLTP), analytics, and reporting, and analysts use a language like _____to query this data.

Tabular data/SQL

_____applications typically run on relational databases, and AWS offers a service called _____ to build and manage this kind of data.

OLTP/AWS RDS (Relational Database Service)

These underlying engines support _____: AWS Aurora, MySQL, MariaDB, Oracle, Microsoft SQL Server, and PostgreSQL

AWS RDS

Relational databases typically use _____ and are suited for queries for specific rows, inserts, and updates.

Row-wise storage

For analytics and reporting workloads that are read heavy, consider a data warehouse solution like _____.

Amazon Redshift

Amazon Redshift uses _____ instead of _____ for fast retrieval of columns and is ideally suited for querying against very large datasets.

Columnar storage/row-level storage

_____ is now integrated with Amazon SageMaker via SageMaker Data Wrangler.

Amazon Redshift

Both Redshift and RDS store _____.

Tabular data

If your data is semi-structured, you should consider a NoSQL database like _____.

DynamoDB

Stores data as key-value pairs and can be used to store data that does not have a specific schema.

DynamoDB

If your data currently lives in an open-source NoSQL store like MongoDB, you can use _____ to migrate that data to AWS.

Amazon DocumentDB

T/F: Amazon recommends using purpose-built databases for specific applications rather than a one-size-fits-all approach.

True

_____ is a data lake solution that helps you centrally catalog your data and establish fine-grained controls on who can access the data.

AWS Lake Formation

Users can query the central catalog in Lake Formation and then run analytics or extract-transform-load (ETL) workstreams on the data using tools like _____.

Amazon Redshift or Amazon EMR

Once your data lands in AWS, you need to move the data to _____ in order to train ML models.

Amazon S3

What are the two ways of migrating data to AWS?

Batch and streaming

For batch migration, you _____ transfer data.

Bulk

For streaming migration, you have a streaming data source like ____ or ____ to stream data into S3.

Sensor/IOT

If your data is already on AWS, you can use _____ to move the data from other data sources such as Redshift, DynamoDB, or RDS to S3.

AWS Data Pipeline

An _____ is a pipeline component that tells Data Pipeline what job to perform.

Activity type

Data Pipeline has some prebuilt activity types that you can use, such as _____ to copy data from one Amazon S3 location to another, _____ to copy data to and from Redshift tables, and _____to run a SQL query on a database and copy the output to S3.

CopyActivity / RedshiftCopyActivity / SqlActivity

What are 3 data sources you can use with AWS Data Pipeline to get data in S3?

Redshift, DynamoDB, and RDS

How do you migrate data from one database to another when your data is in relational format?

AWS Database Migration Service

What's a migration that moves from, say, Oracle or EC2 on prem to Oracle database in Amazon RDS?

Homogenous migration

What's a migration that moves from MySQL database to Amazon Aurora?

Heterogenous migration

How do you convert the schema of a dataset?

Schema Conversion Tool

What can you use to land data from one relational database to Amazon S3?

DMS

Data Pipeline can be used with _____ such as Redshift and NoSQL databases such as DynamoDB, whereas DMS can only be used to migrate _____ such as databases on EC2, AzureSQL, and Oracle.

data warehouses / relational databases

_____ is a managed ETL service that allows you to run serverless extract-transform-load workloads without worrying about provisioning compute.

AWS Glue

You can take data from different data sources, and use the _____ to crawl the data to determine the underlying schema.

Glue catalog

_____ will try to infer the data schema and work with a number of data formats such as CSV, JSON, and Apache Avro.

Glue crawlers

_____ the process of combining data from multiple sources into a large, central repository called a data warehouse. This uses a set of business rules to clean and organize raw data and prepare it for storage, data analytics, and machine learning.

Extract, transform, and load (ETL)

Once a schema is determined, how do you change the data format?

By running ETL scripts

_____ is a service that allows you to visually prepare and clean your data, normalize your data, and run a number of different feature transforms on the dataset without writing code.

Glue Data Brew

This is a powerful service with capabilities such as: - Data visualization using Glue Data Brew - Serverless ETL - The ability to crawl and infer the schema of the data using data crawlers - The ability to catalog your data into a data catalog using Glue Data Catalog

Glue

You use this to catalog data, convert data from one data format to another, run ETL jobs on the data, and land the data in another data source.

Glue

For many applications such as sensors and IoT devices, video or news feeds, and live social media streams, you may want to upload the data to AWS by _____.

Streaming

What word should you think of if the test mentions streaming, sensors, and IoT and concerns data collection?

Kinesis family of services

This provides a set of APIs, SDKs, and a user interface that you can use to store, update, version, and retrieve any amount of data from anywhere on the web.

Amazon Simple Storage Service (S3)

A ____is where objects are stored in Amazon S3. Every object is contained in a _____ you own.

Bucket

An _____that is stored in a bucket consists of the object data and object metadata. Metadata is a set of key-value pairs that describe the object like data modified or standard HTTP metadata such as Content-Type.

object

A bucket is tied down to the _____it is created in. You can choose a _____that optimizes latency or that satisfies regulatory requirements.

region

A single object in S3 can be up to _____TB in size, and you can add up to _____key-value pairs called S3 object tags to each object, which can be updated or deleted at a later time.

5 / 10

T/F: S3 storage is hierarchical

F: nonhierarchical

T/F: Object keys are not folder structures, they're just a way to organize your data.

True

With S3 batch operations, you can copy large amounts of data between buckets, replace tags, or modify access controls _____

with a simple API or through the console.

How do you prevent accidental S3 bucket deletions?

Data versioning and MFA Delete

How do you copy objects to multiple locations automatically, in same or different regions?

S3 replication

How do you implement write-once, read-many (WORM) policy and retain an object version for a specific period of time?

D3 Object Lock

How do you query data without accessing any other analytics service using SQL statements?

S3 Select

What do you use for more involved SQL queries to query data directly on S3?

Amazon Athena or Redshift Spectrum

S3 standard, S3 intelligent tiering, S3 Standard Infrequent Access, S3 One Zone Infrequent Access, S3 Glacier, and S3 Glacier Deep Archive are all:

Storage classes

What are these: - AWS Identity and Access Management (IAM) and access control lists (ACLs) - Query string authentication - AWS Trusted Advisor - Server-side encryption (SSE-KMS, SSE-C, SSE-S3) and client-side encryption. - VPC endpoints

AWS security features

_____ provides a fully managed, POSIX-compliant, elastic NFS filesystem that can be shared by multiple instances. It is built for petabyte scale, and it grows and shrinks automatically and seamlessly as you add and remove data.

Amazon Elastic File System (EFS)

What are these: - Use the console or APIs to create a filesystem. - Create mount targets for your filesystem. - Create and configure security groups.

Steps to get started with EFS

How do you mount a EFS filesystem inside your VPC?

By creating a mount target in each availability zone so that all instances in the same availability zone can share the same mount target.

_____ is a fully managed, high-performance filesystem that can be used for large-scale machine learning jobs and high-performance computing (HPC) use cases. It also provides two types of filesystems, one for windows and one for Lustre.

Amazon FSx

This is based on the popular Lustre filesystem that is used for distributed computing workloads such as machine learning and HPC. It can support hundreds of petabytes of data storage and hundreds of gigabytes of aggregate throughput. The majority of the top 100 fastest supercomputers in the world use this.

Amazon FSx for Lustre

T/F: Version control tools like Git are meant for storing large training datasets or trained ML models.

False

How is code typically versioned on AWS?

CodeCommit

_____ is used to track, version, back up, and restore snapshots of datasets by using familiar tools and AWS back-end storage services like S3 and EFS

DVC

_____ uses local caches that can also be shared across users using services like EFS and can use S3 as a persistent store.

DVC

What might prevent you from incorporating DVS into your workflow?

Restrictions of your current stack or lack of appropriate training.

Why are versioning systems like DVC beneficial?

They allow you to branch, commit, merge, and use datasets using a structured approach.

What is when sharing compute and storage resources helps reduce costs; however, it requires strong security measures to prevent cross-tenant data access.

Pool model

What is it called when each tenant have its own set of isolated resources?

Silo model

What do you use to to collect and process large streams of data records in real time?

Amazon Kinesis Data Streams

What reads data from a data stream as data records?

Kinesis Data Streams applications

Kinesis Data Stream applications can use the _____ and run on _____.

Kinesis Client Library / EC2 instances

This can be used for rapid and continuous data intake and aggregation.

Kinesis Data Streams

These are scenarios for using _____: - Accelerated log and data feed intake and processing - Real-time metrics and reporting - Real-time data analytics - Complex stream processing

Kinesis Data Streams

This is is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon OpenSearch Service, Amazon OpenSearch Serverless, Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, Dynatrace, LogicMonitor, MongoDB, New Relic, Coralogix, and Elastic.

Amazon Data Firehose

This automatically delivers data to a specified destination when you configure your data producers to send data here:

Amazon Data Firehose

This is the underlying entity of Amazon Data Firehose. You use Amazon Data Firehose by creating this and then sending data to it.

Firehose stream

The data of interest that your data producer sends to a Firehose stream, which can be as large as 1,000 KB.

Record

These send records to Firehose streams.

Data producer

T/F: Amazon Data Firehose buffers incoming streaming data to a certain size or for a certain period of time before delivering it to destinations.

True

Buffer Size is in _____and Buffer Interval is in _____.

MBs / seconds

With Firehose, for Amazon S3 destinations, streaming data is delivered to your _____

S3 bucket

With Firehose, for Amazon Redshift destinations, streaming data is delivered to your _____ first. Amazon Data Firehose then issues an _____ to load data from your S3 bucket to your Amazon Redshift cluster.

S3 bucket / Amazon Redshift COPY command

With Firehose, for OpenSearch Service destinations, streaming data is delivered to your _____, and it can optionally be backed up to your S3 bucket concurrently.

OpenSearch Service cluster

With Firehose, for Splunk destinations, streaming data is delivered to _____, and it can optionally be backed up to your S3 bucket concurrently.

Splunk

With _____ for SQL Applications, you can process and analyze streaming data using standard SQL.

Amazon Kinesis Data Analytics

This service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics.

Amazon Kinesis Data Analytics

What application supports ingesting data from Amazon Kinesis Data Streams and Amazon Data Firehose streaming sources.

Kinesis Data Analytics

What are these steps for: 1. Create app 2. Author SQL code using interactive editor 3. Test code with live streaming data

Kinesis Data Analytics

This enables you to quickly author SQL code that continuously reads, processes, and stores data in near real time. Using standard SQL queries on the streaming data, you can construct applications that transform and provide insights into your data.

Kinesis Data Analytics

You can do the following with this: - Generate time-series analytics - Feed real-time dashboards - Create real-time metrics

Kinesis Data Analytics

What benefit does Glue have over EMR?

It's serverless.

How can you run your ETL scripts?

Using Python, PySpark, or Scala

T/F: Glue offers several built-in data transforms and can even build the processing script for you, or you can bring your custom scripts.

True

What's the first step in the Glue workflow?

Point your data sources to Glue and define a data crawler to crawl your data.

What's the second step in the Glue workflow?

Populate the Glue Data Catalog with your table metadata.

What's the third step in the Glue workflow?

Run a custom processing script on your data.

What happens when you point Glue to your data source and a custom or prebuilt script, and schedule a job or use an event trigger to trigger the Glue workflow?

The processed outputs will be stored in the specified destination, such as an S3 bucket or Redshift.

What data sources and destinations does Glue work with?

Ones that support JDBC connectivity, such as Amazon Redshift or Amazon RDS, in addition to Amazon S3.

What does the Glue Data Catalog use to catalog your data?

AWS Lake Formation

T/F: You can run Athena or Redshift Spectrum queries on data in S3 using the Glue Data Catalog as the underlying metadata store.

True

What is the service that helps data scientists visually inspect their data, explore the data, define transformations, and engineer features.

Glue Data Brew

What is a fully managed Hadoop cluster ecosystem that runs on EC2 and allows you to choose from a menu of open-source tools, such as Spark for ETL and SparkML for machine learning, Presto for SQL queries, Flink for stream processing, Pig and Hive to analyze and query data, and Jupyter-style notebooks with Zeppelin?

Amazon EMR

_____ is useful when you want to run data processing and ETL jobs over petabytes of data.

Amazon EMR

In addition to the Hadoop distributed filesystem for storage, _____ integrates directly with data in S3 using _____.

EMR / EMR File System (EMRFS)

For interactive analysis of Spark jobs, you can either use _____ or connect your EMR cluster to _____.

EMR notebooks / SageMaker notebook instances or SageMaker Studio

This allows you to run Spark-based workloads on EC2 instances.

EMR

This solution is: - Ideally suited for the extremely large-scale (petabyte-scale) data requirements - Requires familiarity with the Hadoop ecosystem - Runs on EC2 instances in your AWS account - Is ideally suited for big data engineers

EMR

This is a new option in Amazon EMR that makes it easy and cost-effective for data engineers and analysts to run applications built using open source big data frameworks such as Apache Spark, Hive or Presto, without having to tune, operate, optimize, secure or manage clusters.

Amazon EMR Serverless

What are the following use cases for: - Perform big data analytics - Build scalable data pipelines - Process real-time data streams - Accelerate data science and ML adoption

Amazon EMR

_____ is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.

Apache Hadoop

Instead of using one large computer to store and process the data, _____allows clustering multiple computers to analyze massive datasets in parallel more quickly.

Hadoop

A distributed file system that runs on standard or low-end hardware and provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.

Hadoop Distributed File System (HDFS)

Manages and monitors cluster nodes and resource usage. It schedules jobs and tasks.

Yet Another Resource Negotiator (YARN)

A framework that helps programs do the parallel computation on data. The map task takes input data and converts it into a dataset that can be computed in key value pairs. The output of the map task is consumed by reduce tasks to aggregate output and provide the desired result.

MapReduce

Provides common Java libraries that can be used across all modules.

Hadoop Common

_____makes it easier to use all the storage and processing capacity in cluster servers, and to execute distributed processes against huge amounts of data.

Hadoop

What is this workflow for: - Applications that collect data in various formats can place data into this cluster by using an API operation to connect to the NameNode. - The NameNode tracks the file directory structure and placement of “chunks” for each file, replicated across DataNodes. - To run a job to query the data, provide a MapReduce job made up of many map and reduce tasks that run against the data in HDFS spread across the DataNodes. - Map tasks run on each node against the input files supplied, and reducers run to aggregate and organize the final output.

Hadoop

An open source, distributed processing system commonly used for big data workloads. This uses in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries.

Spark

Allows users to leverage Hadoop MapReduce using a SQL interface, enabling analytics at a massive scale, in addition to distributed and fault-tolerant data warehousing.

Hive

A programming model for processing big data sets with a parallel, distributed algorithm.

Hadoop MapReduce

What service does this describe: - An open-source, distributed processing system used for big data workloads - It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. - It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.

Spark

This was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations.

Spark

T/F: Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset.

True

Data re-use is accomplished through the creation of _____, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations.

DataFrames

T/F: Spark is faster than MapReduce

True, dramatically

Hadoop is an open source framework that has _____ as storage, _____as a way of managing computing resources used by different applications, and an implementation of the _____ programming model as an execution engine.

the Hadoop Distributed File System (HDFS) / YARN / MapReduce

Does Spark have its own storage system?

Spark on Hadoop leverages _____to share a common cluster and dataset as other Hadoop engines, ensuring consistent levels of service, and response.

YARN

The Spark framework includes: _____ as the foundation for the platform _____ for interactive queries _____ for real-time analytics _____ for machine learning _____ for graph processing

Spark Core Spark SQL Spark Streaming Spark MLlib Spark GraphX

What does EMR stand for?

Elastic Map Reduce

What do you use to store structured and unstructured data?

Data lake

_____ is your data lake solution, and _____ is the preferred storage option for data science processing on AWS

AWS Lake Formation / Amazon S3

How do you reduce the cost of data storage?

Amazon S3 storage classes

The storage option for active, frequently accessed data, with milliseconds access, and greater than or equal to 3 availability zones

S3 Standard

The storage class for data with changing access patterns, with milliseconds access, and greater than or equal to 3 availability zones

S3 INT

The storage class for infrequently accessed data, with milliseconds access, and greater than or equal to 3 availability zones

S3 S-IA

The storage class for re-creatable, less accessed data, milliseconds access, and 1 availability zone

S3 1Z-1A

The storage class for archive data, minutes/hours access, and greater than or equal to 3 availability zones

Glacier

With _____, you can build, train and deploy ML models at scale using tools like notebooks, debuggers, profilers, pipelines, MLOps, and more – all in one integrated development environment (IDE).

SageMaker

When your training data is already in Amazon S3 and you plan to run training jobs several times using different algorithms and parameters, consider using _____, a file system service, and it speeds up your training jobs by serving your Amazon S3 data to Amazon SageMaker at high speeds.

Amazon FSx for Lustre

_____ has the benefit of directly launching your training jobs from the service without the need for data movement, resulting in faster training start times.

Amazon EFS

A data scientist can use a _____ to do initial cleansing on a training set, launch a training job from _____, then use their notebook to drop a column and re-launch the training job, comparing the resulting models to see which works better.

Jupyter notebook / Amazon SageMaker

How many images per second can S3 load?

less than 1

How many images per second can EFS load?

How many images per second can EBS load?

1.29

How many images per second can FSx load?

more than 1.6

This ingestion method periodically collects and groups source data in any logical order and used when there is no need for real-time or near-real-time data.

Batch processing

What are three services that help with batch ingestions?

Glue, DMS, Step Functions

This service reads from historical data from source systems, such as relational database management systems, data warehouses, and NoSQL databases, at any desired interval.

AWS Database Migration Service (AWS DMS)

You can also automate various ETL tasks that involve complex workflows by using _____.

AWS Step Functions

This is a real-time data ingestion method that involves no grouping at all and in which data is sourced, manipulated, and loaded as soon as it is created or recognized by the data ingestion layer.

Stream processing

What are two uses for stream processing?

Real-time predictions and dashboards

What's the platform for streaming data?

Kinesis

Use this to ingest and analyze video and audio data

Kinesis Video Streams

Use this to process and transform data streaming through Kinesis Data Streams or Kinesis Data Firehose using SQL and gain real-time insight from incremental stream before storing in S3.

Kinesis Data Analytics

Use this to batch and compress data to generate incremental views and execute custom transformation logic using Lambda before delivering the incremental view to S3.

Kinesis Data Firehose

This is an intermediary between your producer application code and the Kinesis Data Streams API data.

Kinesis Producer Library (KPL)

Use KPL and Kinesis Data Streams API data to write to a _____.

Kinesis Data Stream

Use this to build your own app to preprocess the streaming data as it arrives and emit the data for generating incremental views and downstream analysis.

Kinesis Client Library (KCL)

This is a distributed data store optimized for ingesting and processing streaming data in real-time that allows you to: - Publish and subscribe to streams of records - Effectively store streams of records in the order in which records were generated - Process streams of records in real time

Apache Kafka

Deduplication, incomplete data management, and attribute standardization are all ways you _____.

Transform and clean data

How should you change the data structure to facilitate easy querying of data?

Into an OLAP model

Technology for performing high-speed complex queries or multidimensional analysis on large volumes of data in a data warehouse, data lake or other data repository.

OLAP model

What provides a protocol of data processing and node task distribution and management and also uses algorithms to split datasets into subsets and distribute them across nodes in a compute cluster?

MapReduce and Apache Spark

Using _____ on Amazon EMR provides a managed framework that can process massive quantities of data.

Apache Spark

_____ supports many instance types that have proportionally high CPU with increased network performance, which is well suited for HPC (high-performance computing) applications

Amazon EMR

What are ETL processing services?

Amazon Athena, AWS Glue, Amazon Redshift Spectrum

You can use _____ to provide metadata discovery and management features.

AWS Glue

Tabular data processing with _____lets you manipulate your data files in Amazon S3 using SQL

Athena

If your datasets or computations are not optimally compatible with SQL, you can use _____ to seamlessly run Spark jobs (Scala and Python support) on data stored in your Amazon S3 buckets.

AWS Glue

Customers can store a single source of data in Amazon S3 and perform ad hoc analysis with _____, integrate with a data warehouse on _____, build a visual dashboard for metrics using _____, and build an ML model to predict readmissions using _____.

Athena / Amazon Redshift / Amazon QuickSight / Amazon SageMaker

Rather than develop artificial intelligence (AI) from scratch, data scientists use a _____ as a starting point to develop ML models that power new applications more quickly and cost-effectively.

Foundation model

Using this, you can streamline ML team collaboration, code efficiently using the AI-powered coding companion, tune and debug models, deploy and manage models in production, and automate workflows—all within a single, unified web-based interface.

SageMaker Studio

Domain 1: Data Engineering Flashcards

(200 cards)