Domain 1: Data Engineering Flashcards

1
Q

Data that has a well-defined schema and metadata needed to interpret the data such as the attributes and the data types.

A

Structured Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Tabular data is an example of:

A

Structured Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

T/F: Depending on the column data type, you may have to perform different actions to prepare the data for machine learning.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

An attribute in a tabular dataset is a _____, and a _____corresponds to a data point or an observation.

A

Column/row

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data that does not have a schema or any well-defined structural properties.

A

Unstructured Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What makes up the majority of the data most organizations have?

A

Unstructured Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Whose job is it to convert the unstructured data into some form of structured data for machine learning or train an ML model directly on the unstructured data itself?

A

Data Scientist

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Examples include images, videos, audio files, text documents, or application log files.

A

Unstructured Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data that can be in JSON format or XML data that you may have from a NoSQL database.

A

Semi-structured Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

T/F: You may need to parse this semi-structured data into structured data to make it useful for machine learning.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data that has a single or multiple target columns (dependent variables) or attributes.

A

Labeled Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data with no target attribute or label.

A

Unlabeled Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A column in a tabular dataset besides the label column.

A

Feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A row in a tabular dataset that consists of one or more features, which can also contain one or more labels.

A

Data Point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A collection of data points that you will use for model training and validation.

A

Dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A feature that can be represented by a continuous number or an integer but is unbounded in nature.

A

Numerical Feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A feature that is discrete and qualitative, and can only take on a finite number of values.

A

Categorical Feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

In most machine learning problems, you need to convert _____features into _____features using different techniques.

A

Categorical/numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Images that are usually in different formats such as JPEG or PNG.

A

Image Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

An example of an _____ is the popular handwritten digits dataset such as MNIST or ImageNet.

A

Image dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

This data usually consists of audio files in MP3 or WAV formats and can arise from call transcriptions in call centers.

A

Audio Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

This data is commonly referred to as a corpus and can consists of collections of documents.

A

Text Data (Corpus)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

_____ can be stored in many formats, such as raw PDF or TXT files, JSON, or CSV.

A

Text Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Examples of ________ include the newsgroups dataset, Amazon reviews data, the WikiQA corpus, WordNet, and IMDB reviews.

A

Popular text corpora

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

This is data that consists of a value varying over time such as the sale price of a product, the price of a stock, the daily temperature or humidity, measurements or readings from a sensor or Internet of things (IoT) device, or the number of passengers who ride the New York City Metro daily.

A

Time Series Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

This is the dataset that is used to train the model.

A

Training Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

This is a portion of the dataset that is kept aside to validate your model performance during training.

A

Validation Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

This should be kept aside from the outset so that your model never sees it until it is trained. Once your model is trained and you are satisfied with the model performance on the training and validation datasets, only then should you test the model performance on this.

A

Test Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

T/F: The test dataset should mimic as closely as possible the data you expect your model to serve during production.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

_____ is often used for use cases such as online transaction processing (OLTP), analytics, and reporting, and analysts use a language like _____to query this data.

A

Tabular data/SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

_____applications typically run on relational databases, and AWS offers a service called _____ to build and manage this kind of data.

A

OLTP/AWS RDS (Relational Database Service)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

These underlying engines support _____: AWS Aurora, MySQL, MariaDB, Oracle, Microsoft SQL Server, and PostgreSQL

A

AWS RDS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Relational databases typically use _____ and are suited for queries for specific rows, inserts, and updates.

A

Row-wise storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

For analytics and reporting workloads that are read heavy, consider a data warehouse solution like _____.

A

Amazon Redshift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Amazon Redshift uses _____ instead of _____ for fast retrieval of columns and is ideally suited for querying against very large datasets.

A

Columnar storage/row-level storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

_____ is now integrated with Amazon SageMaker via SageMaker Data Wrangler.

A

Amazon Redshift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Both Redshift and RDS store _____.

A

Tabular data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

If your data is semi-structured, you should consider a NoSQL database like _____.

A

DynamoDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Stores data as key-value pairs and can be used to store data that does not have a specific schema.

A

DynamoDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

If your data currently lives in an open-source NoSQL store like MongoDB, you can use _____ to migrate that data to AWS.

A

Amazon DocumentDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

T/F: Amazon recommends using purpose-built databases for specific applications rather than a one-size-fits-all approach.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

_____ is a data lake solution that helps you centrally catalog your data and establish fine-grained controls on who can access the data.

A

AWS Lake Formation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Users can query the central catalog in Lake Formation and then run analytics or extract-transform-load (ETL) workstreams on the data using tools like _____.

A

Amazon Redshift or Amazon EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Once your data lands in AWS, you need to move the data to _____ in order to train ML models.

A

Amazon S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are the two ways of migrating data to AWS?

A

Batch and streaming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

For batch migration, you _____ transfer data.

A

Bulk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

For streaming migration, you have a streaming data source like ____ or ____ to stream data into S3.

A

Sensor/IOT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

If your data is already on AWS, you can use _____ to move the data from other data sources such as Redshift, DynamoDB, or RDS to S3.

A

AWS Data Pipeline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

An _____ is a pipeline component that tells Data Pipeline what job to perform.

A

Activity type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Data Pipeline has some prebuilt activity types that you can use, such as _____ to copy data from one Amazon S3 location to another, _____ to copy data to and from Redshift tables, and _____to run a SQL query on a database and copy the output to S3.

A

CopyActivity / RedshiftCopyActivity / SqlActivity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What are 3 data sources you can use with AWS Data Pipeline to get data in S3?

A

Redshift, DynamoDB, and RDS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

How do you migrate data from one database to another when your data is in relational format?

A

AWS Database Migration Service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What’s a migration that moves from, say, Oracle or EC2 on prem to Oracle database in Amazon RDS?

A

Homogenous migration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What’s a migration that moves from MySQL database to Amazon Aurora?

A

Heterogenous migration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

How do you convert the schema of a dataset?

A

Schema Conversion Tool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What can you use to land data from one relational database to Amazon S3?

A

DMS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Data Pipeline can be used with _____ such as Redshift and NoSQL databases such as DynamoDB, whereas DMS can only be used to migrate _____ such as databases on EC2, AzureSQL, and Oracle.

A

data warehouses / relational databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

_____ is a managed ETL service that allows you to run serverless extract-transform-load workloads without worrying about provisioning compute.

A

AWS Glue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

You can take data from different data sources, and use the _____ to crawl the data to determine the underlying schema.

A

Glue catalog

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

_____ will try to infer the data schema and work with a number of data formats such as CSV, JSON, and Apache Avro.

A

Glue crawlers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

_____ the process of combining data from multiple sources into a large, central repository called a data warehouse. This uses a set of business rules to clean and organize raw data and prepare it for storage, data analytics, and machine learning.

A

Extract, transform, and load (ETL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Once a schema is determined, how do you change the data format?

A

By running ETL scripts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

_____ is a service that allows you to visually prepare and clean your data, normalize your data, and run a number of different feature transforms on the dataset without writing code.

A

Glue Data Brew

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

This is a powerful service with capabilities such as:
- Data visualization using Glue Data Brew
- Serverless ETL
- The ability to crawl and infer the schema of the data using data crawlers
- The ability to catalog your data into a data catalog using Glue Data Catalog

A

Glue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

You use this to catalog data, convert data from one data format to another, run ETL jobs on the data, and land the data in another data source.

A

Glue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

For many applications such as sensors and IoT devices, video or news feeds, and live social media streams, you may want to upload the data to AWS by _____.

A

Streaming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What word should you think of if the test mentions streaming, sensors, and IoT and concerns data collection?

A

Kinesis family of services

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

This provides a set of APIs, SDKs, and a user interface that you can use to store, update, version, and retrieve any amount of data from anywhere on the web.

A

Amazon Simple Storage Service (S3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

A ____is where objects are stored in Amazon S3. Every object is contained in a _____ you own.

A

Bucket

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

An _____that is stored in a bucket consists of the object data and object metadata. Metadata is a set of key-value pairs that describe the object like data modified or standard HTTP metadata such as Content-Type.

A

object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

A bucket is tied down to the _____it is created in. You can choose a _____that optimizes latency or that satisfies regulatory requirements.

A

region

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

A single object in S3 can be up to _____TB in size, and you can add up to _____key-value pairs called S3 object tags to each object, which can be updated or deleted at a later time.

A

5 / 10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

T/F: S3 storage is hierarchical

A

F: nonhierarchical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

T/F: Object keys are not folder structures, they’re just a way to organize your data.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

With S3 batch operations, you can copy large amounts of data between buckets, replace tags, or modify access controls _____

A

with a simple API or through the console.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

How do you prevent accidental S3 bucket deletions?

A

Data versioning and MFA Delete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

How do you copy objects to multiple locations automatically, in same or different regions?

A

S3 replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

How do you implement write-once, read-many (WORM) policy and retain an object version for a specific period of time?

A

D3 Object Lock

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

How do you query data without accessing any other analytics service using SQL statements?

A

S3 Select

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

What do you use for more involved SQL queries to query data directly on S3?

A

Amazon Athena or Redshift Spectrum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

S3 standard, S3 intelligent tiering, S3 Standard Infrequent Access, S3 One Zone Infrequent Access, S3 Glacier, and S3 Glacier Deep Archive are all:

A

Storage classes

82
Q

What are these:
- AWS Identity and Access Management (IAM) and access control lists (ACLs)
- Query string authentication
- AWS Trusted Advisor
- Server-side encryption (SSE-KMS, SSE-C, SSE-S3) and client-side encryption.
- VPC endpoints

A

AWS security features

83
Q

_____ provides a fully managed, POSIX-compliant, elastic NFS filesystem that can be shared by multiple instances. It is built for petabyte scale, and it grows and shrinks automatically and seamlessly as you add and remove data.

A

Amazon Elastic File System (EFS)

84
Q

What are these:
- Use the console or APIs to create a filesystem.
- Create mount targets for your filesystem.
- Create and configure security groups.

A

Steps to get started with EFS

85
Q

How do you mount a EFS filesystem inside your VPC?

A

By creating a mount target in each availability zone so that all instances in the same availability zone can share the same mount target.

86
Q

_____ is a fully managed, high-performance filesystem that can be used for large-scale machine learning jobs and high-performance computing (HPC) use cases. It also provides two types of filesystems, one for windows and one for Lustre.

A

Amazon FSx

87
Q

This is based on the popular Lustre filesystem that is used for distributed computing workloads such as machine learning and HPC. It can support hundreds of petabytes of data storage and hundreds of gigabytes of aggregate throughput. The majority of the top 100 fastest supercomputers in the world use this.

A

Amazon FSx for Lustre

88
Q

T/F: Version control tools like Git are meant for storing large training datasets or trained ML models.

A

False

89
Q

How is code typically versioned on AWS?

A

CodeCommit

90
Q

_____ is used to track, version, back up, and restore snapshots of datasets by using familiar tools and AWS back-end storage services like S3 and EFS

A

DVC

91
Q

_____ uses local caches that can also be shared across users using services like EFS and can use S3 as a persistent store.

A

DVC

92
Q

What might prevent you from incorporating DVS into your workflow?

A

Restrictions of your current stack or lack of appropriate training.

93
Q

Why are versioning systems like DVC beneficial?

A

They allow you to branch, commit, merge, and use datasets using a structured approach.

94
Q

What is when sharing compute and storage resources helps reduce costs; however, it requires strong security measures to prevent cross-tenant data access.

A

Pool model

95
Q

What is it called when each tenant have its own set of isolated resources?

A

Silo model

96
Q

What do you use to to collect and process large streams of data records in real time?

A

Amazon Kinesis Data Streams

97
Q

What reads data from a data stream as data records?

A

Kinesis Data Streams applications

98
Q

Kinesis Data Stream applications can use the _____ and run on _____.

A

Kinesis Client Library / EC2 instances

99
Q

This can be used for rapid and continuous data intake and aggregation.

A

Kinesis Data Streams

100
Q

These are scenarios for using _____:
- Accelerated log and data feed intake and processing
- Real-time metrics and reporting
- Real-time data analytics
- Complex stream processing

A

Kinesis Data Streams

101
Q

This is is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon OpenSearch Service, Amazon OpenSearch Serverless, Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, Dynatrace, LogicMonitor, MongoDB, New Relic, Coralogix, and Elastic.

A

Amazon Data Firehose

102
Q

This automatically delivers data to a specified destination when you configure your data producers to send data here:

A

Amazon Data Firehose

103
Q

This is the underlying entity of Amazon Data Firehose. You use Amazon Data Firehose by creating this and then sending data to it.

A

Firehose stream

104
Q

The data of interest that your data producer sends to a Firehose stream, which can be as large as 1,000 KB.

A

Record

105
Q

These send records to Firehose streams.

A

Data producer

106
Q

T/F: Amazon Data Firehose buffers incoming streaming data to a certain size or for a certain period of time before delivering it to destinations.

A

True

107
Q

Buffer Size is in _____and Buffer Interval is in _____.

A

MBs / seconds

108
Q

With Firehose, for Amazon S3 destinations, streaming data is delivered to your _____

A

S3 bucket

109
Q

With Firehose, for Amazon Redshift destinations, streaming data is delivered to your _____ first. Amazon Data Firehose then issues an _____ to load data from your S3 bucket to your Amazon Redshift cluster.

A

S3 bucket / Amazon Redshift COPY command

110
Q

With Firehose, for OpenSearch Service destinations, streaming data is delivered to your _____, and it can optionally be backed up to your S3 bucket concurrently.

A

OpenSearch Service cluster

111
Q

With Firehose, for Splunk destinations, streaming data is delivered to _____, and it can optionally be backed up to your S3 bucket concurrently.

A

Splunk

112
Q

With _____ for SQL Applications, you can process and analyze streaming data using standard SQL.

A

Amazon Kinesis Data Analytics

113
Q

This service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics.

A

Amazon Kinesis Data Analytics

114
Q

What application supports ingesting data from Amazon Kinesis Data Streams and Amazon Data Firehose streaming sources.

A

Kinesis Data Analytics

115
Q

What are these steps for:
1. Create app
2. Author SQL code using interactive editor
3. Test code with live streaming data

A

Kinesis Data Analytics

116
Q

This enables you to quickly author SQL code that continuously reads, processes, and stores data in near real time. Using standard SQL queries on the streaming data, you can construct applications that transform and provide insights into your data.

A

Kinesis Data Analytics

117
Q

You can do the following with this:
- Generate time-series analytics
- Feed real-time dashboards
- Create real-time metrics

A

Kinesis Data Analytics

118
Q

What benefit does Glue have over EMR?

A

It’s serverless.

119
Q

How can you run your ETL scripts?

A

Using Python, PySpark, or Scala

120
Q

T/F: Glue offers several built-in data transforms and can even build the processing script for you, or you can bring your custom scripts.

A

True

121
Q

What’s the first step in the Glue workflow?

A

Point your data sources to Glue and define a data crawler to crawl your data.

122
Q

What’s the second step in the Glue workflow?

A

Populate the Glue Data Catalog with your table metadata.

123
Q

What’s the third step in the Glue workflow?

A

Run a custom processing script on your data.

124
Q

What happens when you point Glue to your data source and a custom or prebuilt script, and schedule a job or use an event trigger to trigger the Glue workflow?

A

The processed outputs will be stored in the specified destination, such as an S3 bucket or Redshift.

125
Q

What data sources and destinations does Glue work with?

A

Ones that support JDBC connectivity, such as Amazon Redshift or Amazon RDS, in addition to Amazon S3.

126
Q

What does the Glue Data Catalog use to catalog your data?

A

AWS Lake Formation

127
Q

T/F: You can run Athena or Redshift Spectrum queries on data in S3 using the Glue Data Catalog as the underlying metadata store.

A

True

128
Q

What is the service that helps data scientists visually inspect their data, explore the data, define transformations, and engineer features.

A

Glue Data Brew

129
Q

What is a fully managed Hadoop cluster ecosystem that runs on EC2 and allows you to choose from a menu of open-source tools, such as Spark for ETL and SparkML for machine learning, Presto for SQL queries, Flink for stream processing, Pig and Hive to analyze and query data, and Jupyter-style notebooks with Zeppelin?

A

Amazon EMR

130
Q

_____ is useful when you want to run data processing and ETL jobs over petabytes of data.

A

Amazon EMR

131
Q

In addition to the Hadoop distributed filesystem for storage, _____ integrates directly with data in S3 using _____.

A

EMR / EMR File System (EMRFS)

132
Q

For interactive analysis of Spark jobs, you can either use _____ or connect your EMR cluster to _____.

A

EMR notebooks / SageMaker notebook instances or SageMaker Studio

133
Q

This allows you to run Spark-based workloads on EC2 instances.

A

EMR

134
Q

This solution is:
- Ideally suited for the extremely large-scale (petabyte-scale) data requirements
- Requires familiarity with the Hadoop ecosystem
- Runs on EC2 instances in your AWS account
- Is ideally suited for big data engineers

A

EMR

135
Q

This is a new option in Amazon EMR that makes it easy and cost-effective for data engineers and analysts to run applications built using open source big data frameworks such as Apache Spark, Hive or Presto, without having to tune, operate, optimize, secure or manage clusters.

A

Amazon EMR Serverless

136
Q

What are the following use cases for:
- Perform big data analytics
- Build scalable data pipelines
- Process real-time data streams
- Accelerate data science and ML adoption

A

Amazon EMR

137
Q

_____ is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.

A

Apache Hadoop

138
Q

Instead of using one large computer to store and process the data, _____allows clustering multiple computers to analyze massive datasets in parallel more quickly.

A

Hadoop

139
Q

A distributed file system that runs on standard or low-end hardware and provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.

A

Hadoop Distributed File System (HDFS)

140
Q

Manages and monitors cluster nodes and resource usage. It schedules jobs and tasks.

A

Yet Another Resource Negotiator (YARN)

141
Q

A framework that helps programs do the parallel computation on data. The map task takes input data and converts it into a dataset that can be computed in key value pairs. The output of the map task is consumed by reduce tasks to aggregate output and provide the desired result.

A

MapReduce

142
Q

Provides common Java libraries that can be used across all modules.

A

Hadoop Common

143
Q

_____makes it easier to use all the storage and processing capacity in cluster servers, and to execute distributed processes against huge amounts of data.

A

Hadoop

144
Q

What is this workflow for:
- Applications that collect data in various formats can place data into this cluster by using an API operation to connect to the NameNode.
- The NameNode tracks the file directory structure and placement of “chunks” for each file, replicated across DataNodes.
- To run a job to query the data, provide a MapReduce job made up of many map and reduce tasks that run against the data in HDFS spread across the DataNodes.
- Map tasks run on each node against the input files supplied, and reducers run to aggregate and organize the final output.

A

Hadoop

145
Q

An open source, distributed processing system commonly used for big data workloads. This uses in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries.

A

Spark

146
Q

Allows users to leverage Hadoop MapReduce using a SQL interface, enabling analytics at a massive scale, in addition to distributed and fault-tolerant data warehousing.

A

Hive

147
Q

A programming model for processing big data sets with a parallel, distributed algorithm.

A

Hadoop MapReduce

148
Q

What service does this describe:
- An open-source, distributed processing system used for big data workloads
- It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.
- It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.

A

Spark

149
Q

This was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations.

A

Spark

150
Q

T/F: Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset.

A

True

151
Q

Data re-use is accomplished through the creation of _____, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations.

A

DataFrames

152
Q

T/F: Spark is faster than MapReduce

A

True, dramatically

153
Q

Hadoop is an open source framework that has _____ as storage, _____as a way of managing computing resources used by different applications, and an implementation of the _____ programming model as an execution engine.

A

the Hadoop Distributed File System (HDFS) / YARN / MapReduce

154
Q

Does Spark have its own storage system?

A

No

155
Q

Spark on Hadoop leverages _____to share a common cluster and dataset as other Hadoop engines, ensuring consistent levels of service, and response.

A

YARN

156
Q

The Spark framework includes:

_____ as the foundation for the platform
_____ for interactive queries
_____ for real-time analytics
_____ for machine learning
_____ for graph processing

A

Spark Core
Spark SQL
Spark Streaming
Spark MLlib
Spark GraphX

157
Q

What does EMR stand for?

A

Elastic Map Reduce

158
Q

What do you use to store structured and unstructured data?

A

Data lake

159
Q

_____ is your data lake solution, and _____ is the preferred storage option for data science processing on AWS

A

AWS Lake Formation / Amazon S3

160
Q

How do you reduce the cost of data storage?

A

Amazon S3 storage classes

161
Q

The storage option for active, frequently accessed data, with milliseconds access, and greater than or equal to 3 availability zones

A

S3 Standard

162
Q

The storage class for data with changing access patterns, with milliseconds access, and greater than or equal to 3 availability zones

A

S3 INT

163
Q

The storage class for infrequently accessed data, with milliseconds access, and greater than or equal to 3 availability zones

A

S3 S-IA

164
Q

The storage class for re-creatable, less accessed data, milliseconds access, and 1 availability zone

A

S3 1Z-1A

165
Q

The storage class for archive data, minutes/hours access, and greater than or equal to 3 availability zones

A

Glacier

166
Q

With _____, you can build, train and deploy ML models at scale using tools like notebooks, debuggers, profilers, pipelines, MLOps, and more – all in one integrated development environment (IDE).

A

SageMaker

167
Q

When your training data is already in Amazon S3 and you plan to run training jobs several times using different algorithms and parameters, consider using _____, a file system service, and it speeds up your training jobs by serving your Amazon S3 data to Amazon SageMaker at high speeds.

A

Amazon FSx for Lustre

168
Q

_____ has the benefit of directly launching your training jobs from the service without the need for data movement, resulting in faster training start times.

A

Amazon EFS

169
Q

A data scientist can use a _____ to do initial cleansing on a training set, launch a training job from _____, then use their notebook to drop a column and re-launch the training job, comparing the resulting models to see which works better.

A

Jupyter notebook / Amazon SageMaker

170
Q

How many images per second can S3 load?

A

less than 1

171
Q

How many images per second can EFS load?

A

1

172
Q

How many images per second can EBS load?

A

1.29

173
Q

How many images per second can FSx load?

A

more than 1.6

174
Q

This ingestion method periodically collects and groups source data in any logical order and used when there is no need for real-time or near-real-time data.

A

Batch processing

175
Q

What are three services that help with batch ingestions?

A

Glue, DMS, Step Functions

176
Q

This service reads from historical data from source systems, such as relational database management systems, data warehouses, and NoSQL databases, at any desired interval.

A

AWS Database Migration Service (AWS DMS)

177
Q

You can also automate various ETL tasks that involve complex workflows by using _____.

A

AWS Step Functions

178
Q

This is a real-time data ingestion method that involves no grouping at all and in which data is sourced, manipulated, and loaded as soon as it is created or recognized by the data ingestion layer.

A

Stream processing

179
Q

What are two uses for stream processing?

A

Real-time predictions and dashboards

180
Q

What’s the platform for streaming data?

A

Kinesis

181
Q

Use this to ingest and analyze video and audio data

A

Kinesis Video Streams

182
Q

Use this to process and transform data streaming through Kinesis Data Streams or Kinesis Data Firehose using SQL and gain real-time insight from incremental stream before storing in S3.

A

Kinesis Data Analytics

183
Q

Use this to batch and compress data to generate incremental views and execute custom transformation logic using Lambda before delivering the incremental view to S3.

A

Kinesis Data Firehose

184
Q

This is an intermediary between your producer application code and the Kinesis Data Streams API data.

A

Kinesis Producer Library (KPL)

185
Q

Use KPL and Kinesis Data Streams API data to write to a _____.

A

Kinesis Data Stream

186
Q

Use this to build your own app to preprocess the streaming data as it arrives and emit the data for generating incremental views and downstream analysis.

A

Kinesis Client Library (KCL)

187
Q

This is a distributed data store optimized for ingesting and processing streaming data in real-time that allows you to:
- Publish and subscribe to streams of records
- Effectively store streams of records in the order in which records were generated
- Process streams of records in real time

A

Apache Kafka

188
Q

Deduplication, incomplete data management, and attribute standardization are all ways you _____.

A

Transform and clean data

189
Q

How should you change the data structure to facilitate easy querying of data?

A

Into an OLAP model

190
Q

Technology for performing high-speed complex queries or multidimensional analysis on large volumes of data in a data warehouse, data lake or other data repository.

A

OLAP model

191
Q

What provides a protocol of data processing and node task distribution and management and also uses algorithms to split datasets into subsets and distribute them across nodes in a compute cluster?

A

MapReduce and Apache Spark

192
Q

Using _____ on Amazon EMR provides a managed framework that can process massive quantities of data.

A

Apache Spark

193
Q

_____ supports many instance types that have proportionally high CPU with increased network performance, which is well suited for HPC (high-performance computing) applications

A

Amazon EMR

194
Q

What are ETL processing services?

A

Amazon Athena, AWS Glue, Amazon Redshift Spectrum

195
Q

You can use _____ to provide metadata discovery and management features.

A

AWS Glue

196
Q

Tabular data processing with _____lets you manipulate your data files in Amazon S3 using SQL

A

Athena

197
Q

If your datasets or computations are not optimally compatible with SQL, you can use _____ to seamlessly run Spark jobs (Scala and Python support) on data stored in your Amazon S3 buckets.

A

AWS Glue

198
Q

Customers can store a single source of data in Amazon S3 and perform ad hoc analysis with _____, integrate with a data warehouse on _____, build a visual dashboard for metrics using _____, and build an ML model to predict readmissions using _____.

A

Athena / Amazon Redshift / Amazon QuickSight / Amazon SageMaker

199
Q

Rather than develop artificial intelligence (AI) from scratch, data scientists use a _____ as a starting point to develop ML models that power new applications more quickly and cost-effectively.

A

Foundation model

200
Q

Using this, you can streamline ML team collaboration, code efficiently using the AI-powered coding companion, tune and debug models, deploy and manage models in production, and automate workflows—all within a single, unified web-based interface.

A

SageMaker Studio