AWS Data Analytics Flashcards
In a single data dashboard, Amazon ___________ can include AWS data, third-party data, big data, spreadsheet data, SaaS data, B2B data, and more.
Quicksight
CloudWatch detailed monitoring sends data from your EC2 instance to CloudWatch in ______ intervals.
1-minute
____________ is an ETL service that captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services.
Kinesis Data Firehose
When Kinesis Data Firehose is configured to send data to Redshift, behind the scenes it has to load the streaming data to _______ first and then issue a ______ command to move the data to Redshift.
S3… COPY…
Within Kinesis Data Analytics, using _________ __________ is a windowing method for analyzing time-based, overlapping groups of data that arrive at inconsistent times by aggregating the data.
stagger windows
What are the three windows you can use to process data in Kinesis Data Analytics?
- Stagger Windows
- Tumbling Windows
- Sliding Windows
___________ includes a built-in ML algorithm that can easily provide reliable forecasts for your data.
Amazon QuickSight
_______ is a fast, open-source, distributed SQL query engine designed for interactive analytic queries over large datasets from multiple sources (built by Facebook).
Presto
AWS Glue ETL scripts can be coded in _________ or _________ .
Python… Scala…
Amazon Redshift automatically integrates with ________ but not with an ________ (for encryption keys).
AWS KMS… HSM…
With Amazon Redshift, you can’t migrate to an _______-encrypted cluster by modifying the cluster. This is only possible if you want to enable _______ encryption.
HSM… KMS…
To load data from S3 to Redshift, you can use a __________ _________ that lists out the specific S3 paths you want to be copied over.
manifest file
Using the AWS Glue crawler for compressed files will cause the run time to ____________.
increase… It will take longer because the crawler has to download and decompress the file before reading it.
AWS Glue ___________ crawls only crawl folders that were added since the last crawler run, which can save significant time and cost.
incremental
To enable permissions between S3 and QuickSight, you would need to configure the permissions from the _________ console.
QuickSight
The _________ process re-sorts rows and reclaims space in either a specified table or all tables in the current database in Amazon Redshift.
VACUUM
If QuickSight connects to the data store by using a ________ ________, the data automatically refreshes when you open an associated dataset, analysis, or dashboard.
direct query
________ ______ is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
Amazon EMR
Can you use AWS Glue triggers to execute a job to run directly after a crawler completes?
No, but you can create an AWS Glue workflow with two triggers: one for the crawler and one for the job. This will achieve the same effect.
The capacity limits of an Amazon Kinesis data stream are defined by the ________ _____ ________ within the data stream.
number of shards
When creating an EMR cluster and you want to have the log files archived to Amazon S3, you must enable this feature __________ (while / after) launching the cluster.
while
Does Amazon SQS support real time streaming of data?
No.
What are the two Amazon EMR cluster types (regarding the time it takes for each to initialize) ?
(1) persistent / long-running
(2) transient
In Kinesis Data Streams, you can create up to _____ registered consumers per stream.
20
The two Kinesis Data Streams capacity modes are _________ and _________. These refer to whether the data stream shards are automatically or manually created.
on-demand… provisioned
To detect anomalies in your Kinesis Data Stream, you can use the ________________ function.
RANDOM_CUT_FOREST
Kinesis Data Analytics (KDA) supports _____________, _____________, and _____________ as destinations.
Kinesis Data Streams… Kinesis Data Firehose… Lambda
A common architecture using Kinesis Data Analytics (KDA) might look like this: ___________ –> Kinesis Data Analytics –> ___________ –> S3
Kinesis Data Stream –> Kinesis Data Analytics –> Kinesis Data Firehose –> S3
Apache _______ is a data warehousing system that uses SQL-like queries to analyze structured data stored in Hadoop Distributed File System (HDFS).
Hive
When creating an EMR cluster, what two configuration options can you choose from? The selected option is applied to each node type (primary, core, task) of the cluster.
- Instance Fleets
- Uniform Instance Groups (simpler, provides autoscaling)
Can Glue Data Catalog be used to store data, in a similar way to S3?
No, it is only used to store schema information on data gathered from the Glue crawler.
By default, Amazon Redshift clusters are created and situated in _______ AZ(s) within an AWS Region.
1… However, a multi-AZ deployment is also an option
If you have customized networking requirements for using Amazon Redshift, you will need to enable _________________ _______ _________________.
Enhanced VPC Routing
S3 Transfer Acceleration is enabled at the ________ level.
bucket
What are three common CLI commands for moving data to and from S3?
cp (copy)
mv (move)
sync (sync)
What are the 3 API calls for an S3 Multi Part upload?
CreateMultipartUpload
UploadPart
CompleteMultipartUpload
What is the max number of parts for an S3 Multi Part upload?
10,000
Does Elasticache for Memcached support snapshots and replication?
No. Snapshots and replication are not supported for memcached, just for Redis.
Which AWS database stores data as nodes connected with edges?
Neptune
In relational databases, row-based storage is ideal for OL__P and columnar storage is ideal for OL__P.
OLTP… OLAP
Apache _______ is an analytics framework for processing large datasets. (hint: Databricks is built on top of this)
Spark
What are the 3 data storage options for Amazon EMR?
- HDFS
- EMRFS (uses S3)
- Local Storage (Instance Store / EBS)
An Amazon EMR cluster can have either ____ or ____ primary (aka master) nodes.
1 or 3
How many AZ’s are used for Amazon EMR clusters?
Only 1 AZ
What are the three node types in an Amazon EMR cluster?
- Primary/Master Node
- Core Node
- Task Node (optional)
What is the name of the API you can use to launch an Amazon EMR cluster?
RunJobFlow API
What is the name of the API you can use to terminate an Amazon EMR cluster?
TerminateJobFlows API
The default limit for Amazon EMR instances is _____. This can be increased upon request.
20 instances (across all your clusters)
When using Amazon EMR, can you SSH directly into a task node?
No, you must first SSH into the master node, and then SSH into the desired node.
Which Amazon EMR node type (primary/master, core, task) hosts data using Hadoop Distributed File System (HDFS) and also runs Hadoop tasks?
Core Node
What are 5 implementations of how you can run Amazon EMR applications? (i.e. Amazon EMR on ______, Amazon EMR _____)
- Amazon EMR Serverless
- Amazon EMR on EC2
- Amazon EMR on AWS Outposts
- Amazon EMR on EKS
- Amazon EMR on Local Zones
For Amazon EMR billing, __________ rounds up the runtime duration to the nearest minute, whereas __________ tracks runtime duration to the nearest second.
BilledResourceUtilization… TotalResourceUtilization
Amazon EMR supports what two types of Hive clusters?
- interactive (customer can run Hive scripts directly on master node)
- batch (Hive script stored in S3 and referenced)
Amazon Redshift can automatically generate recommendations for managing your warehouse with the feature called _________ __________
Redshift Advisor
Does Redshift support native integration with Amazon SageMaker?
Yes
____________ is a feature of Amazon Redshift that lets you run queries against your data lake in Amazon S3, with no data loading or ETL required.
Redshift Spectrum
Using Amazon Redshift ______ nodes with managed storage allows you to pay separately for storage and compute.
RA3
For Amazon Redshift instances using Dense Compute (DC) and Dense Storage (DS2) clusters, where is the data stored?
On the compute nodes (as opposed to S3 for RA3 clusters and Redshift Serverless)
How is the data stored when using an Amazon Redshift RA3 instance?
Frequently processed data (hot data) is stored on high performance SSDs, and cold data stored in S3.
What service would a customer use to integrate (and/or aggregate) Amazon Redshift with their own on-premises data warehouse?
AWS Data Exchange
How are Redshift Multi-AZ and Redshift Relocation different, regarding RTO?
Redshift Relocation is free and has a 10-60 minute recover time.
Redshift Multi-AZ is more expensive, but has an RTO measured in seconds.
_____________ allows SQL users to create, train, and deploy machine learning models using familiar SQL commands.
Redshift ML
The Amazon Redshift _______ _______ simplifies access to Amazon Redshift because you don’t need to configure drivers and manage database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint
Data API
How long are Amazon Redshift automatic backups retained vs manual backups?
Automatic: 24 hours
Manual: Indefinitely
How would you monitor the performance of your Amazon Redshift data warehouse cluster?
AWS Management Console, or
CloudWatch APIs
Is there a charge for using the Amazon Redshift Data API?
No
When you launch an Amazon Redshift cluster, what option determines the CPU, RAM, storage capacity, and storage drive type for each node?
The node type
For datasets under 1 TB (compressed), what is the recommended Redshift node type?
DC2 (Dense Compute node)
What are the two EC2 platforms used for launching an Amazon Redshift cluster?
- EC2-Classic
- EC2-VPC
In Amazon Redshift, you need to associate a __________ _________ with each cluster that you create in order to configure database settings such as query timeout and date style.
parameter group
The charges that you accrue for using Amazon Redshift are based on _______ nodes and billed at an _________ rate.
compute… hourly…
Which EC2 instance categories does Amazon EMR support (i.e. on-demand, etc.) ?
on-demand
spot
reserved
An Amazon Redshift cluster is a set of nodes, which consists of a ________ node and one or more ________ nodes.
leader… compute…
A Quicksight _________ is a user who can create and publish dashboards.
Author
A Quicksight _________ is a user who consumes interactive dashboards.
Reader
Amazon QuickSight __________ Edition offers enhanced functionality such as QuickSight Readers, Private VPC connectivity, and AD connectivity.
Enterprise
_________ _________ __________ __________ are of 30-minute duration each. Each session is charged at $0.30 with maximum charges of $5 per Reader in a month.
Amazon QuickSight Reader sessions
Will an Amazon Quicksight Reader be charged if QuickSight is open in a browser in a background tab?
No, only charged when user interacts with page via a page refresh, filtering, clicking, etc.
Can Amazon Quicksight “Authors” or “Readers” invite more users?
No. This can only be done with a QuickSight “Admin” account.
Does Amazon QuickSight connect to both Amazon EC2 and on-premises databases?
Yes
The “Augment with SageMaker” option for Amazon __________ allows your SageMaker ML models to run inferences on your data.
QuickSight
Does QuickSight leverage SageMaker models to perform inference on incremental data or the full data every time it runs?
Inference runs on the full data every time it refreshes.
Amazon QuickSight has an innovative technology called ________ that allows it to select the most appropriate visualizations based on the properties of the data.
AutoGraph
You can use AWS Glue _________ to visually clean up and normalize data without writing code.
DataBrew
How does AWS Glue relate to AWS Lake Formation?
AWS Lake Formation encompasses AWS Glue PLUS additional features.
With AWS Glue _______, data engineers can visually create, run, and monitor ETL workflows.
Studio
The metadata stored in the AWS Glue Data Catalog can be readily accessed from _________________, ______________, _____________, _________________, and third-party services.
AWS Glue ETL
Amazon Athena
Amazon EMR
Amazon Redshift Spectrum
The AWS Glue ________ ____________ is a new feature that allows you to centrally discover, control (i.e. enforce), and evolve data stream schemas.
Schema Registry
The AWS Glue ________ ____________ supports Apache Avro and JSON Schema data formats and Java client applications
Schema Registry
Does the AWS Glue Schema Registry provide encryption at rest and in transit?
Yes
After you define the flow of your data sources, transformations, and targets in the visual (no-code) interface, AWS Glue Studio will generate __________ __________ code on your behalf.
Apache Spark
Which programming languages does AWS Glue ETL support?
Python and Scala
When building an AWS Glue workflow, what are the two ways to trigger AWS Glue ETL jobs within your workflow?
AWS Glue ETL jobs can either be triggered on a schedule or on a job completion event.
AWS Glue provides default retry behavior that will retry all failures _____ times before sending an error notification to CloudWatch.
three
AWS Glue supports ETL on streams from _______________, _____________, and _____________.
Amazon KDS
Apache Kafka
Amazon MSK
Do you have to use both the Data Catalog and AWS Glue ETL together for the service to work?
No, they can be used independently.
Both AWS Glue and Kinesis Data Analytics can be used to process streaming data.
____________ is recommended when your use cases are primarily ETL and you want to run jobs on a serverless Apache Spark-based platform.
____________ is recommended when your use cases are primarily analytics and you want to run jobs on a serverless Apache Flink-based platform.
AWS Glue
Kinesis Data Analytics
Apache Spark is primarily used for ______ processing, whereas Apache Flink is primarily used for ______ processing.
batch… stream…
Both AWS Glue and Kinesis Data Firehose can be used for streaming ETL.
___________ is recommended for complex ETL, including joining streams, and partitioning the output in Amazon S3 based on the data content.
___________ is recommended when your use cases focus on data delivery and preparing data to be processed after it is delivered.
AWS Glue
Kinesis Data Firehose
The AWS Glue __________ ML Transform can solve record linkage and data deduplication problems.
FindMatches
AWS Glue _______ __________ is a feature of AWS Glue that automatically measures and monitors the quality of data in data lakes and pipelines.
Data Quality
For the following AWS Glue features:
Data __________ use DataBrew to transform data without writing any code.
Data __________ use the Data Catalog to manage metadata.
Data __________ use AWS Glue Studio to author scalable data integration pipelines.
analysts
engineers
engineers
Can Amazon Athena process unstructured, semi-structured, and structured datasets?
Yes, it can process all three
AWS strongly recommends using the ______ command to load large amounts of data into Redshift, as opposed to the _______ command.
COPY… INSERT…
To grant or revoke privilege to load data into a table using a Redshift COPY command, grant or revoke the __________ privilege.
INSERT
To load data from Amazon S3, the Redshift COPY command must have _______ access to the bucket and _______ access for the bucket objects.
LIST… GET…
For Redshift to obtain authorization to access a resource, your cluster must be authenticated using either __________ access control or __________ access control.
(________ access control is recommended by AWS)
role-based… key-based…
(role-based)
With ___________ access control, your Redshift cluster temporarily assumes an AWS Identity and Access Management (IAM) role on your behalf.
role-based
When loading data into Redshift, you can use a ___________ file to ensure that your COPY command loads only your specified files from Amazon S3.
manifest
When you load data into Redshift from S3 using a COPY command, what do you need to do differently when S3 server-side encryption is enabled?
Nothing. The process is the same whether S3 is encrypted or not.
When using the COPY command to load a table into Amazon Redshift, does the table to be loaded need to already exist in the Redshift database?
Yes
By default, when loading data from DynamoDB into Redshift, do these two services need to be in the same AWS Region?
Yes, but you can also specify a different region using the REGION parameter
When loading data from DynamoDB into Redshift, what happens when DynamoDB attributes do not match a column in the Amazon Redshift table?
These attributes are discarded. Additionally, they consume part of DynamoDB’s provisioned throughput since the attributes still have to be read.
After a Redshift load operation is complete, you can query the ______________ system table to verify that the expected files were loaded.
STL_LOAD_COMMITS
To validate the data in the Amazon S3 input files or Amazon DynamoDB table before you actually load the data into Redshift, you can use the __________ option with the COPY command.
NOLOAD
To apply automatic compression when loading data to Redshift, run the COPY command with the __________ option set to ON.
COMPUPDATE
When loading data files from Amazon S3 into Redshift, does the order of the columns matter?
Yes, the columns must be in the same order as the Redshift table
The category of SQL commands that manipulate data in a database (INSERT, UPDATE, DELETE) are referred to as _______ _____________ ____________ commands.
Data Manipulation Language (DML)
Does Amazon Redshift support a single merge (or upsert) command to update a table from a single data source?
No, but you can essentially do the same thing with a combination of updates and inserts.
The category of SQL commands that can be used to define the database schema, such as CREATE, DROP, ALTER, are referred to as _______ _____________ ____________.
Data Definition Language (DDL)
In a Redshift cluster, each node is further broken down into ___________, which have their own compute and storage associated with each.
slices
AWS recommends creating your Redshift tables with __________ ______, which uses automatic table optimization to choose the sort key.
SORTKEY AUTO
When you create a Redshift table, you can optionally specify one column as the ____________ ______. When the table is loaded with data, the rows are distributed to the node slices according to this key.
distribution key
What are the two types of Redshift table sort keys, and which is preferred?
COMPOUND (preferred)
INTERLEAVED
With compression in Redshift, can the sort key column be compressed?
No, it must always be in its raw form so it is always available for Redshift to use.
Which type of Redshift sort key performs better when using lots of WHERE clauses?
INTERLEAVED
Which type of Redshift sort key performs better when using lots of ORDER BY clauses?
COMPOUND
AWS recommends which distribution style for your Redshift tables?
DISTSTYLE AUTO
When you create a Redshift table, you can designate one of four distribution styles. What are they?
AUTO
EVEN
KEY
ALL
When creating a Redshift table with a NOT NULL constraint on a column, does Redshift enforce this?
No, Redshift can still accept data into that column
Redshift Spectrum supports ________ and ________ operations.
It does NOT support ________ and ________ operations.
SELECT… INSERT…
UPDATE… DELETE…
When resizing a Redshift cluster, the source cluster goes into ____________ mode while the resized cluster is being created.
read-only
The two types of resize operations you can choose for resizing a Redshift cluster are __________ and __________.
classic resize… elastic resize.
The ______ resize operation for a Redshift cluster takes minutes, while a ______ resize operation can take hours to days.
elastic… classic…
When performing an elastic resize of a Redshift cluster, what are the two main constraints?
- Can’t be used from or to a single-node cluster
- Only available for clusters that use the EC2-VPC platform
For classic resize and elastic resize operations for Redshift clusters, can you cancel the resize operation after it has been started?
For classic resize, yes.
For elastic resize, no.
Are the Redshift pause/resume options supported for EC2-Classic clusters?
No, you can only pause/resume EC2-VPC clusters
Which type of Redshift cluster resize uses a snapshot for the operation?
elastic resize
What Redshift operation can sort rows and will only sort tables that are less than 95% sorted?
VACUUM SORT ONLY
What Redshift operation can reclaim disc space and will only run on tables that have more than 5% of the rows marked for deletion?
VACUUM DELETE ONLY
What Redshift VACUUM option will ensure that the operation is not interrupted by (i.e. resources are not diverted to) incoming queries.
BOOST
A faster alternative to performing a full vacuum operation on a Redshift cluster table could be to do a _______ _______. This can be beneficial when you have an extremely unsorted table.
Deep Copy
What AWS service can transfer data to and from AWS at a huge scale (i.e. 10GB/s per agent, which is approximately 100TB/day) ?
AWS DataSync
What is an Amazon EMR cluster composed of?
A collection of EC2 instances (referred to as “nodes”)
Each EC2 instance in an Amazon EMR cluster is called a _______.
node
Every Amazon EMR cluster has a ___________ node, and it’s possible to
create a single-node cluster with only this node.
primary
The following is an example process using four steps for which AWS service?
1. Submit an input dataset for processing.
2. Process the output of the first step by using a Pig program.
3. Process a second input dataset by using a Hive program.
4. Write an output dataset.
Amazon EMR
When you set up an Amazon EMR cluster in a private subnet, AWS recommends that you also set up _____________________. Otherwise, you will incur additional charges for NAT gateway as the traffic flow will not be contained within your VPC.
VPC endpoints for Amazon S3
Amazon EMR integrates with ___________ to log information about requests made by or on behalf of your AWS account. With this information, you can track who is accessing your cluster when, and the IP address from which they made the request.
CloudTrail
___________ ______ _________ is a web-based integrated development environment (IDE) for fully managed Jupyter notebooks that run on Amazon EMR clusters.
Amazon EMR Studio
What feature of Amazon EMR allows you to browse your data catalog, run SQL queries, and download results before you work with the data in a Studio notebook.
Amazon EMR Studio SQL Explorer
An Amazon EMR Studio is composed of one or more ___________.
Workspaces
___________ ______ _________ does not support EMR clusters with multiple primary nodes.
Amazon EMR Studio
The maximum number of Amazon EMR Studios you can have is _____ per AWS account.
10
To use SSH to log on to the master/primary node of an Amazon EMR cluster, you will need to associate an __________ ______ ______ ______with the cluster.
Amazon EC2 key pair
What are two limitations of launching an Amazon EMR cluster with multiple primary nodes?
- Cannot use instance fleets configuration for the nodes
- If two of the three primary nodes fail simultaneously, then the cluster will fail
When launching an Amazon EMR cluster with multiple primary nodes, how many core nodes does AWS recommend launching?
At least 4
Amazon EMR on ____________ is ideal for low latency workloads that need to be run in close proximity to on-premises data and applications.
AWS Outposts
Are spot instances or reserved instances supported for Amazon EMR on AWS Outposts?
No, only on-demand instances are supported
By default, when you create an Amazon EMR cluster, what AMI is used?
Amazon Linux AMI
When launching an Amazon EMR cluster and choosing between instance fleets or uniform instance groups, which category of nodes does this decision apply to (primary, core, task) ?
All of them
When launching an Amazon EMR cluster with the uniform instance groups configuration, your cluster can include up to _____ instance groups:
_____ primary instance group(s)
_____ core instance group(s), and
up to _____ optional task instance groups.
50
1
1
48
Which EMR node type does not store data?
Task nodes
The DataNode daemons run on which Amazon EMR node type?
Core nodes
Which Amazon EMR storage option is ephemeral, distributed, and best suited for caching results between intermediate job flow steps?
HDFS
Which Amazon EMR storage option would you use to separate your compute and storage and persist data outside of the lifecycle of your cluster?
EMRFS (because it stores data to S3)
Is Kinesis Data Streams a fully managed and serverless AWS service?
Yes
Is Amazon EMR a fully managed and serverless AWS service?
No.
However, there is a new option you can use called Amazon EMR Serverless.
The __________ is a Java library that acts as an intermediary between your record processing logic and Kinesis Data Streams.
Kinesis Client Library (KCL)
Can multiple Kinesis Data Streams applications consume data from the same stream?
Yes
A __________ ______ ________ is a set of shards.
Kinesis data stream
A Kinesis data stream _________ contains a sequence of data records.
Each data record has a __________ __________, which is the unique identifier of each data record within a shard, but this number may overlap for a data record in a different shard.
shard
sequence number
Data records within a Kinesis data stream shard are composed of what three attributes?
- sequence number
- partition key
- data blob
A data blob is one of three attributes within a __________ within a ________ within a Kinesis data stream. The data blob can be up to ____ MB in size.
data record… shard… 1 MB
By default, the retention period of the data records within a Kinesis data stream is __________, and the max retention period is _________.
24 hours… 365 days
Each data record within a Kinesis data stream shard gets assigned a unique ___________ ____________.
sequence number
In Kinesis Data Streams, a ___________ ______ is used to logically separate sets of data. This is generally not a 1:1 ratio to the shards. Often, one shard will have 100+ of these.
partition key
Kinesis Data Streams uses ______________ for encryption.
AWS KMS master keys
In Kinesis Data Streams, to read from or write to an encrypted stream, producer and consumer applications must have
permission to access the __________________.
KMS master key
In Kinesis Data Streams, does using server-side encryption incur AWS KMS costs?
Yes
In Kinesis Data Streams, by default, you
can create up to _____ data streams
with the on-demand capacity
mode. This can be increased with a support ticket.
50
In Kinesis Data Streams, what is the limit for the number of streams per account, using KDS provisioned mode?
No limit
The Kinesis Data Streams GetRecords command can retrieve up to _____ MB of data per call from a single shard, and up to _________ records per call.
10… 10,000…
In Kinesis Data Streams, one read transaction is also referred to as one ________________ call. They are the same thing.
GetRecords
Each Kinesis Data Stream shard can support up to a maximum total data read rate of ____ MB per _________ via GetRecords
2 MB… second
In Kinesis Data Streams, can you switch the capacity mode of your stream? How often?
Yes.
You can switch 2x within 24 hours.
A Kinesis data stream in the on-demand mode accommodates up to ________ the peak write throughput observed
in the previous 30 days.
double
The Kinesis Data Streams ___________ capacity mode is suited for predictable traffic with capacity requirements that are easy to forecast.
provisioned
In Kinesis Data Streams, can you enable server-side encryption after the stream has been created?
Yes
AWS recommends (for better Kinesis Data Stream scalability) that you migrate all of your producers and consumers that call the _____________ API to instead call the ______________ and _____________ API’s.
DescribeStream… DescribeStreamSummary… ListShards
What API would you use to reshard a KDS stream?
UpdateShardCount
In Kinesis Data Streams, what are the two types of resharding operations?
shard split
shard merge
When changing the data retention period for your KDS stream, how quickly does the change take effect?
Within minutes
You can assign your own metadata to streams you create in Amazon Kinesis Data Streams by using _______.
tags
In Kinesis Data Streams, the ___________________ provides a layer of abstraction specifically for ingesting data.
Kinesis Producer Library (KPL)
In Kinesis Data Streams, what is the preferred method for developing producers to add (put) data into a data stream?
The preferred method is to use the Kinesis Producer Library (KPL)
_______ _____________ is a complete solution that lets frontend web and mobile developers easily build, ship, and host full-stack web/mobile applications on AWS.
AWS Amplify
___________ ___________ offers marketers and developers one customizable tool to deliver customer communications across channels, segments, and campaigns at scale.
Amazon Pinpoint
In Kinesis Data Streams, when you add KPL user records using the KPL addUserRecord() operation, a record is given a time stamp and added to a buffer with a deadline set by the ________________ configuration parameter.
RecordMaxBufferedTime
This determines how long it takes for the record to be put into the data stream.
By default, KDS shards in a stream provide ____ MB/sec of read throughput per shard.
2 MB/sec
In Kinesis Data Streams, does the default 2 MB/sec of read throughput per shard get shared across consumers?
Yes, this limit is fixed.
In other words, you cannot have 5 consumers all reading 1 MB/sec each from a shard, because this sums up to 5 MB/sec, which exceeds the limit.
You can use an Amazon Kinesis Data Analytics application to process and analyze data in a KDS stream using _________, _________, or _________ (languages).
SQL, Java, or Scala
In Kinesis Data Streams, the _________________ consumer applications are typically distributed, with one or
more application instances running simultaneously, for failover and load-balancing.
Kinesis Client Library (KCL)
In Kinesis Data Streams, what is the term used in a KCL consumer application to describe how a consumer instance binds to (takes ownership of) processing a particular shard?
lease
In Kinesis Data Streams, each KCL consumer application stores its lease information in a DynamoDB _______ _________.
What are the implications of this for the KCL consumer application names?
lease table
Each KCL consumer application must have a unique name, because this name is used for the DynamoDB lease table.
Amazon Athena also allows you to run _________ __________ applications on Athena to query your data.
Apache Spark
AWS Glue Studio can use datasets that are defined in the _______ ________ ________ ___________.
AWS Glue Data Catalog
In Amazon Athena, when partitioning your S3 data, a common practice is to partition the data based on _________ or __________.
date or time
How can you speed up the time it takes to load large csv files into Redshift?
Split the csv files into smaller chunks. When you use the COPY Redshift command, this utilizes the massively parallel processing architecture of Redshift, so breaking apart the csv files will allow more distributed processing to take place.
What does the Redshift UNLOAD command do?
moves the result of your Redshift query to Amazon S3
To ensure that only certain people can view certain dashboards in Amazon Quicksight, you can use a __________ _________ file to specify the permissions to the dataset.
dataset rules
Can you deliver streaming data directly from Kinesis Data Firehose to DynamoDB?
No, you would have to use a lambda function as an intermediary step
What 3 AWS services can Kinesis Data Firehose deliver data to directly? Also, can KDF deliver to other third-party destinations?
- S3
- Redshift
- OpenSearch Service
Yes, KDF can also deliver data to third parties like Splunk, Logz.io, New Relic, MongoDB, etc.
_________________ is an AWS-managed deployment of the open source, distributed search and analytics suite derived from Elasticsearch.
Amazon OpenSearch Service
The AWS ________ command is an extension of the Apache Distcp tool that you can use to copy large amounts of data.
S3DistCp
A Kinesis Data Analytics application is composed of a ___________ source and an optional _______________ ________ source.
streaming… reference data…
In Kinesis Data Analytics, you have the option to use a reference data source. If used, where must this reference data be stored?
In Amazon S3, as a CSV or JSON file.
Kinesis Data Analytics can use __________ or ____________ as the input data stream.
Kinesis Data Stream… Kinesis Data Firehose…
__________ ________ ___________ automatically provides an in-application error stream for each application. If your application has issues while processing certain records (for example, because of a type mismatch or late arrival), that record is written to the error stream.
Kinesis Data Analytics
_________ __________ is a distributed streaming platform that was originally developed by LinkedIn and was made open source in 2011.
Apache Kafka
In Amazon MSK (Managed Streaming for Apache Kafka), the ________ nodes receive messages from producers (publishers) and store them for the consumers (subscribers) to view.
The ________ nodes coordinate and track the broker nodes and also manage the Kafka topics (categories).
broker… zookeeper…
After you have created an Amazon MSK cluster in your VPC, can you change which VPC your cluster is in?
No.
For Amazon MSK clusters, the _____ broker type has higher throughput than the ____ broker type, and it is recommended for production workloads.
M5… T3…
When using Amazon MSK, what is the maximum retention period for the data?
Unlimited
In a Kinesis Data Stream using provisioned mode, each shard can support up to ___ MB/sec of write throughput and ___ MB/sec of read throughput.
1MB/sec… 2MB/sec…
For Kinesis Data Streams in provisioned mode, the default shard quota is _____ shards per AWS account for us-west-1, us-east-1, and eu-west-1. For all other regions, the default shard quota is _____ shards per AWS account.
500… 200…
The Kinesis __________ is a stand-alone Java software application that offers an easy way to collect and send data to Kinesis Data Streams.
Agent
What are 3 ways you can add data to a Kinesis Data Stream?
- PutRecord API
- Kinesis Producer Library (KPL)
- Kinesis Agent
What 3 compression formats can you use with Kinesis Data Firehose?
- GZIP
- ZIP
- SNAPPY
If you need to deliver data using Kinesis Data Firehose to Redshift as the final destination, which compression format do you need to use?
GZIP
For Kinesis Data Firehose delivery streams, a data record can be up to _____ in size.
1,000 KB
Kinesis Data Firehose uses either a _________ _______ or a _________ _______ which determines how long it takes before KDF delivers the data to the destination.
buffer size (in MB)… buffer time (in seconds)…
When delivering data to S3 using Kinesis Data Firehose, you can choose a buffer size
of __-___ MiBs and a buffer interval of __-___ seconds. The condition that is satisfied first triggers data delivery to Amazon S3.
1–128… 60–900…
Which AWS service would you use for REAL-TIME analytics on IoT sensor data?
Kinesis Data Analytics — Remember: If the question doesn’t specifically state that you must analyze or transform the data in real-time, then KDA is probably not the correct answer.
For the consumer applications in Kinesis Data Streams, you can choose between ______________ and ______________ consumer types to read data from a stream.
shared fan-out… enhanced fan-out…
_____ ___________ is an ideal compute service for serverless application scenarios that need to scale up rapidly, and scale down to zero when not in demand.
AWS Lambda
When using AWS Lambda, can you log in to compute instances or customize the operating system?
No, because Lambda is serverless. AWS manages these things.
The AWS service called _____ _____ _________ builds and deploys containerized web applications automatically.
AWS App Runner
AWS Lambda functions will need to assume an ___________ role when the function is invoked, in order to access other specified AWS services.
execution
In AWS Lambda, you deploy your function code to Lambda using a deployment package. There are two types:
A ______ file that contains your function code and its dependencies, OR
a ___________ _________.
.zip file
container image
An _______ _________ __________ is a Lambda resource that reads from an event source and invokes a Lambda function. You can use these to process items from a stream or queue in services that don’t invoke Lambda functions directly.
Two examples of AWS services that could be used for this are __________.
event source mapping
DynamoDB
Amazon SQS
An AWS Lambda ___________ provides a language-specific environment that runs in an execution environment.
Examples: python3.10, nodejs18.x
runtime
An AWS Lambda _________ provides a convenient way to package libraries and other dependencies that you can use with your Lambda functions. It is a .zip file that contains additional code. You can use ____ of these per function.
layer… 5…
AWS Lambda layers only apply to the _____ _________ Lambda deployment package type. Functions deployed as a __________ ________ do not use layers because everything is already bundled together.
.zip file… container image…
In AWS Lambda, __________ is the term used to define the number of requests that your function is serving at any given time.
concurrency
In AWS Lambda, are incoming requests processed in order?
No, they are often processed out of order.
An AWS Lambda ________ refers to the incoming data (in JSON format) that a Lambda function will process.
event
What is the function timeout limit for AWS Lambda?
15 minutes
For AWS Lambda, what are the request and response payload size limits for synchronous and asynchronous invocations?
synchronous: 6 MB
asynchronous: 256 KB
For AWS Lambda functions, the default concurrency execution limit is _______ and the default storage limit for your functions is ____ GB.
1,000… 75 GB…
What AWS service can schedule automated data movement and data processing throughout AWS? (Note: This AWS service is actively being phased out for better alternatives)
It allows you to define a chain of activities using data sources, destinations, and processing activities referred to as a _____________.
AWS Data Pipeline
pipeline
Which AWS (serverless) service provides a visual drag-and-drop editor to build workflows, orchestrate data processing pipelines, and integrates directly with over 250 AWS services?
AWS Step Functions
The two types of Amazon Redshift snapshots are _________ and _________.
________ snapshots are enabled by default. They take a snapshot every ____ hours or following every ___ GB per node of data changes.
automated… manual.
automated… 8… 5 GB…
Amazon Redshift Serverless measures data warehouse capacity in _________ __________ ________.
Redshift Processing Units (RPUs)
The default base capacity for Amazon Redshift Serverless is ____ RPUs and if you only want to run simple workloads, the minimum capacity is ____ RPUs.
Note: 1 RPU provides ___ GB of memory.
128 RPUs… 8 RPUs…
16 GB
AWS Lake Formation uses __________ ___________ functionality to provide temporary credentials for granting short-term access to S3 data.
credential vending
How would you allow your S3 data to be used in AWS Lake Formation?
You would have to “register” the S3 data with AWS Lake Formation.
Does AWS Lake Formation only use IAM policies for its permissions?
No, it has its own internal permissions system that augments IAM policies.
An AWS Lake Formation __________ is a container for a set of related AWS Glue jobs, crawlers, and triggers. You create this in AWS Lake Formation, and it executes in the AWS Glue service.
workflow
AWS Lake Formation workflows are created based on _________, which are predefined templates for ingesting data from a particular source (RDS, CloudTrail, etc.) into a data lake.
blueprints
AWS Lake Formation uses the _____ _______ _______ ___________ to store metadata about its data sources.
AWS Glue Data Catalog
What 5 AWS services can you layer on top of Lake Formation that all honor Lake Formation’s granular permissions?
- AWS Glue
- Amazon Athena
- Amazon Redshift Spectrum
- Amazon QuickSight (Enterprise Edition)
- Amazon EMR
How would you connect Amazon QuickSight to Amazon Redshift?
Create a security group for the Redshift cluster. Allow inbound access from the IP address range of the QuickSight servers.
Does Amazon QuickSight support Amazon Athena as a data source?
Yes
How many AWS Glue Data Catalogs does each AWS account have per Region?
1 Data Catalog per Region
An AWS Glue job is composed of a _________, ______ _________, and ______ _________.
script… data source… data target
With AWS Glue, you are charged an hourly rate based on the number of ____________________ used to run your ETL job.
Data Processing Units (DPUs)
In AWS Glue, a Data Processing Unit (DPU) is also referred to as a ___________.
worker
AWS Glue ___________ are used to organize metadata tables in AWS Glue. When you define a table in the AWS Glue Data Catalog, you add it to a ___________.
databases… database
An AWS Glue ____________ is a Data Catalog object that stores login credentials, URI strings, virtual private cloud (VPC) information, and more for a particular data store.
connection
In AWS Glue, you set up your crawler with an ordered set of ___________ that will read the data in a data store and return a certainty number between 0.0 and 1.0.
classifiers
In AWS Glue, you can collect metrics about your jobs to view in CloudWatch by enabling the _____ ___________ option within AWS Glue.
job metrics
An AWS Glue ________ policy can be used to control access to AWS Glue Data Catalog resources.
resource
What is one reason you would create an AWS Glue Data Catalog table manually rather than using the crawler?
If you want custom naming conventions for your tables.
Using the crawler will automatically name the tables.
In AWS Glue, you can use ______ ____________ to keep track of previously processed data. When a job runs, only new incremental data is processed since the last checkpoint.
job bookmarks
MQTT is a messaging protocol used for IoT devices. AWS supports this protocol with its ______ ______ ______ service.
AWS IoT Core
______ ______ __________ is an AWS service that filters, transforms, and enriches IoT data before storing it in a time-series data store for analysis.
AWS IoT Analytics
In Amazon Athena, when you partition your S3 data sometimes it doesn’t load into Athena. What command can you use to solve this? (this command only works with Hive-style partitions)
MSCK REPAIR TABLE
Can Amazon QuickSight connect to a Redshift cluster that is in a different region?
Yes
In Amazon Redshift, the term “predicate” simply refers to a ___________.
condition
For Amazon Redshift tables that are not frequently updated, what distribution style is most appropriate?
DISTSTYLE ALL
In Amazon Redshift, a best practice for choosing the right distribution style is to use a column with a ______ cardinality.
high
_____ ___________ _____ ___________ _______ is a feature that lets you send a stream of log events from CloudWatch Logs to other AWS services for custom processing.
AWS CloudWatch Logs Subscription Filters
With ML Insights, Amazon QuickSight provides what three major features?
- anomaly detection
- forecasting
- autonarratives
When using AWS Lambda to process data from a Kinesis Data Stream or a DynamoDB data stream, what setting can you configure in AWS Lambda to process each shard with more than one simultaneous Lambda invocation?
ParallelizationFactor (can set this between 1 and 10)
When launching an EMR cluster using the RunJobFlow API, you can optionally set the __________ parameter to TRUE so that the cluster will transition to the WAITING state rather than shutting down after the steps have completed.
KeepJobFlowAliveWhenNoSteps
Amazon S3 Select works on objects stored in ____, _____, or Apache _______ format.
It also works with objects that are compressed with ______ or _______, and server-side encrypted objects.
CSV… JSON… Parquet
GZIP… BZIP2…
An Amazon OpenSearch Service _______ is the terminology used to refer to an OpenSearch cluster.
domain
An Amazon OpenSearch Serverless __________ refers to an auto-scaling OpenSearch cluster.
collection
Amazon OpenSearch Serverless collections are always ___________.
encrypted
What are the two primary use cases for Amazon OpenSearch Serverless?
- log analytics
- full-text search
OpenSearch serverless has a decoupled architecture that separates the _________ (ingest) components from the ________ (query) components.
indexing… search…
What are the two primary collection types in OpenSearch Serverless?
- time-series
- search
OpenSearch Serverless compute capacity is measured in ___________ ____________ ______.
When you create your first collection, OpenSearch Serverless instantiates a total of _____ (____ each for indexing and search).
OpenSearch Capacity Units (OCUs)
4… 2…
In Amazon OpenSearch Serverless:
_________ collections use a combination of hot and warm caches.
_________ collections store all data in the hot cache.
time-series
search
In terms of encryption, what is the difference between OpenSearch Service (i.e. manual provisioned clusters) and OpenSearch Serverless?
Encryption is optional for OpenSearch Service.
Encryption is required for OpenSearch Serverless.
In OpenSearch Service, if the JVM memory pressure metric is too high, you can solve this by reducing the traffic to your cluster by ___________ (increasing/decreasing) the number of shards.
decreasing
When determining the DISTKEY in Amazon Redshift, does the DISTKEY column have to be the same between the dimension table and fact table?
Yes.
Use the dimension table’s primary key and the fact table’s corresponding foreign key.
In Amazon Redshift, can the distribution style be different between a dimension table and a fact table?
Yes.
You can have different distribution styles for each of your tables.
What feature of S3 is ideal when you need to transfer gigabytes to terabytes of data on a regular basis across continents?
S3 Transfer Acceleration
______ ___________ __________ is a secure transfer service that enables you to transfer files into and out of AWS storage services.
The two supported AWS storage services are _______ and _______.
AWS Transfer Family
S3… EFS…
Which AWS service allows you to quickly move large amounts of data between on-premises and AWS infrastructure and provides end-to-end security, including encryption and integrity validation?
AWS DataSync
In Amazon Redshift, you can use ____________ _____________ to create query queues to efficiently configure query traffic so short, fast-running queries won’t get stuck in queues behind long-running queries.
workload management (WLM)
Are AWS Glue streaming ETL jobs executed in real-time?
No, the data is processed in 100-second windows.
Can Amazon QuickSight use Amazon EMR as a data source?
No.
If you have 12 months of data stored in Amazon Redshift but your business needs only require querying the last 2 months worth of data, what can you do to reduce costs?
You can unload the first 10 months of data into S3, because otherwise the additional Redshift data will slow down the queries.
When querying S3 with Amazon Athena, what are two things you can do to reduce the cost and also increase query performance?
- store your S3 data in Avro or Parquet format
- compress your S3 data
To improve query performance and reduce costs when using Amazon Athena, you can partition your data which restricts the amount of data scanned by each query.
A customer who has data coming in on a ______ basis would partition by year, month, date, and hour.
A customer who has data coming in on a ______ basis would partition by a data source identifier and date.
hourly
daily
In Amazon Athena, you can use ____________ to isolate workloads, users, and teams into groups. This can help to control costs by tracking each user’s queries that they run and sending the query metrics to CloudWatch.
workgroups
When creating a data lake with AWS Lake Formation, would you register the S3 bucket name or the S3 bucket path with Lake Formation?
S3 bucket path
In Amazon EMR, if you attach an EBS volume to your cluster for increased storage and you want to encrypt it, what type of encryption is used?
LUKS
Note: The recommended way to encrypt EBS volumes on EMR is with “EBS encryption” … This applies to both the root volume and any attached volumes. In contrast, LUKS encryption only applies to attached volumes.
To query data in S3, you can use S3 Select, Athena, or Redshift Spectrum. S3 select is different because it only allows you to query a _________ of the data.
subset
With Amazon EMR, can you run analyses on data stored in DynamoDB ?
Yes, they are natively integrated.
In Amazon CloudWatch Events, a ________ is the terminology used that matches incoming events and routes them to targets for processing.
rule
In Amazon CloudWatch Events, a ________ is the terminology used for the thing that processes events.
target
__________ __________ _________delivers a near real-time stream of system events that describe changes in AWS resources. Using simple rules that you can quickly set up, you can match events and route them to one or more target functions or streams.
Amazon CloudWatch Events
In Amazon EMR, the _______ metric tracks whether a cluster is live,
but not currently running tasks.
IsIdle
When data is moved or transitioned to the Amazon S3 _________ storage class, it is no longer readable or queryable by Athena.
GLACIER
Can Amazon Athena query S3 data in a different region?
Yes
The S3DistCp command is primarily used in copying data from Amazon S3 to ___________ and not to Redshift.
Amazon EMR
The DynamoDB object size limit is _______ per item.
400 KB
Would you use IAM policies for column-based permissions in Amazon Redshift? If not, how would you handle this?
No, these permissions are based on the table and are set by using the GRANT command
The _______________ configuration for an EMR cluster (which applies to all node categories: primary/core/task) offers the widest variety of provisioning options for EC2 instances.
instance fleets
When launching an Amazon EMR cluster, what configuration type will typically result in better price performance?
instance fleets
When using KMS server-side encryption in Kinesis Data Streams, how frequently does Kinesis make an API call to KMS to rotate the key?
approximately every 5 minutes
An ______ is a friendly name for an AWS KMS key (e.g. “test-key-1”)
alias
How quickly can you access your data when using S3 Glacier with expedited retrieval?
1-5 minutes
Before using the COPY command to move S3 data to Redshift, what can you do with your files to make sure you are taking maximum advantage of the MPP architecture of Redshift?
Make sure your quantity of S3 files is a multiple of the number of slices in your Redshift cluster.
In Kinesis Data Analytics, do tumbling windows overlap?
No
In Kinesis Data Analytics, _______ windows have a fixed starting and ending time, whereas ________ windows don’t begin until the first event matching the partition key arrives.
tumbling… stagger…
Amazon QuickSight can create visualizations based on S3 data stored in _____ or _____ format, but not in _____ format.
csv or json… not in parquet
Does Amazon Athena support column names that include special characters?
No, only alphanumeric characters and underscores are supported.
Storing passwords in your AWS Glue ETL job script is not recommended. What does AWS recommend instead for using passwords in your script?
Use boto3 to retrieve your passwords from AWS Secrets Manager or AWS Glue Data Catalog.
Can you use the COPY command to copy data directly from an RDS database into Amazon Redshift?
No
In Amazon Redshift, you can enable _________ ____________ to track information about authentication attempts, connections, disconnections, changes to database user definitions, and queries run in the database.
audit logging
By default, Amazon EMR uses _________ as a cluster resource manager.
YARN (Yet Another Resource Negotiator)
For Amazon EMR, you can use automatic scaling with a custom policy for which EMR configuration type?
instance groups
Each EMR instance group in a cluster, except the instance group of the _________ node, can have its own auto-scaling policy, which consists of scale-out and scale-in rules.
Remember: EMR auto-scaling only applies for instance groups configuration.
primary
Regarding S3 data storage formats, ____________ is better for saving storage space, whereas __________ is better for efficient read-heavy operations.
compressed csv… Apache ORC…
AWS Database Migration Service (AWS DMS) is a web service that you can use to migrate data from a source data store to a target data store. However, what is the one requirement for using AWS DMS?
You can’t use AWS DMS to migrate from an on-premises database to another on-premises database. One of your endpoints (source or target) needs to be an AWS service.
In AWS DMS, you can use the ________ ____________ feature to collect data from your on-premises database and analytic servers, and build an inventory of servers, databases, and schemas that you can migrate to the AWS.
Fleet Advisor
For “ad-hoc”, “infrequent”, “cost-effective” analysis requirements, would EMR be a good solution?
No, it would be overkill and too expensive. Use Athena instead.
If your application is having trouble reading all of the required S3 data in an efficient manner, how can you scale the S3 read performance?
Since S3 supports 5,500 GET/HEAD requests per second per prefix in a bucket, the best solution would be to add more prefixes (e.g. 10 prefixes = 55,000 GET requests).
If your application needs to scan millions or billions of objects in S3, what does AWS recommend doing to help parallelize read capacity and performance?
It is recommended to create a random string and add that to the beginning of the object prefixes.
In effect, this creates more unique prefixes, which dramatically increases read capacity, since each bucket prefix supports 5,500 GET/HEAD requests per second.
In Kinesis Data Analytics, what query method would you use to monitor a chosen stock in real-time, identify when a threshold is reached, and then create a real-time notification for the customer?
continuous query
When you terminate an EMR cluster, Amazon EMR retains metadata about the cluster for ______ months at no charge.
two
In Amazon OpenSearch Service, __________ storage provides a cost-effective way to store large amounts of read-only data for your OpenSearch Service cluster.
These nodes use Amazon S3 and a sophisticated caching solution
to improve performance.
UltraWarm
In Amazon OpenSearch Service, _________ storage takes the form of instance stores or Amazon EBS volumes attached to each node and provides the fastest possible performance for indexing and searching new data.
hot
In Amazon OpenSearch Service, _________ storage lets you store any amount of infrequently accessed or historical data on your Amazon OpenSearch Service domain and analyze it on demand, at a lower cost than other storage tiers.
cold
In Amazon OpenSearch Service, _________ ________ ____________ lets you define custom management policies that automate routine tasks. For example, you can define
a policy that moves your index into a read_only state after 30 days and then ultimately deletes it after 90 days.
Index State Management (ISM)
In Amazon OpenSearch Service, you can use ________ ___________ to reduce storage costs by periodically rolling up old data into summarized indices. This allows you to store months or years of historical data at a fraction of the cost with the same query performance.
index rollups
In Amazon OpenSearch Service, you can create ________ ___________ jobs to visualize, analyze, and/or summarize your data in different ways.
index transform
When using the AWS Glue Schema Registry feature, a ___________ is a logical container of schemas.
registry
AWS Glue comes with ________ worker types to help you select the configuration that meets your job latency and cost requirements.
List them.
three
- standard
- G.1X
- G.2X
What are 4 data sources that Redshift can use with its COPY command?
- S3
- EMR cluster files
- EC2 Instance files
- DynamoDB
To meet compliance requirements and ensure Amazon EMR cluster data is not publicly accessible, make sure the ________ _________ _________ option is enabled in the console.
block public access
How quickly can the following visualization tools pull in the data? Which tool supports “near-real-time” and “time-sensitive” dashboards?
1. QuickSight
2. OpenSearch Dashboards (Kibana)
- minimum of 15 minutes
- near-real-time
To query multiple distributed datasets in-place with SQL, you could use ________ ___________ running on Amazon EMR.
Apache Presto
If a Kinesis Producer Library (KPL) producer “RecordMaxBufferedTime” property is resulting in a delay that the application cannot tolerate, you can optionally use the AWS SDK directly. For example, you can update your producer’s code to use the __________ API call.
PutRecord(s)
Can you create an Amazon QuickSight dashboard directly from S3 data stored in Apache Parquet format?
No, you would need to use Athena to query the S3 data in Apache Parquet format, and then use Athena as the data source for QuickSight.
In Amazon QuickSight, which query mode (direct query or SPICE) is most cost-effective when you have 1000 readers?
SPICE, because the data stored in SPICE can be reused without incurring additional costs. Direct Query mode would incur costs every time someone views the dashboard because the data is refreshed constantly.
Amazon Athena has a featured called ___________ __________ that lets you run SQL queries across multiple data sources stored in relational, non-relational, object, and custom data sources.
Federated Query
What are the two types of cost constraints when using Amazon Athena workgroups?
- per-query limit
- per-workgroup limit
By default, each AWS account has a _________ workgroup within Amazon Athena. This ________ (can / cannot) be deleted.
primary… cannot…
In Amazon Athena, you can set up _____________ settings that enforce constraints for all queries that run in a workgroup.
workgroup-wide settings
In Amazon Athena workgroups, how many “per-query” limits and “per-workgroup” limits can you create?
Only 1 “per-query” limit per workgroup.
Multiple “per-workgroup” limits.
In Amazon Athena, _________ _________ write new data to a specified location in Amazon S3, whereas “views” do not write any data.
CTAS queries
_____________ and _____________ are two complementary ways to reduce the amount of data Amazon Athena must scan when you run a query.
Partitioning and bucketing
In Amazon Athena, good candidates for partition keys are columns that have ______ cardinality.
low
What are two rules of thumb for deciding what data columns to use for Amazon Athena bucketing?
- high-cardinality columns
- evenly distributed values (i.e. you want every bucket to have approximately the same amount of data)
____ ________ ________ _______ helps you to easily deploy and enforce compliance controls for individual S3 Glacier vaults.
A _______ ________ policy can be locked to prevent future changes, which provides strong enforcement for your compliance controls.
S3 Glacier Vault Lock
Vault Lock
When using job bookmarks in AWS Glue, always have _________ command in the beginning of the script and the __________ in the end of the script.
job.init() … job.commit() …
The Kinesis Client Library (KCL) is only able to use __________ as the checkpointing table.
DynamoDB
AWS recommends using a Snowball when the data to be transferred is less than _____ and a Snowmobile when the data is >= _________
10PB… 10PB…
Can the Kinesis Agent send data to both KDS and KDF ?
Yes
When using the AWS Glue crawler, you can use _________ ___________ to ignore the filepaths that you do not need or that have already been crawled.
exclude patterns
What are the two primary reasons why you might see duplicate records when using Kinesis Data Streams?
- producer retries
- consumer retries
An _________ schema in Amazon Redshift is a logical grouping of tables that are not stored in Redshift, but are accessible to Redshift through a data catalog.
external
What is the message size limit in Amazon SQS?
256 KB
In Amazon SQS, a FIFO queue can support up to _______ messages per second with batching, or up to _______ messages per second without.
3,000… 300…
When given the choice between different partitioning methods in S3, always lean towards partitioning by ________ unless you have a VERY strong reason not to.
date
If your AWS Glue job uses a JDBC connection and it is running slow and timing out, how can you solve this problem?
Read the JDBC dataset with parallelization by using multiple JBDC connections instead of the default (which is only one connection)
A company needing a machine learning application with the most cost-effective solution should deploy the application using ____________ as opposed to ______________.
EC2 instances… SageMaker…
When setting up the federated queries feature of Amazon Athena to join together different data sources, is this easy to set up?
No, it requires a lot of development effort. If the end goal is to create visualizations, a better solution would be to connect QuickSight straight to the relevant data sources.
In Amazon Redshift, AWS recommends using _____________ access control, instead of _________, to manage access to sensitive columns within a table.
column-level… views…