AWS Data Analytics Flashcards
In a single data dashboard, Amazon ___________ can include AWS data, third-party data, big data, spreadsheet data, SaaS data, B2B data, and more.
Quicksight
CloudWatch detailed monitoring sends data from your EC2 instance to CloudWatch in ______ intervals.
1-minute
____________ is an ETL service that captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services.
Kinesis Data Firehose
When Kinesis Data Firehose is configured to send data to Redshift, behind the scenes it has to load the streaming data to _______ first and then issue a ______ command to move the data to Redshift.
S3… COPY…
Within Kinesis Data Analytics, using _________ __________ is a windowing method for analyzing time-based, overlapping groups of data that arrive at inconsistent times by aggregating the data.
stagger windows
What are the three windows you can use to process data in Kinesis Data Analytics?
- Stagger Windows
- Tumbling Windows
- Sliding Windows
___________ includes a built-in ML algorithm that can easily provide reliable forecasts for your data.
Amazon QuickSight
_______ is a fast, open-source, distributed SQL query engine designed for interactive analytic queries over large datasets from multiple sources (built by Facebook).
Presto
AWS Glue ETL scripts can be coded in _________ or _________ .
Python… Scala…
Amazon Redshift automatically integrates with ________ but not with an ________ (for encryption keys).
AWS KMS… HSM…
With Amazon Redshift, you can’t migrate to an _______-encrypted cluster by modifying the cluster. This is only possible if you want to enable _______ encryption.
HSM… KMS…
To load data from S3 to Redshift, you can use a __________ _________ that lists out the specific S3 paths you want to be copied over.
manifest file
Using the AWS Glue crawler for compressed files will cause the run time to ____________.
increase… It will take longer because the crawler has to download and decompress the file before reading it.
AWS Glue ___________ crawls only crawl folders that were added since the last crawler run, which can save significant time and cost.
incremental
To enable permissions between S3 and QuickSight, you would need to configure the permissions from the _________ console.
QuickSight
The _________ process re-sorts rows and reclaims space in either a specified table or all tables in the current database in Amazon Redshift.
VACUUM
If QuickSight connects to the data store by using a ________ ________, the data automatically refreshes when you open an associated dataset, analysis, or dashboard.
direct query
________ ______ is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
Amazon EMR
Can you use AWS Glue triggers to execute a job to run directly after a crawler completes?
No, but you can create an AWS Glue workflow with two triggers: one for the crawler and one for the job. This will achieve the same effect.
The capacity limits of an Amazon Kinesis data stream are defined by the ________ _____ ________ within the data stream.
number of shards
When creating an EMR cluster and you want to have the log files archived to Amazon S3, you must enable this feature __________ (while / after) launching the cluster.
while
Does Amazon SQS support real time streaming of data?
No.
What are the two Amazon EMR cluster types (regarding the time it takes for each to initialize) ?
(1) persistent / long-running
(2) transient
In Kinesis Data Streams, you can create up to _____ registered consumers per stream.
20
The two Kinesis Data Streams capacity modes are _________ and _________. These refer to whether the data stream shards are automatically or manually created.
on-demand… provisioned
To detect anomalies in your Kinesis Data Stream, you can use the ________________ function.
RANDOM_CUT_FOREST
Kinesis Data Analytics (KDA) supports _____________, _____________, and _____________ as destinations.
Kinesis Data Streams… Kinesis Data Firehose… Lambda
A common architecture using Kinesis Data Analytics (KDA) might look like this: ___________ –> Kinesis Data Analytics –> ___________ –> S3
Kinesis Data Stream –> Kinesis Data Analytics –> Kinesis Data Firehose –> S3
Apache _______ is a data warehousing system that uses SQL-like queries to analyze structured data stored in Hadoop Distributed File System (HDFS).
Hive
When creating an EMR cluster, what two configuration options can you choose from? The selected option is applied to each node type (primary, core, task) of the cluster.
- Instance Fleets
- Uniform Instance Groups (simpler, provides autoscaling)
Can Glue Data Catalog be used to store data, in a similar way to S3?
No, it is only used to store schema information on data gathered from the Glue crawler.
By default, Amazon Redshift clusters are created and situated in _______ AZ(s) within an AWS Region.
1… However, a multi-AZ deployment is also an option
If you have customized networking requirements for using Amazon Redshift, you will need to enable _________________ _______ _________________.
Enhanced VPC Routing
S3 Transfer Acceleration is enabled at the ________ level.
bucket
What are three common CLI commands for moving data to and from S3?
cp (copy)
mv (move)
sync (sync)
What are the 3 API calls for an S3 Multi Part upload?
CreateMultipartUpload
UploadPart
CompleteMultipartUpload
What is the max number of parts for an S3 Multi Part upload?
10,000
Does Elasticache for Memcached support snapshots and replication?
No. Snapshots and replication are not supported for memcached, just for Redis.
Which AWS database stores data as nodes connected with edges?
Neptune
In relational databases, row-based storage is ideal for OL__P and columnar storage is ideal for OL__P.
OLTP… OLAP
Apache _______ is an analytics framework for processing large datasets. (hint: Databricks is built on top of this)
Spark
What are the 3 data storage options for Amazon EMR?
- HDFS
- EMRFS (uses S3)
- Local Storage (Instance Store / EBS)
An Amazon EMR cluster can have either ____ or ____ primary (aka master) nodes.
1 or 3
How many AZ’s are used for Amazon EMR clusters?
Only 1 AZ
What are the three node types in an Amazon EMR cluster?
- Primary/Master Node
- Core Node
- Task Node (optional)
What is the name of the API you can use to launch an Amazon EMR cluster?
RunJobFlow API
What is the name of the API you can use to terminate an Amazon EMR cluster?
TerminateJobFlows API
The default limit for Amazon EMR instances is _____. This can be increased upon request.
20 instances (across all your clusters)
When using Amazon EMR, can you SSH directly into a task node?
No, you must first SSH into the master node, and then SSH into the desired node.
Which Amazon EMR node type (primary/master, core, task) hosts data using Hadoop Distributed File System (HDFS) and also runs Hadoop tasks?
Core Node
What are 5 implementations of how you can run Amazon EMR applications? (i.e. Amazon EMR on ______, Amazon EMR _____)
- Amazon EMR Serverless
- Amazon EMR on EC2
- Amazon EMR on AWS Outposts
- Amazon EMR on EKS
- Amazon EMR on Local Zones
For Amazon EMR billing, __________ rounds up the runtime duration to the nearest minute, whereas __________ tracks runtime duration to the nearest second.
BilledResourceUtilization… TotalResourceUtilization
Amazon EMR supports what two types of Hive clusters?
- interactive (customer can run Hive scripts directly on master node)
- batch (Hive script stored in S3 and referenced)
Amazon Redshift can automatically generate recommendations for managing your warehouse with the feature called _________ __________
Redshift Advisor
Does Redshift support native integration with Amazon SageMaker?
Yes
____________ is a feature of Amazon Redshift that lets you run queries against your data lake in Amazon S3, with no data loading or ETL required.
Redshift Spectrum
Using Amazon Redshift ______ nodes with managed storage allows you to pay separately for storage and compute.
RA3
For Amazon Redshift instances using Dense Compute (DC) and Dense Storage (DS2) clusters, where is the data stored?
On the compute nodes (as opposed to S3 for RA3 clusters and Redshift Serverless)
How is the data stored when using an Amazon Redshift RA3 instance?
Frequently processed data (hot data) is stored on high performance SSDs, and cold data stored in S3.
What service would a customer use to integrate (and/or aggregate) Amazon Redshift with their own on-premises data warehouse?
AWS Data Exchange
How are Redshift Multi-AZ and Redshift Relocation different, regarding RTO?
Redshift Relocation is free and has a 10-60 minute recover time.
Redshift Multi-AZ is more expensive, but has an RTO measured in seconds.
_____________ allows SQL users to create, train, and deploy machine learning models using familiar SQL commands.
Redshift ML
The Amazon Redshift _______ _______ simplifies access to Amazon Redshift because you don’t need to configure drivers and manage database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint
Data API
How long are Amazon Redshift automatic backups retained vs manual backups?
Automatic: 24 hours
Manual: Indefinitely
How would you monitor the performance of your Amazon Redshift data warehouse cluster?
AWS Management Console, or
CloudWatch APIs
Is there a charge for using the Amazon Redshift Data API?
No
When you launch an Amazon Redshift cluster, what option determines the CPU, RAM, storage capacity, and storage drive type for each node?
The node type
For datasets under 1 TB (compressed), what is the recommended Redshift node type?
DC2 (Dense Compute node)
What are the two EC2 platforms used for launching an Amazon Redshift cluster?
- EC2-Classic
- EC2-VPC
In Amazon Redshift, you need to associate a __________ _________ with each cluster that you create in order to configure database settings such as query timeout and date style.
parameter group
The charges that you accrue for using Amazon Redshift are based on _______ nodes and billed at an _________ rate.
compute… hourly…
Which EC2 instance categories does Amazon EMR support (i.e. on-demand, etc.) ?
on-demand
spot
reserved
An Amazon Redshift cluster is a set of nodes, which consists of a ________ node and one or more ________ nodes.
leader… compute…
A Quicksight _________ is a user who can create and publish dashboards.
Author
A Quicksight _________ is a user who consumes interactive dashboards.
Reader
Amazon QuickSight __________ Edition offers enhanced functionality such as QuickSight Readers, Private VPC connectivity, and AD connectivity.
Enterprise
_________ _________ __________ __________ are of 30-minute duration each. Each session is charged at $0.30 with maximum charges of $5 per Reader in a month.
Amazon QuickSight Reader sessions
Will an Amazon Quicksight Reader be charged if QuickSight is open in a browser in a background tab?
No, only charged when user interacts with page via a page refresh, filtering, clicking, etc.
Can Amazon Quicksight “Authors” or “Readers” invite more users?
No. This can only be done with a QuickSight “Admin” account.
Does Amazon QuickSight connect to both Amazon EC2 and on-premises databases?
Yes
The “Augment with SageMaker” option for Amazon __________ allows your SageMaker ML models to run inferences on your data.
QuickSight
Does QuickSight leverage SageMaker models to perform inference on incremental data or the full data every time it runs?
Inference runs on the full data every time it refreshes.
Amazon QuickSight has an innovative technology called ________ that allows it to select the most appropriate visualizations based on the properties of the data.
AutoGraph
You can use AWS Glue _________ to visually clean up and normalize data without writing code.
DataBrew
How does AWS Glue relate to AWS Lake Formation?
AWS Lake Formation encompasses AWS Glue PLUS additional features.
With AWS Glue _______, data engineers can visually create, run, and monitor ETL workflows.
Studio
The metadata stored in the AWS Glue Data Catalog can be readily accessed from _________________, ______________, _____________, _________________, and third-party services.
AWS Glue ETL
Amazon Athena
Amazon EMR
Amazon Redshift Spectrum
The AWS Glue ________ ____________ is a new feature that allows you to centrally discover, control (i.e. enforce), and evolve data stream schemas.
Schema Registry
The AWS Glue ________ ____________ supports Apache Avro and JSON Schema data formats and Java client applications
Schema Registry
Does the AWS Glue Schema Registry provide encryption at rest and in transit?
Yes
After you define the flow of your data sources, transformations, and targets in the visual (no-code) interface, AWS Glue Studio will generate __________ __________ code on your behalf.
Apache Spark
Which programming languages does AWS Glue ETL support?
Python and Scala
When building an AWS Glue workflow, what are the two ways to trigger AWS Glue ETL jobs within your workflow?
AWS Glue ETL jobs can either be triggered on a schedule or on a job completion event.
AWS Glue provides default retry behavior that will retry all failures _____ times before sending an error notification to CloudWatch.
three
AWS Glue supports ETL on streams from _______________, _____________, and _____________.
Amazon KDS
Apache Kafka
Amazon MSK
Do you have to use both the Data Catalog and AWS Glue ETL together for the service to work?
No, they can be used independently.
Both AWS Glue and Kinesis Data Analytics can be used to process streaming data.
____________ is recommended when your use cases are primarily ETL and you want to run jobs on a serverless Apache Spark-based platform.
____________ is recommended when your use cases are primarily analytics and you want to run jobs on a serverless Apache Flink-based platform.
AWS Glue
Kinesis Data Analytics
Apache Spark is primarily used for ______ processing, whereas Apache Flink is primarily used for ______ processing.
batch… stream…
Both AWS Glue and Kinesis Data Firehose can be used for streaming ETL.
___________ is recommended for complex ETL, including joining streams, and partitioning the output in Amazon S3 based on the data content.
___________ is recommended when your use cases focus on data delivery and preparing data to be processed after it is delivered.
AWS Glue
Kinesis Data Firehose
The AWS Glue __________ ML Transform can solve record linkage and data deduplication problems.
FindMatches
AWS Glue _______ __________ is a feature of AWS Glue that automatically measures and monitors the quality of data in data lakes and pipelines.
Data Quality
For the following AWS Glue features:
Data __________ use DataBrew to transform data without writing any code.
Data __________ use the Data Catalog to manage metadata.
Data __________ use AWS Glue Studio to author scalable data integration pipelines.
analysts
engineers
engineers
Can Amazon Athena process unstructured, semi-structured, and structured datasets?
Yes, it can process all three
AWS strongly recommends using the ______ command to load large amounts of data into Redshift, as opposed to the _______ command.
COPY… INSERT…
To grant or revoke privilege to load data into a table using a Redshift COPY command, grant or revoke the __________ privilege.
INSERT
To load data from Amazon S3, the Redshift COPY command must have _______ access to the bucket and _______ access for the bucket objects.
LIST… GET…
For Redshift to obtain authorization to access a resource, your cluster must be authenticated using either __________ access control or __________ access control.
(________ access control is recommended by AWS)
role-based… key-based…
(role-based)
With ___________ access control, your Redshift cluster temporarily assumes an AWS Identity and Access Management (IAM) role on your behalf.
role-based
When loading data into Redshift, you can use a ___________ file to ensure that your COPY command loads only your specified files from Amazon S3.
manifest
When you load data into Redshift from S3 using a COPY command, what do you need to do differently when S3 server-side encryption is enabled?
Nothing. The process is the same whether S3 is encrypted or not.
When using the COPY command to load a table into Amazon Redshift, does the table to be loaded need to already exist in the Redshift database?
Yes
By default, when loading data from DynamoDB into Redshift, do these two services need to be in the same AWS Region?
Yes, but you can also specify a different region using the REGION parameter
When loading data from DynamoDB into Redshift, what happens when DynamoDB attributes do not match a column in the Amazon Redshift table?
These attributes are discarded. Additionally, they consume part of DynamoDB’s provisioned throughput since the attributes still have to be read.
After a Redshift load operation is complete, you can query the ______________ system table to verify that the expected files were loaded.
STL_LOAD_COMMITS
To validate the data in the Amazon S3 input files or Amazon DynamoDB table before you actually load the data into Redshift, you can use the __________ option with the COPY command.
NOLOAD
To apply automatic compression when loading data to Redshift, run the COPY command with the __________ option set to ON.
COMPUPDATE
When loading data files from Amazon S3 into Redshift, does the order of the columns matter?
Yes, the columns must be in the same order as the Redshift table
The category of SQL commands that manipulate data in a database (INSERT, UPDATE, DELETE) are referred to as _______ _____________ ____________ commands.
Data Manipulation Language (DML)
Does Amazon Redshift support a single merge (or upsert) command to update a table from a single data source?
No, but you can essentially do the same thing with a combination of updates and inserts.
The category of SQL commands that can be used to define the database schema, such as CREATE, DROP, ALTER, are referred to as _______ _____________ ____________.
Data Definition Language (DDL)
In a Redshift cluster, each node is further broken down into ___________, which have their own compute and storage associated with each.
slices
AWS recommends creating your Redshift tables with __________ ______, which uses automatic table optimization to choose the sort key.
SORTKEY AUTO
When you create a Redshift table, you can optionally specify one column as the ____________ ______. When the table is loaded with data, the rows are distributed to the node slices according to this key.
distribution key
What are the two types of Redshift table sort keys, and which is preferred?
COMPOUND (preferred)
INTERLEAVED
With compression in Redshift, can the sort key column be compressed?
No, it must always be in its raw form so it is always available for Redshift to use.
Which type of Redshift sort key performs better when using lots of WHERE clauses?
INTERLEAVED
Which type of Redshift sort key performs better when using lots of ORDER BY clauses?
COMPOUND
AWS recommends which distribution style for your Redshift tables?
DISTSTYLE AUTO
When you create a Redshift table, you can designate one of four distribution styles. What are they?
AUTO
EVEN
KEY
ALL
When creating a Redshift table with a NOT NULL constraint on a column, does Redshift enforce this?
No, Redshift can still accept data into that column
Redshift Spectrum supports ________ and ________ operations.
It does NOT support ________ and ________ operations.
SELECT… INSERT…
UPDATE… DELETE…
When resizing a Redshift cluster, the source cluster goes into ____________ mode while the resized cluster is being created.
read-only
The two types of resize operations you can choose for resizing a Redshift cluster are __________ and __________.
classic resize… elastic resize.
The ______ resize operation for a Redshift cluster takes minutes, while a ______ resize operation can take hours to days.
elastic… classic…
When performing an elastic resize of a Redshift cluster, what are the two main constraints?
- Can’t be used from or to a single-node cluster
- Only available for clusters that use the EC2-VPC platform
For classic resize and elastic resize operations for Redshift clusters, can you cancel the resize operation after it has been started?
For classic resize, yes.
For elastic resize, no.
Are the Redshift pause/resume options supported for EC2-Classic clusters?
No, you can only pause/resume EC2-VPC clusters
Which type of Redshift cluster resize uses a snapshot for the operation?
elastic resize
What Redshift operation can sort rows and will only sort tables that are less than 95% sorted?
VACUUM SORT ONLY
What Redshift operation can reclaim disc space and will only run on tables that have more than 5% of the rows marked for deletion?
VACUUM DELETE ONLY
What Redshift VACUUM option will ensure that the operation is not interrupted by (i.e. resources are not diverted to) incoming queries.
BOOST
A faster alternative to performing a full vacuum operation on a Redshift cluster table could be to do a _______ _______. This can be beneficial when you have an extremely unsorted table.
Deep Copy
What AWS service can transfer data to and from AWS at a huge scale (i.e. 10GB/s per agent, which is approximately 100TB/day) ?
AWS DataSync
What is an Amazon EMR cluster composed of?
A collection of EC2 instances (referred to as “nodes”)
Each EC2 instance in an Amazon EMR cluster is called a _______.
node
Every Amazon EMR cluster has a ___________ node, and it’s possible to
create a single-node cluster with only this node.
primary
The following is an example process using four steps for which AWS service?
1. Submit an input dataset for processing.
2. Process the output of the first step by using a Pig program.
3. Process a second input dataset by using a Hive program.
4. Write an output dataset.
Amazon EMR
When you set up an Amazon EMR cluster in a private subnet, AWS recommends that you also set up _____________________. Otherwise, you will incur additional charges for NAT gateway as the traffic flow will not be contained within your VPC.
VPC endpoints for Amazon S3
Amazon EMR integrates with ___________ to log information about requests made by or on behalf of your AWS account. With this information, you can track who is accessing your cluster when, and the IP address from which they made the request.
CloudTrail
___________ ______ _________ is a web-based integrated development environment (IDE) for fully managed Jupyter notebooks that run on Amazon EMR clusters.
Amazon EMR Studio
What feature of Amazon EMR allows you to browse your data catalog, run SQL queries, and download results before you work with the data in a Studio notebook.
Amazon EMR Studio SQL Explorer
An Amazon EMR Studio is composed of one or more ___________.
Workspaces
___________ ______ _________ does not support EMR clusters with multiple primary nodes.
Amazon EMR Studio
The maximum number of Amazon EMR Studios you can have is _____ per AWS account.
10