AWS Data Engineer Flashcards Flashcards by David Blocher

How can you analyze data from multiple sources with EMR?

Install Presto

How well did you know this?

Not at all

Perfectly

Simplest way to improve load time from S3 to Redshift when loading large .csv files (2 steps)?

Break large objects into multiple chunks
Load objects to Redshift with the ‘COPY’ command

How well did you know this?

Not at all

Perfectly

What is JDBC? ODBC?

Java Database Connectivity
Open Database Connectivity

How well did you know this?

Not at all

Perfectly

What forms of connections can Amazon Athena use to connect to BI tools (2)?

JDBC (used by SQL workbench)
ODBC

How well did you know this?

Not at all

Perfectly

When splitting large files for upload to Redshift, you want to split them into a number of objects equal to a multiple of the number of __________ in your Redshift cluster?

slices

How well did you know this?

Not at all

Perfectly

(Redshift) When should you use classic resize vs elastic resize for increasing node type and/or number?

You should never use classic, it can take hours or days while elastic takes 10-15 mins of write downtime.

How well did you know this?

Not at all

Perfectly

How do you block public access to your EMR cluster?

Enable the ‘block public access’ setting

How well did you know this?

Not at all

Perfectly

What feature of Redshift can be used to custom prioritize workload/query completion?

Workload Management (WLM)

This allows short-running queries to resolve without having to wait on long-running queries, for example.

How well did you know this?

Not at all

Perfectly

What does the ‘VACUUM’ command do in Redshift?

Cleans up and sorts data rows after a large upload.

How well did you know this?

Not at all

Perfectly

Who can run ‘VACUUM’ on Redshift?

A superuser or table owner ONLY.

How well did you know this?

Not at all

Perfectly

What will cause ‘VACUUM’ on Redshift to run slowly? (3)

sort key interleaving
is not run often enough
Too many columns in data

How well did you know this?

Not at all

Perfectly

Most efficient way to transfer 5TB of objects from S3 to HDFS on EMR?

‘S3DistCP’ command

How well did you know this?

Not at all

Perfectly

What types of jobs/scripts can be run by AWS Glue? (3)

Apache Spark, Spark Streaming, Python

How well did you know this?

Not at all

Perfectly

How can you run a scheduled Apache Hive job? (2 steps)

Event -> Lambda -> provision an EMR cluster to run the Hive job using ‘RunJobFlow’ API.

Set ‘KeepJobFlowAliveWhenNoSteps’ to FALSE to make sure it shuts down on completion.

How well did you know this?

Not at all

Perfectly

(Athena) What should be done before querying a partitioned table created form a record set stored in S3?

run ‘MSCK REPAIR TABLE’ or ‘ALTER TABLE ADD PARTITION’ to get Athena to recognize the partition.

How well did you know this?

Not at all

Perfectly

What command should be run to get Athena to recognize this partition pattern (Hive-style):

/year=2021/month=01/day=01/myfile.csv

MSCK REPAIR TABLE

How well did you know this?

Not at all

Perfectly

How do you authorize Quicksight to have access to an S3 bucket?

From the QuickSight console

How well did you know this?

Not at all

Perfectly

What are the two limitations of transitioning a Redshift cluster to use HSM encryption?

You must create a new encrypted cluster and move your data
You must use client and server certificates to create a trusted connection between Redshift and HSM

How well did you know this?

Not at all

Perfectly

Preferred service to move 5TB data warehouse from on-prem to AWS over DX connection?

DMS
(datasync could work)

How well did you know this?

Not at all

Perfectly

What method to get the most up-to-date results when querying data from multiple sources over JDBC?

EMR with Apache Presto

How well did you know this?

Not at all

Perfectly

Service combo to view near-real-time dashboards of streamed transaction data?

Firehose & Opensearch

How well did you know this?

Not at all

Perfectly

When querying time-stamped data using Redshift Spectrum, what should you use as the partition key?

usually date

How well did you know this?

Not at all

Perfectly

Apache Hive on EMR - Where do you define permissions for EMR to access an S3 bucket?

On the service role attached to the EC2 instances that make up the cluster – NOT on EMR itself.

How well did you know this?

Not at all

Perfectly

Can QuickSight access data in regions other than the one the console is hosted in?

Yes

How well did you know this?

Not at all

Perfectly

How can you save the state of previously executed ETL jobs in AWS Glue?

using bookmarks

What can improve the performance of ETL'ing data into Redshift?

Load the data into a 'staging' table, which will automatically drop after transformation is complete.

How can you store activity logs for an EMR cluster?

Enable logging to S3 upon creation

What is the fastest way to detect a threshold breach from data being streamed to AWS via Kinesis?

With A Kinesis Data Analytics application

How can you configure Kinesis Data Streams to scale to meet fluctuating demand?

Use Application Auto Scaling to call the 'UpdateShardCount' API operation

Kinesis - How to improve throughput if you do not need high data affinity within each shard?

Use a random or unique partition key

How to decrease Kinesis read load when delivering streaming data to Lambda functions?

Use enhanced fan-out

How do you trigger a Glue Job after a Glue Crawler completes?

Using a Glue workflow

How do you grant AWS Glue Data Catalog access to IAM groups -- but only to particular tables?

This must be done with a resource policy -- cannot be this granular with identity-based policies.

Redshift - 'COPY' command of .csv files from S3 results in no data transfer. Why?

.csv's used carriage returns as line terminator, and the 'COPY' command has 'IGNOREHEADER' parameter by default

EMR - which of the following EC2 deployment configurations can you configure auto-scaling on: -Instance Fleets -Instance Groups

Instance Groups ONLY

How many Kinesis Data Streams are needed to ingest streaming data from six different sources?

One.

Athena - How do you query data stored in S3 buckets across multiple regions?

Create a glue crawler (in the same region you want to run Athena queries) to catalog the datasets you wish to query in all regions.

What is Amazon Keyspace?

A managed Apache Cassandra-comaptible database

What is MemoryDB for Redis used for?

In-memory transient database without need for underlying DB (i.e. it's not a cache, but a vastly scalable in-memory DB)

In addition to increasing the number of shards and using enhanced fan-out, what factor can increase the efficiency of reading from Kinesis using AWS Lambda?

Increase the parallelization factor to allow multiple lambdas to process each shard

Which 3 events can remove a message from an SQS queue?

1. 'DeleteMessage' API call 2. queue is purged 3. the 'maxRecieveCount' for the item has been reached.

SQS - what is a 'visibility timeout'?

a period after a message is received by SQS that it cannot be processed by a consumer

Which RDS database types allow for IAM authentication?

MariaDB, MySQL, PostgreSQL (NOT Microsoft SQL, best you can do is rotating secrets in Secrets Manager)

Where do you monitor AWS Glue metrics to make sure you have provisioned the correct DPUs, for example?

The 'job run monitoring' section of the Glue console

What Redshift feature can be used to view several rows stored in Redshift compared to similar rows of historical data stored in S3?

Redshift Materialized Views SQL query S3 to get a subset of data, store it in the materialized view. Then run SQL REFRESH and query the view.

Which service can be used to mask PII before sending it off to ML applications like Sagemaker?

Glue DataBrew

What solution allows for the storage of a metadata catalog that allows granular permissions to be defined at the DB, table, column, row, and cell levels.

AWS Lake Formation with Lake Formation data filters

What feature of SageMaker can be used to ensure that the data being used is accurate, complete, and trustworthy?

Implement SageMaker ML Lineage Tracking within a SageMaker workflow.

How can you configure a SaaS application to continuously send data for storage in Redshift?

Configure an AppFlow flow to run on event

(EMR) What is the main benefit of using instance fleets over groups?

Grater cost-saving measures by determining % of fleet using spot instances

How can you read, write, and manage petabytes of data using a SQL-like interface?

Hive (on EMR)

Kinesis is increasing number of shards and records are being processed out of order as a result. Why?

parent shard is not being fully processed before children shards start processing records.

What should you use when you only have to query a subset of data from csv in S3?

S3 select

EMR Querying same set of data from S3 repeatedly. What use to make data load more efficient?

S3 Select and Apache Spark Data-Frames

How do you share data in an S3 bucket with a data lake?

What format is data indexed in OpenSearch?

JSON

How do you encrypt newly attached EBS volumes on EMR clusters?

Linux Unified Key Setup (LUKS)

How can you use Apache Parquet from S3 in QuickSight?

Have Athena read the data from S3. Use Athena as the data source. (Parquet not supported as a direct source for QuickSight)

What is the scale used for classification certainty in AWS Glue?

0.0-1.0

Can you automatically rotate the KMS keys for server-side encryption on Kinesis Data Streams?

No, must do it manually.

What do you need to do to get DataSync to encrypt data in transit and at rest?

nothing, these are configured by default.

How do you use DataSync to transfer huge amounts of ongoing data from HDFS to S3, while making the transfer system resilient to hardware failures?

Use EC2 instances across AZs to host DataSync agents. Use on-prem HDFS to set up an HDFS transfer location.

AWS Glue - What job property parameter is adjusted to increase performance of Glue ETL job?

maximum capacity parameter

How should you determine the number of shards to provision in an OpenSearch cluster?

You should have about 10-50 GB of data per shard.

2 cost-effective ways to prevent 'java.lang.OutOfMemoryError: Java heap space' error on Glue jobs. (large number of small files)

don't cache the whole object list in-memory by setting 'useS3ListImplementation' on the DynamicFrame class to 'true'. Have tasks process objects in batches by setting the 'groupFiles' option of the DynamicFrame class to 'inPartition'

How do you move Redshift query results to S3

UNLOAD

If you can only store data for analysis on Opensearch due to cost restrictions, how can you reduce the cost of older, less-frequently accessed data?

Use Index State Management (ISM) to move older data to "UltraWarm" storage.

How do you make a centralized metadata layer for objects processed by EMR?

Configure an external metastore for Hive.

Opensearch how to exclude irrelevant fields from storage?

using the filter_path parameter

How do you prevent a periodic Glue job from creating duplicate rows in a Redshift cluster?

Set up a staging table in Glue and use 'DynamicFrameWriter' to replace duplicates.

AWS Data Engineer Flashcards

(71 cards)