AWS Data Engineer Flashcards

1
Q

How can you analyze data from multiple sources with EMR?

A

Install Presto

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Simplest way to improve load time from S3 to Redshift when loading large .csv files (2 steps)?

A
  1. Break large objects into multiple chunks
  2. Load objects to Redshift with the ‘COPY’ command
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is JDBC? ODBC?

A

Java Database Connectivity
Open Database Connectivity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What forms of connections can Amazon Athena use to connect to BI tools (2)?

A
  1. JDBC (used by SQL workbench)
  2. ODBC
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When splitting large files for upload to Redshift, you want to split them into a number of objects equal to a multiple of the number of __________ in your Redshift cluster?

A

slices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(Redshift) When should you use classic resize vs elastic resize for increasing node type and/or number?

A

You should never use classic, it can take hours or days while elastic takes 10-15 mins of write downtime.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you block public access to your EMR cluster?

A

Enable the ‘block public access’ setting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What feature of Redshift can be used to custom prioritize workload/query completion?

A

Workload Management (WLM)

This allows short-running queries to resolve without having to wait on long-running queries, for example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does the ‘VACUUM’ command do in Redshift?

A

Cleans up and sorts data rows after a large upload.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Who can run ‘VACUUM’ on Redshift?

A

A superuser or table owner ONLY.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What will cause ‘VACUUM’ on Redshift to run slowly? (3)

A
  1. sort key interleaving
  2. is not run often enough
  3. Too many columns in data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Most efficient way to transfer 5TB of objects from S3 to HDFS on EMR?

A

‘S3DistCP’ command

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What types of jobs/scripts can be run by AWS Glue? (3)

A

Apache Spark, Spark Streaming, Python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you run a scheduled Apache Hive job? (2 steps)

A

Event -> Lambda -> provision an EMR cluster to run the Hive job using ‘RunJobFlow’ API.

Set ‘KeepJobFlowAliveWhenNoSteps’ to FALSE to make sure it shuts down on completion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(Athena) What should be done before querying a partitioned table created form a record set stored in S3?

A

run ‘MSCK REPAIR TABLE’ or ‘ALTER TABLE ADD PARTITION’ to get Athena to recognize the partition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What command should be run to get Athena to recognize this partition pattern (Hive-style):

/year=2021/month=01/day=01/myfile.csv

A

MSCK REPAIR TABLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you authorize Quicksight to have access to an S3 bucket?

A

From the QuickSight console

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the two limitations of transitioning a Redshift cluster to use HSM encryption?

A
  1. You must create a new encrypted cluster and move your data
  2. You must use client and server certificates to create a trusted connection between Redshift and HSM
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Preferred service to move 5TB data warehouse from on-prem to AWS over DX connection?

A

DMS
(datasync could work)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What method to get the most up-to-date results when querying data from multiple sources over JDBC?

A

EMR with Apache Presto

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Service combo to view near-real-time dashboards of streamed transaction data?

A

Firehose & Opensearch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

When querying time-stamped data using Redshift Spectrum, what should you use as the partition key?

A

usually date

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Apache Hive on EMR - Where do you define permissions for EMR to access an S3 bucket?

A

On the service role attached to the EC2 instances that make up the cluster – NOT on EMR itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Can QuickSight access data in regions other than the one the console is hosted in?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How can you save the state of previously executed ETL jobs in AWS Glue?

A

using bookmarks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What can improve the performance of ETL’ing data into Redshift?

A

Load the data into a ‘staging’ table, which will automatically drop after transformation is complete.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How can you store activity logs for an EMR cluster?

A

Enable logging to S3 upon creation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the fastest way to detect a threshold breach from data being streamed to AWS via Kinesis?

A

With A Kinesis Data Analytics application

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How can you configure Kinesis Data Streams to scale to meet fluctuating demand?

A

Use Application Auto Scaling to call the ‘UpdateShardCount’ API operation

30
Q

Kinesis - How to improve throughput if you do not need high data affinity within each shard?

A

Use a random or unique partition key

31
Q

How to decrease Kinesis read load when delivering streaming data to Lambda functions?

A

Use enhanced fan-out

32
Q

How do you trigger a Glue Job after a Glue Crawler completes?

A

Using a Glue workflow

33
Q

How do you grant AWS Glue Data Catalog access to IAM groups – but only to particular tables?

A

This must be done with a resource policy – cannot be this granular with identity-based policies.

34
Q

Redshift - ‘COPY’ command of .csv files from S3 results in no data transfer. Why?

A

.csv’s used carriage returns as line terminator, and the ‘COPY’ command has ‘IGNOREHEADER’ parameter by default

35
Q

EMR - which of the following EC2 deployment configurations can you configure auto-scaling on:
-Instance Fleets
-Instance Groups

A

Instance Groups ONLY

36
Q

How many Kinesis Data Streams are needed to ingest streaming data from six different sources?

A

One.

37
Q

Athena - How do you query data stored in S3 buckets across multiple regions?

A

Create a glue crawler (in the same region you want to run Athena queries) to catalog the datasets you wish to query in all regions.

38
Q

What is Amazon Keyspace?

A

A managed Apache Cassandra-comaptible database

39
Q

What is MemoryDB for Redis used for?

A

In-memory transient database without need for underlying DB (i.e. it’s not a cache, but a vastly scalable in-memory DB)

40
Q

In addition to increasing the number of shards and using enhanced fan-out, what factor can increase the efficiency of reading from Kinesis using AWS Lambda?

A

Increase the parallelization factor to allow multiple lambdas to process each shard

41
Q

Which 3 events can remove a message from an SQS queue?

A
  1. ‘DeleteMessage’ API call
  2. queue is purged
  3. the ‘maxRecieveCount’ for the item has been reached.
42
Q

SQS - what is a ‘visibility timeout’?

A

a period after a message is received by SQS that it cannot be processed by a consumer

43
Q

Which RDS database types allow for IAM authentication?

A

MariaDB, MySQL, PostgreSQL

(NOT Microsoft SQL, best you can do is rotating secrets in Secrets Manager)

44
Q

Where do you monitor AWS Glue metrics to make sure you have provisioned the correct DPUs, for example?

A

The ‘job run monitoring’ section of the Glue console

45
Q

What Redshift feature can be used to view several rows stored in Redshift compared to similar rows of historical data stored in S3?

A

Redshift Materialized Views

SQL query S3 to get a subset of data, store it in the materialized view. Then run SQL REFRESH and query the view.

46
Q

Which service can be used to mask PII before sending it off to ML applications like Sagemaker?

A

Glue DataBrew

47
Q

What solution allows for the storage of a metadata catalog that allows granular permissions to be defined at the DB, table, column, row, and cell levels.

A

AWS Lake Formation with Lake Formation data filters

48
Q

What feature of SageMaker can be used to ensure that the data being used is accurate, complete, and trustworthy?

A

Implement SageMaker ML Lineage Tracking within a SageMaker workflow.

49
Q

How can you configure a SaaS application to continuously send data for storage in Redshift?

A

Configure an AppFlow flow to run on event

50
Q

(EMR) What is the main benefit of using instance fleets over groups?

A

Grater cost-saving measures by determining % of fleet using spot instances

51
Q

How can you read, write, and manage petabytes of data using a SQL-like interface?

A

Hive (on EMR)

52
Q

Kinesis is increasing number of shards and records are being processed out of order as a result. Why?

A

parent shard is not being fully processed before children shards start processing records.

53
Q

What should you use when you only have to query a subset of data from csv in S3?

A

S3 select

54
Q

EMR Querying same set of data from S3 repeatedly. What use to make data load more efficient?

A

S3 Select and Apache Spark Data-Frames

55
Q

How do you share data in an S3 bucket with a data lake?

A

Register the path location of the data on Lake Formation

56
Q

What format is data indexed in OpenSearch?

A

JSON

57
Q

How do you encrypt newly attached EBS volumes on EMR clusters?

A

Linux Unified Key Setup (LUKS)

58
Q

How can you use Apache Parquet from S3 in QuickSight?

A

Have Athena read the data from S3. Use Athena as the data source. (Parquet not supported as a direct source for QuickSight)

59
Q

What is the scale used for classification certainty in AWS Glue?

A

0.0-1.0

60
Q

Can you automatically rotate the KMS keys for server-side encryption on Kinesis Data Streams?

A

No, must do it manually.

61
Q

What do you need to do to get DataSync to encrypt data in transit and at rest?

A

nothing, these are configured by default.

62
Q

How do you use DataSync to transfer huge amounts of ongoing data from HDFS to S3, while making the transfer system resilient to hardware failures?

A

Use EC2 instances across AZs to host DataSync agents. Use on-prem HDFS to set up an HDFS transfer location.

63
Q

AWS Glue - What job property parameter is adjusted to increase performance of Glue ETL job?

A

maximum capacity parameter

64
Q

How should you determine the number of shards to provision in an OpenSearch cluster?

A

You should have about 10-50 GB of data per shard.

65
Q

2 cost-effective ways to prevent ‘java.lang.OutOfMemoryError: Java heap space’ error on Glue jobs. (large number of small files)

A

don’t cache the whole object list in-memory by setting ‘useS3ListImplementation’ on the DynamicFrame class to ‘true’.

Have tasks process objects in batches by setting the ‘groupFiles’ option of the DynamicFrame class to ‘inPartition’

66
Q

How do you move Redshift query results to S3

A

UNLOAD

67
Q

If you can only store data for analysis on Opensearch due to cost restrictions, how can you reduce the cost of older, less-frequently accessed data?

A

Use Index State Management (ISM) to move older data to “UltraWarm” storage.

68
Q

How do you make a centralized metadata layer for objects processed by EMR?

A

Configure an external metastore for Hive.

69
Q

Opensearch how to exclude irrelevant fields from storage?

A

using the filter_path parameter

70
Q

How do you prevent a periodic Glue job from creating duplicate rows in a Redshift cluster?

A

Set up a staging table in Glue and use ‘DynamicFrameWriter’ to replace duplicates.

71
Q
A