AWS Data Engineer Flashcards

1
Q

How can you analyze data from multiple sources with EMR?

A

Install Presto

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Simplest way to improve load time from S3 to Redshift when loading large .csv files (2 steps)?

A
  1. Break large objects into multiple chunks
  2. Load objects to Redshift with the ‘COPY’ command
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is JDBC? ODBC?

A

Java Database Connectivity
Open Database Connectivity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What forms of connections can Amazon Athena use to connect to BI tools (2)?

A
  1. JDBC (used by SQL workbench)
  2. ODBC
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When splitting large files for upload to Redshift, you want to split them into a number of objects equal to a multiple of the number of __________ in your Redshift cluster?

A

slices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(Redshift) When should you use classic resize vs elastic resize for increasing node type and/or number?

A

You should never use classic, it can take hours or days while elastic takes 10-15 mins of write downtime.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you block public access to your EMR cluster?

A

Enable the ‘block public access’ setting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What feature of Redshift can be used to custom prioritize workload/query completion?

A

Workload Management (WLM)

This allows short-running queries to resolve without having to wait on long-running queries, for example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does the ‘VACUUM’ command do in Redshift?

A

Cleans up and sorts data rows after a large upload.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Who can run ‘VACUUM’ on Redshift?

A

A superuser or table owner ONLY.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What will cause ‘VACUUM’ on Redshift to run slowly? (3)

A
  1. sort key interleaving
  2. is not run often enough
  3. Too many columns in data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Most efficient way to transfer 5TB of objects from S3 to HDFS on EMR?

A

‘S3DistCP’ command

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What types of jobs/scripts can be run by AWS Glue? (3)

A

Apache Spark, Spark Streaming, Python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you run a scheduled Apache Hive job? (2 steps)

A

Event -> Lambda -> provision an EMR cluster to run the Hive job using ‘RunJobFlow’ API.

Set ‘KeepJobFlowAliveWhenNoSteps’ to FALSE to make sure it shuts down on completion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(Athena) What should be done before querying a partitioned table created form a record set stored in S3?

A

run ‘MSCK REPAIR TABLE’ or ‘ALTER TABLE ADD PARTITION’ to get Athena to recognize the partition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What command should be run to get Athena to recognize this partition pattern (Hive-style):

/year=2021/month=01/day=01/myfile.csv

A

MSCK REPAIR TABLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you authorize Quicksight to have access to an S3 bucket?

A

From the QuickSight console

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the two limitations of transitioning a Redshift cluster to use HSM encryption?

A
  1. You must create a new encrypted cluster and move your data
  2. You must use client and server certificates to create a trusted connection between Redshift and HSM
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Preferred service to move 5TB data warehouse from on-prem to AWS over DX connection?

A

DMS
(datasync could work)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What method to get the most up-to-date results when querying data from multiple sources over JDBC?

A

EMR with Apache Presto

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Service combo to view near-real-time dashboards of streamed transaction data?

A

Firehose & Opensearch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

When querying time-stamped data using Redshift Spectrum, what should you use as the partition key?

A

usually date

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Apache Hive on EMR - Where do you define permissions for EMR to access an S3 bucket?

A

On the service role attached to the EC2 instances that make up the cluster – NOT on EMR itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Can QuickSight access data in regions other than the one the console is hosted in?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How can you save the state of previously executed ETL jobs in AWS Glue?
using bookmarks
26
What can improve the performance of ETL'ing data into Redshift?
Load the data into a 'staging' table, which will automatically drop after transformation is complete.
27
How can you store activity logs for an EMR cluster?
Enable logging to S3 upon creation
28
What is the fastest way to detect a threshold breach from data being streamed to AWS via Kinesis?
With A Kinesis Data Analytics application
29
How can you configure Kinesis Data Streams to scale to meet fluctuating demand?
Use Application Auto Scaling to call the 'UpdateShardCount' API operation
30
Kinesis - How to improve throughput if you do not need high data affinity within each shard?
Use a random or unique partition key
31
How to decrease Kinesis read load when delivering streaming data to Lambda functions?
Use enhanced fan-out
32
How do you trigger a Glue Job after a Glue Crawler completes?
Using a Glue workflow
33
How do you grant AWS Glue Data Catalog access to IAM groups -- but only to particular tables?
This must be done with a resource policy -- cannot be this granular with identity-based policies.
34
Redshift - 'COPY' command of .csv files from S3 results in no data transfer. Why?
.csv's used carriage returns as line terminator, and the 'COPY' command has 'IGNOREHEADER' parameter by default
35
EMR - which of the following EC2 deployment configurations can you configure auto-scaling on: -Instance Fleets -Instance Groups
Instance Groups ONLY
36
How many Kinesis Data Streams are needed to ingest streaming data from six different sources?
One.
37
Athena - How do you query data stored in S3 buckets across multiple regions?
Create a glue crawler (in the same region you want to run Athena queries) to catalog the datasets you wish to query in all regions.
38
What is Amazon Keyspace?
A managed Apache Cassandra-comaptible database
39
What is MemoryDB for Redis used for?
In-memory transient database without need for underlying DB (i.e. it's not a cache, but a vastly scalable in-memory DB)
40
In addition to increasing the number of shards and using enhanced fan-out, what factor can increase the efficiency of reading from Kinesis using AWS Lambda?
Increase the parallelization factor to allow multiple lambdas to process each shard
41
Which 3 events can remove a message from an SQS queue?
1. 'DeleteMessage' API call 2. queue is purged 3. the 'maxRecieveCount' for the item has been reached.
42
SQS - what is a 'visibility timeout'?
a period after a message is received by SQS that it cannot be processed by a consumer
43
Which RDS database types allow for IAM authentication?
MariaDB, MySQL, PostgreSQL (NOT Microsoft SQL, best you can do is rotating secrets in Secrets Manager)
44
Where do you monitor AWS Glue metrics to make sure you have provisioned the correct DPUs, for example?
The 'job run monitoring' section of the Glue console
45
What Redshift feature can be used to view several rows stored in Redshift compared to similar rows of historical data stored in S3?
Redshift Materialized Views SQL query S3 to get a subset of data, store it in the materialized view. Then run SQL REFRESH and query the view.
46
Which service can be used to mask PII before sending it off to ML applications like Sagemaker?
Glue DataBrew
47
What solution allows for the storage of a metadata catalog that allows granular permissions to be defined at the DB, table, column, row, and cell levels.
AWS Lake Formation with Lake Formation data filters
48
What feature of SageMaker can be used to ensure that the data being used is accurate, complete, and trustworthy?
Implement SageMaker ML Lineage Tracking within a SageMaker workflow.
49
How can you configure a SaaS application to continuously send data for storage in Redshift?
Configure an AppFlow flow to run on event
50
(EMR) What is the main benefit of using instance fleets over groups?
Grater cost-saving measures by determining % of fleet using spot instances
51
How can you read, write, and manage petabytes of data using a SQL-like interface?
Hive (on EMR)
52
Kinesis is increasing number of shards and records are being processed out of order as a result. Why?
parent shard is not being fully processed before children shards start processing records.
53
What should you use when you only have to query a subset of data from csv in S3?
S3 select
54
EMR Querying same set of data from S3 repeatedly. What use to make data load more efficient?
S3 Select and Apache Spark Data-Frames
55
How do you share data in an S3 bucket with a data lake?
Register the path location of the data on Lake Formation
56
What format is data indexed in OpenSearch?
JSON
57
How do you encrypt newly attached EBS volumes on EMR clusters?
Linux Unified Key Setup (LUKS)
58
How can you use Apache Parquet from S3 in QuickSight?
Have Athena read the data from S3. Use Athena as the data source. (Parquet not supported as a direct source for QuickSight)
59
What is the scale used for classification certainty in AWS Glue?
0.0-1.0
60
Can you automatically rotate the KMS keys for server-side encryption on Kinesis Data Streams?
No, must do it manually.
61
What do you need to do to get DataSync to encrypt data in transit and at rest?
nothing, these are configured by default.
62
How do you use DataSync to transfer huge amounts of ongoing data from HDFS to S3, while making the transfer system resilient to hardware failures?
Use EC2 instances across AZs to host DataSync agents. Use on-prem HDFS to set up an HDFS transfer location.
63
AWS Glue - What job property parameter is adjusted to increase performance of Glue ETL job?
maximum capacity parameter
64
How should you determine the number of shards to provision in an OpenSearch cluster?
You should have about 10-50 GB of data per shard.
65
2 cost-effective ways to prevent 'java.lang.OutOfMemoryError: Java heap space' error on Glue jobs. (large number of small files)
don't cache the whole object list in-memory by setting 'useS3ListImplementation' on the DynamicFrame class to 'true'. Have tasks process objects in batches by setting the 'groupFiles' option of the DynamicFrame class to 'inPartition'
66
How do you move Redshift query results to S3
UNLOAD
67
If you can only store data for analysis on Opensearch due to cost restrictions, how can you reduce the cost of older, less-frequently accessed data?
Use Index State Management (ISM) to move older data to "UltraWarm" storage.
68
How do you make a centralized metadata layer for objects processed by EMR?
Configure an external metastore for Hive.
69
Opensearch how to exclude irrelevant fields from storage?
using the filter_path parameter
70
How do you prevent a periodic Glue job from creating duplicate rows in a Redshift cluster?
Set up a staging table in Glue and use 'DynamicFrameWriter' to replace duplicates.
71