AWS Data Engineer Flashcards
How can you analyze data from multiple sources with EMR?
Install Presto
Simplest way to improve load time from S3 to Redshift when loading large .csv files (2 steps)?
- Break large objects into multiple chunks
- Load objects to Redshift with the ‘COPY’ command
What is JDBC? ODBC?
Java Database Connectivity
Open Database Connectivity
What forms of connections can Amazon Athena use to connect to BI tools (2)?
- JDBC (used by SQL workbench)
- ODBC
When splitting large files for upload to Redshift, you want to split them into a number of objects equal to a multiple of the number of __________ in your Redshift cluster?
slices
(Redshift) When should you use classic resize vs elastic resize for increasing node type and/or number?
You should never use classic, it can take hours or days while elastic takes 10-15 mins of write downtime.
How do you block public access to your EMR cluster?
Enable the ‘block public access’ setting
What feature of Redshift can be used to custom prioritize workload/query completion?
Workload Management (WLM)
This allows short-running queries to resolve without having to wait on long-running queries, for example.
What does the ‘VACUUM’ command do in Redshift?
Cleans up and sorts data rows after a large upload.
Who can run ‘VACUUM’ on Redshift?
A superuser or table owner ONLY.
What will cause ‘VACUUM’ on Redshift to run slowly? (3)
- sort key interleaving
- is not run often enough
- Too many columns in data
Most efficient way to transfer 5TB of objects from S3 to HDFS on EMR?
‘S3DistCP’ command
What types of jobs/scripts can be run by AWS Glue? (3)
Apache Spark, Spark Streaming, Python
How can you run a scheduled Apache Hive job? (2 steps)
Event -> Lambda -> provision an EMR cluster to run the Hive job using ‘RunJobFlow’ API.
Set ‘KeepJobFlowAliveWhenNoSteps’ to FALSE to make sure it shuts down on completion.
(Athena) What should be done before querying a partitioned table created form a record set stored in S3?
run ‘MSCK REPAIR TABLE’ or ‘ALTER TABLE ADD PARTITION’ to get Athena to recognize the partition.
What command should be run to get Athena to recognize this partition pattern (Hive-style):
/year=2021/month=01/day=01/myfile.csv
MSCK REPAIR TABLE
How do you authorize Quicksight to have access to an S3 bucket?
From the QuickSight console
What are the two limitations of transitioning a Redshift cluster to use HSM encryption?
- You must create a new encrypted cluster and move your data
- You must use client and server certificates to create a trusted connection between Redshift and HSM
Preferred service to move 5TB data warehouse from on-prem to AWS over DX connection?
DMS
(datasync could work)
What method to get the most up-to-date results when querying data from multiple sources over JDBC?
EMR with Apache Presto
Service combo to view near-real-time dashboards of streamed transaction data?
Firehose & Opensearch
When querying time-stamped data using Redshift Spectrum, what should you use as the partition key?
usually date
Apache Hive on EMR - Where do you define permissions for EMR to access an S3 bucket?
On the service role attached to the EC2 instances that make up the cluster – NOT on EMR itself.
Can QuickSight access data in regions other than the one the console is hosted in?
Yes
How can you save the state of previously executed ETL jobs in AWS Glue?
using bookmarks
What can improve the performance of ETL’ing data into Redshift?
Load the data into a ‘staging’ table, which will automatically drop after transformation is complete.
How can you store activity logs for an EMR cluster?
Enable logging to S3 upon creation
What is the fastest way to detect a threshold breach from data being streamed to AWS via Kinesis?
With A Kinesis Data Analytics application