Analytics Flashcards
What does Glue create when it scans your unstructured data in S3?
It creates metadata which can be used to query the data.
What is Hive?
It allows you to run SQL like queries from EMR
Can you import a Hive metastore into AWS Glue?
Yes. You can also import AWS Glue metadata into Hive.
How can you increase the performance of a Spark Job?
Provision additional DPUs (Data processing units).
How can you determine how many DPUs you will need for your job?
Enable job metrics to understand the maximum capacity in DPUs that you will need.
Where are Glue errors reported?
Cloudwatch
How can you schedule Glue jobs?
Glue Scheduler.. This is the most straight forward approach.
What is a DynamicFrame in AWS Glue?
a collection of dynamicRecords
What is a DynamicRecord in AWS Glue?
They are records that are self-describing and have a schema.
Using native AWS Glue functionality, how can you drop fields or null fields?
DropFields or DropNullFields transformation
Using AWS Glue, how can you select a subset of records during your ETL process?
Using filter transformation
How can you enrich your data from another source in AWS Glue?
Use the join transformation
What does the Map transformation in AWS Glue do?
It allows you to add fields, delete fields, and perform external lookups.
What does the FindMatches ML transformation do in AWS Glue?
It identifies duplicate or matching records in your dataset. Even when the records do not have a common identifier.
What format conversions can AWS Glue support?
CSV, JSON, Avro, Parquet, ORC, XML
What does AWS Glue ResolveChoice do?
It deals with ambiguities in your DynamicFrame and returns a new one. example is two fields with the same name.
How do you update your Glue Data Catalog?
You can re-run the crawler or have a script use enableUpdateCatalog / updateBehavior
What are AWS Glue Development endpoints?
They allow you to use a notebook to develop your ETL script. They are launched in a VPC and can be used with SageMaker notebook or Zeppelin.
What do AWS Job bookmarks do?
It keeps track of where you left off so you are not reprocessing old data. Works with S3 sources and relational databases. It only works with new rows in a database, not updated ones. The primary key also needs to be sequential.
Can you start a step function from an AWS Glue event?
Yes
How are you billed using AWS Glue?
You are billed by the second.
How are AWS Glue development endpoints billed?
By the minute
If you want to use engines like Hive or Pig, what AWS service is the best fit?
EMR. Glue is based on Spark
Can AWS Glue process streaming data?
Yes. It can do this from Kinesis or Kafka.
Can AWS Glue clean and transform streaming data in-flight?
yes
What is AWS Glue Studio?
It is a visual interface for ETL workflows
Where can you view the status of AWS Glue Jobs running?
In the Glue Studio Monitoring console
What is AWS Glue Data Quality?
It evaluates the data based on rules that you set. It uses DQDL for custom development.
What is AWS Glue Data Brew?
A visual preparation tool for transforming data.
What are Glue Data Brew sources?
S3, data warehouse, database
Where does Glue Data Brew output data?
S3
What is a recipe in Data Brew?
It is a saved set of transformations that can be applied to any dataset
Can you define data quality rules in Data Brew?
Yes
How does Redshift and Snowflake integrate with Data Brew?
You can use custom SQL to create datasets
Does Data brew integrate with KMS?
Yes, but only with customer master keys KMS SSE-C
Can you schedule a job in Data Brew?
Yes
How can you remove PII in Data Brew
Substitution
Shuffling
Deterministic_Encrypt
Probablistic Encryption
NULL or DELETE
MASK OUT
HASH
What are AWS Event Bridge Batch conditions?
It only fires an event when a specific number of events or seconds within a time period are exceeded.
What is AWS Lake Formation?
It makes it easy to set up a secure data lake in days.
What can you do in Lake Formation?
Anything that you can do in Glue. It is built on Glue.
What AWS services can query lake formation?
Athena, Redshift, and EMR
Can you have multiple accounts accessing Lake Formation?
Yes. The recipient must be a data lake administrator. You can leverage AWS RAM for this as well.
Does Lake Formation support manifests?
No
What are AWS Lake Formation Governed Tables?
They support ACID transactions across multiple tables. This cannot be changed once enabled. Also works with Kinesis streaming data.
How does Lake Formation optimize storage performance?
Automatic compaction
How can you control access to Lake Formation data?
Granular row and column level access
Other than IAM, what else can Lake Formation tie into for permissions?
SAML or external AWS accounts
What are Lake formation policy tags attached to?
Databases, tables, or columns
What are AWS Lake Formation Filters?
They provide column, row, or cell level security. This done when granting select permissions on tables.
What is AWS Athena?
A query service for your data in S3. Data all stays in S3
Is Athena serverless?
yes
What data formats are splitable for parallel processing in Athena?
ORC, Parquet, and Avro
What are Athena Workgroups?
They organize users, teams, and applications into groups. You can control access and track costs by workgroup. They integrate with IAM, CoudWatch, and SNS
Can you set query limits in Athena by using workgroups?
Yes. You can limit how much data is returned.
Are Athena canceled queries billable?
Yes. Only failed queries are not billable.
Are CREATE / ALTER / DROP queries billable in Athena?
No
What can you do to save money querying data with Athena?
Use a columnar format such as ORC or Parquet. You will scan less data.
Do large files perform better in Athena?
Yes. A small number of large files performs better than a large number of small files.
What should you run when you partition after the fact in Athena?
Run MSCK REPAIR TABLE
If you want to ensure your table is ACID compliant in Athena, what table type should you use?
ICEBERG
What are Athena time travel operations?
You can recover data recently deleted with a SELECT statement
What should you do if your ACID transactions in Athena are getting slower over time
Optimize table command using bin_pack where catalog = N
How granular does Athena get with permissions?
Database and table level.
What can you use to query Spark directly?
Spark SQL
Does Spark have machine learning capabilities?
Yes, using MLLib
Can you process streaming data with Spark?
Yes. It integrates with Kinesis and Kafka
Can you change data formats in Athena?
Yes, using CTAS and the format attribute.
What is Spark Structured Streaming?
It just keeps appending to a table and you query by using windows of time.
Can Spark support Redshift?
Yes
Can you run a Jupyter notebook with Spark within the Athena console?
Yes
What is AWS EMR?
A managed hadoop framework that runs on EC2. Uncludes Spark, HBase, Presto, Flink, Hive, and more
What are EMR notebooks
Browser based development in a notebook.
Whare are the EMR node types?
Master, Core, and Task node
Where does data persist in EMR?
On the core nodes in HDFS
Do EMR task nodes store data?
No
What is a good strategy to reduce EMR costs?
Use spot instances for task nodes since they do not persist data.
What is a transient EMR cluster?
One that terminates once all the steps are complete. Good for cost savings.
What is a long running cluster?
One that must be manually terminated. A good use of reserved instances for cost savings.
When do you configure frameworks and applications in EMR?
When the cluster is launched.
How can you run EMR jobs directly?
By connecting to the master node or submitting jobs via ordered steps in the console.