Analytics - AWS Glue Flashcards
server less discovery and definition of table definitions and schema.
server as a central metadata for our data lake.
Publish table def for athemna, emr or redshif.
a) AWS GLUE
b) RDS
AWS GLUE.
Has Custom ETl jobs either trigger driven, schduled or on demand.
a) AWS GLUE
b) RDS
AWS Glue
WIth AWS GLUE you have the power of a APAche ________ server. WIthout having to manage the cluster.
a) AWS GLUE
b) RDS
c) Spark
Spark.
1 peice of it its the Glue crawler and data catalog which it populates.
______________ scan data in SR to create schema. Can be run periodcially.
a) AWS GLUE
b) RDS
c) Spark
d) Glue Cralwer
Glue Crawler
Uses glue to provide the glue between unstrucutred data lake and realtional database interfaces.
Stores table defs, orginal data stays in S3.
Once catalogued you can treat your unstrucutred data like its structured with:
Redshift,
Athena,
EMR,
Quicksight/
Glue crawler will extract___________ of your data based on how your S3 data is organised.
a) AWS GLUE partitions
b) RDS
c) Spark
d) Glue Cralwer
Glue and S3 partitions
When thinking about AWS Glue partitions. HOw would your orgniased them if a device is sending in data per hour?
Your goal is to Query by device rather then hour.
A) Do you Query primaliry by time ranges
b) DO you Query primialry by device
b) This way your primary partition would be the device id. Then followed by the year, month, and data.
You want to make sure that the device is the top-level bucket. The optimial way to organise and extract your data.
A) Partition based on time ranges, organised your buckets by year, month and date. The the device that generated the data. SO you can effcianetky access data per date.
Glue data catalog can provide metainformation to _______ running on EMR. Similar to glue, where it can provdie the glue between your unstructured data on S# and AWS services like Redshift and athena and services running on your EMR cluster.
a) GLUE HIVE
b) RDS
Hive
SQL like Queries on your EMR cluster.
a)HiveSQL
b)Parquet
a) HiveQL
_______________ automatically generate code, for transformaing your data. Be under scala or python
a) GLUE HIVE ETL
b) RDS
c) Apache Spark cluster
d) EMR
Glue HIVE ETL.
Take use of the apache spark cluster underneath.
- ENcryption: server-side (at rest) or SSL (in transit) Which are both handled for you.
event based driven, as soon as new data is seen by Glue.
a) GLUE HIVE ETL
b) RDS
c) Apache Spark cluster
d) EMR
a) GLUE HIVE ETL
To increase performance of the underlying AWS GLUE hive Spark jobs. You would Provision additional ___________
a) DTU’s
b)vcores
c) DPUs
d) WPUs
c) DPUs
To increase the performance of an apache saprk job when using the AWS Glue Hive ETL process. How do you figure out how many DPU’s to configure.
a) Set up another RDS instance and show all of it in a database.
b) Set up a email alert when new data is entered into S3.
c) Enable Job metrics for your cluster. To Study the maximum capacity for your DPU’s in your job.
c) You would study the metrics set-up for the cluster to study to help come up with an acceptable maixmum capacity.
To be notified of errors that are encountered along the aws glue etl pipeline. _____________ sns can be set-up to automatically send text, or notified if your etl process runs into trouble. l ___________
a) DTU’s
b)Cloud Watch
c) DPUs
d) WPUs
b) Cloud watch
________________ is a system that lets you automatically process and transform your data. VIa Graphical interface. Which lets you define how you want that transformation to work.
a) EMR
b)Apache Spark
c) Glue ETL
c) GLUE ETL
AWS service ____________ that allows you to transform data, clean data, enrich data (before doing analysis).
a) EMR
b)Apache Spark
c) Glue ETL
GLUE ETL
_____________ AWS service where you would gnerate etl code in python or scala where you can modify with your own code.
The target would be S3, JDBC (RDS< Redshift) or Glue Data catalog.
a) EMR
b)Apache Spark
c) Glue ETL
Glue ETL
in GLUE ETL the ____________________ is a collection of Dynamic Records. Which are slef describing have a schema.
a) DynamicFrame
b)Apache Spark
c) RDS metrics
a) DynamicFrame
for example:
val pushdownEvents = glueContext.getCatalogSource (
database = “githubarchive_month”, tablename = “data”)
val projecteEvents = pushdownEvents.applyMapping(Seq(“id”, “string”, “id”, “long”), (“type”, “string”, “type”, “string”), (“actor.login”, “string”, “actor”, “string”), (“repo.name”, “string”, “repo”, “string”), (“payload.action”, “string”, “action”, “string”), (“org.login”, “string”, “org”, “string”), “year”, “string”, “year”, “int”), (“month”, “string”, “month”, “int”), (“day”, “string”, “day”, “int”)))
With Glue ETL what type of ________________ transformation (that comes out of the box) that you would use to remove empty data. Common operation in pre-processing.
a) Bundled
b) Informative
c) Machine Learning
a) Bundled
it can drop null fields to remove empty data from your incoming data for you.
IN Glue ETL there are certian _______________ transformations that come with out of the box.
This includes DropFields, DropNullFields, Filter (specify a function to filter records). Joins to enrich data. Map - add fields, delete fields, perform external lookups so like transform data one row at a time.
a) Bundled
b) Informative
c) Machine Learning
a) Bundled
Like using filter to etract only a subset of data to analyze.
For example, maps can be used if there is a bunch of columns of data coming into your data. Use maps to extract fields that you actually care about, and dlete the ones that you dont care about.
In Glue ETL FindMatches __________________ transformation, its purpose is to identify duplicate or matching records in your data set. even if those records do not have a common unique identifier and no fields match exactly.
a) Bundled
b) Informative
c) Machine Learning
Machine Learning
Find Matches ML transformation can learn what a duplicate recorods looks like, even if its nt really a duplicate.