Analytics - AWS Glue Flashcards
server less discovery and definition of table definitions and schema.
server as a central metadata for our data lake.
Publish table def for athemna, emr or redshif.
a) AWS GLUE
b) RDS
AWS GLUE.
Has Custom ETl jobs either trigger driven, schduled or on demand.
a) AWS GLUE
b) RDS
AWS Glue
WIth AWS GLUE you have the power of a APAche ________ server. WIthout having to manage the cluster.
a) AWS GLUE
b) RDS
c) Spark
Spark.
1 peice of it its the Glue crawler and data catalog which it populates.
______________ scan data in SR to create schema. Can be run periodcially.
a) AWS GLUE
b) RDS
c) Spark
d) Glue Cralwer
Glue Crawler
Uses glue to provide the glue between unstrucutred data lake and realtional database interfaces.
Stores table defs, orginal data stays in S3.
Once catalogued you can treat your unstrucutred data like its structured with:
Redshift,
Athena,
EMR,
Quicksight/
Glue crawler will extract___________ of your data based on how your S3 data is organised.
a) AWS GLUE partitions
b) RDS
c) Spark
d) Glue Cralwer
Glue and S3 partitions
When thinking about AWS Glue partitions. HOw would your orgniased them if a device is sending in data per hour?
Your goal is to Query by device rather then hour.
A) Do you Query primaliry by time ranges
b) DO you Query primialry by device
b) This way your primary partition would be the device id. Then followed by the year, month, and data.
You want to make sure that the device is the top-level bucket. The optimial way to organise and extract your data.
A) Partition based on time ranges, organised your buckets by year, month and date. The the device that generated the data. SO you can effcianetky access data per date.
Glue data catalog can provide metainformation to _______ running on EMR. Similar to glue, where it can provdie the glue between your unstructured data on S# and AWS services like Redshift and athena and services running on your EMR cluster.
a) GLUE HIVE
b) RDS
Hive
SQL like Queries on your EMR cluster.
a)HiveSQL
b)Parquet
a) HiveQL
_______________ automatically generate code, for transformaing your data. Be under scala or python
a) GLUE HIVE ETL
b) RDS
c) Apache Spark cluster
d) EMR
Glue HIVE ETL.
Take use of the apache spark cluster underneath.
- ENcryption: server-side (at rest) or SSL (in transit) Which are both handled for you.
event based driven, as soon as new data is seen by Glue.
a) GLUE HIVE ETL
b) RDS
c) Apache Spark cluster
d) EMR
a) GLUE HIVE ETL
To increase performance of the underlying AWS GLUE hive Spark jobs. You would Provision additional ___________
a) DTU’s
b)vcores
c) DPUs
d) WPUs
c) DPUs
To increase the performance of an apache saprk job when using the AWS Glue Hive ETL process. How do you figure out how many DPU’s to configure.
a) Set up another RDS instance and show all of it in a database.
b) Set up a email alert when new data is entered into S3.
c) Enable Job metrics for your cluster. To Study the maximum capacity for your DPU’s in your job.
c) You would study the metrics set-up for the cluster to study to help come up with an acceptable maixmum capacity.
To be notified of errors that are encountered along the aws glue etl pipeline. _____________ sns can be set-up to automatically send text, or notified if your etl process runs into trouble. l ___________
a) DTU’s
b)Cloud Watch
c) DPUs
d) WPUs
b) Cloud watch
________________ is a system that lets you automatically process and transform your data. VIa Graphical interface. Which lets you define how you want that transformation to work.
a) EMR
b)Apache Spark
c) Glue ETL
c) GLUE ETL
AWS service ____________ that allows you to transform data, clean data, enrich data (before doing analysis).
a) EMR
b)Apache Spark
c) Glue ETL
GLUE ETL
_____________ AWS service where you would gnerate etl code in python or scala where you can modify with your own code.
The target would be S3, JDBC (RDS< Redshift) or Glue Data catalog.
a) EMR
b)Apache Spark
c) Glue ETL
Glue ETL
in GLUE ETL the ____________________ is a collection of Dynamic Records. Which are slef describing have a schema.
a) DynamicFrame
b)Apache Spark
c) RDS metrics
a) DynamicFrame
for example:
val pushdownEvents = glueContext.getCatalogSource (
database = “githubarchive_month”, tablename = “data”)
val projecteEvents = pushdownEvents.applyMapping(Seq(“id”, “string”, “id”, “long”), (“type”, “string”, “type”, “string”), (“actor.login”, “string”, “actor”, “string”), (“repo.name”, “string”, “repo”, “string”), (“payload.action”, “string”, “action”, “string”), (“org.login”, “string”, “org”, “string”), “year”, “string”, “year”, “int”), (“month”, “string”, “month”, “int”), (“day”, “string”, “day”, “int”)))
With Glue ETL what type of ________________ transformation (that comes out of the box) that you would use to remove empty data. Common operation in pre-processing.
a) Bundled
b) Informative
c) Machine Learning
a) Bundled
it can drop null fields to remove empty data from your incoming data for you.
IN Glue ETL there are certian _______________ transformations that come with out of the box.
This includes DropFields, DropNullFields, Filter (specify a function to filter records). Joins to enrich data. Map - add fields, delete fields, perform external lookups so like transform data one row at a time.
a) Bundled
b) Informative
c) Machine Learning
a) Bundled
Like using filter to etract only a subset of data to analyze.
For example, maps can be used if there is a bunch of columns of data coming into your data. Use maps to extract fields that you actually care about, and dlete the ones that you dont care about.
In Glue ETL FindMatches __________________ transformation, its purpose is to identify duplicate or matching records in your data set. even if those records do not have a common unique identifier and no fields match exactly.
a) Bundled
b) Informative
c) Machine Learning
Machine Learning
Find Matches ML transformation can learn what a duplicate recorods looks like, even if its nt really a duplicate.
IN GLUE ETL _______________________ can be used to support formats that can automatically convert between CSV, JSON, Avro, Parquet, ORC and XML
a) Bundled
b) Informative
c) Machine Learning
d) Format conversions
Format Conversions.
Say you have data coming in as 1 format and you need to trasnform it into something else.
______________________ deals with ambiguities and returns a new one.
For example, two fields with the same name.
a) Bundled
b) Resolve Choice
c) Machine Learning
d) Format conversions
b) Resolve Choice
make_cols: creates a new column for each type.
cast: cast values to specified type.
make_struct: creates a structure that contains each data type.
project: projects every type to a given type, for example project: string.
When modfying your _______ __________. THere are alternatives then just re-running the crawler.
Such as:
ETL scripts can update your schema and partiions if necessary.
Adding new partitions. Where you re-run the crawler or have the script use enableUpdateCatalog and partition Keys options.
Updatiing table schema. Where you re-run the crawler or Use enableUpdateCatalog/Update behaviour from script.
Create new tables. You would use enableUpdateCatalog /updateBehaviour with setCatalogInfo.
a) Glue Catalog
b) RDS
c) S3
a) Glue Catalog
Restrictions - S3 only, Json, csv, avro, parquet only. Parquet requires special code. Nested schemas are not suported.
_____________ allows you to keep track of where you left off. Where you can prevent the reprocessing of old data and just process that new data when you rerun on a schedule.
_______________ Works with relational databases via JDBC if your primary keys are in sequential order.
_______________ can fire off cloudwatch events usig lambda functions or sns notifications.
You can invoke ec2 run to do further processing send the event on to kinesis, activate a step function, whatever you might wanna do.
a) Glue Catalog
b) RDS
c) S3
d) Job book mark
job book mark
If you want to use other engines. Other than the Glue ETL (based on spark). What AWS service would you use____________
a) Pig
b) Hive
c) S3
d) EMR
d) EMR
Like legacy code that want wor with the apache spark.
____________________ supports serverless streaming ETL where its consumes from kinesis or kafka. STores results into S3 or other data stores.
a) Glue Catalog
b) RDS
c) S3
d) Glue ETL
Glue etl
Runs on apache spark structured streaming. As data gets added to it and it can transform that data using some kind of code that you would use for batch processing. So if you have a spark script on Glue ETL thats built for processing batch data. You can adapt that for streamking data aswell.
With AWS Glue studio __________________________ can can be added as a step where you can inject into your glue jobs to automatically evaulate the quality of the data comin in your glue job to automatically evaluate the quality of your data coming in.
and if it violates certain parameters or rules that you set-up, then it can automatically fail the job or just log something in cloudwatch for you.
a) AWS Glue Data Quality
b) RDS
c) S3
AWS Glue Data Quality
___________________ Rules maybe created manually or recommended automatically.
For example, you can tell _____________________ to look at source data and infer some rule from it.
You can then intergrate them into your Glue jobs.
a) Glue Data Quality
b) RDS
c) S3
Glue Data Quality
Uses Data Quality Definition language
a) Glue Data Quality
b) RDS
c) S3
Glue Data Quality
IN Glue Data Quality for an ETL job. You can use _________________ to report job failures to help with better reporting of falase positives.
a) Hive
b) CloudWatch
c) S3
b) CloudWatch
Where you can intepret them and act on them as you see fit.
___________________ Visual Data preperation tool, Where it has a UI for pre-Processing large data sets. Those data sets can come from S3, from a data warehouse or database.
The output will then go to S3 once tranformed from ____________
a) Glue Data Quality
b) Glue ETL
c) S3
d) Glue DataBrew
Glue DataBrew
_________________________ Has over 250 ready made trasnformations that can be used to visauly apply to data for data trasnformations.
a) Glue Data Quality
b) Glue ETL
c) S3
d) Glue DataBrew
DataBrew
______________________ Can have recipies of transofrmation that can be saved as jobs within a larger project.
Those recipies are made up of recipie actions.
a) Glue Data Quality
b) Glue ETL
c) S3
d) Glue DataBrew
DataBrew
_________________________ you can create datasets with custom SQL from Redshift and snowflake.
a) Glue Data Quality
b) Glue ETL
c) S3
d) Glue DataBrew
Glue DataBrew
You can also integrate with KMS, SSL in transit, IAM, CloudWatch & CloudTrail.
___________________ is a possible alternative to Glue ETL
a) Glue Data Quality
b) Glue ETL
c) S3
d) Glue DataBrew
Glue DataBrew
Data Brew PPI transformation technqiues. _______________ (Replace with Random)
a) Glue Data Quality
b) Glue ETL
c) S3
d) Substitution.
d) Substitution
Data Brew PPI transformation technqiues. _______________ (SHUFFLE_ROWS)
a) Glue Data Quality
b) Glue ETL
c) S3
d) Shuffling.
d) Shuffling
Data Brew PPI transformation technqiues. _______________ (Deterministic encrypt)
a) Deterministic Encryption
b) Decryption
c) Masking
d) Substitution.
a) Deterministic Encryption
Data Brew PPI transformation technique____________________ (MASK_CUSTOM, _DATE, _DELIMITER, _RANGE)
a) Deterministic Encryption
b) Decryption
c) Masking out
d) Substitution.
c) Masking Out
DataBrew Technique ____________ (CRYPTOGRAPHIC_HASH)
a) Deterministic Encryption
b) Decryption
c) Masking
d) Substitution.
e) Hashing
Hashing
________________________ way of organising larger workflows. One of manay orchestration tools.
WHere you can design multi-job, multi-cralwer etl processes run together.
a)RDS
b) S3
d) Glue Workflows
Glue Workflows
for Glue workflows ______________ within workflows start jobs or crawlers.
a) Triggers
b) EMR cluster
c) Elastic Job
d) CloudWatch SNS notifications
a) triggers
They can be fired when jobs or crawlers complete. Or scheduled based on cron expression. or on demand.
IN Glue Worfklow triggers _______________ can start on single events or batch events. Optional batch conditions, batch size (number of events), Batch window (within x seconds, default is 15min)
a) Triggers
b) EMR cluster
c) Elastic Job
d) Event bridge
Event Bridge
________________ makes it easy to set up a secure data lake in days.
Loading data & monitoring data flows. Setting up partition, encryption & managing keys. Defining transformation jobs & monitoring them. Access control, Auditing, Built on top of Glue.
a) Triggers
b) EMR cluster
c) LAke Formation
d) Event bridge
Lake Formation
Built on top of glue.
Loads data from extenral data sources either in S3 or onpremise.
Lake formation is gonna sit alongside that S3 data lake, help you create it, help you maintain it going forward.
Data can ceom from external dbs like RDBMS< NoSQLm S3. You can set up all the crawlers and etl jobs and data catalogs and secuirty and access control, any data cleaning or transformations including parquet or ORC.
It can then integrate with athena, Redshift or redhsift spectrum, emr (can talk directly to lake formation aswell).
You need ot createe a _____________ bucket for the lake.
a) Lake Formation
b) EMR cluster
c) Elastic Job
d) S3
S3
When setting up lake formation to build a data lake. You need to ______________ the S3 path in lake formation, grant permissions.
a) Triggers
b) EMR cluster
c) Elastic Job
d) Register
Register
Then create the database in lake formation for data catalog
Then Use a blueprint for workflow
Run the workflow
Grant select permisions to whoever needs to read it.
in AWS lake formation. ________________________ permissions is set-up for when you need to have people across different accountsa ccessing your data lake and accessing lake formation, you need ot make sure that the recipient is set-up as a data lake administrator.
a) cross account
b) iam permissions
c) Lake formation manifests
cross account
When looking into aws lake formation permissions. _____________________ can be used for accounts external to your org.
a) cross account
b) iam permissions
c) Lake formation manifests
d) AWS Resource Access Manager
AWS Resource Access Manager
When looking at AWS lake formation access. _________________________ on the kms ecnryption key are needed for encrypted data catalogs in lake formation.
a) cross account
b) iam permissions
c) Lake formation manifests
d) AWS Resource Access Manager
IAM Permissions.
AWS Lake FOrmation supports ____________________ tables that support ACID transactions across multiplee tables.
a) Govenered Tables
b) Transactional Tables
a) Govenred tables
New type of s3 table
Can’t change choice of govenred afterwards
works with streaming data too (kinesis)
Can query with Athena.
Grnaular access control with row and cell level secuirty. Both for governed and S3 tables. (may occur addiotnal charges).