[FAQs] Data Ecosystem Flashcards
What ML capabilities does QuickSight have?
- Discover anomalies
- Forecasting
- Auto Narratives with natural language
What platform does Glue use for ETL?
Apache Spark
What are the key components of Glue?
- Data Catalog
- ETL engine using Python or Scala
- Scheduling engine
What data sources does Glue support?
ETL jobs S3, Redshift and most databases running on RDS or EC2
What platform does the Glue Data Catalog use?
It is an Apache Hive Megastore
Can you edit a Glue Data Catalog manually?
Yes, using the console, API or manually importing another Hive Megastore
Can you include custom libraries in Glue ETL jobs?
Yes, you can import custom Python libraries and Jar files
How can Glue jobs be triggered?
Manually, on a schedule, when another job finishes or from Lambda etc..
Can Glue ETL be used for streaming data?
Not really - use Kinesis Data Firehose / Analytics as an intermediary
What ML capabilities does Glue have?
The FindMatches transform performs deduplication of records
How are ML transforms managed?
You create them for your dataset - you must provided labelled ground truth data
Why might you use EMR over Glue?
EMR gives you direct access to the Hadoop environment, so you have greater flexibility
How is the power of Glue ETL jobs specified?
In Data Processing Units (DPUs)
Broadly speaking, what is EMR?
A hosted Hadoop service running on EC2 and S3
How can ad hoc analysis be done with EMR?
Using EMR Notebooks, which are a managed environment based on Jupyter