[FAQs] Data Ecosystem Flashcards
What ML capabilities does QuickSight have?
- Discover anomalies
- Forecasting
- Auto Narratives with natural language
What platform does Glue use for ETL?
Apache Spark
What are the key components of Glue?
- Data Catalog
- ETL engine using Python or Scala
- Scheduling engine
What data sources does Glue support?
ETL jobs S3, Redshift and most databases running on RDS or EC2
What platform does the Glue Data Catalog use?
It is an Apache Hive Megastore
Can you edit a Glue Data Catalog manually?
Yes, using the console, API or manually importing another Hive Megastore
Can you include custom libraries in Glue ETL jobs?
Yes, you can import custom Python libraries and Jar files
How can Glue jobs be triggered?
Manually, on a schedule, when another job finishes or from Lambda etc..
Can Glue ETL be used for streaming data?
Not really - use Kinesis Data Firehose / Analytics as an intermediary
What ML capabilities does Glue have?
The FindMatches transform performs deduplication of records
How are ML transforms managed?
You create them for your dataset - you must provided labelled ground truth data
Why might you use EMR over Glue?
EMR gives you direct access to the Hadoop environment, so you have greater flexibility
How is the power of Glue ETL jobs specified?
In Data Processing Units (DPUs)
Broadly speaking, what is EMR?
A hosted Hadoop service running on EC2 and S3
How can ad hoc analysis be done with EMR?
Using EMR Notebooks, which are a managed environment based on Jupyter
Can EMR be ran on Outposts?
Yes
Can EMR work with streaming data?
Yes - the EMR Connector to Kinesis allows EMR to directly read and query streaming data from Kinesis
For how long can records be pulled from a Kinesis Data Stream?
24 hours by default but they can be configured to be retained for up to 7 days
What is the maximum item size for an item in a Kinesis Data Stream?
1 MB after Base64 encoding
What destinations does Firehose support?
S3, Redshift, Amazon Elasticsearch Service and Splunk
How can Firehose data be transformed?
A Lambda function can be used to transform it in real time before it is loaded to its destination