[FAQs] Data Ecosystem Flashcards

1
Q

What ML capabilities does QuickSight have?

A
  • Discover anomalies
  • Forecasting
  • Auto Narratives with natural language
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What platform does Glue use for ETL?

A

Apache Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the key components of Glue?

A
  • Data Catalog
  • ETL engine using Python or Scala
  • Scheduling engine
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What data sources does Glue support?

A

ETL jobs S3, Redshift and most databases running on RDS or EC2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What platform does the Glue Data Catalog use?

A

It is an Apache Hive Megastore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Can you edit a Glue Data Catalog manually?

A

Yes, using the console, API or manually importing another Hive Megastore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Can you include custom libraries in Glue ETL jobs?

A

Yes, you can import custom Python libraries and Jar files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can Glue jobs be triggered?

A

Manually, on a schedule, when another job finishes or from Lambda etc..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Can Glue ETL be used for streaming data?

A

Not really - use Kinesis Data Firehose / Analytics as an intermediary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What ML capabilities does Glue have?

A

The FindMatches transform performs deduplication of records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How are ML transforms managed?

A

You create them for your dataset - you must provided labelled ground truth data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why might you use EMR over Glue?

A

EMR gives you direct access to the Hadoop environment, so you have greater flexibility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is the power of Glue ETL jobs specified?

A

In Data Processing Units (DPUs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Broadly speaking, what is EMR?

A

A hosted Hadoop service running on EC2 and S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can ad hoc analysis be done with EMR?

A

Using EMR Notebooks, which are a managed environment based on Jupyter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Can EMR be ran on Outposts?

A

Yes

17
Q

Can EMR work with streaming data?

A

Yes - the EMR Connector to Kinesis allows EMR to directly read and query streaming data from Kinesis

18
Q

For how long can records be pulled from a Kinesis Data Stream?

A

24 hours by default but they can be configured to be retained for up to 7 days

19
Q

What is the maximum item size for an item in a Kinesis Data Stream?

A

1 MB after Base64 encoding

20
Q

What destinations does Firehose support?

A

S3, Redshift, Amazon Elasticsearch Service and Splunk

21
Q

How can Firehose data be transformed?

A

A Lambda function can be used to transform it in real time before it is loaded to its destination