Untitled Deck Flashcards

Question 1

Q

How do you profile and debug slow-running Spark jobs?

Answer

A

Use tools like Spark UI, logs, and metrics to identify performance issues.

Question 2

Q

What are the common bottlenecks in distributed systems?

Answer

A

Network latency, data serialization, and resource contention.

Question 3

Q

How does caching improve data processing performance?

Answer

A

Caching reduces the need to recompute data, leading to faster access.

Question 4

Q

What is the difference between vertical scaling and horizontal scaling?

Answer

A

Vertical scaling adds resources to a single node, while horizontal scaling adds more nodes.

Question 5

Q

How do you monitor the performance of ETL pipelines?

Answer

A

Use monitoring tools to track execution time, data quality, and resource usage.

Question 6

Q

How do you handle high-cardinality data in distributed systems?

Answer

A

Use techniques like data partitioning and indexing to manage high-cardinality data.

Question 7

Q

What techniques do you use to minimize latency in data pipelines?

Answer

A

Implement batching, parallel processing, and efficient data serialization.

Question 8

Q

Explain the role of job checkpointing in streaming data systems.

Answer

A

Checkpointing saves the state of a job to recover from failures.

Question 9

Q

How do you optimize resource allocation in tools like EMR or Dataproc?

Answer

A

Use auto-scaling, instance types, and configuration tuning to optimize resources.

Question 10

Q

What are some strategies for reducing costs when processing large-scale data on the cloud?

Answer

A

Use spot instances, optimize data storage, and schedule jobs during off-peak hours.

(10 cards)