Untitled Deck Flashcards
How do you profile and debug slow-running Spark jobs?
Use tools like Spark UI, logs, and metrics to identify performance issues.
What are the common bottlenecks in distributed systems?
Network latency, data serialization, and resource contention.
How does caching improve data processing performance?
Caching reduces the need to recompute data, leading to faster access.
What is the difference between vertical scaling and horizontal scaling?
Vertical scaling adds resources to a single node, while horizontal scaling adds more nodes.
How do you monitor the performance of ETL pipelines?
Use monitoring tools to track execution time, data quality, and resource usage.
How do you handle high-cardinality data in distributed systems?
Use techniques like data partitioning and indexing to manage high-cardinality data.
What techniques do you use to minimize latency in data pipelines?
Implement batching, parallel processing, and efficient data serialization.
Explain the role of job checkpointing in streaming data systems.
Checkpointing saves the state of a job to recover from failures.
How do you optimize resource allocation in tools like EMR or Dataproc?
Use auto-scaling, instance types, and configuration tuning to optimize resources.
What are some strategies for reducing costs when processing large-scale data on the cloud?
Use spot instances, optimize data storage, and schedule jobs during off-peak hours.