6. Programming (Python, Scala, Java) Flashcards

Question 1

Q

How do you read and process large CSV or JSON files using Python?

Answer

A

You can use libraries like Pandas for CSV files and the built-in json module for JSON files.

Question 2

Q

What are Python libraries for data processing, like Pandas and PySpark?

Answer

A

Pandas is used for data manipulation and analysis, while PySpark is used for large-scale data processing.

Question 3

Q

Explain how you would handle schema evolution in JSON data.

Answer

A

You can manage schema evolution by using flexible data structures and versioning your schemas.

Question 4

Q

How would you implement a sliding window algorithm in Python for streaming data?

Answer

A

You can use collections.deque to maintain a fixed-size window of data.

Question 5

Q

How do you manage memory optimization in PySpark?

Answer

A

You can optimize memory usage by tuning Spark configurations and using efficient data formats.

Question 6

Q

What are Python decorators, and when would you use them in ETL scripts?

Answer

A

Decorators are functions that modify the behavior of other functions. They can be used for logging or performance monitoring in ETL scripts.

Question 7

Q

How would you parallelize data processing in Python?

Answer

A

You can use the multiprocessing module or libraries like Dask to parallelize data processing.

Question 8

Q

How does Scala handle immutability and parallel processing in Spark?

Answer

A

Scala uses immutable collections and functional programming principles to enable safe parallel processing in Spark.

Question 9

Q

Explain the purpose of Python generators in handling large datasets.

Answer

A

Generators allow you to iterate over large datasets without loading them entirely into memory.

Question 10

Q

How do you handle exceptions in Python for robust ETL workflows?

Answer

A

You can use try-except blocks to catch and handle exceptions, ensuring the ETL process continues smoothly.

(10 cards)