6. Programming (Python, Scala, Java) Flashcards
How do you read and process large CSV or JSON files using Python?
You can use libraries like Pandas for CSV files and the built-in json module for JSON files.
What are Python libraries for data processing, like Pandas and PySpark?
Pandas is used for data manipulation and analysis, while PySpark is used for large-scale data processing.
Explain how you would handle schema evolution in JSON data.
You can manage schema evolution by using flexible data structures and versioning your schemas.
How would you implement a sliding window algorithm in Python for streaming data?
You can use collections.deque to maintain a fixed-size window of data.
How do you manage memory optimization in PySpark?
You can optimize memory usage by tuning Spark configurations and using efficient data formats.
What are Python decorators, and when would you use them in ETL scripts?
Decorators are functions that modify the behavior of other functions. They can be used for logging or performance monitoring in ETL scripts.
How would you parallelize data processing in Python?
You can use the multiprocessing module or libraries like Dask to parallelize data processing.
How does Scala handle immutability and parallel processing in Spark?
Scala uses immutable collections and functional programming principles to enable safe parallel processing in Spark.
Explain the purpose of Python generators in handling large datasets.
Generators allow you to iterate over large datasets without loading them entirely into memory.
How do you handle exceptions in Python for robust ETL workflows?
You can use try-except blocks to catch and handle exceptions, ensuring the ETL process continues smoothly.