Batch Processing Flashcards
In designing a Data Factory pipeline, it is important that you create a webhook process if your pipeline activity is successfully run. Which activity would be the best choice?
Success
Success triggers an activity upon the successful completion of a previous activity.
You need to configure a pipeline trigger to perform at fixed intervals starting from last week. Whichis the best option?
Tumbling
Tumbling window is the best option. The key is historical data. Tumbling windows will let you run historical data as well.
You need to quickly update a database by examining and loading the difference. What’s the best solution?
Incremental data loading with watermark
This would load the difference.
You need to create a pipeline step to insert select rows into a database table if they don’t exist, or update them if they do. Which data loading type would be the best solution?
Upserting
Upserting would be the best data loading type for this scenario.
You are working in Databricks and need to change your programming language from Python to SQL for a single cell. Is this possible? And if so, how would you complete this task?
You would start the cell is %SQL.
Starting the cell with %SQL is the best way to change your programming language from Python to SQL for a single cell.
Which scenario would NOT lend itself to batch processing?
A) When you have complex transformation requirements and need low cost solutions
B) When working with Data Factory
C) When data is not required immediately
D) When you need data in real-time
D) When you need data in real-time
For real-time data, you would need to implement a streaming solution.
Why is handling schema drift important?
If not handled, it can lead to a complete pipeline breakdown.
If schema drift is not handled, it can lead to a complete breakdown in pipelines.
What would be a good reason to use data flows?
You need a visual, no-code solution to implement transformational logic in Azure Data Factory.
This is an excellent reason to use data flows.
You have been experiencing issues with code-breaking deployments in production. What are two advantages of implementing GitHub in Data Factory?
Source Control
Increased collaboration
This is a definite advantage. Source control is also an advantage of using GitHub.
This is a definite advantage. Increased collaboration is also an advantage of using GitHub.