Instructor's Method - 6/15/2021 Flashcards
PolyBase
One technology to help load data from data lake to dedicated SQL pool
Different distribution methods
Round Robin - data is (almost) equally divided amongst different distributions
Hash Distributed - data will be co-located based on Hash column
Round Robin
Don’t have to define/analyze in which distributions should I store the data
But reading data is slow
Hash Distribution
Dedicated SQL pool will look into value of hash column, perform the hash, and then store it into distribution
Writing will be slower. Because Dedicated SQL pool has to apply logic (hash) to decide in which distribution to store
Best Practice
Make sure you have at least 60 distinct values in hash column
Column that you choose should spread the data “as equally as possible”
Choose a column that will mostly be used in WHERE, JOINS, GROUP BY, etc.
CTAS
Create Table as SELECT
Concurrency slots
A set of resources
Resource Classes
Defines # slots allocated
Resource Classes
Defines # slots allocated to use
Data warehousing unit
Decides # concurrency slots you are going to get
The performance level
Synapse Pipelines
Shares code base with Azure Data Factory