Data engineering Flashcards
Which 2 file formats store data in column based and row based files
parquet = column-based file format
avro = row-based file format
What does this SQL command do?
DBCC PDW_SHOWSPACEUSED(‘[scheme].[table]’)
This command specific to microsoft’s SQL server parallel data warehouse (dedicated SQL pool).
DBCC stand for data console commands
PDW_SHOWSPACEUSED show information about storage used by the table. It returns, amongst others, the rows contained per distributions which can be summed up to get the total row count of the table.
There is a Retry option for activities in azure data factory. What does this option specify?
How many retries a certain activity should run before failing the whole pipeline
There are 3 types of dimensions in dim/fact modelling. What are the characteristics of type 0,1,2
Type 0 = Static, data gets entered then never changes.
Type 1 = Here you just override the data, so if a record should be updated, you simply just override the original row.
Type 2 = These dimension are classified as slowly changing dimensions and often has multiple records of the same instance to track historical data. They often have a start-date, end-date and isCurrent attribute to help identify the current record. Soft deletion can also happen here.
In a Azure Stream Analytics job, you see a high number of Backlogged input events. How can you solve this problem.
Increase the number of streaming units assigned to the job.
How many distribution is the on a default dedicated sql pool in azure synapse
60
What is the 3 kinds distribution that you can use in a dedicated sql pool?
Round-robin (default)
Hash
Replicated
How does these 3 distribution work?
Round-robin: is the same concept as in load-balancing, so it just takes the first row and places it on the first distribution, the second row get put on the second distribution and so on.
hash: just like only hashing takes a input x which can be one or more column values, calculates a hash and place it on some distribution, thing with the same hash will of course be placed on the same distribution. Can however create data skew
replicated: Basically just replicates a table across all notes.
If you assume you have a data warehouse under normal conditions (star schema, table sizes, how things are joined together and so on…) Where could we see the 3 kinds of distributions be used
Hash = Large fact tables
Replicated = smaller dimension tables
Round-robin: the staging layer here we just need to get the rows placed somewhere before we transform it.
There are several Azure Stream Analytics job in place. We want to ensure that the data stream is segmented into distinct time segments and ensure that events don’t overlap. Which windowing function should we use?
Tumbling window.
For a normal General Purpose V2 storage account, what needs to be enabled for it to behave like a data lake gen 2 account
Hierarchical namespace.
There are several Azure Stream Analytics job in place. We want to ensure to output events only for points in time when the content of the window actually changes? Which windowing function should we use?
Sliding window.
What is data skew
When you use hash distribution and you hash a value which appears often, the data will accumulate on certain distributions and others will barely have any data. This causes sub optimal performance
Which service could you use to host your data-warehouse?
Azure Synapse Analytics
Which sort of SQL pools are able to work with EXTERNAL TABLES?
Both the serverless and dedicated SQL Server pools can use EXTERNAL TABLES.