Data engineering Flashcards by torben Neuer

Which 2 file formats store data in column based and row based files

parquet = column-based file format
avro = row-based file format

How well did you know this?

Not at all

Perfectly

What does this SQL command do?
DBCC PDW_SHOWSPACEUSED(‘[scheme].[table]’)

This command specific to microsoft’s SQL server parallel data warehouse (dedicated SQL pool).

DBCC stand for data console commands
PDW_SHOWSPACEUSED show information about storage used by the table. It returns, amongst others, the rows contained per distributions which can be summed up to get the total row count of the table.

How well did you know this?

Not at all

Perfectly

There is a Retry option for activities in azure data factory. What does this option specify?

How many retries a certain activity should run before failing the whole pipeline

How well did you know this?

Not at all

Perfectly

There are 3 types of dimensions in dim/fact modelling. What are the characteristics of type 0,1,2

Type 0 = Static, data gets entered then never changes.
Type 1 = Here you just override the data, so if a record should be updated, you simply just override the original row.
Type 2 = These dimension are classified as slowly changing dimensions and often has multiple records of the same instance to track historical data. They often have a start-date, end-date and isCurrent attribute to help identify the current record. Soft deletion can also happen here.

How well did you know this?

Not at all

Perfectly

In a Azure Stream Analytics job, you see a high number of Backlogged input events. How can you solve this problem.

Increase the number of streaming units assigned to the job.

How well did you know this?

Not at all

Perfectly

How many distribution is the on a default dedicated sql pool in azure synapse

How well did you know this?

Not at all

Perfectly

What is the 3 kinds distribution that you can use in a dedicated sql pool?

Round-robin (default)
Hash
Replicated

How well did you know this?

Not at all

Perfectly

How does these 3 distribution work?

Round-robin: is the same concept as in load-balancing, so it just takes the first row and places it on the first distribution, the second row get put on the second distribution and so on.

hash: just like only hashing takes a input x which can be one or more column values, calculates a hash and place it on some distribution, thing with the same hash will of course be placed on the same distribution. Can however create data skew

replicated: Basically just replicates a table across all notes.

How well did you know this?

Not at all

Perfectly

If you assume you have a data warehouse under normal conditions (star schema, table sizes, how things are joined together and so on…) Where could we see the 3 kinds of distributions be used

Hash = Large fact tables

Replicated = smaller dimension tables

Round-robin: the staging layer here we just need to get the rows placed somewhere before we transform it.

How well did you know this?

Not at all

Perfectly

There are several Azure Stream Analytics job in place. We want to ensure that the data stream is segmented into distinct time segments and ensure that events don’t overlap. Which windowing function should we use?

Tumbling window.

How well did you know this?

Not at all

Perfectly

For a normal General Purpose V2 storage account, what needs to be enabled for it to behave like a data lake gen 2 account

Hierarchical namespace.

How well did you know this?

Not at all

Perfectly

There are several Azure Stream Analytics job in place. We want to ensure to output events only for points in time when the content of the window actually changes? Which windowing function should we use?

Sliding window.

How well did you know this?

Not at all

Perfectly

What is data skew

When you use hash distribution and you hash a value which appears often, the data will accumulate on certain distributions and others will barely have any data. This causes sub optimal performance

How well did you know this?

Not at all

Perfectly

Which service could you use to host your data-warehouse?

Azure Synapse Analytics

How well did you know this?

Not at all

Perfectly

Which sort of SQL pools are able to work with EXTERNAL TABLES?

Both the serverless and dedicated SQL Server pools can use EXTERNAL TABLES.

How well did you know this?

Not at all

Perfectly

Are EXTERNAL TABLES persisted or just a logical view on data?

Study These Flashcards

A logical view.

What is polybase?

Study These Flashcards

It is the most effective way to move large amount of data from external sources such as csv, json, avro, parquet into the dedicated sql server pool.

Based on microsoft own number what is the minimum number of rows you should have per distribution and partition to get good performance?

Study These Flashcards

1 million rows.

In Azure data factory if you to store pipeline-run data for 60 days, what should you then configure?

Study These Flashcards

Diagnostic setting

What is a heap table

Study These Flashcards

heap tables are tables that does not have a clustered index. Therefore data is not stored in any particular order, however they do not have any extra overhead of indexes which means data insertion is faster as data does not need to be sorted first. This makes them a good option for small or temporary tables.

what is type 3 dimensions.

Study These Flashcards

In this approach you add columns instead for rows. so if a named changed in a dimension record, you add a new column named something like “changed name” and the name column might be rewritten as “original name”

what is indexes

Study These Flashcards

Indexes in SQL are tools that allow developers to create additional structures within a database to speed up data retrieval operations. They are categorized into rowstore and columnstore indexes. Some Rowstore indexes change psychical order of the data stored in the table (which is row-wise), and other creates a separate structure to facilitate faster search and access to rows. Columnstore indexes, on the other hand, organize data physically in a column-wise format, optimizing for queries that read large volumes of data but only a few columns

In a normal relational database system, how is the data normally stored

Study These Flashcards

Row-wise

In a data warehouse system, how is the data normally stored

Study These Flashcards

column-wise

In azure synapse with a data-warehouse associated (dedicated sql server pool), how can you create a column-wise and a row-wise index?

CLUSTERED INDEX (column_name) can be used to create a rowstore index. CLUSTERED COLUMNSTORE INDEX, because you here split the table up in columns you dont need to provide a input.

What is the main difference between CLUSTERED INDEXES and NONCLUSTERED INDEXES?

There can only be one clustered index on a table whereas there can be multiple NONCLUSTERED INDEXES on tables on a table. Both store the data row-wise however CLUSTERED INDEXES change the physical order of the data whereas NONCLUSTERED INDEXES creates separate data structures which points to certain rows so queries can be speed up.

When should you consider using heap tables or clustered index?

If your table contains columns of the type varchar(max), nvarchar(max) or varbinary(max): because it would be very costly to store a column that contains this big amount of data compared to the other columns. if you want to create temporary tables if you have a table with less than 60 millions rows.

How do you create a heap table in TSQL?

WITH ( HEAP, DISTRIBUTIOIN = ROUND_ROBIN -- could of course be any distribution. )

What are table partitions?

Partitions are a tool to speed up certain queries by separating the data into smaller logical(aka. not physical) partitions. This is not the same as distributions, as that only has do with splitting data between distributions(nodes) . Partitions are helpful when you have something that nicely split your data into even size. E.g. dates. Helps in particular with WHERE queries.

How does partitions work?

By splitting up the data into logical partitions a distribution can more effectively locate data by using a process called partition elimination. This allows each distribution to narrow down the amount of data it needs to process.

What is a windowing function?

A windowing function is a SQL tool that allows you to split rows into different sets and apply aggregations to each set. It work by using the OVER clause, and could be followed with the PARTITION BY keywords. This way you can abstractly create windows for your data.

What is Mapping Data flows

This is a feature which resides inside of Azure data factory. It helps your visualise your data transformations and perform them without any code. It runs on Apache spark clusters.

What is the derived column activity?

This is a sub feature within Mapping Data Flows which allows you to create new columns based on the data available in the pipeline. If you need a extra column in your fact table which is a product of two other columns values. Then this activity is well suited.

What are chached sinks and looks up for in the mapping data flow

Chached sinks can be used to store results from queries or similar, earlier in the pipeline and through lookups be found and used later in the flow.

When working in azure factory and your pipelines becomes troublesome time wise. what feature can you use to test your pipelines quicker than just running these pipelines.

the Data flow debug feature can

Data engineering Flashcards

(36 cards)