examtopics_1 Flashcards by Dominik Mues

C. [ManagerEmployeeKey] [int] NULL

How well did you know this?

Not at all

Perfectly

B. an error (dbo schema error)

How well did you know this?

Not at all

Perfectly

The answer should be D –> A –> C.

Step 1:
Create an empty table SalesFact_Work with same schema as SalesFact.

Step 2:
Switch the partition (to be removed) from SalesFact to SalesFact_Work. The syntax is:
ALTER TABLE <source></source> SWITCH PARTITION <partition> to <destination></destination></partition>

Step 3:
Delete the SalesFact_Work table.

How well did you know this?

Not at all

Perfectly

B. File1.csv and File4.csv only

How well did you know this?

Not at all

Perfectly

1: Parquet - column-oriented binary file format
2: AVRO - Row based format, and has logical type timestamp

How well did you know this?

Not at all

Perfectly

D. /{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}{YYYY}{MM}_{DD}.csv

How well did you know this?

Not at all

Perfectly

1: PARQUET
Because Parquet is a columnar file format.

2: AVRO
Because Avro is a row-based file format (as JSON) which is connected to logical timestamp

How well did you know this?

Not at all

Perfectly

Merge files
Parquet

How well did you know this?

Not at all

Perfectly

All the Dim tables –> Replicated
Fact Tables –> Hash Distributed

How well did you know this?

Not at all

Perfectly

Cool –> You will access infrequently but data must be avaliable in few time if you want to access them
Archive –> You will never acces them but you need to configure a data archiving solution, so you must retain them always and not delete the blob

How well did you know this?

Not at all

Perfectly

DISTRIBUTION = HASH (id)
PARTITION (ID RANGE LEFT
FOR VALUES (1, 1000000, 2000000) )

How well did you know this?

Not at all

Perfectly

D. as a Type 2 slowly changing dimension (SCD) table

How well did you know this?

Not at all

Perfectly

F. Create a managed identity.
A. Add the managed identity to the Sales group.
B. Use the managed identity as the credentials for the data load process.

How well did you know this?

Not at all

Perfectly

0
Value stored in database

How well did you know this?

Not at all

Perfectly

Answer is C. Drop the external table and recreate it.

How well did you know this?

Not at all

Perfectly

Binary
PerserveHierarchy

How well did you know this?

Not at all

Perfectly

B. read-access geo-redundant storage (RA-GRS)

How well did you know this?

Not at all

Perfectly

ZRS: “…copies your data synchronously across three Azure availability zones in the primary region” (meaning, in different Data Centers. In our scenario this would meet the requirements)
-> D is right
GRS/GZRS: are like LRS/ZRS but with the Data Centers in different azure regions. This works too but is more expensive than ZRS. So ZRS is the right answer.

How well did you know this?

Not at all

Perfectly

Round-robin - this is the simplest distribution model, not great for querying but fast to process
Heap - no brainer when creating staging tables
No partitions - this is a staging table, why add effort to partition, when truncated daily?

How well did you know this?

Not at all

Perfectly

B. hash-distributed on PurchaseKey. (Hash-distributed tables improve query performance on large fact tables. The PurchaseKey has many unique values, does not have NULLs and is not a date column.)

How well did you know this?

Not at all

Perfectly

EventCategory -> dimEvent
channelGrouping -> dimChannel
TotalEvents -> factEvent

How well did you know this?

Not at all

Perfectly

The answer is A
Compression doesn’t not only help to reduce the size or space occupied by a file in a storage but also increases the speed of file movement during transfer

How well did you know this?

Not at all

Perfectly

Answer is no ,u use HEAP idx

How well did you know this?

Not at all

Perfectly

No, rows need to have less than 1 MB. A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size capacity.

How well did you know this?

Not at all

Perfectly

Create materialized views that store the results of the complex SELECT queries. Materialized views are precomputed views stored as tables, and they can significantly reduce query times by avoiding the need to recompute the results every time the query is executed.

D. Parquet

D: DataBricks with Java lang

A. Convert the files to JSON => no sense B. Convert the files to Avro => my understanding is that the format of the file csv is given, so no C. Compress the files => for batch processing it's a win and this option that you can assume true given the available information D. Merge the files => this can be true but not knowing how many files there is big issue Ich würde D nemen das ist am intuitivsten

1. Move to Cool Tier 2. Container1/contoso.csv

Select D because analysts will most commonly analyze transactions for a given month

Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier Azure Blob storage lifecycle management rules

B. Parquet

C. Switch the first partition from stg.Sales to dbo.Sales ALTER TABLE stg.Sales SWITCH PARTITION 1 TO dbo.Sales PARTITION 1;

A. surrogate primary key B. effective start date E. effective end date

denormalizing and IDENTITY

A. 40 The number of records for the period stated = 2.4 billion Number of underlying ("automatic") distributions: 60 2.4 billion / 60 distributions = 40 million rows 40 million / 40 partitions = 1 million rows As stated, 1 million rows per distribution are optimal for compression and performance. Divide the 40 million rows with the other partitioning options and you have too few rows per distribution -> suboptimal.

A: Type 1 B: a surrogate Key

Is there any "official" answer to this? A. Replicated: Replicated tables have copies of the entire table on each distribution. While this option can eliminate data movement, it may not be the most efficient choice for very large tables with frequent updates. B. Hash-Distributed on PurchaseKey: Hash distribution on "PurchaseKey" may lead to data skew if "PurchaseKey" doesn't have a wide range of unique values. Additionally, it doesn't align with the primary filtering condition on "DateKey." C. Round-Robin: Round-robin distribution ensures even data distribution, but it doesn't take advantage of data locality for specific types of queries. D. Hash-Distributed on DateKey: Distributing on "DateKey" aligns with your primary filtering condition, but it's a date column. This could lead to clustering by date, especially if many users filter on the same date. None of the answers seem to fit. D could be the best guess but it's a date column.

A. Use Snappy compression for the files. Snappy compression can reduce the size of Parquet files by up to 70%. This can save you a significant amount of money on storage costs.

1. CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the storage. 2. CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files. 3. CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.

C. a dimension table for Employee E. a fact table for Transaction

C. Type 2

Step 1: Create an external data source that uses the abfs location Create External Data Source to reference Azure Data Lake Store Gen 1 or 2 Step 2: Create an external file format and set the First_Row option. Create External File Format. Step 3: Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages

Box 1: PARTITION - RANGE RIGHT FOR VALUES is used with PARTITION. Part 2: [TransactionDateID] Partition on the date column.

D. Only CSV that have file names that beginning with "tripdata_2020".

select, explode alias

First week: Hot After one month: Cool After one year: Cool

D. Load the data by using PySpark. (Best when the structure and data types varies by file)

D. Create a pool in workspace1.

Path pattern: {date}/product.csv Date format YYYY-MM-DD

False, True, False. https://learn.microsoft.com/en-us/azure/stream-analytics/repartition The first is False, because this: "The following example query joins two streams of repartitioned data." It's extracted from the link above, and it's pointing to our query! Repartitioned and not partitioned. Second is True, it's explicitly written The output scheme should match the stream scheme key and count so that each substream can be flushed independently. Third is False, "In general, six SUs are needed for each partition." In the example we have 10 positions for step 1 and 10 for step 2, it should be 120 and not 60. To be discussed

1. CREATE EXTERNAL TABLE 2. OPENROWSET

1. dfs 2. TYPE = HADOOP

B. Parquet

1. Hash -> Fact Table 2. DateKey -> for Partition

Data over five years old: cool storage Data over seven years old: move to archive storage (delete isn't a data archiving solution)

ZRS and Failover initiated by Microsoft. (To be discussed)

B. und E.

Replicated (Because its a Dimension table) Hash (Fact table with High volume of data) Round-Robin (Staging table)

Box 1: (CLUSTERED COLUMNSTORE INDEX Box 2: HASH([ProductKey])

A partition is divided into distribution units. A) eleven per month => 30 million / 60 = 500k rows per partition B) eleven per year => 360 million / 60 = 6 million rows per partition C) eleven per day => 1 million / 60 = 15k rows per partition D) eleven per week => 7.5~ million/60 = 125k rows per partition But since I already have 1 billion rows distributed, each distribution node would have 16.7 million rows, fulfilling more than the minimum, since I am interested in having a certain reasonable number of partitions to increase the speed of the queries, I would choose once per month, because once per year would create very few partitions. Therefore the answer is A)

MERGE Operation: The MERGE operation in Apache Spark SQL (and specifically in Delta Lake) combines the capabilities of both INSERT and UPDATE. It allows you to upsert data (insert new records or update existing records) based on a specified condition. When you have a slowly changing dimension (SCD) table like Table1, where historical data needs to be maintained. Therefore, the correct answer is D. MERGE

D: Parquet Parquet has columnar format, best for reading a few columns

Given the large size of the table, I will utilize PolyBase for data transfer. Additionally, considering PolyBase's constraint that it cannot load rows exceeding 1MB in size, I will compress rows to ensure compliance with this requirement, thereby making PolyBase the optimal choice for data transfer. => Option A: yes

A: replicated Replicated because - Dimension table - Less than 2 GB ( less than 1 GB in this case)

C. an IDENTITY column

Yes, No, No. /.Folder 3 2ill be ignored because it is a hidden folder and _mydata4.csv will also be ignored because it is a hidden file

B: month * Show order counts by week. * Calculate sales totals by region. * Calculate sales totals by product. For these, Group By is required while querying, hence cannot be a parition. But fourth one, requires you to use WHERE clause, so month is ideal for a partition

A. \DataSource\SubjectArea\YYYY\WW\FileData_YYYY_MM_DD.parquet The recommended folder structure that best meets the requirements is option A. It separates data by data source, year and week. It allows for easy filtering of data by year or week, which aligns with the usage pattern where most queries include a filter on the current year or week.

D. ALTER INDEX ALL on table1 REBUILD

E. [EffectiveStartDate] B. [EffectiveEndDate]

1. Hash(CustomerID) 2. Replicate

A. Use a native external table and authenticate by using a shared access signature (SAS)

1. join DimGeography and DimCustomer (join Product and Product line lso makes sense but is not an option) 2. 5 tables.

Auto Loader Auto Loader can load data files from AWS S3 (s3://), Azure Data Lake Storage Gen2 (ADLS Gen2, abfss://), Google Cloud Storage (GCS, gs://), Azure Blob Storage (wasbs://), ADLS Gen1 (adl://), and Databricks File System (DBFS, dbfs:/).

A. the WITH clause There is no header row, therefore you should define one using the WITH Clause

D. 60 million Clustered Column Store will by default have 60 partitions. And to achieve best compression we need at least 1 Million rows per partition, hence Option D 60 Millions (1M per partition)

C. Insert an extra row D. Update one column of an existing row. 1) Insert an extra row with the updated last name and the current date as the StartDate. 2) Update one columns of an existing row: set the EndDate of the previous row for that salesperson to the current date

1: For Storage redundancy, you should select ZRS (Zone-redundant storage). This will maintain read and write access to data even if an availability zone becomes unavailable. 2: For data deletion, you should select A lifecycle management policy. This will allow you to automatically delete data that was last modified more than two years ago

HR: Archive storage after one day and delete storage after 2555 days Operations: Cool storage after 180 days

1. Derived column 2. Alter row

1. OPENROWSET 2. BULK

parquet. CSV , delta also possible but not an option here.

AD: Query acceleration supports CSV and JSON formatted data as input to each request

D CSV, OPENROWSET doesn't have a JSON option

1. Yes: Azure Synapse Analytics allows Apache Spark pools in the same workspace to share a managed HMS (Hive Metastore) compatible metastore as their catalog. When customers want to persist the Hive catalog metadata outside of the workspace, and share catalog objects with other computational engines outside of the workspace, such as HDInsight and Azure Databricks, they can connect to an external Hive Metastore. Only Azure SQL Database and Azure Database for MySQL are supported as an external Hive Metastore. 2. Yes: And currently we only support User-Password authentication. 3. No: And currently we only support User-Password authentication. ==> STORAGE BLOB CONTRIBUTOR is an Azure RBAC (Role-Based Access Control) ==> NOT COMPATIBLE (it is supported User-Password authentication ONLY).

Product name -type 0 color -type 3 size -type 2

1. Round_Robin 2. Heap The ROUND_ROBIN distribution distributes the data evenly across all distribution nodes in the SQL pool. This distribution type is suitable for loading data quickly into the staging tables because it minimizes the data movement during the loading process. Use a HEAP table: Instead of creating a clustered index on the staging table, it is recommended to create a HEAP table. A HEAP table does not have a clustered index, which eliminates the need for maintaining the index and improves the data loading performance. It allows for faster insert operations

A,C,D Enable Azure Synapse Link for Cosmos1: This allows seamless integration between Azure Synapse Analytics and Cosmos DB, enabling efficient querying of data stored in Cosmos DB containers 1. Create a linked service in ws1 that references Cosmos1: By creating a linked service, you establish a connection to your Cosmos DB account, allowing the serverless SQL pool to access data from Cosmos1 2. "Make sure that you have prepared Analytical store" is a prerequisite. https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-cosmos-db-analytical-store?tabs=openrowset-key#prerequisites

No, Yes, No Because names that begin with an underline (_) or a period (.) are ignored In serverless SQL pools must be specified /** at the end of the location path. In Dedicated pool the folders are always scanned recursively.

1. Execute 2. Read and Execute ecute is to navigate folders

B. tumbling Retry capability is not supported on Schedule trigger

Yes, No, Yes

B. Cache hit percentage

1. BULK 2. (Col1 int, Col2 varchar(20)) Serverless SQL pool uses BULK

D. a partitioned view in a serverless SQL pool maybe A. a partitioned table in a dedicated SQL pool: because only partitioned Table can support partition elimination but this relies on CSV / parquet

A. 1 million 60 million rows per month / 60 distributions = 1 million rows per distribution per month.

A. relationships E. table definitions

C, E, F

1. Partition by date with one partition per month 2. Switch the oldest partition to another table and then drop it dedicated SQL pool divides partitions into 60 databases and should aim for at least 1M rows per distribution (although not mentioned in the question, clustered columnstore compression becomes efficient at this scale). Therefore, partitioning by day would result in too small partitions. Having partitioned by months, we can use switch to move it using metadata operations only, which makes the operation extremely efficient.

A. in a file in a data lake serverless SQL pool do not use tables

B. business area

B. Create a pull request to merge the changes into the main branch.

B. No Ist disskussionswürdig, das der save button nicht verfügbar ist bedeutet, dass der git modus nicht aktiviert ist. Wenn es um das speichern der logik im repository geht kann die git aktivierung helfen :P

A. Yes Horrible question, but yes you technically can store it as JSON to keep the logic and your work.

B. No You can't export an ARM Template with validation errors.

Add to a delta table => format is delta Add the rows to existing table => append

Analytical store cannot be turned off after it has been enabled. A new container is needed to transfer So 1. Create a new container named Container2 and copy the contenst of Container1 to Container2 2. Delete Container1 3. Create another Container named Container1 and copy the contents of Container2 to Container1 4. Delete Container2.

1. Create a new table named Table1v2 by running CTAS 2. Rename Table1 as Table1_old. 3. Rename Table1v2 as Table1 4. Drop Table1_old

B. No

A. Update the firewall rules to allow Azure services to access sql1. C. From the Access control (IAM) settings, assign the Contributor role to the system-assigned managed identity of workspace1.

C - With a full fidelity schema, the analytical store will track both the data and their types accurately. This means different types for the same field will be stored in separate columns. Specifically: There will be one column for customerID as an integer. There will be another column for customerID as a string. There will be one column for customer as a string.

1. so.filepath(1) = '2023' 2. so,filepath(2) IN ('1', '2')