examtopics_1 Flashcards

1
Q
A

C. [ManagerEmployeeKey] [int] NULL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
A

B. an error (dbo schema error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
A

The answer should be D –> A –> C.

Step 1:
Create an empty table SalesFact_Work with same schema as SalesFact.

Step 2:
Switch the partition (to be removed) from SalesFact to SalesFact_Work. The syntax is:
ALTER TABLE <source></source> SWITCH PARTITION <partition> to <destination></destination></partition>

Step 3:
Delete the SalesFact_Work table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
A

B. File1.csv and File4.csv only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
A

1: Parquet - column-oriented binary file format
2: AVRO - Row based format, and has logical type timestamp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
A

D. /{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}{YYYY}{MM}_{DD}.csv

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
A

1: PARQUET
Because Parquet is a columnar file format.

2: AVRO
Because Avro is a row-based file format (as JSON) which is connected to logical timestamp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
A
  1. Merge files
  2. Parquet
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
A

All the Dim tables –> Replicated
Fact Tables –> Hash Distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
A
  1. Cool –> You will access infrequently but data must be avaliable in few time if you want to access them
  2. Archive –> You will never acces them but you need to configure a data archiving solution, so you must retain them always and not delete the blob
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
A

DISTRIBUTION = HASH (id)
PARTITION (ID RANGE LEFT
FOR VALUES (1, 1000000, 2000000) )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
A

D. as a Type 2 slowly changing dimension (SCD) table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
A

F. Create a managed identity.
A. Add the managed identity to the Sales group.
B. Use the managed identity as the credentials for the data load process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
A
  1. 0
  2. Value stored in database
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
A

Answer is C. Drop the external table and recreate it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
A
  1. Binary
  2. PerserveHierarchy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
A

B. read-access geo-redundant storage (RA-GRS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
A

ZRS: “…copies your data synchronously across three Azure availability zones in the primary region” (meaning, in different Data Centers. In our scenario this would meet the requirements)
-> D is right
GRS/GZRS: are like LRS/ZRS but with the Data Centers in different azure regions. This works too but is more expensive than ZRS. So ZRS is the right answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q
A

Round-robin - this is the simplest distribution model, not great for querying but fast to process
Heap - no brainer when creating staging tables
No partitions - this is a staging table, why add effort to partition, when truncated daily?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q
A

B. hash-distributed on PurchaseKey. (Hash-distributed tables improve query performance on large fact tables. The PurchaseKey has many unique values, does not have NULLs and is not a date column.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q
A

EventCategory -> dimEvent
channelGrouping -> dimChannel
TotalEvents -> factEvent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q
A

The answer is A
Compression doesn’t not only help to reduce the size or space occupied by a file in a storage but also increases the speed of file movement during transfer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q
A

Answer is no ,u use HEAP idx

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q
A

No, rows need to have less than 1 MB. A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size capacity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q
A

Create materialized views that store the results of the complex SELECT queries. Materialized views are precomputed views stored as tables, and they can significantly reduce query times by avoiding the need to recompute the results every time the query is executed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q
A

D. Parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q
A

D: DataBricks with Java lang

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q
A

A. Convert the files to JSON => no sense
B. Convert the files to Avro => my understanding is that the format of the file csv is given, so no
C. Compress the files => for batch processing it’s a win and this option that you can assume true given the available information
D. Merge the files => this can be true but not knowing how many files there is big issue

Ich würde D nemen das ist am intuitivsten

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q
A
  1. Move to Cool Tier
  2. Container1/contoso.csv
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q
A

Select D because analysts will most commonly analyze transactions for a given month

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q
A

Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier

Azure Blob storage lifecycle management rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q
A

B. Parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q
A

C. Switch the first partition from stg.Sales to dbo.Sales

ALTER TABLE stg.Sales
SWITCH PARTITION 1
TO dbo.Sales
PARTITION 1;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q
A

A. surrogate primary key
B. effective start date
E. effective end date

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q
A

denormalizing and IDENTITY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q
A

A. 40

The number of records for the period stated = 2.4 billion
Number of underlying (“automatic”) distributions: 60

2.4 billion / 60 distributions = 40 million rows
40 million / 40 partitions = 1 million rows

As stated, 1 million rows per distribution are optimal for compression and performance. Divide the 40 million rows with the other partitioning options and you have too few rows per distribution -> suboptimal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q
A

A: Type 1
B: a surrogate Key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q
A

Is there any “official” answer to this?
A. Replicated: Replicated tables have copies of the entire table on each distribution. While this option can eliminate data movement, it may not be the most efficient choice for very large tables with frequent updates.
B. Hash-Distributed on PurchaseKey: Hash distribution on “PurchaseKey” may lead to data skew if “PurchaseKey” doesn’t have a wide range of unique values. Additionally, it doesn’t align with the primary filtering condition on “DateKey.”
C. Round-Robin: Round-robin distribution ensures even data distribution, but it doesn’t take advantage of data locality for specific types of queries.
D. Hash-Distributed on DateKey: Distributing on “DateKey” aligns with your primary filtering condition, but it’s a date column. This could lead to clustering by date, especially if many users filter on the same date.

None of the answers seem to fit. D could be the best guess but it’s a date column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q
A

A. Use Snappy compression for the files.

Snappy compression can reduce the size of Parquet files by up to 70%. This can save you a significant amount of money on storage costs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q
A
  1. CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the storage.
  2. CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files.
  3. CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q
A

C. a dimension table for Employee
E. a fact table for Transaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q
A

C. Type 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q
A

Step 1: Create an external data source that uses the abfs location
Create External Data Source to reference Azure Data Lake Store Gen 1 or 2
Step 2: Create an external file format and set the First_Row option.
Create External File Format.
Step 3: Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q
A

Box 1: PARTITION -
RANGE RIGHT FOR VALUES is used with PARTITION.
Part 2: [TransactionDateID]
Partition on the date column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q
A

D. Only CSV that have file names that beginning with “tripdata_2020”.

47
Q
A

select, explode
alias

48
Q
A

First week: Hot
After one month: Cool
After one year: Cool

49
Q
A

D. Load the data by using PySpark.
(Best when the structure and data types varies by file)

50
Q
A

D. Create a pool in workspace1.

51
Q
A

Path pattern: {date}/product.csv
Date format YYYY-MM-DD

52
Q
A

False, True, False.
https://learn.microsoft.com/en-us/azure/stream-analytics/repartition
The first is False, because this:
“The following example query joins two streams of repartitioned data.”
It’s extracted from the link above, and it’s pointing to our query! Repartitioned and not partitioned.
Second is True, it’s explicitly written
The output scheme should match the stream scheme key and count so that each substream can be flushed independently.
Third is False,
“In general, six SUs are needed for each partition.”
In the example we have 10 positions for step 1 and 10 for step 2, it should be 120 and not 60.

To be discussed

53
Q
A
  1. CREATE EXTERNAL TABLE
  2. OPENROWSET
54
Q
A
  1. dfs
  2. TYPE = HADOOP
55
Q
A

B. Parquet

56
Q
A
  1. Hash -> Fact Table
  2. DateKey -> for Partition
57
Q
A

Data over five years old: cool storage
Data over seven years old: move to archive storage (delete isn’t a data archiving solution)

58
Q
A

ZRS and Failover initiated by Microsoft. (To be discussed)

59
Q
A

B. und E.

60
Q
A

Replicated (Because its a Dimension table)
Hash (Fact table with High volume of data)
Round-Robin (Staging table)

61
Q
A

Box 1: (CLUSTERED COLUMNSTORE INDEX
Box 2: HASH([ProductKey])

62
Q
A

A partition is divided into distribution units.

A) eleven per month => 30 million / 60 = 500k rows per partition
B) eleven per year => 360 million / 60 = 6 million rows per partition
C) eleven per day => 1 million / 60 = 15k rows per partition
D) eleven per week => 7.5~ million/60 = 125k rows per partition

But since I already have 1 billion rows distributed, each distribution node would have 16.7 million rows, fulfilling more than the minimum, since I am interested in having a certain reasonable number of partitions to increase the speed of the queries, I would choose once per month, because once per year would create very few partitions.

Therefore the answer is A)

63
Q
A

MERGE Operation:
The MERGE operation in Apache Spark SQL (and specifically in Delta Lake) combines the capabilities of both INSERT and UPDATE. It allows you to upsert data (insert new records or update existing records) based on a specified condition. When you have a slowly changing dimension (SCD) table like Table1, where historical data needs to be maintained.
Therefore, the correct answer is D. MERGE

64
Q
A

D: Parquet
Parquet has columnar format, best for reading a few columns

65
Q
A

Given the large size of the table, I will utilize PolyBase for data transfer. Additionally, considering PolyBase’s constraint that it cannot load rows exceeding 1MB in size, I will compress rows to ensure compliance with this requirement, thereby making PolyBase the optimal choice for data transfer. => Option A: yes

66
Q
A

A: replicated
Replicated because
- Dimension table
- Less than 2 GB ( less than 1 GB in this case)

67
Q
A

C. an IDENTITY column

68
Q
A

Yes, No, No. /.Folder 3 2ill be ignored because it is a hidden folder and _mydata4.csv will also be ignored because it is a hidden file

69
Q
A

B: month

  • Show order counts by week.
  • Calculate sales totals by region.
  • Calculate sales totals by product.
    For these, Group By is required while querying, hence cannot be a parition. But fourth one, requires you to use WHERE clause, so month is ideal for a partition
70
Q
A

A. \DataSource\SubjectArea\YYYY\WW\FileData_YYYY_MM_DD.parquet

The recommended folder structure that best meets the requirements is option A. It separates data by data source, year and week. It allows for easy filtering of data by year or week, which aligns with the usage pattern where most queries include a filter on the current year or week.

71
Q
A

D. ALTER INDEX ALL on table1 REBUILD

72
Q
A

E. [EffectiveStartDate]
B. [EffectiveEndDate]

73
Q
A
  1. Hash(CustomerID)
  2. Replicate
74
Q
A

A. Use a native external table and authenticate by using a shared access signature (SAS)

75
Q
A
  1. join DimGeography and DimCustomer (join Product and Product line lso makes sense but is not an option)
  2. 5 tables.
76
Q
A

Auto Loader
Auto Loader can load data files from AWS S3 (s3://), Azure Data Lake Storage Gen2 (ADLS Gen2, abfss://), Google Cloud Storage (GCS, gs://), Azure Blob Storage (wasbs://), ADLS Gen1 (adl://), and Databricks File System (DBFS, dbfs:/).

77
Q
A

A. the WITH clause
There is no header row, therefore you should define one using the WITH Clause

78
Q
A

D. 60 million
Clustered Column Store will by default have 60 partitions. And to achieve best compression we need at least 1 Million rows per partition, hence Option D 60 Millions (1M per partition)

79
Q
A

C. Insert an extra row
D. Update one column of an existing row.
1) Insert an extra row with the updated last name and the current date as the StartDate.
2) Update one columns of an existing row: set the EndDate of the previous row for that salesperson to the current date

80
Q
A

1: For Storage redundancy, you should select ZRS (Zone-redundant storage). This will maintain read and write access to data even if an availability zone becomes unavailable.

2: For data deletion, you should select A lifecycle management policy. This will allow you to automatically delete data that was last modified more than two years ago

81
Q
A

HR: Archive storage after one day and delete storage after 2555 days
Operations: Cool storage after 180 days

82
Q
A
  1. Derived column
  2. Alter row
83
Q
A
  1. OPENROWSET
  2. BULK
84
Q
A

parquet. CSV , delta also possible but not an option here.

85
Q
A

AD: Query acceleration supports CSV and JSON formatted data as input to each request

86
Q
A

D CSV, OPENROWSET doesn’t have a JSON option

87
Q
A
  1. Yes:
    Azure Synapse Analytics allows Apache Spark pools in the same workspace to share a managed HMS (Hive Metastore) compatible metastore as their catalog. When customers want to persist the Hive catalog metadata outside of the workspace, and share catalog objects with other computational engines outside of the workspace, such as HDInsight and Azure Databricks, they can connect to an external Hive Metastore. Only Azure SQL Database and Azure Database for MySQL are supported as an external Hive Metastore.
  2. Yes:
    And currently we only support User-Password authentication.
  3. No:
    And currently we only support User-Password authentication. ==> STORAGE BLOB CONTRIBUTOR is an Azure RBAC (Role-Based Access Control) ==> NOT COMPATIBLE (it is supported User-Password authentication ONLY).
88
Q
A

Product name -type 0
color -type 3
size -type 2

89
Q
A
  1. Round_Robin
  2. Heap

The ROUND_ROBIN distribution distributes the data evenly across all distribution nodes in the SQL pool. This distribution type is suitable for loading data quickly into the staging tables because it minimizes the data movement during the loading process.

Use a HEAP table: Instead of creating a clustered index on the staging table, it is recommended to create a HEAP table. A HEAP table does not have a clustered index, which eliminates the need for maintaining the index and improves the data loading performance. It allows for faster insert operations

90
Q
A

A,C,D
Enable Azure Synapse Link for Cosmos1: This allows seamless integration between Azure Synapse Analytics and Cosmos DB, enabling efficient querying of data stored in Cosmos DB containers 1.
Create a linked service in ws1 that references Cosmos1: By creating a linked service, you establish a connection to your Cosmos DB account, allowing the serverless SQL pool to access data from Cosmos1 2.
“Make sure that you have prepared Analytical store” is a prerequisite.
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-cosmos-db-analytical-store?tabs=openrowset-key#prerequisites

91
Q
A

No, Yes, No
Because names that begin with an underline (_) or a period (.) are ignored
In serverless SQL pools must be specified /** at the end of the location path. In Dedicated pool the folders are always scanned recursively.

92
Q
A
  1. Execute
  2. Read and Execute
    ecute is to navigate folders
93
Q
A

B. tumbling
Retry capability is not supported on Schedule trigger

94
Q
A

Yes, No, Yes

95
Q
A

B. Cache hit percentage

96
Q
A
  1. BULK
  2. (Col1 int, Col2 varchar(20))

Serverless SQL pool uses BULK

97
Q
A

D. a partitioned view in a serverless SQL pool

maybe A. a partitioned table in a dedicated SQL pool: because only partitioned Table can support partition elimination but this relies on CSV / parquet

98
Q
A

A. 1 million

60 million rows per month / 60 distributions = 1 million rows per distribution per month.

99
Q
A

A. relationships
E. table definitions

100
Q
A

C, E, F

101
Q
A
  1. Partition by date with one partition per month
  2. Switch the oldest partition to another table and then drop it

dedicated SQL pool divides partitions into 60 databases and should aim for at least 1M rows per distribution (although not mentioned in the question, clustered columnstore compression becomes efficient at this scale). Therefore, partitioning by day would result in too small partitions.

Having partitioned by months, we can use switch to move it using metadata operations only, which makes the operation extremely efficient.

102
Q
A

A. in a file in a data lake

serverless SQL pool do not use tables

103
Q
A

B. business area

104
Q
A

B. Create a pull request to merge the changes into the main branch.

105
Q
A

B. No

Ist disskussionswürdig, das der save button nicht verfügbar ist bedeutet, dass der git modus nicht aktiviert ist. Wenn es um das speichern der logik im repository geht kann die git aktivierung helfen :P

106
Q
A

A. Yes
Horrible question, but yes you technically can store it as JSON to keep the logic and your work.

107
Q
A

B. No
You can’t export an ARM Template with validation errors.

108
Q
A

Add to a delta table => format is delta
Add the rows to existing table => append

109
Q
A

Analytical store cannot be turned off after it has been enabled. A new container is needed to transfer

So
1. Create a new container named Container2 and copy the contenst of Container1 to Container2
2. Delete Container1
3. Create another Container named Container1 and copy the contents of Container2 to Container1
4. Delete Container2.

110
Q
A
  1. Create a new table named Table1v2 by running CTAS
  2. Rename Table1 as Table1_old.
  3. Rename Table1v2 as Table1
  4. Drop Table1_old
111
Q
A

B. No

112
Q
A

A. Update the firewall rules to allow Azure services to access sql1.
C. From the Access control (IAM) settings, assign the Contributor role to the system-assigned managed identity of workspace1.

113
Q
A

C - With a full fidelity schema, the analytical store will track both the data and their types accurately. This means different types for the same field will be stored in separate columns. Specifically:

There will be one column for customerID as an integer.
There will be another column for customerID as a string.
There will be one column for customer as a string.

114
Q
A
  1. so.filepath(1) = ‘2023’
  2. so,filepath(2) IN (‘1’, ‘2’)