examtopics_1 Flashcards
C. [ManagerEmployeeKey] [int] NULL
B. an error (dbo schema error)
The answer should be D –> A –> C.
Step 1:
Create an empty table SalesFact_Work with same schema as SalesFact.
Step 2:
Switch the partition (to be removed) from SalesFact to SalesFact_Work. The syntax is:
ALTER TABLE <source></source> SWITCH PARTITION <partition> to <destination></destination></partition>
Step 3:
Delete the SalesFact_Work table.
B. File1.csv and File4.csv only
1: Parquet - column-oriented binary file format
2: AVRO - Row based format, and has logical type timestamp
D. /{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}{YYYY}{MM}_{DD}.csv
1: PARQUET
Because Parquet is a columnar file format.
2: AVRO
Because Avro is a row-based file format (as JSON) which is connected to logical timestamp
- Merge files
- Parquet
All the Dim tables –> Replicated
Fact Tables –> Hash Distributed
- Cool –> You will access infrequently but data must be avaliable in few time if you want to access them
- Archive –> You will never acces them but you need to configure a data archiving solution, so you must retain them always and not delete the blob
DISTRIBUTION = HASH (id)
PARTITION (ID RANGE LEFT
FOR VALUES (1, 1000000, 2000000) )
D. as a Type 2 slowly changing dimension (SCD) table
F. Create a managed identity.
A. Add the managed identity to the Sales group.
B. Use the managed identity as the credentials for the data load process.
- 0
- Value stored in database
Answer is C. Drop the external table and recreate it.
- Binary
- PerserveHierarchy
B. read-access geo-redundant storage (RA-GRS)
ZRS: “…copies your data synchronously across three Azure availability zones in the primary region” (meaning, in different Data Centers. In our scenario this would meet the requirements)
-> D is right
GRS/GZRS: are like LRS/ZRS but with the Data Centers in different azure regions. This works too but is more expensive than ZRS. So ZRS is the right answer.
Round-robin - this is the simplest distribution model, not great for querying but fast to process
Heap - no brainer when creating staging tables
No partitions - this is a staging table, why add effort to partition, when truncated daily?
B. hash-distributed on PurchaseKey. (Hash-distributed tables improve query performance on large fact tables. The PurchaseKey has many unique values, does not have NULLs and is not a date column.)
EventCategory -> dimEvent
channelGrouping -> dimChannel
TotalEvents -> factEvent
The answer is A
Compression doesn’t not only help to reduce the size or space occupied by a file in a storage but also increases the speed of file movement during transfer
Answer is no ,u use HEAP idx
No, rows need to have less than 1 MB. A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size capacity.
Create materialized views that store the results of the complex SELECT queries. Materialized views are precomputed views stored as tables, and they can significantly reduce query times by avoiding the need to recompute the results every time the query is executed.
D. Parquet
D: DataBricks with Java lang
A. Convert the files to JSON => no sense
B. Convert the files to Avro => my understanding is that the format of the file csv is given, so no
C. Compress the files => for batch processing it’s a win and this option that you can assume true given the available information
D. Merge the files => this can be true but not knowing how many files there is big issue
Ich würde D nemen das ist am intuitivsten
- Move to Cool Tier
- Container1/contoso.csv
Select D because analysts will most commonly analyze transactions for a given month
Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier
Azure Blob storage lifecycle management rules
B. Parquet
C. Switch the first partition from stg.Sales to dbo.Sales
ALTER TABLE stg.Sales
SWITCH PARTITION 1
TO dbo.Sales
PARTITION 1;
A. surrogate primary key
B. effective start date
E. effective end date
denormalizing and IDENTITY
A. 40
The number of records for the period stated = 2.4 billion
Number of underlying (“automatic”) distributions: 60
2.4 billion / 60 distributions = 40 million rows
40 million / 40 partitions = 1 million rows
As stated, 1 million rows per distribution are optimal for compression and performance. Divide the 40 million rows with the other partitioning options and you have too few rows per distribution -> suboptimal.
A: Type 1
B: a surrogate Key
Is there any “official” answer to this?
A. Replicated: Replicated tables have copies of the entire table on each distribution. While this option can eliminate data movement, it may not be the most efficient choice for very large tables with frequent updates.
B. Hash-Distributed on PurchaseKey: Hash distribution on “PurchaseKey” may lead to data skew if “PurchaseKey” doesn’t have a wide range of unique values. Additionally, it doesn’t align with the primary filtering condition on “DateKey.”
C. Round-Robin: Round-robin distribution ensures even data distribution, but it doesn’t take advantage of data locality for specific types of queries.
D. Hash-Distributed on DateKey: Distributing on “DateKey” aligns with your primary filtering condition, but it’s a date column. This could lead to clustering by date, especially if many users filter on the same date.
None of the answers seem to fit. D could be the best guess but it’s a date column.
A. Use Snappy compression for the files.
Snappy compression can reduce the size of Parquet files by up to 70%. This can save you a significant amount of money on storage costs.
- CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the storage.
- CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files.
- CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.
C. a dimension table for Employee
E. a fact table for Transaction
C. Type 2
Step 1: Create an external data source that uses the abfs location
Create External Data Source to reference Azure Data Lake Store Gen 1 or 2
Step 2: Create an external file format and set the First_Row option.
Create External File Format.
Step 3: Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages
Box 1: PARTITION -
RANGE RIGHT FOR VALUES is used with PARTITION.
Part 2: [TransactionDateID]
Partition on the date column.