DP-600 Part 1 Flashcards
How to use on-premise data gateway
- Install the data gateway on the on-premise server
- In Fabric, create a new on-premise data gateway connection
- Use the gateway in a dataflow or data pipeline to get data into Fabric
Fabric Admin Portal - Tenant Settings
- Allows users to create fabric items
- Allows users to create workspace
- Whole host of security features
- Allow service principal access to Fabric APIs
- Allow Git integration
- Allow Copilot
Some settings can be enabled for entire organization, specific security groups, all except for certain security groups. Other settings are either enabled or disabled
Fabric Admin portal - Capacity Settings
- Create new capacities
- Delete capacities
- Management capacity permissions
- Change the size of the capacity
Where to increase the SKU of the capacity
Go to Admin portal > Capacity settings and click through to Azure to update your capacity
Structure of a Fabric implementation and where admin happens
- Tenant: Fabric Admin Portal (Tenant Settings)
- Capacity: Azure Portal and Fabric Admin Portal (Capacity Settings)
- Workspace: Workspace settings (in the workspace)
- Item level: data warehouse, lakehouse, semantic model
- Object level: dbo.customer, VW.MySQL View
Capacity adminstration tasks in Fabric capacity settings
- Enabled disaster recovery
- View capacity usage report
- Define who can create workspaces
- Define who is a capacity administrator workspace creation permissions
- Update Power BI connection settings from/to this capacity
- Permit workspace admins to size their own custom Spark tools based on workspace compute requirements
- Assign workspaces to the Capacity
Workspace administrator settings
- Edit license for the workspace (Pro, PPU, Fabric, Trial,etc.)
- Configure Azure connections
- Configure Azure DevOps connection (Git)
- Setup workspace identity
- Power BI settings
- Spark settings
What is xmla endpoint
The XMLA endpoint is essentially a gateway that lets external tools communicate with the data stored in Microsoft Fabric. This feature is particularly useful for those who need more control or prefer working with tools they are already comfortable with, like SSMS or Excel.
Difference XMLA with Fabric loading options like mirroring, copy data acitivity, etc.
XMLA: ETL (transform data using tools you’re familiar with outside of Microsoft Fabric and then load it into the lakehouse)
Others: ELT
load raw data into Fabric and then transform it using tools within the platform
Dataflow vs Data Pipeline
- Dataflows are for straightforward ETL processes and data preparation, with a focus on user-friendly transformation of data for analytics.
- Data Pipelines are for more complex orchestration and management of data workflows, handling multiple steps and dependencies, often in an ELT scenario. They are more suited for technical users who need to automate and manage comprehensive data workflows.
Shortcut vs mirroring in Fabric
Shortcut: reference to a dataset that exists in OneLake
Mirroring: process that creates a replicated copy of data in OneLake. Useful when you need a local copy of data for performance reasons, redundancy, or to ensure that your operations are not impacted by the availability or performance of the original data source.
What is Azure Blob Storage
cloud-based storage service provided by Microsoft Azure, designed to store large amounts of unstructured data. “Blob” stands for Binary Large Object and refers to data types like images, videos, documents, and other types of unstructured data.
What determines the number of capacities required
- Compliance with data residency regulations (e.g. maybe if the data must be located in the EU, then another data must be in the US, then you must separate the capacity)
- Billing preference within the organization
- Segregating by workload type (i.e. Data Engineering, Business Intelligence )
- Segregating by department
What determines the required sizing of a capacity
- Intensity of expected workloads (high volume of data ingestion)
- Heavy data transformation (i.e. Spark)
- The higher the SKU, the more expensive (budget)
- Can you afford to wait?
- Access to F64+ features or not? >F64: co-pilot
Options of data ingestion
- shortcut: ADLS Gen 2, Amazon S3, Google Cloud Storage or dataverse
- database mirroring: Azure SQL, Azure Cosmos DB, Snowflake
- ETL - dataflow: On-premise SQL
- ETL - data pipeline: On-premise SQL
- ETL - notebook
- Eventstream: real-time events
Other: ETL by dataflow, data pipeline or notebook
*above shows the preferred options and possibilities that are open
Data Ingestion requirement
Location of the data
- On-premise data gateway: if data is living in on-premise sql
- Vnet Data gateway: if data is living in azure virtual network or private endpoint
- Fast copy
Volume of the data
- low (megabytes per day):
- medium (gigabytes per day): fast copy and staging
- high (many GB or terabytes per day): fast copy and staging
Difference between Virtual network data gateway and On-premises data gateway
Virtual network data gateway: used when all your data is stored within Azure Virtual Network (VNet). Enables secure connections between Azure services (like Power BI) and data sources that are inside an Azure VNet.
On-Premises data gateway: data is stored outside of Azure like on your local network or in another cloud provider’s environment (AWS, Google Cloud), if you have direct network connectivity like VPN or ExpressRoute to these environment. Enables secure connections between cloud services (like power BI) and data sources that are not within azure
Data Storage Options
- Lakehouse
- Warehouse
- KQL database
Deciding factors for data storage
Data type:
- Lakehouse: structured, semi-structured and or unstrcutred
- Relational /strctured: lakehouse or warehouse
- Real-time/streaming - KQL data warehouse
Skills exist in the team:
- T-SQL: data warehouse
- Spark: lakehouse
- KQL: KQL database
The admin portal can only be accessed by
Someone with a Fabric license and either a:
- Global admin
- Power platform admin
- Fabric admin
Toby creates a new workspace with some Fabric items to be used by Data Analysts. Toby creates a new security group called Data analyst. He includes himself as a member of this security group. Toby gives the data analysts security a viewer role in the workspace. What workspace role does Toby have?
Admin. Since he is the creator of the workspace, his admin role supersedes the viewer role
Toby wants to delegate some of the management responsibilites in the workspace. He wants to give this person the ability to share content within the workspace, invite new Contributors to the workspace but no add new Admins to the workspace. Which role should Toby give this person?
Member
You have admin role in a workspace. Sheila is a data engineer in your team. Currently she has no access to the workspace. Sheila needs to update a data transformation script in a PySpark notebook. The script gets data from a Lakehouse table, cleans it and then writes it to a rable in the same Lakehouse. You want to adhere to the principle of least privilege. What actions should you take to enable this?
Share the lakehouse data with ReadAll Spark Data permission and share the Notebook with Edit permission
You have admin role in a workspace. You want to pre-install some useful Python packages to be used across all notebooks in your workspace. How do you achieve this?
Create an environment, install the packages in the environment and then go to workspace settings > spark settings and set the default environment.
About Domains
It is a way of logically grouping together all the data in an organization that is relevant to a particular area or field.
To group data into domains, workspaces are associated with domains!!! When a workspace is associated with a domain, all the items in the workspace are also associated with the domain and they receive a domain attribute as part of their metadata.
Domain Roles
- Fabric admin (or higher): fabric admins can create and edit domains, specify domain admins and domain contributors, and associate workspaces with domains. Fabric admins see all the defined domains on the Domains tab in the admin portal and they can edit and delete domains
- Domain admin: can only see and edit the donains they’re admins of
- domain contributor: are workspace admins whom a domain or fabric admin has authorized to assign the workspaces they’re the admins of to a domain, or to change the current domain assignment.
When you define a domain for specified users and/or security groups, the following happens
- The system scans the organization’s workspaces. When it finds a workspace whose admin is a specified user or member of a specified security group: if the workspace already has a domain assignment, it is preserved. The default domain doesn’t override the current assignment. If the workspace is unassigned, it is assigned to the default domain.
- After this, whenever a specified user or member of a specified security group creates a new workspace, it is assigned to the default domain.
The specified users and/or members of the specified security groups generally automatically become domain contributors of worskpaces that are assigned in this manner
Delta vs Parquet
Parquet: a way to store data in a very organized and efficient manner. Great for big data tools like Apache Spark and Hadoop because it helps save space and speeds up data retrieval. However once you write a parquet file, you cant change it. If you need to update it, you have to create a whole new file.
Delta file is the upgraded version of parquet. It allows you to make changes to your data without creating bew files every time. So if you need to update or depete something, you can do it easily. Handy for real-time applications
Relating to microsoft fabric:
- Using Parquet files: getting the benefits of efficient storage and fast queries
- Using Delta files: gain the ability to handle chanfes in your data more flexibly, which is great for applications that need to adapt quickly to new information.
According to Spark Definition a Managed Table is a
table which is stored in the Fabric Tables section and data as well as metadata are managed by spark
Deployment rules can be implemented to change things like
The default lakehouse (for a notebook) at different stages
There are other ways to management deployment oter than the development, test/staging, production stages
- managed through branching
- managed through Azure DevOps Pipelines (YAML templates)
- for semantic models, you can do it using the XMLA endpoint
In a Azure DevOps repo, the main branch is ‘protected’ (needs approval before any changes are merged into it). The repo contains one PBIP. You have to update the Title in the report, merging these changes to the Main branch, in which order should you carry out the following tasks to achieve this
- Clone the repository to your local machine
- Checkout a new feature branch from the MAIN branch
- Make the required changes to the report
- Commit and push the feature branch
- Open a pull request in Azure Repos
- Wait for approval, then merge into the main
You want to deploy a semantic model using the XMLA endpoint. where can you do to find the XMLA endpoint to set up a connection with a third-party tool?
Go to the workspace settings for the workspace you want to deploy your model to
Different types of power bi files
- pbix: standard
- pbit: template
- pbip: track changes in Git for version control
What is the fabric capacity metrics app
Observe capacity utilisation trends to determine what processes are consuming CUs and whether any throttling is occurring
What is DMV
Retrieves information about the current state of the data warehouse
sys.dm_exec_connections
Returns info about data warehouse connections
sys.dm_exec_sessions
Returns information about authenticated sessions
sys.dm_exec_requests
Returns information about active requests and provide details about SQL commands running in the data warehouse
Admin,member,viewer, contributor has permission which dmv and
Admin: all 3
Member,viewer,contributor: except sys.dm_exec_connection
3 Query Insights Views:
- queryinsights.exec_requests_history: details of each completed SQL query
- queryinsights.long_running_queries: details of query execution time
- queryinsighrs.frequently_run_queries: details of frequently run queries
In Power Bl Desktop, you enable incremental refresh for the table and load only one week’s worth of data. You publish the semantic model to Workspace 1.
Due to the size of the semantic model, you need to bootstrap the initial full load.
What can you use to create the partitions in the Power BI service without processing them?
Use tabular editor to run apply refresh policy command on a table that has an incremental refresh policy defined in power bi desktop. This will create the partitions based on the policy but does not process them. This method is useful when working with very large datasets when the initial full load can take many hours
You are developing a large Microsoft Power BI semantic model that will contain a fact table. The table will contain 400 million rows.
You plan to leverage user-defined aggregations to speed up the performance of the most frequently run queries.You need to confirm that the queries are mapped to aggregated data in the tables.
Which two tools should you use? Each correct answer presents part of the solution.
SQL Server Profiler and DAX Studio can detect whether queries were returned from the in-memory cache storage engine or pushed by DirectQuery to the data source
You have Azure Databricks tables and a Fabric lakehouse.
You need to create a new Fabric artifact to combine data from both architectures. The solution must use data pipelines for the Azure Databricks data and shortcuts for the existing Fabric lakehouse.
What Fabric artifact should you create?
Only lakehouse can create shortcut to other lakehouses. Fabric data warehouse can use data pipeline but cannot use shortcuts!!!
You have a Fabric workspace that contains a lakehouse named Lakehouse1.
You need to create a data pipeline and ingest data into Lakehousel by using the Copy data activity.
Which properties on the General tab are mandatory for the activity?
Name only
You have a Fabric workspace that contains a set of Dataflow Gen2 queries.
You plan to use the native Dataflow Gen2 refresh scheduler to configure the queries to refresh as often as possible.
What is the fastest refresh interval that can be configured?
The native refresh scheduler for dataflows, just like semantic models, can be scheduled every 30 minutes.
You have a Fabric warehouse named Warehouse 1.
You discover a SQL query that performs poorly, and you notice that table statistics are out of date.
You need to manually update the statistics to improve query performance for the table.
Which column statistics should you update?
When manually creating or updating statistics for optimising query performance, you should focus on columns used in JOIN ORDER BY and GROUO. Y clauses
How can you bring data into fabric
- Data ingestion: Dataflow, data pipeline, notebook, eventstream
- Shortcuts: external (amazon s3, adls) and internal (lakehouse, warehouse, kql tables)
- Database mirroring: snowflaks,cosmosdb, azure sql
Ingest data with a dataflow
Pros: now low code, perform ETL, access on-premise, get multiple datasets at once (but better to soace it out for data validation), can upload raw diles or static files
Cons: struggles with large datasets, difficult to implement data validation, cannot pass in external parameters
Ingest data with a dataflow
Pros: now low code, perform ETL, access on-premise, get multiple datasets at once (but better to soace it out for data validation), can upload raw diles or static files
Cons: struggles with large datasets, difficult to implement data validation, cannot pass in external parameters
Ingest data with data pipeline
Pros: able to ingest large datasets, import cloud data, wheb you need control flow logic, trigger wide variety of actions in fabric (and outside of fabric) like dataflows, notebooks, stored procs, kql scripts, webhooks, azure functions, azure ml, azure databricks
Cons: cannot do trasnform natively but can embed notebooks or dataflow, no ability to upload local files, does not work cross workspace currently
Ingest data with notebook
Pros: extraction from APIs (using Python requests library or similar), to use client libraries, good code for reuse, for data validation and data quality testing of incoming data, fastest in terms of data ingestion (and most efficient for CU soend)
Cons: when you dint have a puthon caoabukuty in your organisation and when you want to write to a data warehouse
What is monitoring hub used for
to track fabric items like semantic model
What is the capacity metrics app used for
Compute: capacity units spend, time, overages and also breakdown by workspace and fabric item.
Storage: GB storage by workspace
What is performance monitoring and which data ingestion is it used for?
- Used for dataflows
- Provides a high level summary of a dataflow run which is available in the monitoring hub
- Lower-level data and specific understanding of a dataflow is available in the refresh history of a particular dataflow. (Inspect error messages and get a breakdown of the different sub-activities being performed by the dataflow)
Performance Optimization / Features
- Staging: when enabled it’s useful when you have huge data volume and require lots of transformation. Not useful for small dataset
- Fast Copy: similar to data pipeline, used when struggling with performance
- DMV: receive live SQL Query lifecycle insights
Problem with delta tables and what are the ways to solve delta table problems
Context: Fabric is built on delta files. And while it is a great format as it updates the data into the same file. However, it can lead to poor performance and bloated storage size.
Solution: V-Order, Optimise, VACUUM,coalesce.repartition are some of the options to help reduce the problem with delta tables
How do i get the summary of the file and how it is partitioned
%%sql
describe
V-order purpose
reduce file size
- if it’s enabled: spark.conf.get
- to enable/disable:
spark.conf.set(____,’false’/’true’)
Optimise purpose and features
It is idempotent (meaning that it won’t reoptimise files that has been optimised)
- performs bin compaction by joining small files into large files
VACUUM purpose
removal of files no longer reference by a delta table
COALESCE
spark method to reduce the amount of partitions in a delta table. Let’s say you have 100 partitions, you can coalesce into 10 partitions
Repartition
Involves breaking of existing partitions to create new partitions, either be more or less than the original partitions. Repartitioning is an expensive operation because it involves shuffling (unlike coalesce)
- always reoptimises files regardless whether it has been optimised previously or not
V order increases the ___________ but reduces the _____ performance
increases the write time but reduces the read performance
What level of access do you need to run query info
Contributor level. Viewer not possible
Benefits of Direct Lake
- read parquet files direct.
- for real-time, stored in one fabric data store, and dataset size is large
- must be in lakehouse or warehouse
What is composite model used for
for many-to-many relationships without the need for bridge tables. For example: for fact tables use direct query and for dim tables use import mode
Implicit vs Explicit measure
- Implicit measures in Power BI are measures that are automatically created by Power BI based on the data model. Power BI generates implicit measures based on the aggregation used in the visualizations. For example, if you create a bar chart and drag a field to the Value area, Power BI automatically creates a sum aggregation for that field, which becomes an implicit measure. Implicit measures are also created when you use the Quick Measure feature in Power BI.
- Explicit measures in Power BI are measures that are created by the user using DAX formulas. Explicit measures are highly customizable and can be used to create more complex calculations.
You have a Fabric tenant that contains a semantic model. You need to prevent report creators from populating visuals by using implicit measures. What are two tools that you can use to achieve the goal»
-Microsoft Power BI Desktop: allows you to control and manage how measures are used within your reports. By carefully defining and using explicit measures within your data model, you can ensure that report creators use only these predefined measures instead of creating implicit measures automatically
-Tabular Editor: powerful tool for managing and editing Power BI and Analysis service tabular models. It allows you to enforce best practices, such as disabling implicit measures, by modifying model’s properties and ensuring that only explicit measures are available for use in reports.
Transformations that support query folding
- removing or renaming columns
-merging foldable queries that are based on the SAME source - appending foldable queries based on the SAME source
- numeric calculations
- joins
- pivot and unpivot
Transformations that do not support query folding
- merging or appending queries that are based on different sources
- using some functions while adding custom columns that do not have a counterpart in sql
- adding columns with complex logic. these refer to functions that do not have equivalent functions in the data source.
importance of query folding
- increased efficiency
- optimisation of cpu usage
- improved data security
What is the visualisation command in Microsoft Fabric Notebook to see summary of the data
- display()
- display(df,summary=true) to check the statistics summary of a given Apache Spark Dataframe. The summary includes the column name, column type, unique values, and missing values for each column.
- Microsoft Fabric also supports displayHTML() option
How to Embed a Power BI report in a notebook
- import powerbiclient
Capabilities: - you can render an existing Power BI report Istill in preview)
- create report visuals from a pandas Dataframe
- create report visuals from a spark dataframe
What does setting the XMLA Endpoint to read write do?
Allows users not only to read the data from the Power BI service but also to write back to it. Necessary for users to create and publish custom direct lake semantic models. Without write access, users would be unable to publish or update their models, which is critical part of managing semantic models.
What does allowing XMLA Endpoint and Analyze in Excel with on-premises datasets to Enabled in the Tenant Settings do ?
We need to enable XMLA endpoints from the tenant settings to ensure that external tools interact with the data models as and when necessary within the tenant’s workspace. The analyze in excel is just complimentary to this setting in Fabric and is irrelevant
RLS only applies to queries in _____ or _____ in Fabric. Power BI queries on a warehouse in Direct Lake mode will fall back to _____ to abide RLS
- Applies to queries on warehouse or sql analytics endpoint
- will fall back to direct query mode
Query the $System.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS dynamic management view (DMV). What does this do?
provides information about memory grants for queries. Helps identify the frequently used columns that are loaded into memory
Clone table in microsoft fabric
Microsoft fabric offers the capability to create near-instantaneous zero-copy clones with minimal storage costs.
- table clones facilitate development and testing processes by creating copies of tables in lower environments
- provides consistent reporting and zero-copy duplication of data for analytical workloads and ML modeling and testing
- provide the capability of data recovery in the event of a failed release or data corruption by retaining the previous state of data.
- help create historical reports that reflect the state of data as it existed as of a specific point-in-time in the past.
What is zero-copy clone
creates a replica of the table by copying the metadata, while still referencing the same data files in OneLake. The metadata is copied while the underlying data of the table stored as parquet files is not copied. The creation of a clone is similar to creating a table within a Warehouse in Microsoft Fabric
Permissions to create a table clone
- users with admin,member, or contributor workspace roles can clone the tables within the workspace. The viewer workspace role cannot create a clone
- select permission on all the rows and columns of the source of the table clone is required.
- users must have create tale permission in the schema where the table clone will be created
Table clone inheritance
- inherits OLS from the source table of the clone
- inherits RLS and dynamic data masking
- all atributes that exist at the source, same schema or not
- inherits primary and unique key
Where can you do delta lake table optimisation and V-order?
ONLY IN THE LAKEHOUSE
You have a Fabric tenant that contains a warehouse.
You use a dataflow to load a new dataset from OneLake to the warehouse.
You need to add a PowerQuery step to identify the maximum values for the numeric columns.
Which function should you include in the step?
Use Table.Profile instead of Table.Max because Table.Max returns the row in a table that contains the maximum value for a specified column, rather than providing the maximum values for all numeric columns.
Where to enable XMLA read-write
By default, Premium capacity or Premium Per User semantic model workloads have the XMLA endpoint property setting enabled for read-only. This means applications can only query a semantic model. For applications to perform write operations, the XMLA Endpoint property must be enabled for read-write.
- Go to Settings > Admin Portal > Capacity Settings > Power BI premium > capacity name
- Expand workloads. In the XMLA endpoint setting, select read write.
TRUNCATE TABLE
removes all rows from a table, but the table structure and its columns, constraints, indexes, and so on, remain.
Query folding is not applicable to
csv files
Are parquet files compressed?
by default yes. No need to enable it .When writing parquet files, you can specify the desired compression codec, to further optimise storage and performance