DP-600 Part 1 Flashcards
How to use on-premise data gateway
- Install the data gateway on the on-premise server
- In Fabric, create a new on-premise data gateway connection
- Use the gateway in a dataflow or data pipeline to get data into Fabric
Fabric Admin Portal - Tenant Settings
- Allows users to create fabric items
- Allows users to create workspace
- Whole host of security features
- Allow service principal access to Fabric APIs
- Allow Git integration
- Allow Copilot
Some settings can be enabled for entire organization, specific security groups, all except for certain security groups. Other settings are either enabled or disabled
Fabric Admin portal - Capacity Settings
- Create new capacities
- Delete capacities
- Management capacity permissions
- Change the size of the capacity
Where to increase the SKU of the capacity
Go to Admin portal > Capacity settings and click through to Azure to update your capacity
Structure of a Fabric implementation and where admin happens
- Tenant: Fabric Admin Portal (Tenant Settings)
- Capacity: Azure Portal and Fabric Admin Portal (Capacity Settings)
- Workspace: Workspace settings (in the workspace)
- Item level: data warehouse, lakehouse, semantic model
- Object level: dbo.customer, VW.MySQL View
Capacity adminstration tasks in Fabric capacity settings
- Enabled disaster recovery
- View capacity usage report
- Define who can create workspaces
- Define who is a capacity administrator workspace creation permissions
- Update Power BI connection settings from/to this capacity
- Permit workspace admins to size their own custom Spark tools based on workspace compute requirements
- Assign workspaces to the Capacity
Workspace administrator settings
- Edit license for the workspace (Pro, PPU, Fabric, Trial,etc.)
- Configure Azure connections
- Configure Azure DevOps connection (Git)
- Setup workspace identity
- Power BI settings
- Spark settings
What is xmla endpoint
The XMLA endpoint is essentially a gateway that lets external tools communicate with the data stored in Microsoft Fabric. This feature is particularly useful for those who need more control or prefer working with tools they are already comfortable with, like SSMS or Excel.
Difference XMLA with Fabric loading options like mirroring, copy data acitivity, etc.
XMLA: ETL (transform data using tools you’re familiar with outside of Microsoft Fabric and then load it into the lakehouse)
Others: ELT
load raw data into Fabric and then transform it using tools within the platform
Dataflow vs Data Pipeline
- Dataflows are for straightforward ETL processes and data preparation, with a focus on user-friendly transformation of data for analytics.
- Data Pipelines are for more complex orchestration and management of data workflows, handling multiple steps and dependencies, often in an ELT scenario. They are more suited for technical users who need to automate and manage comprehensive data workflows.
Shortcut vs mirroring in Fabric
Shortcut: reference to a dataset that exists in OneLake
Mirroring: process that creates a replicated copy of data in OneLake. Useful when you need a local copy of data for performance reasons, redundancy, or to ensure that your operations are not impacted by the availability or performance of the original data source.
What is Azure Blob Storage
cloud-based storage service provided by Microsoft Azure, designed to store large amounts of unstructured data. “Blob” stands for Binary Large Object and refers to data types like images, videos, documents, and other types of unstructured data.
What determines the number of capacities required
- Compliance with data residency regulations (e.g. maybe if the data must be located in the EU, then another data must be in the US, then you must separate the capacity)
- Billing preference within the organization
- Segregating by workload type (i.e. Data Engineering, Business Intelligence )
- Segregating by department
What determines the required sizing of a capacity
- Intensity of expected workloads (high volume of data ingestion)
- Heavy data transformation (i.e. Spark)
- The higher the SKU, the more expensive (budget)
- Can you afford to wait?
- Access to F64+ features or not? >F64: co-pilot
Options of data ingestion
- shortcut: ADLS Gen 2, Amazon S3, Google Cloud Storage or dataverse
- database mirroring: Azure SQL, Azure Cosmos DB, Snowflake
- ETL - dataflow: On-premise SQL
- ETL - data pipeline: On-premise SQL
- ETL - notebook
- Eventstream: real-time events
Other: ETL by dataflow, data pipeline or notebook
*above shows the preferred options and possibilities that are open
Data Ingestion requirement
Location of the data
- On-premise data gateway: if data is living in on-premise sql
- Vnet Data gateway: if data is living in azure virtual network or private endpoint
- Fast copy
Volume of the data
- low (megabytes per day):
- medium (gigabytes per day): fast copy and staging
- high (many GB or terabytes per day): fast copy and staging
Difference between Virtual network data gateway and On-premises data gateway
Virtual network data gateway: used when all your data is stored within Azure Virtual Network (VNet). Enables secure connections between Azure services (like Power BI) and data sources that are inside an Azure VNet.
On-Premises data gateway: data is stored outside of Azure like on your local network or in another cloud provider’s environment (AWS, Google Cloud), if you have direct network connectivity like VPN or ExpressRoute to these environment. Enables secure connections between cloud services (like power BI) and data sources that are not within azure
Data Storage Options
- Lakehouse
- Warehouse
- KQL database
Deciding factors for data storage
Data type:
- Lakehouse: structured, semi-structured and or unstrcutred
- Relational /strctured: lakehouse or warehouse
- Real-time/streaming - KQL data warehouse
Skills exist in the team:
- T-SQL: data warehouse
- Spark: lakehouse
- KQL: KQL database
The admin portal can only be accessed by
Someone with a Fabric license and either a:
- Global admin
- Power platform admin
- Fabric admin
Toby creates a new workspace with some Fabric items to be used by Data Analysts. Toby creates a new security group called Data analyst. He includes himself as a member of this security group. Toby gives the data analysts security a viewer role in the workspace. What workspace role does Toby have?
Admin. Since he is the creator of the workspace, his admin role supersedes the viewer role
Toby wants to delegate some of the management responsibilites in the workspace. He wants to give this person the ability to share content within the workspace, invite new Contributors to the workspace but no add new Admins to the workspace. Which role should Toby give this person?
Member
You have admin role in a workspace. Sheila is a data engineer in your team. Currently she has no access to the workspace. Sheila needs to update a data transformation script in a PySpark notebook. The script gets data from a Lakehouse table, cleans it and then writes it to a rable in the same Lakehouse. You want to adhere to the principle of least privilege. What actions should you take to enable this?
Share the lakehouse data with ReadAll Spark Data permission and share the Notebook with Edit permission
You have admin role in a workspace. You want to pre-install some useful Python packages to be used across all notebooks in your workspace. How do you achieve this?
Create an environment, install the packages in the environment and then go to workspace settings > spark settings and set the default environment.
About Domains
It is a way of logically grouping together all the data in an organization that is relevant to a particular area or field.
To group data into domains, workspaces are associated with domains!!! When a workspace is associated with a domain, all the items in the workspace are also associated with the domain and they receive a domain attribute as part of their metadata.
Domain Roles
- Fabric admin (or higher): fabric admins can create and edit domains, specify domain admins and domain contributors, and associate workspaces with domains. Fabric admins see all the defined domains on the Domains tab in the admin portal and they can edit and delete domains
- Domain admin: can only see and edit the donains they’re admins of
- domain contributor: are workspace admins whom a domain or fabric admin has authorized to assign the workspaces they’re the admins of to a domain, or to change the current domain assignment.
When you define a domain for specified users and/or security groups, the following happens
- The system scans the organization’s workspaces. When it finds a workspace whose admin is a specified user or member of a specified security group: if the workspace already has a domain assignment, it is preserved. The default domain doesn’t override the current assignment. If the workspace is unassigned, it is assigned to the default domain.
- After this, whenever a specified user or member of a specified security group creates a new workspace, it is assigned to the default domain.
The specified users and/or members of the specified security groups generally automatically become domain contributors of worskpaces that are assigned in this manner
Delta vs Parquet
Parquet: a way to store data in a very organized and efficient manner. Great for big data tools like Apache Spark and Hadoop because it helps save space and speeds up data retrieval. However once you write a parquet file, you cant change it. If you need to update it, you have to create a whole new file.
Delta file is the upgraded version of parquet. It allows you to make changes to your data without creating bew files every time. So if you need to update or depete something, you can do it easily. Handy for real-time applications
Relating to microsoft fabric:
- Using Parquet files: getting the benefits of efficient storage and fast queries
- Using Delta files: gain the ability to handle chanfes in your data more flexibly, which is great for applications that need to adapt quickly to new information.
According to Spark Definition a Managed Table is a
table which is stored in the Fabric Tables section and data as well as metadata are managed by spark
Deployment rules can be implemented to change things like
The default lakehouse (for a notebook) at different stages
There are other ways to management deployment oter than the development, test/staging, production stages
- managed through branching
- managed through Azure DevOps Pipelines (YAML templates)
- for semantic models, you can do it using the XMLA endpoint
In a Azure DevOps repo, the main branch is ‘protected’ (needs approval before any changes are merged into it). The repo contains one PBIP. You have to update the Title in the report, merging these changes to the Main branch, in which order should you carry out the following tasks to achieve this
- Clone the repository to your local machine
- Checkout a new feature branch from the MAIN branch
- Make the required changes to the report
- Commit and push the feature branch
- Open a pull request in Azure Repos
- Wait for approval, then merge into the main
You want to deploy a semantic model using the XMLA endpoint. where can you do to find the XMLA endpoint to set up a connection with a third-party tool?
Go to the workspace settings for the workspace you want to deploy your model to
Different types of power bi files
- pbix: standard
- pbit: template
- pbip: track changes in Git for version control
What is the fabric capacity metrics app
Observe capacity utilisation trends to determine what processes are consuming CUs and whether any throttling is occurring