Lakehouse Flashcards
What is a lakehouse?
A lakehouse presents as a database and is built on top of a data lake using Delta format tables.
What capabilities do lakehouses combine?
The SQL-based analytical capabilities of a relational data warehouse and the flexibility and scalability of a data lake.
What types of data formats can lakehouses store?
All data formats.
What is the advantage of lakehouses being cloud-based?
They can scale automatically and provide high availability and disaster recovery.
What processing engines do lakehouses use?
Spark and SQL engines.
What is the schema-on-read format?
Data is organized in a schema-on-read format, meaning the schema is defined as needed rather than having a predefined schema.
What does ACID stand for in the context of lakehouses?
Atomicity, Consistency, Isolation, Durability.
What are the roles of different users in a lakehouse?
Data engineers, data scientists, and data analysts access and use data.
What is the ETL process?
Extract, Transform, Load.
What types of data sources can be ingested into a lakehouse?
Local files, databases, or APIs.
What are Fabric shortcuts?
Links to data in external sources, such as Azure Data Lake Store Gen2 or OneLake.
What tools can be used to transform ingested data?
Apache Spark with notebooks or Dataflows Gen2.
What is the purpose of Data Factory pipelines?
To orchestrate different ETL activities and land prepared data into the lakehouse.
What familiar tool do Dataflows Gen2 utilize?
Power Query.
What can you analyze using a lakehouse?
Using SQL.
What can be developed in Power BI using a lakehouse?
Reports.
How is lakehouse access managed?
Through workspace roles or item-level sharing.
What are sensitivity labels used for in lakehouses?
Data governance features.
True or False: Item-level sharing is best for granting access for read-only needs.
True.
Fill in the blank: Lakehouses support _______ transactions through Delta Lake formatted tables.
ACID
What is a key benefit of using a lakehouse for analytics?
Scalable analytics solution that maintains data consistency.
What three items are automatically created in your workspace when you create a new lakehouse?
Shortcuts, folders, files, and tables.
The lakehouse serves as a central hub for data management.
What does the Semantic model (default) provide for Power BI report developers?
An easy data source.
The Semantic model simplifies data representation for reporting.
What is the purpose of the SQL analytics endpoint in a lakehouse?
Allows read-only access to query data with SQL.
This endpoint enables SQL-based interaction with the lakehouse data.
In what two modes can you work with data in the lakehouse?
Lakehouse mode and SQL analytics endpoint mode.
Each mode offers different capabilities for managing and querying data.
What is the first step in the ETL process for a lakehouse?
Ingesting data into your lakehouse.
This step is crucial for preparing data for analysis.
List the methods to ingest data into a lakehouse.
- Upload local files
- Dataflows Gen2
- Notebooks
- Data Factory pipelines
Each method has its own use case and benefits.
What should you consider when ingesting data to determine your loading pattern?
Whether to load all raw data as files or use staging tables.
This decision impacts performance and data processing efficiency.
What can Spark job definitions be used for in a lakehouse?
To submit batch/streaming jobs to Spark clusters.
This allows for processing large volumes of data efficiently.
What is the purpose of shortcuts in a lakehouse?
To integrate data while keeping it stored in external storage.
Shortcuts enhance data accessibility across different storage solutions.
How are source data permissions and credentials managed when using shortcuts?
They are managed by OneLake.
This central management simplifies access control across data sources.
What is required for a user to access data through a shortcut to another OneLake location?
The user must have permissions in the target location to read the data.
This ensures secure and authorized access to the data.
Where can shortcuts be created?
In both lakehouses and KQL databases.
This versatility allows for broader data integration options.
True or False: Shortcuts appear as a folder in the lake.
True.
This structure allows for organized data management within the lakehouse.
What is the main role of data transformations in the data loading process?
Most data requires transformations before loading into tables.
What tools can be used to transform and load data?
The same tools used to ingest data can also transform and load data.
What is a Delta table?
Transformed data can be loaded as a file or a Delta table.
Who favors notebooks for data engineering tasks?
Data engineers familiar with different programming languages including PySpark, SQL, and Scala.
What interface do Dataflows Gen2 use?
The PowerQuery interface.
What do pipelines provide in the ETL process?
A visual interface to perform and orchestrate ETL processes.
How complex can pipelines be?
Pipelines can be as simple or as complex as needed.
What is required for data to be used after ingestion?
Data must be transformed and loaded.
What do Fabric items provide for organizations?
The flexibility needed for every organization.
What tools can data scientists use for exploring and training machine learning models?
Notebooks or Data wrangler.
What can report developers create using the semantic model?
Power BI reports.
What can analysts use the SQL analytics endpoint for?
To query, filter, aggregate, and explore data in lakehouse tables.
What is the benefit of combining Power BI with a data lakehouse?
You can implement an end-to-end analytics solution on a single platform.
Fill in the blank: After data is ingested, transformed, and loaded, it’s ready for _______.
others to use.
True or False: Dataflows Gen2 are excellent for developers familiar with SQL only.
False.