Databricks Fundamentals Flashcards
Open Data Lake
also known as a data Lakehouse. Databricks Data Intelligence is built on this. The first part of the pyramid.
- Data Ingestion and storage
- Data processing and support for continuous data engineering
- Data Access and Consumption
- Data Governance – Discoverability, Security, and Compliance
- Infrastructure and operations
- All Raw Data
(Logs, Texts, Audio, Video, and Images)
Delta Lake
-Unified Data Storage for reliability and sharing
- is a file-based open source storage format. ACID transaction guarantees
1st piece of the Data Intelligence Engine funnel/pyramid (after Open Data Lake)
- Data layout is automatically optimized based on usage patterns, acid transaction guarantees, (scalable data and metadata handling), (audit history and time travel), (unified streaming and batch processing), (schema enforcement, and schema evolution)
-Features : Predictive I/O, Predictive Optimizations, Liquid Clustering
Unity Catalog
Unified security, governance, and cataloging
- Context-aware search, auto-describe tables and columns, automated lineage, end-to-end observability and monitoring, sharing ai models
3rd piece of the data lake house funnel/pyramid (after Delta Lake)
- Securely get insights in natural language
Data Intelligence Engine
Use generative Ai to understand the semantics of your data
- Delta Lake
- Unity Catalog
ACID Transaction
- Atomicity: A transaction is treated as a single atomic unit. All steps that make up the transaction must succeed or the entire transaction rolls back. If they all succeed, the changes made by the transaction are permanently committed to the managing system. Consider the transfer transaction example. For the transaction to be committed to the database, the $200 must be successfully deducted from the savings account and added to the checking account. The funds in both accounts must be verified to ensure their accuracy. If any of these tasks fail, all changes roll back and none are committed.
- Consistency: A transaction must preserve the consistency of the underlying data. The transaction should make no changes that violate the rules or constraints placed on the data. For instance, a database supporting banking transactions might include a rule stating that a customer’s account balance can never be negative. If a transaction attempts to withdraw more money from an account than is available, the transaction will fail, and any changes made to the data will roll back.
- Isolation: A transaction is isolated from all other transactions. Transactions can run concurrently only if they don’t interfere with each other. Returning to the transfer transaction example, if another transaction were to attempt to withdraw funds from the same savings account, isolation would prevent the second transaction from firing. Without isolation, it might be possible for the second transaction to withdraw more funds than are available in the account after the first transaction was completed.
- Durability: A transaction that is committed is guaranteed to remain committed – that is, all changes are made permanent and will not be lost if an event such as a power failure should occur. This typically means persisting the changes to nonvolatile storage. If durability were not guaranteed, it would be possible for some or all changes to be lost, affecting the data’s reliability.
Elements of Data Governance
- Data cataloging
- Data Classification
- Auditing data entitlements and access
- Data discovery
- Data sharing and collaboration
- Data Lineage
- Data Security
- Data quality
Databricks Data governance
Unity Catalog: Unified governance and security
Delta Sharing: Sharing between organizations. Share live data without copying it, open cross-platform sharing, centralized admin and gov
Databricks Marketplace; Commercialization of data assets
Databricks Cleanroom: Private, secure computing
Databricks Security Architecture
- Control plane
- Data Plane
Data Plane
- one of Databrick’s security architecture
-handle the movement of data packets within and between cloud environments.
-where the data is processed by clusters of compute resources
Control plane
- one of Databrick’s security architecture
Photon
-Increased ETL, ingestion on data lake. Can be built on Spark
- Loading data into Delta and Parquet, IoT use cases, SQL-based use cases
Data Warehousing
- Databricks SQL
- Text to SQL
- AI-driven queries
- AI-driven serverless computing
scales for cost efficiency and peak
performance - AI-driven debugging and
remediation
Delta Live Tables (DTL)
ETL & Real-Time Analytics
-Automated and scalable streaming ingestion and transformation
-Workload-specific autoscaling
-Intelligent orchestration, error handling, and optimization
Orchestration
- Workflows
Intelligent ETL processing, AI-driven debugging and remediation, end-to-end observability and monitoring, broad ecosystem integration
GEN AI
- Custom Models
- Model serving
- RAG
End-to-End AI
- MLOPS (MLFLOW)
- AutoML
- Monitoring
- Governance