Analytics Platforms Flashcards
Azure Data Lake Storage (Gen2)
Azure Data Lake Storage (ADLS) Gen2 is a cloud-based data storage solution designed for big data analytics. It combines the scalability and cost-effectiveness of Azure Blob Storage with hierarchical namespace capabilities for efficient data organization.
Key Features
* Cost and Performance: Low-cost, tiered storage. Managed for you.
* Built on Blobs: Supports lifecycle policies and access tiers.
* Folder Hierarchy (Trees): Organize data into directories and files with multiple levels.
* Hadoop-Compatible: Work with Hadoop and HDFS-based frameworks.
* Massive Scale: Store petabytes of data with gigabits of throughput.
* Fine-Grain ACLs: Support POSIX-style access control lists (ACL) for security.
Architecture - Implementation
* Storage Account: Supports General Purpose v2 or Premium Block Blob storage.
* Hierarchical Namespace: The storage account must have Hierarchical Namespace enabled.
* Container/File System: A container within ADLS acts as the file system where all data is stored.
Additional Elements
* Hadoop File System (HDFS) integration.
* Azure Blob File System (ABFS) compatibility.
Azure Synapse Analytics
Azure Synapse Analytics is an integrated analytics service that combines big data and data warehousing capabilities, enabling large-scale data analysis and business intelligence.
Key Features
Enterprise Data Warehouse (EDW)
* Supports massive parallel processing (MPP) for high-performance querying.
* Stores structured data optimized for analytics.
Big Data Integration
* Natively integrates with Azure Data Lake Storage (ADLS).
* Supports querying structured and unstructured data with Synapse SQL.
Serverless and Dedicated Pools
* Serverless SQL Pool: Pay-per-query model, great for ad-hoc analysis.
* Dedicated SQL Pool: Provisioned compute resources for high-performance workloads.
Integrated Apache Spark
* Built-in support for Spark for big data and AI/ML workloads.
Data Integration with Pipelines
* Enables ETL/ELT processing using Synapse Pipelines (built on Azure Data Factory).
* Connects to various Azure and on-premises data sources.
Security and Compliance
* Data encryption at rest and in transit.
* Row-level and column-level security.
* Dynamic Data Masking to protect sensitive data.
Synapse Studio
* Unified workspace for SQL, Spark, Power BI, and Data Integration.
* Allows interactive development, debugging, and monitoring.
Architecture
Azure Data Lake Storage (ADLS)
* Required for storing, processing, and ingesting data in Azure Synapse.
Synapse Workspace
* Acts as a parent container for managing security, networking, identity, and configuration.
Analytics Pools
* Supports SQL and Spark pools.
* Can be dedicated or serverless for massively parallel processing (MPP).
Networking
* Default Mode: Requires manual VNet configuration.
* Managed Mode: Uses private endpoints for isolated network security.
Uses Synapse Link for real-time data access.
Azure Synapse Analytics - Pools
SQL (Dedicated)
* Purpose: Large-scale data warehousing and analytics using MPP (Massively Parallel Processing) engine.
* Data Formats: CSV, Parquet, ORC, JSON, Avro, and Delta Lake (read-only).
* Data Storage: Database stored in Azure Data Lake Storage (ADLS), using distributed architecture, columnstore indexes, and hash partitioning.
* Queries: Uses T-SQL and stored procedures.
SQL (Serverless)
* Purpose: On-demand engine for querying data from data lakes.
* Data Formats: CSV, Parquet, ORC, JSON, Avro, and Delta Lake (read-only).
* Data Storage: ADLS for cache and external tables to access intermediate data.
* Queries: Supports T-SQL queries and open rowset functions.
Spark (Dedicated)
* Purpose: Dedicated scalable Spark cluster for big data processing and machine learning (ML).
* Data Formats: CSV, Parquet, ORC, JSON, Avro, and Delta Lake (read/write).
* Data Storage: Local disk cache and external tables for accessing intermediate data.
* Queries: Supports Spark SQL queries, PySpark, Scala, and .NET Spark APIs.
Spark (Serverless)
* Purpose: On-demand Spark cluster for quick/interactive data exploration.
* Data Formats: CSV, Parquet, ORC, JSON, Avro, and Delta Lake (read/write).
* Data Storage: Local disk cache and external tables for accessing intermediate data.
* Queries: Supports Spark SQL queries, PySpark, Scala, and .NET Spark APIs.
Key Takeaways
* SQL (Dedicated) is best for enterprise-level data warehousing.
* SQL (Serverless) is for ad-hoc queries on data lakes.
* Spark (Dedicated) is for ML workloads and big data processing.
* Spark (Serverless) is for quick exploratory data analysis.
Azure Databricks
Azure Databricks is a cloud-based data analytics and machine learning platform that provides a unified environment for big data processing, AI, and collaborative data science. It is built on Apache Spark and integrates seamlessly with other Azure services.
Key Features
Optimized Apache Spark
* Fully managed Apache Spark environment for big data analytics and machine learning.
* Supports Python, Scala, SQL, Java, and R.
Collaborative Notebooks
* Interactive notebooks for data science, ML, and engineering.
* Supports multiple users editing simultaneously.
Auto-scaling & Performance Optimization
* Scales automatically based on workload.
* Built-in Photon Engine for performance improvements.
Architecture - Implementation
Azure Databricks Account: Parent resource that defines pricing, networking, identity, and security settings.
Databricks Workspace: A dedicated environment to create, manage, and collaborate on Databricks assets and services.
Databricks Services - A collection of tools for:
* SQL – Querying structured data
* Data Engineering – Building data pipelines
* Machine Learning – Training and deploying ML models
Azure Databricks - Components
Clusters
* Managed Spark clusters for distributed computing.
* Supports autoscaling for cost-efficiency.
Workspaces
* Organizes notebooks, jobs, clusters, and models.
* Enables collaboration among data engineers and scientists.
Jobs
* Automated workflows for scheduled ETL and ML pipelines.
Databricks Runtime
* Custom Spark environment optimized for Azure.
* Includes MLflow for experiment tracking.
Delta Lake
* Open-source data lake storage with ACID transactions.
* Provides schema enforcement & versioning.
Azure Data Factory
Azure Data Factory (ADF) is a fully managed cloud-based data integration service provided by Microsoft Azure. It enables organizations to build, manage, and automate data movement and transformation workflows across various data sources, both on-premises and in the cloud. ADF is primarily designed for Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) processes, facilitating seamless data integration for analytics, reporting, and machine learning workloads.
Key Features
* ETL & ELT Orchestration – Allows the extraction, transformation, and loading (ETL) or extract, load, transform (ELT) of data using a fully managed cloud solution.
* Diverse Connectivity – Supports 90+ connectors to integrate with on-premises and cloud data sources such as Azure SQL Database, Amazon S3, Google BigQuery, Snowflake, and SAP.
* Low-Code & No-Code Data Pipelines – Provides an intuitive visual designer to create data workflows without extensive coding.
* Data Flow Transformations – Enables complex data transformations using mapping data flows with Spark-based execution.
* Data Movement Across Hybrid Environments – Allows on-premises to cloud and cloud-to-cloud data movement using a self-hosted integration runtime.
* Built-in Monitoring & Debugging – Offers real-time monitoring, logging, and debugging tools within the Azure portal.
* CI/CD Integration – Supports DevOps practices with Git integration, ARM templates, and Azure DevOps for automated deployments.
* Event-Driven Data Pipelines – Supports event triggers like time-based schedules, storage events, and API calls to automate data workflows.
Architecture
- Data Factory: Parent resource. Define pricing, networking, identity, etc.
- Pipeline: Activities, Data Flows, and Queries working with data (source/sink).
- Integration Runtime: The execution environment (either in Azure or outside of Azure).
Azure Data Factory - Components
1. Data Pipelines: A pipeline is a logical group of activities that define a workflow for data movement and transformation.
2. Activities: Individual steps in a pipeline such as data movement, transformation, or external service execution (e.g., calling an Azure Function).
3. Datasets: Represent data structures within data stores (tables, files, or collections).
4. Linked Services: Connection strings or endpoints that define data sources and destinations in Azure, on-premises, or third-party services.
5. Integration Runtimes
* Azure Integration Runtime: For cloud-based activities. Microsoft managed compute. Connects to any publicly accessible source.
* Self-Hosted Integration Runtime: For on-premises data movement and hybrid scenarios.Self-managed compute. Connects to any privately accessible source.
* Azure SSIS Integration Runtime: Microsoft managed cluster of VMs dedicated for running SQL Server Integration Services (SSIS) packages in the cloud.
6. Triggers: Automate pipeline execution using schedule-based, event-driven, or manual triggers.
7. Data Flow: A visual-based ETL transformation tool that enables complex data transformation without coding.