Describe common elements of large-scale analytics Flashcards
Data Warehoursing Architectures
1.Data ingestion and processing – data from one or more transactional data stores, files, real-time streams, or other sources is loaded into a data lake or a relational data warehouse. The load operation usually involves an extract, transform, and load (ETL) or extract, load, and transform (ELT) process in which the data is cleaned, filtered, and restructured for analysis.
2.Analytical data store – data stores for large scale analytics include relational data warehouses, file-system based data lakes, and hybrid architectures that combine features of data warehouses and data lakes (sometimes called data lakehouses or lake databases).
3.Analytical data model – while data analysts and data scientists can work with the data directly in the analytical data store, it’s common to create one or more data models that pre-aggregate the data to make it easier to produce reports, dashboards, and interactive visualizations.
4.Data visualization – data analysts consume data from analytical models, and directly from analytical stores to create reports, dashboards, and other visualizations.
Considerations for data ingestion and processing
-Understand the types and formats of data you’ll be ingesting
-Design your ingestion and processing pipeline to scale horizontally to handle growing volumes of data.
-Define the latency requirements for your data
-Ensure that your data ingestion and processing system is reliable and fault-tolerant
-Plan for data transformation and enrichment during the ingestion process
-Implement robust security measures to protect sensitive data during ingestion and processing.
-Set up comprehensive monitoring and logging to track the health and performance of your data pipeline.
-Implement data quality checks and validation during the ingestion process to identify and address issues early
Data Warehouse
A data warehouse is a relational database in which the data is stored in a schema that is optimized for data analytics rather than transactional workloads. A data warehouse is a great choice when you have transactional data that can be organized into a structured schema of tables, and you want to use SQL to query them.
Data Lakehouses
A data lake is a file store, usually on a distributed file system for high performance data access. Technologies like Spark or Hadoop are often used to process queries on the stored files and return data for reporting and analytics.
Data lakes are great for supporting a mix of structured, semi-structured, and even unstructured data that you want to analyze without the need for schema enforcement when the data is written to the store.
Hybrid Approach
You can use a hybrid approach that combines features of data lakes and data warehouses in a lake database or data lakehouse. The raw data is stored as files in a data lake, and a relational storage layer abstracts the underlying files and expose them as tables, which can be queried using SQL.
Azure Synapse Analytics
Azure Synapse Analytics is a comprehensive, unified Platform-as-a-Service (PaaS) solution for data analytics that provides a single service interface for multiple analytical capabilities, including:
-Pipelines - based on the same technology as Azure Data Factory.
-SQL - a highly scalable SQL database engine, optimized for data warehouse workloads.
-Apache Spark - an open-source distributed data processing system that supports multiple programming languages and APIs, including Java, Scala, Python, and SQL.
-Azure Synapse Data Explorer - a high-performance data analytics solution that is optimized for real-time querying of log and telemetry data using Kusto Query Language (KQL).
-Data engineers can use Azure Synapse Analytics to create a unified data analytics solution that combines data ingestion pipelines, data warehouse storage, and data lake storage through a single service.
-Data analysts can use SQL and Spark pools through interactive notebooks to explore and analyze data, and take advantage of integration with services such as Azure Machine Learning and Microsoft Power BI to create data models and extract insights from the data.
Azure Databricks
Azure Databricks is an Azure implementation of the popular Databricks platform. Databricks is a comprehensive data analytics solution built on Apache Spark, and offers native SQL capabilities as well as workload-optimized Spark clusters for data analytics and data science.
Databricks provides an interactive user interface through which the system can be managed and data can be explored in interactive notebooks. Due to its common use on multiple cloud platforms, you might want to consider using Azure Databricks as your analytical store if you want to use existing expertise with the platform or if you need to operate in a multi-cloud environment or support a cloud-portable solution.
-Data engineers can use existing Databricks and Spark skills to create analytical data stores in Azure Databricks.
-Data Analysts can use the native notebook support in Azure Databricks to query and visualize data in an easy to use web-based interface.
Azure HDInsight
Azure HDInsight is an Azure service that supports multiple open-source data analytics cluster types. Although not as user-friendly as Azure Synapse Analytics and Azure Databricks, it can be a suitable option if your analytics solution relies on multiple open-source frameworks or if you need to migrate an existing on-premises Hadoop-based solution to the cloud.
-Apache Spark - a distributed data processing system that supports multiple programming languages and APIs, including Java, Scala, Python, and SQL.
-Apache Hadoop - a distributed system that uses MapReduce jobs to process large volumes of data efficiently across multiple cluster nodes. MapReduce jobs can be written in Java or abstracted by interfaces such as Apache Hive - a SQL-based API that runs on Hadoop.
-Apache HBase - an open-source system for large-scale NoSQL data storage and querying.
-Apache Kafka - a message broker for data stream processing.
Azure DataFactory
Azure Data Factory is an Azure service that enables you to define and schedule data pipelines to transfer and transform data. You can integrate your pipelines with other Azure services, enabling you to ingest data from cloud data stores, process the data using cloud-based compute, and persist the results in another data store.
Azure Data Factory is used by data engineers to build extract, transform, and load (ETL) solutions that populate analytical data stores with data from transactional systems across the organization.
Microsoft Fabric
Microsoft Fabric is a unified Software-as-a-Service (SaaS) analytics platform based on open and governed lakehouse that includes functionality to support:
-Data ingestion and ETL
-Data lakehouse analytics
-Data warehouse analytics
-Data Science and machine learning
-Realtime analytics
-Data visualization
-Data governance and management