Skills Measured Flashcards

1
Q

Describe features of structured data

A
  • adheres to a fixed schema, so all of the data has the same fields and properties
  • most commonly, the schema for structured data entities is tabular
  • often stored in a database in which multiple tables can reference one another by using key values in a relational model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe features of semi-structured

A
  • information that has some structure, but which allows for some variation between entity instances
  • for example, while most customers may have an email address, some might have multiple email addresses, and some might have none at all
  • one common format for semi-structured data is JavaScript Object Notation (JSON)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe features of unstructured data

A
  • data without any specific structure
  • for example, documents, images, audio and video data, and binary files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe common formats for data files

A
  • delimited text files: stored in plain text format with specific field delimiters and row terminators. Most common in CSV, other examples include TSV and space-delimited. Delimited text is a good choice for structured data that needs to be accessed by a wide range of applications and services in a human-readable format.
  • JavaScript Object Notation (JSON): a ubiquitous format in which a hierarchical document schema is used to define data entities (objects) that have multiple attributes. Each attribute might be an object (or a collection of objects), making JSON a flexible format that’s good for both structured and semi-structured data.
  • Extensible Markup Language (XML): a human-readable data format that was popular in the 1990s and 2000s. It’s largely been superseded by the less verbose JSON format, but there are still some systems that use XML to represent data.
  • Binary Large Object (BLOB): Ultimately, all files are stored as binary data (1’s and 0’s), but in the human-readable formats discussed above, the bytes of binary data are mapped to printable characters (typically through a character encoding scheme such as ASCII or Unicode). Some file formats however, particularly for unstructured data, store the data as raw binary that must be interpreted by applications and rendered. Common types of data stored as binary include images, video, audio, and application-specific documents.
  • Optimized File Formats: While human-readable formats for structured and semi-structured data can be useful, they’re typically not optimized for storage space or processing. Over time, some specialized file formats that enable compression, indexing, and efficient storage and processing have been developed: Avro (row-based format), ORC (Optimized Row Columnar format), Parquet (another columnar data format)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe types of databases

A
  • Relational databases: commonly used to store and query structured data. The data is stored in tables that represent entities, such as customers, products, or sales orders. Each instance of an entity is assigned a primary key that uniquely identifies it; and these keys are used to reference the entity instance in other tables. This use of keys to reference data entities enables a relational database to be normalized; which in part means the elimination of duplicate data values. The tables are managed and queried using Structured Query Language (SQL), which is based on an ANSI standard, so it’s similar across multiple database systems.
  • Non-relational databases: data management systems that don’t apply a relational schema to the data. Non-relational databases are often referred to as NoSQL database, even though some support a variant of the SQL language. Four common types: Key-value databases (each record consists of a unique key and an associated value, which can be in any format), Document databases (a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)), Column family databases (store tabular data comprising rows and columns, but you can divide the columns into groups known as column-families. Each column family holds a set of columns that are logically related together.), Graph databases (store entities as nodes with links to define relationships between them)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe features of transactional workloads

A
  • A transactional data processing system is what most people consider the primary function of business computing. A transactional system records transactions that encapsulate specific events that the organization wants to track. A transaction could be financial, such as the movement of money between accounts in a banking system, or it might be part of a retail system, tracking payments for goods and services from customers. Think of a transaction as a small, discrete, unit of work.
  • Transactional systems are often high-volume, sometimes handling many millions of transactions in a single day. The data being processed has to be accessible very quickly. The work performed by transactional systems is often referred to as Online Transactional Processing (OLTP).
  • OLTP solutions rely on a database system in which data storage is optimized for both read and write operations in order to support transactional workloads in which data records are created, retrieved, updated, and deleted (often referred to as CRUD operations). These operations are applied transactionally, in a way that ensures the integrity of the data stored in the database. To accomplish this, OLTP systems enforce transactions that support so-called ACID semantics: Atomicity, Consistency, Isolation, Durability
  • OLTP systems are typically used to support live applications that process business data - often referred to as line of business (LOB) applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe features of analytical workloads

A
  • Analytical data processing typically uses read-only (or read-mostly) systems that store vast volumes of historical data or business metrics. Analytics can be based on a snapshot of the data at a given point in time, or a series of snapshots.
  • The specific details for an analytical processing system can vary between solutions, but a common architecture for enterprise-scale analytics looks like this:
    1. Operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis.
    2. Data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files in the data lake, or a data warehouse with a fully relational SQL engine.
    3. Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube. Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimension tables. For example, sales revenue might be totaled by date, customer, and product.
    4. The data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards.
  • Data lakes are common in large-scale data analytical processing scenarios, where a large volume of file-based data must be collected and analyzed.
  • Data warehouses are an established way to store data in a relational schema that is optimized for read operations – primarily queries to support reporting and data visualization. Data Lakehouses are a more recent innovation that combine the flexible and scalable storage of a data lake with the relational querying semantics of a data warehouse. The table schema may require some denormalization of data in an OLTP data source (introducing some duplication to make queries perform faster).
  • An OLAP model is an aggregated type of data storage that is optimized for analytical workloads. Data aggregations are across dimensions at different levels, enabling you to drill up/down to view aggregations at multiple hierarchical levels; for example to find total sales by region, by city, or for an individual address. Because OLAP data is pre-aggregated, queries to return the summaries it contains can be run quickly.
  • Different types of user might perform data analytical work at different stages of the overall architecture. For example:
  • Data scientists might work directly with data files in a data lake to explore and model data.
    Data Analysts might query tables directly in the data warehouse to produce complex reports and visualizations.
    Business users might consume pre-aggregated data in an analytical model in the form of reports or dashboards.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe responsibilities for database administrators

A
  • A database administrator is responsible for the design, implementation, maintenance, and operational aspects of on-premises and cloud-based database systems. They’re responsible for the overall availability and consistent performance and optimizations of databases. They work with stakeholders to implement policies, tools, and processes for backup and recovery plans to recover following a natural disaster or human-made error.
  • The database administrator is also responsible for managing the security of the data in the database, granting privileges over the data, granting or denying access to users as appropriate.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe responsibilities for data engineers

A
  • A data engineer collaborates with stakeholders to design and implement data-related workloads, including data ingestion pipelines, cleansing and transformation activities, and data stores for analytical workloads. They use a wide range of data platform technologies, including relational and non-relational databases, file stores, and data streams.
  • They’re also responsible for ensuring that the privacy of data is maintained within the cloud and spanning from on-premises to the cloud data stores. They own the management and monitoring of data pipelines to ensure that data loads perform as expected.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe responsibilities for data analysts

A
  • A data analyst enables businesses to maximize the value of their data assets. They’re responsible for exploring data to identify trends and relationships, designing and building analytical models, and enabling advanced analytics capabilities through reports and visualizations.
  • A data analyst processes raw data into relevant insights based on identified business requirements to deliver relevant insights.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Identify features of relational data

A
  • In a relational database, you model collections of entities from the real world as tables. An entity can be anything for which you want to record information, typically important objects and events. A table contains rows, and each row represents a single instance of an entity.
  • Relational tables are a format for structured data, and each row in a table has the same columns; though in some cases, not all columns need to have a value (NULL).
  • Each column stores data of a specific datatype. For example, an Email column in a Customer table would likely be defined to store character-based (text) data (which might be fixed or variable in length), a Price column in a Product table might be defined to store decimal numeric data, while a Quantity column in an Order table might be constrained to integer numeric values; and an OrderDate column in the same Order table would be defined to store date/time values. The available datatypes that you can use when defining a table depend on the database system you are using; though there are standard datatypes defined by the American National Standards Institute (ANSI) that are supported by most database systems.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe normalization and why it is used

A
  • Normalization is a term used by database professionals for a schema design process that minimizes data duplication and enforces data integrity (data is correct, consistent, and dependable).
  • While there are many complex rules that define the process of refactoring data into various levels (or forms) of normalization, a simple definition for practical purposes is:
    1. Separate each entity into its own table.
    2. Separate each discrete attribute into its own column.
    3. Uniquely identify each entity instance (row) using a primary key.
    4. Use foreign key columns to link related entities.
  • Recording each instance of an entity as a row in an entity-specific table removes duplication of data. For example, to change a customer’s address, you need only modify the value in a single row.
  • The decomposition of attributes into individual columns ensures that each value is constrained to an appropriate data type. Additionally, the creation of individual columns provides a useful level of granularity in the data for querying - for example, you can easily filter customers to those who live in a specific city.
  • Instances of each entity are uniquely identified by an ID or other key value, known as a primary key; and when one entity references another (for example, an order has an associated customer), the primary key of the related entity is stored as a foreign key. You can look up the address of the customer (which is stored only once) for each record in the Order table by referencing the corresponding record in the Customer table. Typically, a relational database management system (RDBMS) can enforce referential integrity to ensure that a value entered into a foreign key field has an existing corresponding primary key in the related table – for example, preventing orders for non-existent customers.
  • In some cases, a key (primary or foreign) can be defined as a composite key based on a unique combination of multiple columns. For example, the LineItem table in the example above uses a unique combination of OrderNo and ItemNo to identify a line item from an individual order.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Identify common structured query language (SQL) statements

A
  • SQL stands for Structured Query Language, and is used to communicate with a relational database. It’s the standard language for relational database management systems. Some common relational database management systems that use SQL include Microsoft SQL Server, MySQL, PostgreSQL, MariaDB, and Oracle.
  • Although these SQL statements are part of the SQL standard, many database management systems also have their own additional proprietary extensions to handle the specifics of that database management system. Some popular dialects of SQL include: Transact-SQL (T-SQL) (used by Microsoft SQL Server and Azure SQL services), pgSQL (PostgreSQL), PL\SQL (used by Oracle. PL/SQL stands for Procedural Language/SQL)
  • SQL statements are grouped into three main logical groups:
    • Data Definition Language (DDL): used to create, modify, and remove tables and other objects in a database (table, stored procedures, views, and so on). Most common statements: CREATE, ALTER, DROP, RENAME
    • Data Control Language (DCL): Database administrators generally use DCL statements to manage access to objects in a database by granting, denying, or revoking permissions to specific users or groups. Three main statements: GRANT, DENY, REVOKE
    • Data Manipulation Language (DML): used to manipulate the rows in tables. These statements enable you to retrieve (query) data, insert new rows, or modify existing rows. You can also delete rows if you don’t need them anymore. Four main statements are: SELECT, INSERT, UPDATE, DELETE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Identify common database objects

A
  • In addition to tables, a relational database can contain other structures that help to optimize data organization, encapsulate programmatic actions, and improve the speed of access. Three of these structures: views, stored procedures, and indexes.
  • Views: a virtual table based on the results of a SELECT query. You can think of a view as a window on specified rows in one or more underlying tables.
  • Stored procedure: defines SQL statements that can be run on command. Stored procedures are used to encapsulate programmatic logic in a database for actions that applications need to perform when working with data. You can define a stored procedure with parameters to create a flexible solution for common actions that might need to be applied to data based on a specific key or criteria.
  • Index: helps you search for data in a table. Think of an index over a table like an index at the back of a book. A book index contains a sorted set of references, with the pages on which each reference occurs. When you want to find a reference to an item in the book, you look it up through the index. You can use the page numbers in the index to go directly to the correct pages in the book. Without an index, you might have to read through the entire book to find the references you’re looking for. When you create an index in a database, you specify a column from the table, and the index contains a copy of this data in a sorted order, with pointers to the corresponding rows in the table. When the user runs a query that specifies this column in the WHERE clause, the database management system can use this index to fetch the data more quickly than if it had to scan through the entire table row by row. The index creates a tree-based structure that the database system’s query optimizer can use to quickly find rows. For a table containing few rows, using the index is probably not any more efficient than simply reading the entire table and finding the rows requested by the query (in which case the query optimizer will ignore the index). However, when a table has many rows, indexes can dramatically improve the performance of queries. You can create many indexes on a table. However, indexes aren’t free. An index consumes storage space, and each time you insert, update, or delete data in a table, the indexes for that table must be maintained. This additional work can slow down insert, update, and delete operations. You must strike a balance between having indexes that speed up your queries versus the cost of performing other operations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe the Azure SQL family of products including Azure SQL Database, Azure SQL Managed Instance, and SQL Server on Azure Virtual Machines

A

Azure SQL is a collective term for a family of Microsoft SQL Server based database services in Azure. Specific Azure SQL services include:
- SQL Server on Azure Virtual Machines (VMs) - A virtual machine running in Azure with an installation of SQL Server. The use of a VM makes this option an infrastructure-as-a-service (IaaS) solution that virtualizes hardware infrastructure for compute, storage, and networking in Azure; making it a great option for “lift and shift” migration of existing on-premises SQL Server installations to the cloud. Use this option when you need to migrate or extend an on-premises SQL Server solution and retain full control over all aspects of server and database configuration.
- Azure SQL Managed Instance - A platform-as-a-service (PaaS) option that provides near-100% compatibility with on-premises SQL Server instances while abstracting the underlying hardware and operating system. The service includes automated software update management, backups, and other maintenance tasks, reducing the administrative burden of supporting a database server instance. Use this option for most cloud migration scenarios, particularly when you need minimal changes to existing applications.
- Azure SQL Database - A fully managed, highly scalable PaaS database service that is designed for the cloud. This service includes the core database-level capabilities of on-premises SQL Server, and is a good option when you need to create a new application in the cloud. Use this option for new cloud solutions, or to migrate applications that have minimal instance-level dependencies.
- Azure SQL Edge - A SQL engine that is optimized for Internet-of-things (IoT) scenarios that need to work with streaming time-series data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Identify Azure database services for open-source database systems

A
  • In addition to Azure SQL services, Azure data services are available for other popular relational database systems, including MySQL, MariaDB, and PostgreSQL.
  • MySQL started life as a simple-to-use open-source database management system. It’s the leading open source relational database for Linux, Apache, MySQL, and PHP (LAMP) stack apps. It’s available in several editions; Community, Standard, and Enterprise. The Community edition is available free-of-charge, and has historically been popular as a database management system for web applications, running under Linux. Versions are also available for Windows. Standard edition offers higher performance, and uses a different technology for storing data. Enterprise edition provides a comprehensive set of tools and features, including enhanced security, availability, and scalability. The Standard and Enterprise editions are the versions most frequently used by commercial organizations, although these versions of the software aren’t free.
  • MariaDB is a newer database management system, created by the original developers of MySQL. The database engine has since been rewritten and optimized to improve performance. MariaDB offers compatibility with Oracle Database (another popular commercial database management system). One notable feature of MariaDB is its built-in support for temporal data. A table can hold several versions of data, enabling an application to query the data as it appeared at some point in the past.
  • PostgreSQL is a hybrid relational-object database. You can store data in relational tables, but a PostgreSQL database also enables you to store custom data types, with their own non-relational properties. The database management system is extensible; you can add code modules to the database, which can be run by queries. Another key feature is the ability to store and manipulate geometric data, such as lines, circles, and polygons. PostgreSQL has its own query language called pgsql. This language is a variant of the standard relational query language, SQL, with features that enable you to write stored procedures that run inside the database.
16
Q

Describe Azure File storage

A
  • Azure Files is essentially a way to create cloud-based network shares, such as you typically find in on-premises organizations to make documents and other files available to multiple users. By hosting file shares in Azure, organizations can eliminate hardware costs and maintenance overhead, and benefit from high availability and scalable cloud storage for files.
  • After you’ve created a storage account, you can upload files to Azure File Storage using the Azure portal, or tools such as the AzCopy utility. You can also use the Azure File Sync service to synchronize locally cached copies of shared files with the data in Azure File Storage.
  • Azure File Storage offers two performance tiers. The Standard tier uses hard disk-based hardware in a datacenter, and the Premium tier uses solid-state disks. The Premium tier offers greater throughput, but is charged at a higher rate.
  • Azure Files supports two common network file sharing protocols:
    • Server Message Block (SMB) file sharing is commonly used across multiple operating systems (Windows, Linux, macOS).
    • Network File System (NFS) shares are used by some Linux and macOS versions. To create an NFS share, you must use a premium tier storage account and create and configure a virtual network through which access to the share can be controlled.
16
Q

Describe Azure Blob storage

A
  • Azure Blob Storage is a service that enables you to store massive amounts of unstructured data as binary large objects, or blobs, in the cloud. Blobs are an efficient way to store data files in a format that is optimized for cloud-based storage, and applications can read and write them by using the Azure blob storage API.
  • In an Azure storage account, you store blobs in containers. A container provides a convenient way of grouping related blobs together. Within a container, you can organize blobs in a hierarchy of virtual folders, similar to files in a file system on disk. The folders are purely virtual.
  • Azure Blob Storage supports three different types of blob: Block blobs (best used to store discrete, large, binary objects that change infrequently.), Page blobs (optimized to support random read and write operations), Append blobs (optimized to support append operations)
  • Blob storage provides three access tiers, which help to balance access latency and storage cost: Hot (use this tier for blobs that are accessed frequently), Cool (for data that is accessed infrequently), and Archive (for historical data that mustn’t be lost, but is required only rarely)
17
Q

Describe Azure Table storage

A
  • Azure Table Storage is a NoSQL storage solution that makes use of tables containing key/value data items. Each item is represented by a row that contains columns for the data fields that need to be stored.
  • don’t be misled into thinking that an Azure Table Storage table is like a table in a relational database. An Azure Table enables you to store semi-structured data. All rows in a table must have a unique key (composed of a partition key and a row key), and when you modify data in a table, a timestamp column records the date and time the modification was made; but other than that, the columns in each row can vary. Azure Table Storage tables have no concept of foreign keys, relationships, stored procedures, views, or other objects you might find in a relational database. Data in Azure Table storage is usually denormalized, with each row holding the entire data for a logical entity.
  • To help ensure fast access, Azure Table Storage splits a table into partitions. Partitioning is a mechanism for grouping related rows, based on a common property or partition key. Rows that share the same partition key will be stored together. Partitioning not only helps to organize data, it can also improve scalability and performance in the following ways:
    • Partitions are independent from each other, and can grow or shrink as rows are added to, or removed from, a partition. A table can contain any number of partitions.
    • When you search for data, you can include the partition key in the search criteria. This helps to narrow down the volume of data to be examined, and improves performance by reducing the amount of I/O (input and output operations, or reads and writes) needed to locate the data.
  • The key in an Azure Table Storage table comprises two elements; the partition key that identifies the partition containing the row, and a row key that is unique to each row in the same partition. Items in the same partition are stored in row key order. If an application adds a new row to a table, Azure ensures that the row is placed in the correct position in the table. This scheme enables an application to quickly perform point queries that identify a single row, and range queries that fetch a contiguous block of rows in a partition.
18
Q

Identify use cases for Azure Cosmos DB

A
  • Azure Cosmos DB is a highly scalable cloud database service for NoSQL data. Cosmos DB uses indexes and partitioning to provide fast read and write performance and can scale to massive volumes of data. You can enable multi-region writes, adding the Azure regions of your choice to your Cosmos DB account so that globally distributed users can each work with data in their local replica. Cosmos DB is a highly scalable database management system. Cosmos DB automatically allocates space in a container for your partitions, and each partition can grow up to 10 GB in size. Indexes are created and maintained automatically. There’s virtually no administrative overhead.
  • Cosmos DB is highly suitable for the following scenarios:
    • IoT and telematics. These systems typically ingest large amounts of data in frequent bursts of activity. Cosmos DB can accept and store this information quickly. The data can then be used by analytics services, such as Azure Machine Learning, Azure HDInsight, and Power BI. Additionally, you can process the data in real-time using Azure Functions that are triggered as data arrives in the database.
    • Retail and marketing. Microsoft uses Cosmos DB for its own e-commerce platforms that run as part of Windows Store and Xbox Live. It’s also used in the retail industry for storing catalog data and for event sourcing in order processing pipelines.
    • Gaming. The database tier is a crucial component of gaming applications. Modern games perform graphical processing on mobile/console clients, but rely on the cloud to deliver customized and personalized content like in-game stats, social media integration, and high-score leaderboards. Games often require single-millisecond latencies for reads and write to provide an engaging in-game experience. A game database needs to be fast and be able to handle massive spikes in request rates during new game launches and feature updates.
    • Web and mobile applications. Azure Cosmos DB is commonly used within web and mobile applications, and is well suited for modeling social interactions, integrating with third-party services, and for building rich personalized experiences. The Cosmos DB SDKs can be used to build rich iOS and Android applications using the popular Xamarin framework.
19
Q

Describe Azure Cosmos DB APIs

A
  • When you provision a new Cosmos DB instance, you select the database engine that you want to use. The choice of engine depends on many factors including the type of data to be stored, the need to support existing applications, and the skills of the developers who will work with the data store.
  • Azure Cosmos DB for NoSQL: Microsoft’s native non-relational service for working with the document data model. It manages data in JSON document format, and despite being a NoSQL data storage solution, uses SQL syntax to work with the data.
  • Azure Cosmos DB for MongoDB: MongoDB is a popular open source database in which data is stored in Binary JSON (BSON) format. Azure Cosmos DB for MongoDB enables developers to use MongoDB client libraries and code to work with data in Azure Cosmos DB.
  • Azure Cosmos DB for PostgreSQL: a native PostgreSQL, globally distributed relational database that automatically shards data to help you build highly scalable apps.
  • Azure Cosmos DB for Table: used to work with data in key-value tables, similar to Azure Table Storage. It offers greater scalability and performance than Azure Table Storage.
  • Azure Cosmos DB for Apache Cassandra: is compatible with Apache Cassandra, which is a popular open source database that uses a column-family storage structure. Column families are tables, similar to those in a relational database, with the exception that it’s not mandatory for every row to have the same columns.
  • Azure Cosmos DB for Apache Gremlin: used with data in a graph structure; in which entities are defined as vertices that form nodes in connected graph. Nodes are connected by edges that represent relationships.
20
Q

Describe considerations for data ingestion and processing

A
  • data from one or more transactional data stores, files, real-time streams, or other sources is loaded into a data lake or a relational data warehouse. The load operation usually involves an extract, transform, and load (ETL) or extract, load, and transform (ELT) process in which the data is cleaned, filtered, and restructured for analysis. In ETL processes, the data is transformed before being loaded into an analytical store, while in an ELT process the data is copied to the store and then transformed. Either way, the resulting data structure is optimized for analytical queries. The data processing is often performed by distributed systems that can process high volumes of data in parallel using multi-node clusters. Data ingestion includes both batch processing of static data and real-time processing of streaming data.
  • On Azure, large-scale data ingestion is best implemented by creating pipelines that orchestrate ETL processes. You can create and run pipelines using Azure Data Factory, or you can use a similar pipeline engine in Azure Synapse Analytics or Microsoft Fabric if you want to manage all of the components of your data analytics solution in a unified workspace. In either case, pipelines consist of one or more activities that operate on data. An input dataset provides the source data, and activities can be defined as a data flow that incrementally manipulates the data until an output dataset is produced. Pipelines can connect to external data sources to integrate with a wide variety of data services.
21
Q

Describe options for analytical data stores

A
  • There are two common types of analytical data store: Data warehouses and Data lakehouses
  • Data warehouses: A data warehouse is a relational database in which the data is stored in a schema that is optimized for data analytics rather than transactional workloads. Commonly, the data from a transactional store is transformed into a schema in which numeric values are stored in central fact tables, which are related to one or more dimension tables that represent entities by which the data can be aggregated. For example a fact table might contain sales order data, which can be aggregated by customer, product, store, and time dimensions (enabling you, for example, to easily find monthly total sales revenue by product for each store). This kind of fact and dimension table schema is called a star schema; though it’s often extended into a snowflake schema by adding additional tables related to the dimension tables to represent dimensional hierarchies (for example, product might be related to product categories). A data warehouse is a great choice when you have transactional data that can be organized into a structured schema of tables, and you want to use SQL to query them.
  • Data lakehouses: A data lake is a file store, usually on a distributed file system for high performance data access. Technologies like Spark or Hadoop are often used to process queries on the stored files and return data for reporting and analytics. These systems often apply a schema-on-read approach to define tabular schemas on semi-structured data files at the point where the data is read for analysis, without applying constraints when it’s stored. Data lakes are great for supporting a mix of structured, semi-structured, and even unstructured data that you want to analyze without the need for schema enforcement when the data is written to the store. You can use a hybrid approach that combines features of data lakes and data warehouses in a lake database or data lakehouse. The raw data is stored as files in a data lake, and a relational storage layer abstracts the underlying files and expose them as tables, which can be queried using SQL. SQL pools in Azure Synapse Analytics include PolyBase, which enables you to define external tables based on files in a data lake (and other sources) and query them using SQL. Synapse Analytics also supports a Lake Database approach in which you can use database templates to define the relational schema of your data warehouse, while storing the underlying data in data lake storage – separating the storage and compute for your data warehousing solution. Data lakehouses are a relatively new approach in Spark-based systems, and are enabled through technologies like Delta Lake; which adds relational storage capabilities to Spark, so you can define tables that enforce schemas and transactional consistency, support batch-loaded and streaming data sources, and provide a SQL API for querying.
22
Q

Describe Azure services for data warehousing, including Azure Synapse Analytics, Azure Databricks, Microsoft Fabric, Azure HDInsight, and Azure Data Factory

A
  • On Azure, there are three main platform-as-a-service (PaaS) services that you can use to implement a large-scale analytical store:
    • Azure Synapse Analytics: a unified, end-to-end solution for large scale data analytics. It brings together multiple technologies and capabilities, enabling you to combine the data integrity and reliability of a scalable, high-performance SQL Server based relational data warehouse with the flexibility of a data lake and open-source Apache Spark. It also includes native support for log and telemetry analytics with Azure Synapse Data Explorer pools, as well as built in data pipelines for data ingestion and transformation. All Azure Synapse Analytics services can be managed through a single, interactive user interface called Azure Synapse Studio, which includes the ability to create interactive notebooks in which Spark code and markdown content can be combined. Synapse Analytics is a great choice when you want to create a single, unified analytics solution on Azure.
    • Azure Databricks: an Azure implementation of the popular Databricks platform. Databricks is a comprehensive data analytics solution built on Apache Spark, and offers native SQL capabilities as well as workload-optimized Spark clusters for data analytics and data science. Databricks provides an interactive user interface through which the system can be managed and data can be explored in interactive notebooks. Due to its common use on multiple cloud platforms, you might want to consider using Azure Databricks as your analytical store if you want to use existing expertise with the platform or if you need to operate in a multicloud environment or support a cloud-portable solution.
    • Azure HDInsight: an Azure service that supports multiple open-source data analytics cluster types. Although not as user-friendly as Azure Synapse Analytics and Azure Databricks, it can be a suitable option if your analytics solution relies on multiple open-source frameworks or if you need to migrate an existing on-premises Hadoop-based solution to the cloud.
  • Microsoft Fabric: Scalable analytics with PaaS services can be complex, fragmented, and expensive. With Microsoft Fabric, you don’t have to spend all of your time combining various services and implementing interfaces through which business users can access them. Instead, you can use a single product that is easy to understand, set up, create, and manage. Fabric is a unified software-as-a-service (SaaS) offering, with all your data stored in a single open format in OneLake. OneLake is Fabric’s lake-centric architecture that provides a single, integrated environment for data professionals and the business to collaborate on data projects. Think of it like OneDrive for data; OneLake combines storage locations across different regions and clouds into a single logical lake, without moving or duplicating data. Data can be stored in any file format in OneLake and can be structured or unstructured. For tabular data, the analytical engines in Fabric will write data in delta format when writing to OneLake. All engines will know how to read this format and treat delta files as tables no matter which engine writes it.
  • Azure Data Factory is a cloud-based data integration service that orchestractes data movement & transformation between diverse data sources and cloud compute resources at scale.
23
Q

Describe the difference between batch and streaming data

A
  • Data processing is simply the conversion of raw data to meaningful information through a process. There are two general ways to process data:
    • Batch processing, in which multiple data records are collected and stored before being processed together in a single operation.
    • Stream processing, in which a source of data is constantly monitored and processed in real time as new data events occur.
  • Apart from the way in which batch processing and streaming processing handle data, there are other differences:
    • Data scope: Batch processing can process all the data in the dataset. Stream processing typically only has access to the most recent data received, or within a rolling time window (the last 30 seconds, for example).
    • Data size: Batch processing is suitable for handling large datasets efficiently. Stream processing is intended for individual records or micro batches consisting of few records.
    • Performance: Latency is the time taken for the data to be received and processed. The latency for batch processing is typically a few hours. Stream processing typically occurs immediately, with latency in the order of seconds or milliseconds.
    • Analysis: You typically use batch processing to perform complex analytics. Stream processing is used for simple response functions, aggregates, or calculations such as rolling averages.
24
Q

Identify Microsoft cloud services for real-time analytics

A

Real-time analytics in Azure
- Microsoft Azure supports multiple technologies that you can use to implement real-time analytics of streaming data, including:
- Azure Stream Analytics: A platform-as-a-service (PaaS) solution that you can use to define streaming jobs that ingest data from a streaming source, apply a perpetual query, and write the results to an output.
- Spark Structured Streaming: An open-source library that enables you to develop complex streaming solutions on Apache Spark based services, including Azure Synapse Analytics, Azure Databricks, and Azure HDInsight.
- Azure Data Explorer: A high-performance database and analytics service that is optimized for ingesting and querying batch or streaming data with a time-series element, and which can be used as a standalone Azure service or as an Azure Synapse Data Explorer runtime in an Azure Synapse Analytics workspace.

Sources for stream processing
- The following services are commonly used to ingest data for stream processing on Azure:
- Azure Event Hubs: A data ingestion service that you can use to manage queues of event data, ensuring that each event is processed in order, exactly once.
- Azure IoT Hub: A data ingestion service that is similar to Azure Event Hubs, but which is optimized for managing event data from Internet-of-things (IoT) devices.
- Azure Data Lake Store Gen 2: A highly scalable storage service that is often used in batch processing scenarios, but which can also be used as a source of streaming data.
- Apache Kafka: An open-source data ingestion solution that is commonly used together with Apache Spark. You can use Azure HDInsight to create a Kafka cluster.

Sinks for stream processing
- The output from stream processing is often sent to the following services:
- Azure Event Hubs: Used to queue the processed data for further downstream processing.
- Azure Data Lake Store Gen 2 or Azure blob storage: Used to persist the processed results as a file.
- Azure SQL Database or Azure Synapse Analytics, or Azure Databricks: Used to persist the processed results in a database table for querying and analysis.
- Microsoft Power BI: Used to generate real time data visualizations in reports and dashboards.

25
Q

Identify capabilities of Power BI

A
  • There are many data visualization tools that data analysts can use to explore data and summarize insights visually; including chart support in productivity tools like Microsoft Excel and built-in data visualization widgets in notebooks used to explore data in services such as Azure Synapse Analytics and Azure Databricks. However, for enterprise-scale business analytics, an integrated solution that can support complex data modeling, interactive reporting, and secure sharing is often required.
  • Microsoft Power BI is a suite of tools and services that data analysts can use to build interactive data visualizations for business users to consume.
  • A typical workflow for creating a data visualization solution starts with Power BI Desktop, a Microsoft Windows application in which you can import data from a wide range of data sources, combine and organize the data from these sources in an analytics data model, and create reports that contain interactive visualizations of the data.
  • After you’ve created data models and reports, you can publish them to the Power BI service; a cloud service in which reports can be published and interacted with by business users. You can also do some basic data modeling and report editing directly in the service using a web browser, but the functionality for this is limited compared to the Power BI Desktop tool. You can use the service to schedule refreshes of the data sources on which your reports are based, and to share reports with other users. You can also define dashboards and apps that combine related reports in a single, easy to consume location.
  • Users can consume reports, dashboards, and apps in the Power BI service through a web browser, or on mobile devices by using the Power BI phone app.
26
Q

Describe features of data models in Power BI

A
  • Analytical models enable you to structure data to support analysis. Models are based on related tables of data and define the numeric values that you want to analyze or report (known as measures) and the entities by which you want to aggregate them (known as dimensions). For example, a model might include a table containing numeric measures for sales (such as revenue or quantity) and dimensions for products, customers, and time. This would enable you aggregate sale measures across one or more dimensions (for example, to identify total revenue by customer, or total items sold by product per month). Conceptually, the model forms a multidimensional structure, which is commonly referred to as a cube, in which any point where the dimensions intersect represents an aggregated measure for those dimensions.
  • You can use Power BI to define an analytical model from tables of data, which can be imported from one or more data source. You can then use the data modeling interface on the Model tab of Power BI Desktop to define your analytical model by creating relationships between fact and dimension tables, defining hierarchies, setting data types and display formats for fields in the tables, and managing other properties of your data that help define a rich model for analysis.
27
Q

Identify appropriate visualizations for data

A
  • After you’ve created a model, you can use it to generate data visualizations that can be included in a report:
    • Tables and text: Tables and text are often the simplest way to communicate data. Tables are useful when numerous related values must be displayed, and individual text values in cards can be a useful way to show important figures or metrics.
    • Bar and column charts: Bar and column charts are a good way to visually compare numeric values for discrete categories.
    • Line charts: Line charts can also be used to compare categorized values and are useful when you need to examine trends, often over time.
    • Pie charts: Pie charts are often used in business reports to visually compare categorized values as proportions of a total.
    • Scatter plots: Scatter plots are useful when you want to compare two numeric measures and identify a relationship or correlation between them.
    • Maps: Maps are a great way to visually compare values for different geographic areas or locations.
    • Interactive reports in Power BI: In Power BI, the visual elements for related data in a report are automatically linked to one another and provide interactivity. For example, selecting an individual category in one visualization will automatically filter and highlight that category in other related visualizations in the report.