DP-900 Flashcards
How can we classify data?
structured, semi-structured, or unstructured
What is structured data?
Structured data is data that adheres to a fixed schema, so all of the data has the same fields or properties.
What is semi-structured data?
Semi-structured data is information that has some structure, but which allows for some variation between entity instances.
What are some common formats of semi-structured data?
JSON
What are unstructured data?
Not all data is structured or even semi-structured. For example, documents, images, audio and video data, and binary files might not have a specific structure. This kind of data is referred to as unstructured data.
Give me some examples of unstructured data?
documents, images, audio and video data, and binary files
What two categories of a data store do we have?
File stores and databases
What should one consider if we are to use a file store or a database?
- The type of data being stored (structured, semi-structured, or unstructured).
- The applications and services that will need to read, write, and process the data.
- The need for the data files to be readable by humans, or optimized for efficient storage and processing.
What is Delimited Text Files?
Data is often stored in plain text format with specific field delimiters and row terminators. The most common format for delimited data is comma-separated values (CSV) in which fields are separated by commas, and rows are terminated by a carriage return / new line.
What type of files can one store in a file store?
- Delimitted text files
- JSON
- XML
- BLOB
What is some of the popular optimized file formats?
- Avro
- ORC
- Parquet
What is Avro?
Avro is a row-based format. It was created by Apache. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information. An application uses the information in the header to parse the binary data and extract the fields it contains. Avro is a good format for compressing data and minimizing storage and network bandwidth requirements.
What is ORC?
ORC (Optimized Row Columnar format) organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive (Hive is a data warehouse system that supports fast data summarization and querying over large datasets). An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.
What is Parquet
Parquet is another columnar data format. It was created by Cloudera and Twitter. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.
What optimized file format should we use to compress data and to minimizing storage and network bandwidth requirements
Avro
What optimized file format should we use to optimize read and write operations in apache hive
ORC
What optimized file format should we use that specialices in storing and processing nested data types efficiently
Parquet
What is normalization of data?
The elimination of duplicate data values
What type of non relational databases do we have?
- Key-value databases
- Document databases
- Column family databases
- Graph databases
What is key-value type in non relational database?
Key-value databases in which each record consists of a unique key and an associated value, which can be in any format.
What is Document type in non relational database?
Document databases, which are a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)
What is Column family type in non relational database?
Column family databases, which store tabular data comprising rows and columns, but you can divide the columns into groups known as column-families. Each column family holds a set of columns that are logically related together.
What is Graph databases type in non relational database?
Graph databases, which store entities as nodes with links to define relationships between them.
What is Online Transactional Processing (OLTP)?
OLTP solutions rely on a database system in which data storage is optimized for both read and write operations in order to support transactional workloads in which data records are created, retrieved, updated, and deleted (often referred to as CRUD operations).
How does a OLTP system accomplish it’s goal?
ACID scematics
What does the ACID stand for?
Atomicity – each transaction is treated as a single unit, which succeeds completely or fails completely. For example, a transaction that involved debiting funds from one account and crediting the same amount to another account must complete both actions. If either action can’t be completed, then the other action must fail.
Consistency – transactions can only take the data in the database from one valid state to another. To continue the debit and credit example above, the completed state of the transaction must reflect the transfer of funds from one account to the other.
Isolation – concurrent transactions cannot interfere with one another, and must result in a consistent database state. For example, while the transaction to transfer funds from one account to another is in-process, another transaction that checks the balance of these accounts must return consistent results - the balance-checking transaction can’t retrieve a value for one account that reflects the balance before the transfer, and a value for the other account that reflects the balance after the transfer.
Durability – when a transaction has been committed, it will remain committed. After the account transfer transaction has completed, the revised account balances are persisted so that even if the database system were to be switched off, the committed transaction would be reflected when it is switched on again.
When is OLTP systems typically used?
OLTP systems are typically used to support live applications that process business data - often referred to as line of business (LOB) applications.
What is LOB?
Line of buisness
What is a common architecture of an enterprise-scale analytics?
- Operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis.
- Data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files in the data lake, or a data warehouse with a fully relational SQL engine.
- Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube. Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimension tables. For example, sales revenue might be totaled by date, customer, and product.
- The data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards.
What is a data lake?
Data lakes are common in large-scale data analytical processing scenarios, where a large volume of file-based data must be collected and analyzed.
What is data warehouse?
Data warehouses are an established way to store data in a relational schema that is optimized for read operations – primarily queries to support reporting and data visualization.
What is data lakehouses?
Data Lakehouses are a more recent innovation that combine the flexible and scalable storage of a data lake with the relational querying semantics of a data warehouse. The table schema may require some denormalization of data in an OLTP data source (introducing some duplication to make queries perform faster).
What is OLAP model?
An OLAP model is an aggregated type of data storage that is optimized for analytical workloads. Data aggregations are across dimensions at different levels, enabling you to drill up/down to view aggregations at multiple hierarchical levels; for example to find total sales by region, by city, or for an individual address. Because OLAP data is pre-aggregated, queries to return the summaries it contains can be run quickly.
Which data model should we use if we want to drill up/down?
OLAP
What are three key job roles that deal with data in most organizations and what do they do??
- Database administrators manage databases, assigning permissions to users, storing backup copies of data and restore data in the event of a failure.
- Data engineers manage infrastructure and processes for data integration across the organization, applying data cleaning routines, identifying data governance rules, and implementing pipelines to transfer and transform data between systems.
- Data analysts explore and analyze data to create visualizations and charts that enable organizations to make informed decisions.
What types of Azure SQL do we have?
- Azure SQL Database
- Azure SQL Managed Instance
- Azure SQL VM
What is Azure SQL Database?
Azure SQL Database – a fully managed platform-as-a-service (PaaS) database hosted in Azure
What is Azure SQL Managed Instance
Azure SQL Managed Instance – a hosted instance of SQL Server with automated maintenance, which allows more flexible configuration than Azure SQL DB but with more administrative responsibility for the owner.
What is Azure SQL VM?
Azure SQL VM – a virtual machine with an installation of SQL Server, allowing maximum configurability with full management responsibility.
What open source elational databases does Azure SQL support?
- Azure Database for MySQL
- Azure Database for MariaDB
- Azure Database for PostgreSQL
What is Azure Database for MySQL and when are they used?
Azure Database for MySQL - a simple-to-use open-source database management system that is commonly used in Linux, Apache, MySQL, and PHP (LAMP) stack apps.
What is Azure Database for MariaDB and when are they used?
Azure Database for MariaDB - a newer database management system, created by the original developers of MySQL. The database engine has since been rewritten and optimized to improve performance. MariaDB offers compatibility with Oracle Database (another popular commercial database management system).
What is Azure Database for PostgreSQLand when are they used?
Azure Database for PostgreSQL - a hybrid relational-object database. You can store data in relational tables, but a PostgreSQL database also enables you to store custom data types, with their own non-relational properties.
What type of storage does an storage account support?
- File store
- Tables
- Blob containers
What is Azure Data Factory?
Azure Data Factory is an Azure service that enables you to define and schedule data pipelines to transfer and transform data. You can integrate your pipelines with other Azure services, enabling you to ingest data from cloud data stores, process the data using cloud-based compute, and persist the results in another data store.
Azure Data Factory is used by data engineers to build extract, transform, and load (ETL) solutions that populate analytical data stores with data from transactional systems across the organization.
What is Azure Synapse Analytics?
Data engineers can use Azure Synapse Analytics to create a unified data analytics solution that combines data ingestion pipelines, data warehouse storage, and data lake storage through a single service.
Azure Synapse Analytics is a comprehensive, unified Platform-as-a-Service (PaaS) solution for data analytics that provides a single service interface for multiple analytical capabilities. Including:
- Pipelines - based on the same technology as Azure Data Factory.
- SQL - a highly scalable SQL database engine, optimized for data warehouse workloads.
- Apache Spark - an open-source distributed data processing system that supports multiple programming languages and APIs, including Java, Scala, Python, and SQL.
- Azure Synapse Data Explorer - a high-performance data analytics solution that is optimized for real-time querying of log and telemetry data using Kusto Query Language (KQL).
What is Azure Databricks?
Azure Databricks is a fully managed first-party service that enables an open data lakehouse in Azure. Azure Databricks is an Azure-integrated version of the popular Databricks platform, which combines the Apache Spark data processing platform with SQL database semantics and an integrated management interface to enable large-scale data analytics.
What is Azure HDInsights?
Azure HDInsight is a full-spectrum, managed cluster platform which simplifies running big data frameworks in large volume and velocity using Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Hadoop, and more in your Azure environment.