Skills Measured Flashcards
Describe features of structured data
- adheres to a fixed schema, so all of the data has the same fields and properties
- most commonly, the schema for structured data entities is tabular
- often stored in a database in which multiple tables can reference one another by using key values in a relational model
Describe features of semi-structured
- information that has some structure, but which allows for some variation between entity instances
- for example, while most customers may have an email address, some might have multiple email addresses, and some might have none at all
- one common format for semi-structured data is JavaScript Object Notation (JSON)
Describe features of unstructured data
- data without any specific structure
- for example, documents, images, audio and video data, and binary files
Describe common formats for data files
- delimited text files: stored in plain text format with specific field delimiters and row terminators. Most common in CSV, other examples include TSV and space-delimited. Delimited text is a good choice for structured data that needs to be accessed by a wide range of applications and services in a human-readable format.
- JavaScript Object Notation (JSON): a ubiquitous format in which a hierarchical document schema is used to define data entities (objects) that have multiple attributes. Each attribute might be an object (or a collection of objects), making JSON a flexible format that’s good for both structured and semi-structured data.
- Extensible Markup Language (XML): a human-readable data format that was popular in the 1990s and 2000s. It’s largely been superseded by the less verbose JSON format, but there are still some systems that use XML to represent data.
- Binary Large Object (BLOB): Ultimately, all files are stored as binary data (1’s and 0’s), but in the human-readable formats discussed above, the bytes of binary data are mapped to printable characters (typically through a character encoding scheme such as ASCII or Unicode). Some file formats however, particularly for unstructured data, store the data as raw binary that must be interpreted by applications and rendered. Common types of data stored as binary include images, video, audio, and application-specific documents.
- Optimized File Formats: While human-readable formats for structured and semi-structured data can be useful, they’re typically not optimized for storage space or processing. Over time, some specialized file formats that enable compression, indexing, and efficient storage and processing have been developed: Avro (row-based format), ORC (Optimized Row Columnar format), Parquet (another columnar data format)
Describe types of databases
- Relational databases: commonly used to store and query structured data. The data is stored in tables that represent entities, such as customers, products, or sales orders. Each instance of an entity is assigned a primary key that uniquely identifies it; and these keys are used to reference the entity instance in other tables. This use of keys to reference data entities enables a relational database to be normalized; which in part means the elimination of duplicate data values. The tables are managed and queried using Structured Query Language (SQL), which is based on an ANSI standard, so it’s similar across multiple database systems.
- Non-relational databases: data management systems that don’t apply a relational schema to the data. Non-relational databases are often referred to as NoSQL database, even though some support a variant of the SQL language. Four common types: Key-value databases (each record consists of a unique key and an associated value, which can be in any format), Document databases (a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)), Column family databases (store tabular data comprising rows and columns, but you can divide the columns into groups known as column-families. Each column family holds a set of columns that are logically related together.), Graph databases (store entities as nodes with links to define relationships between them)
Describe features of transactional workloads
- A transactional data processing system is what most people consider the primary function of business computing. A transactional system records transactions that encapsulate specific events that the organization wants to track. A transaction could be financial, such as the movement of money between accounts in a banking system, or it might be part of a retail system, tracking payments for goods and services from customers. Think of a transaction as a small, discrete, unit of work.
- Transactional systems are often high-volume, sometimes handling many millions of transactions in a single day. The data being processed has to be accessible very quickly. The work performed by transactional systems is often referred to as Online Transactional Processing (OLTP).
- OLTP solutions rely on a database system in which data storage is optimized for both read and write operations in order to support transactional workloads in which data records are created, retrieved, updated, and deleted (often referred to as CRUD operations). These operations are applied transactionally, in a way that ensures the integrity of the data stored in the database. To accomplish this, OLTP systems enforce transactions that support so-called ACID semantics: Atomicity, Consistency, Isolation, Durability
- OLTP systems are typically used to support live applications that process business data - often referred to as line of business (LOB) applications
Describe features of analytical workloads
- Analytical data processing typically uses read-only (or read-mostly) systems that store vast volumes of historical data or business metrics. Analytics can be based on a snapshot of the data at a given point in time, or a series of snapshots.
- The specific details for an analytical processing system can vary between solutions, but a common architecture for enterprise-scale analytics looks like this:
1. Operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis.
2. Data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files in the data lake, or a data warehouse with a fully relational SQL engine.
3. Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube. Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimension tables. For example, sales revenue might be totaled by date, customer, and product.
4. The data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards. - Data lakes are common in large-scale data analytical processing scenarios, where a large volume of file-based data must be collected and analyzed.
- Data warehouses are an established way to store data in a relational schema that is optimized for read operations – primarily queries to support reporting and data visualization. Data Lakehouses are a more recent innovation that combine the flexible and scalable storage of a data lake with the relational querying semantics of a data warehouse. The table schema may require some denormalization of data in an OLTP data source (introducing some duplication to make queries perform faster).
- An OLAP model is an aggregated type of data storage that is optimized for analytical workloads. Data aggregations are across dimensions at different levels, enabling you to drill up/down to view aggregations at multiple hierarchical levels; for example to find total sales by region, by city, or for an individual address. Because OLAP data is pre-aggregated, queries to return the summaries it contains can be run quickly.
- Different types of user might perform data analytical work at different stages of the overall architecture. For example:
- Data scientists might work directly with data files in a data lake to explore and model data.
Data Analysts might query tables directly in the data warehouse to produce complex reports and visualizations.
Business users might consume pre-aggregated data in an analytical model in the form of reports or dashboards.
Describe responsibilities for database administrators
- A database administrator is responsible for the design, implementation, maintenance, and operational aspects of on-premises and cloud-based database systems. They’re responsible for the overall availability and consistent performance and optimizations of databases. They work with stakeholders to implement policies, tools, and processes for backup and recovery plans to recover following a natural disaster or human-made error.
- The database administrator is also responsible for managing the security of the data in the database, granting privileges over the data, granting or denying access to users as appropriate.
Describe responsibilities for data engineers
- A data engineer collaborates with stakeholders to design and implement data-related workloads, including data ingestion pipelines, cleansing and transformation activities, and data stores for analytical workloads. They use a wide range of data platform technologies, including relational and non-relational databases, file stores, and data streams.
- They’re also responsible for ensuring that the privacy of data is maintained within the cloud and spanning from on-premises to the cloud data stores. They own the management and monitoring of data pipelines to ensure that data loads perform as expected.
Describe responsibilities for data analysts
- A data analyst enables businesses to maximize the value of their data assets. They’re responsible for exploring data to identify trends and relationships, designing and building analytical models, and enabling advanced analytics capabilities through reports and visualizations.
- A data analyst processes raw data into relevant insights based on identified business requirements to deliver relevant insights.
Identify features of relational data
- In a relational database, you model collections of entities from the real world as tables. An entity can be anything for which you want to record information, typically important objects and events. A table contains rows, and each row represents a single instance of an entity.
- Relational tables are a format for structured data, and each row in a table has the same columns; though in some cases, not all columns need to have a value (NULL).
- Each column stores data of a specific datatype. For example, an Email column in a Customer table would likely be defined to store character-based (text) data (which might be fixed or variable in length), a Price column in a Product table might be defined to store decimal numeric data, while a Quantity column in an Order table might be constrained to integer numeric values; and an OrderDate column in the same Order table would be defined to store date/time values. The available datatypes that you can use when defining a table depend on the database system you are using; though there are standard datatypes defined by the American National Standards Institute (ANSI) that are supported by most database systems.
Describe normalization and why it is used
- Normalization is a term used by database professionals for a schema design process that minimizes data duplication and enforces data integrity (data is correct, consistent, and dependable).
- While there are many complex rules that define the process of refactoring data into various levels (or forms) of normalization, a simple definition for practical purposes is:
1. Separate each entity into its own table.
2. Separate each discrete attribute into its own column.
3. Uniquely identify each entity instance (row) using a primary key.
4. Use foreign key columns to link related entities. - Recording each instance of an entity as a row in an entity-specific table removes duplication of data. For example, to change a customer’s address, you need only modify the value in a single row.
- The decomposition of attributes into individual columns ensures that each value is constrained to an appropriate data type. Additionally, the creation of individual columns provides a useful level of granularity in the data for querying - for example, you can easily filter customers to those who live in a specific city.
- Instances of each entity are uniquely identified by an ID or other key value, known as a primary key; and when one entity references another (for example, an order has an associated customer), the primary key of the related entity is stored as a foreign key. You can look up the address of the customer (which is stored only once) for each record in the Order table by referencing the corresponding record in the Customer table. Typically, a relational database management system (RDBMS) can enforce referential integrity to ensure that a value entered into a foreign key field has an existing corresponding primary key in the related table – for example, preventing orders for non-existent customers.
- In some cases, a key (primary or foreign) can be defined as a composite key based on a unique combination of multiple columns. For example, the LineItem table in the example above uses a unique combination of OrderNo and ItemNo to identify a line item from an individual order.
Identify common structured query language (SQL) statements
- SQL stands for Structured Query Language, and is used to communicate with a relational database. It’s the standard language for relational database management systems. Some common relational database management systems that use SQL include Microsoft SQL Server, MySQL, PostgreSQL, MariaDB, and Oracle.
- Although these SQL statements are part of the SQL standard, many database management systems also have their own additional proprietary extensions to handle the specifics of that database management system. Some popular dialects of SQL include: Transact-SQL (T-SQL) (used by Microsoft SQL Server and Azure SQL services), pgSQL (PostgreSQL), PL\SQL (used by Oracle. PL/SQL stands for Procedural Language/SQL)
- SQL statements are grouped into three main logical groups:
- Data Definition Language (DDL): used to create, modify, and remove tables and other objects in a database (table, stored procedures, views, and so on). Most common statements: CREATE, ALTER, DROP, RENAME
- Data Control Language (DCL): Database administrators generally use DCL statements to manage access to objects in a database by granting, denying, or revoking permissions to specific users or groups. Three main statements: GRANT, DENY, REVOKE
- Data Manipulation Language (DML): used to manipulate the rows in tables. These statements enable you to retrieve (query) data, insert new rows, or modify existing rows. You can also delete rows if you don’t need them anymore. Four main statements are: SELECT, INSERT, UPDATE, DELETE
Identify common database objects
- In addition to tables, a relational database can contain other structures that help to optimize data organization, encapsulate programmatic actions, and improve the speed of access. Three of these structures: views, stored procedures, and indexes.
- Views: a virtual table based on the results of a SELECT query. You can think of a view as a window on specified rows in one or more underlying tables.
- Stored procedure: defines SQL statements that can be run on command. Stored procedures are used to encapsulate programmatic logic in a database for actions that applications need to perform when working with data. You can define a stored procedure with parameters to create a flexible solution for common actions that might need to be applied to data based on a specific key or criteria.
- Index: helps you search for data in a table. Think of an index over a table like an index at the back of a book. A book index contains a sorted set of references, with the pages on which each reference occurs. When you want to find a reference to an item in the book, you look it up through the index. You can use the page numbers in the index to go directly to the correct pages in the book. Without an index, you might have to read through the entire book to find the references you’re looking for. When you create an index in a database, you specify a column from the table, and the index contains a copy of this data in a sorted order, with pointers to the corresponding rows in the table. When the user runs a query that specifies this column in the WHERE clause, the database management system can use this index to fetch the data more quickly than if it had to scan through the entire table row by row. The index creates a tree-based structure that the database system’s query optimizer can use to quickly find rows. For a table containing few rows, using the index is probably not any more efficient than simply reading the entire table and finding the rows requested by the query (in which case the query optimizer will ignore the index). However, when a table has many rows, indexes can dramatically improve the performance of queries. You can create many indexes on a table. However, indexes aren’t free. An index consumes storage space, and each time you insert, update, or delete data in a table, the indexes for that table must be maintained. This additional work can slow down insert, update, and delete operations. You must strike a balance between having indexes that speed up your queries versus the cost of performing other operations.
Describe the Azure SQL family of products including Azure SQL Database, Azure SQL Managed Instance, and SQL Server on Azure Virtual Machines
Azure SQL is a collective term for a family of Microsoft SQL Server based database services in Azure. Specific Azure SQL services include:
- SQL Server on Azure Virtual Machines (VMs) - A virtual machine running in Azure with an installation of SQL Server. The use of a VM makes this option an infrastructure-as-a-service (IaaS) solution that virtualizes hardware infrastructure for compute, storage, and networking in Azure; making it a great option for “lift and shift” migration of existing on-premises SQL Server installations to the cloud. Use this option when you need to migrate or extend an on-premises SQL Server solution and retain full control over all aspects of server and database configuration.
- Azure SQL Managed Instance - A platform-as-a-service (PaaS) option that provides near-100% compatibility with on-premises SQL Server instances while abstracting the underlying hardware and operating system. The service includes automated software update management, backups, and other maintenance tasks, reducing the administrative burden of supporting a database server instance. Use this option for most cloud migration scenarios, particularly when you need minimal changes to existing applications.
- Azure SQL Database - A fully managed, highly scalable PaaS database service that is designed for the cloud. This service includes the core database-level capabilities of on-premises SQL Server, and is a good option when you need to create a new application in the cloud. Use this option for new cloud solutions, or to migrate applications that have minimal instance-level dependencies.
- Azure SQL Edge - A SQL engine that is optimized for Internet-of-things (IoT) scenarios that need to work with streaming time-series data.