Azure Data Fundamentals Flashcards
Storage Account
Azure Storage is a core Azure service that enables you to store data in:
Blob containers - scalable, cost-effective storage for binary files.
File shares - network file shares such as you typically find in corporate networks.
Tables - key-value storage for applications that need to read and write data values quickly.
Blob Storage
Tables
Files
Entity
Data structures in which this data is organized often represents Each entity typically has one or more attributes, or characteristics
Structured Data
Fixed Schema, tabular, often stored in db’s,
Semi Structured Data
data is information that has some structure, but which allows for some variation between entity instances; One common format for semi-structured data is JavaScript Object Notation (JSON)
Unstructured
Not all data is structured or even semi-structured. For example, documents, images, audio and video data, and binary files might not have a specific structure.
Data Stores
there are 2 - files stores and databases
BLOB
Binary Large Objects such as video, audio, images, and application specific documents, stored as raw binary files
Optimized File Formats
Avro, ORC, Parquet
While human-readable formats for structured and semi-structured data can be useful, they’re typically not optimized for storage space or processing. Over time, some specialized file formats that enable compression, indexing, and efficient storage and processing have been developed.
Avro
Avro is a row-based format. It was created by Apache. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information. An application uses the information in the header to parse the binary data and extract the fields it contains. Avro is a good format for compressing data and minimizing storage and network bandwidth requirements.
ORC (Optimized Row Columnar)
organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive (Hive is a data warehouse system that supports fast data summarization and querying over large datasets). An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.
Parquet
another columnar data format. It was created by Cloudera and Twitter. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.
Data Documents
Collective form in which data exist; common types:
Datasets, Databases, Datastores, Data Warehouses, Notebooks
Dataset
logical grouping of data
Database
structured or semi-structured data that can be quickly access and searches;
Datastores
unstructured or semi-strucutred data (example Azure Data Lake)
Non-Relational Databases
often referred to as NoSQL databases. There are 4 common types: Key-value; document (similar to key-value where the value is the JSON document); column family; graph
Data Lake Store (Gen 2)
hierarchical data storage for analytical data lakes, work with structure, semi, and unstructured; To create an Azure Data Lake Store Gen2 files system, you must enable the Hierarchical Namespace option of an Azure Storage account.
Azure File Storage
Many on-premises systems comprising a network of in-house computers make use of file shares. A file share enables you to store a file on one computer, and grant access to that file to users and applications running on other computers. This strategy can work well for computers in the same local area network, but doesn’t scale well as the number of users increases, or if users are located at different sites.
Azure Files enables you to share up to 100 TB of data in a single storage account.
Offers Standard and Premium tiers
Azure Table Storage
NoSQL storage solution, key-value, The key in an Azure Table Storage table comprises two elements; the partition key that identifies the partition containing the row, and a row key that is unique to each row in the same partition.
Azure Table Storage allows you to store key-value data as the cheapest per GB rate. Cosmos DB also have a key-value option, but is not as cost effective.
Storage Account
needs to be set up first before other storage services are added
Azure Cosmos DB
Azure Cosmos DB is a highly scalable cloud database service for NoSQL data; example of NoSQL database; Cosmos DB uses indexes and partitioning to provide fast read and write performance and can scale to massive volumes of data. You can enable multi-region writes, adding the Azure regions of your choice to your Cosmos DB account so that globally distributed users can each work with data in their local replica. Includes API’s for Mongo DB, Table API (key-value tables), Cassandra API (column family storage), Gremlin API (graph structures)