Azure Data Fundamentals Flashcards
Storage Account
Azure Storage is a core Azure service that enables you to store data in:
Blob containers - scalable, cost-effective storage for binary files.
File shares - network file shares such as you typically find in corporate networks.
Tables - key-value storage for applications that need to read and write data values quickly.
Blob Storage
Tables
Files
Entity
Data structures in which this data is organized often represents Each entity typically has one or more attributes, or characteristics
Structured Data
Fixed Schema, tabular, often stored in db’s,
Semi Structured Data
data is information that has some structure, but which allows for some variation between entity instances; One common format for semi-structured data is JavaScript Object Notation (JSON)
Unstructured
Not all data is structured or even semi-structured. For example, documents, images, audio and video data, and binary files might not have a specific structure.
Data Stores
there are 2 - files stores and databases
BLOB
Binary Large Objects such as video, audio, images, and application specific documents, stored as raw binary files
Optimized File Formats
Avro, ORC, Parquet
While human-readable formats for structured and semi-structured data can be useful, they’re typically not optimized for storage space or processing. Over time, some specialized file formats that enable compression, indexing, and efficient storage and processing have been developed.
Avro
Avro is a row-based format. It was created by Apache. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information. An application uses the information in the header to parse the binary data and extract the fields it contains. Avro is a good format for compressing data and minimizing storage and network bandwidth requirements.
ORC (Optimized Row Columnar)
organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive (Hive is a data warehouse system that supports fast data summarization and querying over large datasets). An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.
Parquet
another columnar data format. It was created by Cloudera and Twitter. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.
Data Documents
Collective form in which data exist; common types:
Datasets, Databases, Datastores, Data Warehouses, Notebooks