Core Data Concepts Flashcards
Structured Data
Data that adheres to fixed schema (all data has the same fields/properties).
Most commonly, the schema is tabular
Semi-structured Data
Data has a structure but not all data points share the same fields/properties.
Common format is JavaScript Object Notation (JSON).
Optimized file formats: Avro
Avro is a row-based format.
Each record contains a JSON header that describes the structure of the data.
The data is stored as binary information.
Header example:
{ "type": "record", "name": "thecodebuzz_schema", "namespace": "thecodebuzz.avro", "fields": [ { "name": "username", "type": "string", "doc": "Name of the user account on Thecodebuzz.com" }, { "name": "email", "type": "string", "doc": "The email of the user logging message on the blog" }, { "name": "timestamp", "type": "long", "doc": "time in seconds" } ], "doc:": "A basic schema for storing thecodebuzz blogs messages" }
Unstructured data
Data with no structure,
documents, images, audio and video data, and binary files might not have a specific structure
common file formats for data: Delimited text files
stored in plain text format with specific field delimiters and row terminators.
Comma-separated values (CSV) and tab-separated values (TSV) are two common formats.
common file formats for data: JavaScript Object Notation (JSON)
hierarchical document schema is used to define data entities (objects) that have multiple attributes. Each attribute might be an object (or a collection of objects); making JSON a flexible format that’s good for both structured and semi-structured data.
common file formats for data: Extensible Markup Language (XML)
XML uses tags enclosed in angle-brackets (<../>) to define elements and attributes,
Optimized file formats: ORC
ORC (Optimized Row Columnar format) organizes data into columns rather than rows.
An ORC file contains:
- groups of row data called stripes.
- file footer
- a postscript
A stripe is 250mb by default.
File footer contains list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.
The postscript holds compression parameters and the size of the compressed footer.
Optimized file formats: Parquet
Parquet is another columnar data format. It was created by Cloudera and Twitter.
A Parquet file contains row groups.
Data for each column is stored together in the same row group.
Each row group contains one or more chunks of data.
A Parquet file includes metadata that describes the set of rows found in each chunk.
RowGroup:
- ColumnChunk: Pete, Dave, Sue
- ColumnChunk: M, M, F
- ColumnChunk: 41, 20, 33
RowGroup:
- ColumnChunk: Kate, May, Ari
- ColumnChunk: F, F, M
- ColumnChunk: 23, 53, 19
Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.
Relational databases
Data is stored in tables that represent entities, such as customers, products, or sales orders.
Each instance of an entity is assigned a primary key that uniquely identifies it; and these keys are used to reference the entity instance in other tables.
The tables are managed and queried using Structured Query Language (SQL), which is based on an ANSII standard, so it’s similar across multiple database systems.
Non-relational databases
Non-relational databases are data management systems that don’t apply a relational schema to the data.
Non-relational databases are often referred to as NoSQL database, even though some support a variant of the SQL language.
Non-relational databases: Key-value databases
Key-value databases in which each record consists of a unique key and an associated value, which can be in any format.
Key, Value
123, “Hammer ($2.99)”
456, “Screwdriver ($3.49)”
789, “Wrench ($4.25)”
Non-relational databases: Document databases
Document databases, which are a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)
key,document
1, {“name”: “joe”, “sex”: “male”}
2, {“name”: “jane”, “sex”: “female”}
Non-relational databases: Column family databases
Column family databases store tabular data comprising rows and columns.
They can be considered a two-dimensional key-value store, where a row key and a column key are used to access data.
“Employees” : {
row1 : { “ID”:1, “Last”:”Cooper”, “First”:”James”, “Age”:32},
row2 : { “ID”:2, “Last”:”Bell”, “First”:”Lisa”, “Age”:57},
…
}
Non-relational databases: Graph databases
Graph databases, which store entities as nodes with links to define relationships between them.