Core Data Cocepts Flashcards
What is Data?
Data is a collection of facts such as numbers, descriptions, and observations used to record information.
Structured Data
Structured data is data that adheres to a fixed schema, all of the data has the same fields or properties. The data is represented in one or more tables that consist of rows to represent each instance of a data entity, and columns to represent attributes of the entity.
Semi-structured data
Semi-structured data is information that has some structure, but which allows for some variation between entity instances. Eg. while most customers may have an email address, some might have multiple email addresses, and some might have none at all.
Unstructured data
Not all data is structured or even semi-structured. For example, documents, images, audio and video data, and binary files might not have a specific structure. This kind of data is referred to as unstructured data.
Data stores
There are two broad categories of data store in common use:
- File stores
- Databases
Delimited text files
Data is often stored in plain text format with specific field delimiters and row terminators. The most common format for delimited data is comma-separated values (CSV) in which fields are separated by commas and rows. Another popular format is tab-separated values(TSV).
Delimited text is a good choice for structured data that needs to be accessed by a wide range of applications and services in a human-readable format.
JavaScript Object Notation (JSON)
JSON is a ubiquitous format in which a hierarchical document schema is used to define data entities (objects) that have multiple attributes. Each attribute might be an object (or a collection of objects); making JSON a flexible format that’s good for both structured and
semi-structured data.
Extensible Markup Language (XML)
XML is a human-readable data format that was popular in the 1990s and 2000s. It’s largely been superseded by the less verbose JSON format, but there are still some systems that use XML to represent data.
Binary Large Object (BLOB)
Ultimately, all files are stored as binary data (1’s and 0’s), but in the human-readable formats discussed above, the bytes of binary data are mapped to printable characters (typically though a character encoding scheme such as ASCII or Unicode).
Common types of data stored as binary include images, video, audio, and application-specific documents.
When working with data like this, data professionals often refer to the data files as BLOBs (Binary Large Objects).
Optimized file formats
Specialized file formats that enable compression, indexing, and efficient storage and processing have been developed because human-readable formats are not optimized for storage and processing.
Avro – Is a row-based format created by Apache. Each record contains a header that describes the structure of the record, the header is stored as JSON. Data is stored as binary. Avro is a good file format for compressing and minimizing storage and network bandwidth requirements.
ORC – (Optimized Row Columnar format) organizes data into columns rather than rows. Developed by HortonWorks for optimizing read and write operations in Apache Hive (Hive is a data warehouse system that supports fast data summarization and querying over large datasets). An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.
Parquet – Parquet is another columnar data format. It was created by Cloudera and Twitter. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data .A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.
Relational databases
Relational databases are commonly used to store and query structured data. The data is stored in tables that represent entities, such as customers, products, or sales orders. Each instance of an entity is assigned a primary key that uniquely identifies it, and these keys are used to reference the entity instance in other tables. This use of keys to reference data entities enables a relational database to be normalized; which in part means the elimination of duplicate data values so that, for example, the details of an individual customer are stored only once; not for each sales order the customer places.
Non-relational databases
Non-relational databases are data management systems that don’t apply a relational schema to the data. Non-relational databases are often referred to as NoSQL database, even though some support a variant of the SQL language.
There are four common types of Non-relational database commonly in use.
- Key-value databases in which each record consists of a unique key and an associated value, which can be in any format.
- Document databases, which are a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)
- Column family databases, which store tabular data comprising rows and columns, but you can divide the columns into groups known as column-families. Each column family holds a set of columns that are logically related together.
- Graph databases, which store entities as nodes with links to define relationships between them.
Transactional data workloads
A transactional data processing system is what most people consider the primary function of business computing. A transactional system records transactions that encapsulate specific events that the organization wants to track. A transaction could be financial, such as the movement of money between accounts in a banking system.
Transactional systems are often high-volume, sometimes handling many millions of transactions in a single day. The data being processed has to be accessible very quickly. The work performed by transactional systems is often referred to as Online Transactional Processing (OLTP).
Online Transactional Processing (OLTP).
OLTP solutions rely on a database system in which data storage is optimized for both read and write operations in order to support transactional workloads in which data records are created, retrieved, updated, and deleted (often referred to as CRUD operations).These operations are applied transactionally, in a way that ensures the integrity of the data stored in the database. To accomplish this, OLTP systems enforce transactions that support so-called ACID semantics:
– Atomicity – each transaction is treated as a single unit, which succeeds completely or fails completely. EG. For example, a transaction that involved debiting funds from one account and crediting the same amount to another account must complete both actions. If either action can’t be completed, then the other action must fail.
Atomicity
– Atomicity – each transaction is treated as a single unit, which succeeds completely or fails completely. EG. For example, a transaction that involved debiting funds from one account and crediting the same amount to another account must complete both actions. If either action can’t be completed, then the other action must fail.
Consistency
Consistency – transactions can only take the data in the database from one valid state to another. To continue the debit and credit example above, the completed state of the transaction must reflect the transfer of funds from one account to the other.
Isolation
Isolation – concurrent transactions cannot interfere with one another, and must result in a consistent database state. For example, while the transaction to transfer funds from one account to another is in-process, another transaction that checks the balance of these accounts must return consistent results - the balance-checking transaction can’t retrieve a value for one account that reflects the balance before the transfer, and a value for the other account that reflects the balance after the transfer.