Core Data Concepts Flashcards

1
Q

Structured Data

A

Data that adheres to fixed schema (all data has the same fields/properties).

Most commonly, the schema is tabular

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Semi-structured Data

A

Data has a structure but not all data points share the same fields/properties.

Common format is JavaScript Object Notation (JSON).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Optimized file formats: Avro

A

Avro is a row-based format.

Each record contains a JSON header that describes the structure of the data.

The data is stored as binary information.

Header example:

{
  "type": "record",
  "name": "thecodebuzz_schema",
  "namespace": "thecodebuzz.avro",
  "fields": [
    {
      "name": "username",
      "type": "string",
      "doc": "Name of the user account on Thecodebuzz.com"
    },
    {
      "name": "email",
      "type": "string",
      "doc": "The email of the user logging message on the blog"
    },
    {
      "name": "timestamp",
      "type": "long",
      "doc": "time in seconds"
    }
  ],
  "doc:": "A basic schema for storing thecodebuzz blogs messages"
}
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unstructured data

A

Data with no structure,

documents, images, audio and video data, and binary files might not have a specific structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

common file formats for data: Delimited text files

A

stored in plain text format with specific field delimiters and row terminators.

Comma-separated values (CSV) and tab-separated values (TSV) are two common formats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

common file formats for data: JavaScript Object Notation (JSON)

A

hierarchical document schema is used to define data entities (objects) that have multiple attributes. Each attribute might be an object (or a collection of objects); making JSON a flexible format that’s good for both structured and semi-structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

common file formats for data: Extensible Markup Language (XML)

A

XML uses tags enclosed in angle-brackets (<../>) to define elements and attributes,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Optimized file formats: ORC

A

ORC (Optimized Row Columnar format) organizes data into columns rather than rows.

An ORC file contains:

  • groups of row data called stripes.
  • file footer
  • a postscript

A stripe is 250mb by default.

File footer contains list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.

The postscript holds compression parameters and the size of the compressed footer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Optimized file formats: Parquet

A

Parquet is another columnar data format. It was created by Cloudera and Twitter.

A Parquet file contains row groups.
Data for each column is stored together in the same row group.
Each row group contains one or more chunks of data.
A Parquet file includes metadata that describes the set of rows found in each chunk.

RowGroup:

  • ColumnChunk: Pete, Dave, Sue
  • ColumnChunk: M, M, F
  • ColumnChunk: 41, 20, 33

RowGroup:

  • ColumnChunk: Kate, May, Ari
  • ColumnChunk: F, F, M
  • ColumnChunk: 23, 53, 19

Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Relational databases

A

Data is stored in tables that represent entities, such as customers, products, or sales orders.

Each instance of an entity is assigned a primary key that uniquely identifies it; and these keys are used to reference the entity instance in other tables.

The tables are managed and queried using Structured Query Language (SQL), which is based on an ANSII standard, so it’s similar across multiple database systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Non-relational databases

A

Non-relational databases are data management systems that don’t apply a relational schema to the data.

Non-relational databases are often referred to as NoSQL database, even though some support a variant of the SQL language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Non-relational databases: Key-value databases

A

Key-value databases in which each record consists of a unique key and an associated value, which can be in any format.

Key, Value
123, “Hammer ($2.99)”
456, “Screwdriver ($3.49)”
789, “Wrench ($4.25)”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Non-relational databases: Document databases

A

Document databases, which are a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)

key,document
1, {“name”: “joe”, “sex”: “male”}
2, {“name”: “jane”, “sex”: “female”}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Non-relational databases: Column family databases

A

Column family databases store tabular data comprising rows and columns.

They can be considered a two-dimensional key-value store, where a row key and a column key are used to access data.

“Employees” : {
row1 : { “ID”:1, “Last”:”Cooper”, “First”:”James”, “Age”:32},
row2 : { “ID”:2, “Last”:”Bell”, “First”:”Lisa”, “Age”:57},

}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Non-relational databases: Graph databases

A

Graph databases, which store entities as nodes with links to define relationships between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A transactional data processing system records …

A

transactions that encapsulate specific events (i.e., small, discrete, unit of work) that the organization wants to track.

17
Q

OLTP

A

The work performed by transactional systems is often referred to as Online Transactional Processing (OLTP).

18
Q

CRUD

A

Create
Read
Update
Delete

19
Q

OLTP systems enforce transactions support ACID semantics:

A

Atomicity - each transaction is treated as a single unit.
Consistency - transactions can only take the data from one valid state to another
Isolation - concurrent transaction cannot interfere with one another
Durability - when a transactions has been committed it will remain committed.

20
Q

Data warehouse

A

a relational database in which the schema is optimized for read operations - primarily queries to support reporting and data visualization

21
Q

Data lake

A

are common in large-scale data analytical processing scenarios, where a large volume of file-based data must be collected and analyzed.

22
Q

An OLAP model is ….

A

an aggregated type of data storage that is optimized for analytical workloads.

Data aggregations are across dimensions at different levels, enabling you to drill up/down to view aggregations at multiple hierarchical levels; for example to find total sales by region, by city, or for an individual address.

Because OLAP data is pre-aggregated, queries to return the summaries it contains can be run quickly.

23
Q

Row vs Columnar data

A

Row:
1,Marc,Johns,41,M
2,Mary,Johns,37,F

Columnar:
ID: 1,2
First Name: Marc,Mary
Last Name: Johns,Johns
Age: 41,37
Sex: M,F