Core Data Concepts Flashcards

Question 1

Q

Structured Data

Answer

A

Data that adheres to fixed schema (all data has the same fields/properties).

Most commonly, the schema is tabular

Question 2

Q

Semi-structured Data

Answer

A

Data has a structure but not all data points share the same fields/properties.

Common format is JavaScript Object Notation (JSON).

Question 3

Q

Optimized file formats: Avro

Answer

A

Avro is a row-based format.

Each record contains a JSON header that describes the structure of the data.

The data is stored as binary information.

Header example:

{
  "type": "record",
  "name": "thecodebuzz_schema",
  "namespace": "thecodebuzz.avro",
  "fields": [
    {
      "name": "username",
      "type": "string",
      "doc": "Name of the user account on Thecodebuzz.com"
    },
    {
      "name": "email",
      "type": "string",
      "doc": "The email of the user logging message on the blog"
    },
    {
      "name": "timestamp",
      "type": "long",
      "doc": "time in seconds"
    }
  ],
  "doc:": "A basic schema for storing thecodebuzz blogs messages"
}

Question 4

Q

Unstructured data

Answer

A

Data with no structure,

documents, images, audio and video data, and binary files might not have a specific structure

Question 5

Q

common file formats for data: Delimited text files

Answer

A

stored in plain text format with specific field delimiters and row terminators.

Comma-separated values (CSV) and tab-separated values (TSV) are two common formats.

Question 6

Q

common file formats for data: JavaScript Object Notation (JSON)

Answer

A

hierarchical document schema is used to define data entities (objects) that have multiple attributes. Each attribute might be an object (or a collection of objects); making JSON a flexible format that’s good for both structured and semi-structured data.

Question 7

Q

common file formats for data: Extensible Markup Language (XML)

Answer

A

XML uses tags enclosed in angle-brackets (<../>) to define elements and attributes,

Question 8

Q

Optimized file formats: ORC

Answer

A

ORC (Optimized Row Columnar format) organizes data into columns rather than rows.

An ORC file contains:

groups of row data called stripes.
file footer
a postscript

A stripe is 250mb by default.

File footer contains list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.

The postscript holds compression parameters and the size of the compressed footer.

Question 9

Q

Optimized file formats: Parquet

Answer

A

Parquet is another columnar data format. It was created by Cloudera and Twitter.

A Parquet file contains row groups.
Data for each column is stored together in the same row group.
Each row group contains one or more chunks of data.
A Parquet file includes metadata that describes the set of rows found in each chunk.

RowGroup:

ColumnChunk: Pete, Dave, Sue
ColumnChunk: M, M, F
ColumnChunk: 41, 20, 33

RowGroup:

ColumnChunk: Kate, May, Ari
ColumnChunk: F, F, M
ColumnChunk: 23, 53, 19

Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.

Question 10

Q

Relational databases

Answer

A

Data is stored in tables that represent entities, such as customers, products, or sales orders.

Each instance of an entity is assigned a primary key that uniquely identifies it; and these keys are used to reference the entity instance in other tables.

The tables are managed and queried using Structured Query Language (SQL), which is based on an ANSII standard, so it’s similar across multiple database systems.

Question 11

Q

Non-relational databases

Answer

A

Non-relational databases are data management systems that don’t apply a relational schema to the data.

Non-relational databases are often referred to as NoSQL database, even though some support a variant of the SQL language.

Question 12

Q

Non-relational databases: Key-value databases

Answer

A

Key-value databases in which each record consists of a unique key and an associated value, which can be in any format.

Key, Value
123, “Hammer ($2.99)”
456, “Screwdriver ($3.49)”
789, “Wrench ($4.25)”

Question 13

Q

Non-relational databases: Document databases

Answer

A

Document databases, which are a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)

key,document
1, {“name”: “joe”, “sex”: “male”}
2, {“name”: “jane”, “sex”: “female”}

Question 14

Q

Non-relational databases: Column family databases

Answer

A

Column family databases store tabular data comprising rows and columns.

They can be considered a two-dimensional key-value store, where a row key and a column key are used to access data.

“Employees” : {
row1 : { “ID”:1, “Last”:”Cooper”, “First”:”James”, “Age”:32},
row2 : { “ID”:2, “Last”:”Bell”, “First”:”Lisa”, “Age”:57},
…
}

Question 15

Q

Non-relational databases: Graph databases

Answer

A

Graph databases, which store entities as nodes with links to define relationships between them.

Question 16

Q

A transactional data processing system records …

Answer

Study These Flashcards

A

transactions that encapsulate specific events (i.e., small, discrete, unit of work) that the organization wants to track.

Question 17

Q

OLTP

Answer

Study These Flashcards

A

The work performed by transactional systems is often referred to as Online Transactional Processing (OLTP).

Question 18

Q

CRUD

Answer

Study These Flashcards

A

Create
Read
Update
Delete

Question 19

Q

OLTP systems enforce transactions support ACID semantics:

Answer

Study These Flashcards

A

Atomicity - each transaction is treated as a single unit.
Consistency - transactions can only take the data from one valid state to another
Isolation - concurrent transaction cannot interfere with one another
Durability - when a transactions has been committed it will remain committed.

Question 20

Q

Data warehouse

Answer

Study These Flashcards

A

a relational database in which the schema is optimized for read operations - primarily queries to support reporting and data visualization

Question 21

Q

Data lake

Answer

Study These Flashcards

A

are common in large-scale data analytical processing scenarios, where a large volume of file-based data must be collected and analyzed.

Question 22

Q

An OLAP model is ….

Answer

Study These Flashcards

A

an aggregated type of data storage that is optimized for analytical workloads.

Data aggregations are across dimensions at different levels, enabling you to drill up/down to view aggregations at multiple hierarchical levels; for example to find total sales by region, by city, or for an individual address.

Because OLAP data is pre-aggregated, queries to return the summaries it contains can be run quickly.

Question 23

Q

Row vs Columnar data

Answer

Study These Flashcards

A

Row:
1,Marc,Johns,41,M
2,Mary,Johns,37,F

Columnar:
ID: 1,2
First Name: Marc,Mary
Last Name: Johns,Johns
Age: 41,37
Sex: M,F

Core Data Concepts Flashcards

(23 cards)