Big Data Flashcards

Question 1

Q

Big Data

Answer

A

Big Data is a term that refers to data that can’t be processed or analysed using traditional methods

There are 3 features that make data Big Data:

Volume; the amount of data to be processed is too big to fit on a single server - data must be analysed in a single set in terabytes or petabytes to be classed as Big Data
Velocity; The data is generated very quickly and must be processed very quickly.
If the data is at rest it can be batch processed but if the data is in motion (streaming systems) it must be processed in real time
Variety; The data is in many forms, including unstructured, semi-structured, structured, text and multimedia

Question 2

Q

Structured, semi-structured and unstructured data

Answer

A

Structured data = data that can be represented in table form because it has a clear, identifiable structure

Semi-structured data = data such as XML or JSON formatted files. They don’t have a formal structure but do have some kind of structure which can vary

Unstructured data = data whose text is so variable that it can’t be modelled in advance, it can’t be fitted into a table structure required by relational database modelling or its elements are not identifiable with tags

Question 3

Q

Distributed Processing

Answer

A

In systems involved in the processing of big data, the data has to be distributed across multiple servers because there’s too much data to fit on one server

The program written to process Big Data must be able to execute on more than one machine at a time - this is called distributed code

Question 4

Q

Functional programming and Big Data

Answer

A

Functional languages are a solution to Big Data problems as:

Functional languages have immutable data strucures - an immutable object is one whose state cannot be changed after its been created
Functional programs are stateless; the program’s behaviour doesn’t depend on how often the function is called or in what order different functions are called
Functional languages support higher order functions; they are functions that take at least one function as a parameter or return a function as a result or both

Higher order functions can be easily parallelised so that many processors can work at the same time without affecting other parts of the data

Question 5

Q

Fact based models

Answer

A

Each fact in a facted based model captures a single piece of information

The data in a fact-based model immutable and cannot be altered except to delete any data that has been entered incorrectly

When a change in circumstance is to be recorded it’s recorded as a new fact rather than an update - this means the dataset grows continuously with the addition of time-stamped immutable data

Each fact is:
- Atomic; stores a single piece of info

Time-stamped
Kept immutable with timestamps

Question 6

Q

Graph schema

Answer

A

A graph schema captures the structure of a dataset stored using the fact-based model

It shows entities in the dataset, properties of the entities and relationships between entities

Big Data Flashcards

(6 cards)