Big Data Flashcards
Big Data
Big Data is a term that refers to data that can’t be processed or analysed using traditional methods
There are 3 features that make data Big Data:
- Volume; the amount of data to be processed is too big to fit on a single server - data must be analysed in a single set in terabytes or petabytes to be classed as Big Data
- Velocity; The data is generated very quickly and must be processed very quickly.
If the data is at rest it can be batch processed but if the data is in motion (streaming systems) it must be processed in real time - Variety; The data is in many forms, including unstructured, semi-structured, structured, text and multimedia
Structured, semi-structured and unstructured data
Structured data = data that can be represented in table form because it has a clear, identifiable structure
Semi-structured data = data such as XML or JSON formatted files. They don’t have a formal structure but do have some kind of structure which can vary
Unstructured data = data whose text is so variable that it can’t be modelled in advance, it can’t be fitted into a table structure required by relational database modelling or its elements are not identifiable with tags
Distributed Processing
In systems involved in the processing of big data, the data has to be distributed across multiple servers because there’s too much data to fit on one server
The program written to process Big Data must be able to execute on more than one machine at a time - this is called distributed code
Functional programming and Big Data
Functional languages are a solution to Big Data problems as:
- Functional languages have immutable data strucures - an immutable object is one whose state cannot be changed after its been created
- Functional programs are stateless; the program’s behaviour doesn’t depend on how often the function is called or in what order different functions are called
- Functional languages support higher order functions; they are functions that take at least one function as a parameter or return a function as a result or both
Higher order functions can be easily parallelised so that many processors can work at the same time without affecting other parts of the data
Fact based models
Each fact in a facted based model captures a single piece of information
The data in a fact-based model immutable and cannot be altered except to delete any data that has been entered incorrectly
When a change in circumstance is to be recorded it’s recorded as a new fact rather than an update - this means the dataset grows continuously with the addition of time-stamped immutable data
Each fact is:
- Atomic; stores a single piece of info
- Time-stamped
- Kept immutable with timestamps
Graph schema
A graph schema captures the structure of a dataset stored using the fact-based model
It shows entities in the dataset, properties of the entities and relationships between entities