4.11.1 Big Data Flashcards
Define big data.
A catch all term for data that won’t fit in the usual containers.
What do the three v’s do?
Describe big data.
What are the 3 v’s?
Volume.
Velocity.
Variety.
Define volume.
Too much data for it all to fit on a conventional hard drive or server. Data has to be stored over multiple serves, each of which is composed over many hard drives.
In terms of volume, why must data be stored over multiple servers?
As relational databases don’t scale well over multiple machines.
Define velocity.
Data in the servers created and modified rapidly. Servers must respond to frequently changing data within a matter of milliseconds.
Define variety.
Data held on the servers consists of many different data types - from binary files to multimedia files.
In terms of big data, what is the biggest problem?
Unstructured nature gives cause for difficulty when analysing the data. Conventional databases are not suited to store big data as it is required that it confirms to a column and row structure. Do not scale well over multiple servers.
What needs to happen when storing big data over multiple servers?
The processing associated with using the data must be amongst multiple machines.
Why is storing big data over multiple machines incredibly difficult with conventional programming paradigms?
As all machines would have to be synchronised so no data is overwritten or damaged.
Why is functional programming used with big data?
Solves the problem of programming over multiple machines.
Stateless - no side effects.
Uses immutable data structures.
Supports higher order functions.
Attributes make it easier to write and correct efficient, distributed code than with any procedural programming.
How can we represent data that doesn’t conform to the typical column and row format?
With the fact based model
In terms of the fact based model (FBM) how is data stored?
As a fact.
(FBM) What are the benefits of facts?
Immutable and cannot be overwritten, reducing the risk of loosing data due to human error.
(FBM) what is stored with each fact?
A time stamp - indicating the data and time each piece of information was recorded.
Why are timestamps used?
Multiple different values could be held for the same attribute - computers can discern most reason values.
Define a graph scheme (Big Data and Graphs (BDG)).
Uses nodes and edges to graphically represent the structure of a dataset.
Define an edge.
Relationships between entities with a brief description of it.
Where are the properties? (BDG)
Listed within the entities.
How often are timestamps used and why.
Rarely, as it is assumed that most nodes contain the most recent information available.
What are the alternative representations of properties? (BDG)
Inside rectangles joined to entities with a dashed line, not representing a relationship, just the properties that belong to said entity.
(Functional programming and Big Data (FPBD) when do we use functional programming in big data?
When working with data which needs to be distributed over multiple servers (volume).
(FPBD) does functional programming have side effects?
No, it will not change any values or affect the program elsewhere.
(FPBD) What is stateless news?
When the current state of the variable, regardless of the order call of functions, does not rely on variables from other function.