Big Data and Functional Programming Flashcards
What is Big Data?
Big data is data that is collected on such a large scale that it cannot be easily analysed.
Healthcare – patients, medical records, clinical data.
Google – 24 petabytes of data per day 24 x 10^15.
Amazon – vast amounts of sales data.
Describe volume as a characteristic of Big Data.
Big Data is too big to be handled by one server.
In the UK, there were 3.5billion card purchases in the first quarter of 2016.
Google processes more than three billion search queries every day and saves every single one.
Tesco collects 70 million refrigerator-related data points from its units and analyses them to keep better tabs on performance, to gauge when machines might need to be serviced and cut down on energy use.
Describe velocity as a characteristic of Big Data.
Continuous streams of data being collected in real time, that may also require a response within milliseconds.
Smartphones, sensor networks, and CCTV all create large volumes of data, continuously.
Describe variety as a characteristic of Big Data.
Big data comes in many forms: structured, unstructured, text, audio, and image.
60 million new photos are uploaded daily on Instagram.
48 hours of videos are uploaded every minute on YouTube.
There are 3billion views of YouTube videos every day.
Spotify has 100 million users.
Why are relational databases not suited to Big Data?
Volume – they don’t scale well to large datasets.
Variety – they don’t understand unstructured data (Links between data items are too complex for traditional database relationships to represent).
Velocity – relational databases are designed for steady data retention, rather than rapid growth.
Why is functional programming suitable for Big Data?
Processing Big Data often needs to be distributed across multiple servers.
Useful properties of functional programming for Big Data:
- Immutable data structures, which cannot be accidentally altered in a function.
- Statelessness, so the program’s behaviour does not depend on the order in which functions are called.
- Higher-order functions, such as map and fold allow functions to be input as arguments.
What does the higher-order function ‘map’ do?
Applies a function to each element of a list and returns a new list.
What does the higher-order function ‘fold’/’reduce’ do?
Applies a function recursively over a list and returns a value.
Why are traditional programming paradigms unsuitable for Big Data?
When data is stored over multiple servers, as is the case with big data, the processing associated with using the data must also be split across multiple machines. This would be incredibly difficult with conventional programming paradigms as the machines would all have to be synchronised to ensure that no data is overwritten or otherwise damaged.
Why is the fact-based model suitable for representing Big Data?
Because big data doesn’t conform to the row and column format typically used to represent data, it must be represented differently. One way of representing big data is with the fact-based model.
In the fact-based model, each piece of information is stored as a fact. Facts are immutable and can’t be overwritten.
How are timestamps used in fact-based models?
Stored with each fact is a timestamp, indicating the date and time at which a piece of information was recorded. Seeing as facts are never deleted or overwritten, multiple different values could be held for the same attribute. This is where timestamps come in, allowing a computer to discern which value is the most recent.
What are the benefits of facts in fact-based models being immutable?
Thanks to facts being immutable (and therefore not overwritable), using the fact-based model for storing big data reduces the risk of accidentally losing data due to human error.
Moreover, the model does away with an index for the data and instead simply appends new data to the dataset as it is created.
How are graph schema used to represent Big Data?
Graph schema uses graphs consisting of nodes and edges to graphically represent the structure of a dataset. Nodes in a graph represent entities and can contain the properties of the entity.
Edges are used to represent relationships between entities and are labelled with a brief description of the relationship.
Are timestamps included in graph schema?
Timestamps are rarely included in graph schema diagrams, instead you should assume that each node contains the most recent information available.
What are functional languages in programming?
Functional languages are declarative which means they are concerned with what
needs to be performed as opposed to how it should be performed as is the case
with procedural imperative languages.
Functional languages rely on recursion (the function calling itself) and not
iteration.
Functional programs are shorter than codes written in procedural languages. This
results in code that is likely to contain fewer errors because there are fewer lines
of source code and opportunities to introduce errors.