Big Data Flashcards
What is big data?
Big data is seen as an ‘all-encompassing’ term given to data that won’t fit the usual data constructs or containers. Calling it ‘big’ is misleading as it implies that volume is the only factor determine whether data should be classed. But there are other factors
What are the three features to consider?
Volume, variety, velocity
Why is volume a factor?
If so much data has been collected that it is no longer suitable to store on a single server then there is no hope of having the ability to process that data without breaking it down.
Why is a variety a factor?
When data is so varied and in different formats and data types that it becomes difficult to structure it.
Why is velocity a factor?
It is the speed at which data needs to be accessed; this is a problem as people aren’t satisfied with slow data connections and want data as fast and flawlessly as possible so the tech has had to be developed to enable this
How do we fix the issues?
When data banks become so large that they will no longer fit into a single server, the task of processing the data must be distributed across a bank if computers,. Using more than a single computer to perform tasks on a data bank requires specialist programming that is very complex and expensive to produce as it needs to be made to order for the data bank that is being processed. However, the functional programming paradigm can be used because it makes it easier to create and maintain code that allows for the workload to be efficiently distributed because it supports:
- Immutable data structures, statelessness, high-order functions
Why immutable data structures?
They are data structures that are unchanging. The rules that govern how they are used cannot be changed or altered, meaning that there are strict rules as to how data can be used and manipulated.
Why statelessness?
Statelessness is inherent in functional programming - the paradigm doesn’t support the concept of state; it doesn’t remember the results or states of any preceding events prior to the current instruction being read, as states can sometimes be restricting.
Why high-order functions?
High-order functions are functions that can accept a function as arguments and return a function as a result. This allows the language to be highly adaptable for whatever the requirements may be.
What are the possible solutions?
Fact-based model, Graph databases
What are graph databases?
Graph databases offer the same functionality as a standard database but instead of there being an index for a field or an entity, the graph database uses a pointer to reference the next entity. The graphs are an abstraction of data that are re,aged to each other or linked and are represented using nodes, edges and properties; these three criteria are found in the database’s schema. A schema is a blue print of how the database is structured, how to data is stored and what constraints there are on the structures and the way in which data can be stored
What is a fact- based model?
Generating a fact-based system is much like creating the Unified Modelling Language (it is a set of approaches for representing the user, data flow and functionality of a system in a way that can convey large systems in a relatively straightforward and comprehensive manner). Similarly, fact-based are not concerned with the data itself but how the dataset is structured, how the dataset is linked and how the dataset can be used. This then allows for a system to handle vast quantities of data without needing to be concerned with how the data will be used because the system will only perform operations within the constraints of the facts that have been generated.
Fact based vs Graph databases