Big Data Flashcards
what is Big Data?
The term Big Data is described as data that can’t be processed or analysed using traditional processes or tools because it falls into one or more of the following categories:
- too big to fit into a single server
- too heterogeneous (diverse in character or content) - structured, semi-structured or totally unstructured
- its production can occur at very high rates
The three defining features of big data can be remembered as “the three Vs”:
Velocity, Volume and Variety
Big Data Velocity
- Data on the servers is created and modified rapidly.
- The servers must respond to frequently changing data within a matter of milliseconds
define Data in motion(velocity)
- Data in motion is data that is streamed (received or sent) at some (high) frequency continuously (e.g. Rate of 1000 events per second).
- In the Big Data scenario, this data is likely to arrive at a high rate, and from multiple sources simultaneously e.g. twitter streams
define Data at rest(velocity) and it’s batch processing
- Data at rest is data that has been stored on some permanent data storage device.
- The data may be processed at any time.
- Big Data at rest is usually batch processed.
- In batch processing, processing, once started, is carried out to completion without user interaction, this is where machine learning comes in.
Big Data Volume
Volume in Big Data refers to the size of the data to be processed. Large volumes of data fall into the Big Data category if that data must be analysed as a single dataset.
What is a distributed file system?
A distributed file system is one in which the blocks of individual files are spread across more than one server.
e.g. Google’s distributed file system is GFS. Yahoo, Facebook, and Twitter use HDFS, the Hadoop Distributed File System.
Both systems use racks of servers with network switches interconnecting servers in a rack and servers in other racks
Big Data Variety
- Data can appear in many forms from structured through semi-structured to unstructured.
- Big data’s unstructured nature makes it difficult to analyse the data.
- Conventional databases are not suited to storing big data because they require the data to conform to a row and column structure, and do not scale well across multiple servers.
Examples of big data?
Twitter, continuously monitored banking interactions, data from surveillance systems
How does Machine Learning benefit Big Data?
- Machine learning techniques are needed to discern patterns in data and to extract useful information.
- This can take the form of a predictive model that can then be used in the algorithm that processes streaming data to extract the value from the data in the stream.
What are the principles of fact based modelling?
- Raw data stored as atomic facts(smallest/single)
- Each is identifiable so querying can identify duplicates
- Facts capture one single piece of information
- Facts are immutable and eternally true due to a timestamp
Big data can be stored this way
What are the advantages of fact based modelling?
- Simplicity as no indexing is needed
- New items simply appended to growing data set
- Data is true forever
- Immutable facts mean errors are easy to correct by returning to good facts
- Historical queries are easy to run
- Reduces risk of losing data due to human error
Graph schema (what does each component represent)
- Graph schema uses graphs consisting of nodes and edges to graphically represent the structure of a dataset.
- Nodes in a graph represent entities and can contain the properties of the entity.
- Edges are used to represent relationships between entities and are labelled with a brief description of the relationship.
- Timestamps are rarely included in graph schema diagrams, instead you should assume that each node contains the most recent information available.
- to list an entity’s properties inside rectangles joined to entities with a dashed line
Functional Programming
- Functional programming is a solution to the problem of
processing data over multiple machines. - Functional programs are stateless (meaning that they have no side
effects) and make use of immutable data structures. - Furthermore, the functional programming paradigm supports
higher-order functions.