4.11 Big Data Flashcards
What is big data?
Big data is the term for data that does not fit the usual containers.
It encompasses data that is too large or complex to be handled by conventional data-processing software.
What are the three defining features of big data?
Volume, velocity, variety
What does ‘volume’ refer to in big data?
Too much data to fit on a conventional hard drive or server.
This requires data to be stored over multiple servers, each composed of many hard drives.
What is meant by ‘velocity’ in the context of big data?
Data on the servers are created and modified rapidly.
Servers must respond to frequently changing data in a matter of milliseconds.
What does ‘variety’ mean when discussing big data?
Data held on servers consist of many different types of data.
Eg from binary files, photos, videos, etc.
Why is big data difficult to analyze?
The lack of structure makes it difficult to analyze the data.
Why don’t conventional databases scale well for big data?
Conventional databases require data to fit into a row-and-column format.
What techniques must be used to extract useful information from big data?
Machine learning techniques.
These techniques help to discern patterns in the data.
Give examples of big data sources.
Data from networked sensors, smartphones, video surveillance, mouse clicks.
These are continuously streamed data sources.
What is a challenge when processing data stored across multiple servers?
Data processing must be split across multiple machines.
This is difficult with conventional programming paradigms as machines must be synchronized.
How does functional programming help with big data processing?
It makes it easier to write correct and efficient, distributed code.
What does it mean for functional programs to be stateless?
They have no side effects.
This characteristic contributes to their reliability in distributed computing.
What type of data structures do functional programs use?
Immutable data structures.
This means that data cannot be changed once created.
What is a fact-based model in data storage?
Each individual piece of data is stored as a fact, which is immutable and can’t be overwritten.
Each fact also includes a timestamp to indicate when the information was stored.
What happens when multiple facts for the same item are retrieved?
Timestamps are compared, and the most recent fact is returned.
This reduces the risk of accidentally losing data due to human error.
What does a graph schema represent?
It uses graphs consisting of nodes and edges to graphically represent the structure of a dataset.
Nodes represent entities and contain properties, while edges represent relationships between entities.
Are timestamps included in graph schemas?
Timestamps are rarely included.
It is assumed that each node contains the most recent information available.