11 Big Data Flashcards

Question 1

Q

Define ‘Big Data’.

Answer

A

Big data is a generic term given to datasets that are so large or complicated that they are difficult to store, manipulate and analyse.
A term used to cover all data that cannot be handled using traditional processing methods and systems.

Question 2

Q

Describe Big Data in terms of the three V’s.

Answer

A

Volume: The capacity required to store the data exceeds a single server
Velocity: The data is produced and/or processed at very high speed
Variety: The data is very diverse; data can appear in different types (e.g. text, video, images) and forms (e.g. structured, unstructured, semi-structured)

Question 3

Q

Where is Big Data used?

Answer

A

Used to record factual data like bank transactions and it is increasingly being used to analyse trends and try to make predictions based on relationships and correlations within the data.

scientific research
retail
banking
government
mobile networks
security
real-time applications
the Internet.

Question 4

Q

What are the issues with Big Data?

Answer

A

Datasets are so large they are too difficult to store and analyse
Unstructured data difficult to analyse in an automated way
Specialist software needed to manage & extract info from the data
Massive storage and processing power needed
Data is constantly changing, difficult to track every change
Possible to infer the wrong conclusion from the data
Concurrency where several users working at same time

Question 5

Q

Describe the fact-based model.

Answer

A

The fact-based model is used to represent, model, and query data sets at the scale of Big Data.
A fact is a piece of data that cannot be deconstructed further and is timestamped. In this way, each fact is ‘eternally true’ meaning that:

It doesn’t include redundant information
It is specific to a particular point in time
It is immutable, it can’t be changed or deleted

Question 6

Q

Outline the graph schema.

Answer

A

Graph schemas are graphs that depict the structure of a data set that is stored using the fact-based model.
Includes the types of facts that the data set contains and the relationships between these facts.

They are made up of:

Nodes are used to represent the core entities in the data set, they are depicted using ovals.
Edges are used to represent the relationships between the nodes, they are depicted using solid lines. Edges can be directed (to specify a hierarchical relationship) or undirected.
Properties are used to capture information about the nodes, they are depicted using rectangular boxes.

Advantages:

Easy to expand so graph schemas can adapt in order to capture the complexity of an evolving system and are able to adequately represent data that is diverse and unpredictable.

Question 7

Q

Why is functional programming used with Big Data?

Answer

A

Big Data is often so big you can’t store all of the data onto a single machine or analyse it quickly enough.
Work is therefore spread over many servers or workstations over a network, distributing the processing between the processors of each.
Big Data has given rise to the re-emergence of functional programming, a programming paradigm that for a long time was considered niche.
Functional programming lends itself to producing code that can be proved correct and can be distributed across more than one machine without fear of unexpected results

Question 8

Q

Features of functional programming.

Answer

A

1. Immutable data structures

An immutable data structure cannot be changed during program execution
This means data in the data structure cannot be added, removed or replaced.
IDSs eliminate errors caused when data is overwritten by mistake which is essential for parallel processing

2. Statelessness

State refers to the variables and data used in the program at any point during execution
Functional programs are stateless as data structures are immutable
This makes producing code easier as we have more confidence that the same inputs will give the same outputs as it doesn’t depend on anything else

3. Higher-order functions

Higher-order functions can take other functions as parameters and return functions as a result.
For example, the ‘map’ higher-order function applies a given function to each element of a list so the programmer only needs to specify the function to be applied — not how the mapping is achieved
Higher-order functions can run in parallel systems safely because the processors can carry out the computation without disturbing other parts of the data set.
As a result, it is easier to solve and debug a problem, which also means that programs can be executed across more than one server.

11 Big Data Flashcards

(8 cards)