11 Big Data Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Define ‘Big Data’.

A

Big data is a generic term given to datasets that are so large or complicated that they are difficult to store, manipulate and analyse.
A term used to cover all data that cannot be handled using traditional processing methods and systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe Big Data in terms of the three V’s.

A
  1. Volume: The capacity required to store the data exceeds a single server
  2. Velocity: The data is produced and/or processed at very high speed
  3. Variety: The data is very diverse; data can appear in different types (e.g. text, video, images) and forms (e.g. structured, unstructured, semi-structured)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Where is Big Data used?

A

Used to record factual data like bank transactions and it is increasingly being used to analyse trends and try to make predictions based on relationships and correlations within the data.

  • scientific research
  • retail
  • banking
  • government
  • mobile networks
  • security
  • real-time applications
  • the Internet.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the issues with Big Data?

A
  1. Datasets are so large they are too difficult to store and analyse
  2. Unstructured data difficult to analyse in an automated way
  3. Specialist software needed to manage & extract info from the data
  4. Massive storage and processing power needed
  5. Data is constantly changing, difficult to track every change
  6. Possible to infer the wrong conclusion from the data
  7. Concurrency where several users working at same time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the fact-based model.

A
  • The fact-based model is used to represent, model, and query data sets at the scale of Big Data.
  • A fact is a piece of data that cannot be deconstructed further and is timestamped. In this way, each fact is ‘eternally true’ meaning that:
  1. It doesn’t include redundant information
  2. It is specific to a particular point in time
  3. It is immutable, it can’t be changed or deleted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Outline the graph schema.

A
  • Graph schemas are graphs that depict the structure of a data set that is stored using the fact-based model.
  • Includes the types of facts that the data set contains and the relationships between these facts.

They are made up of:

  • Nodes are used to represent the core entities in the data set, they are depicted using ovals.
  • Edges are used to represent the relationships between the nodes, they are depicted using solid lines. Edges can be directed (to specify a hierarchical relationship) or undirected.
  • Properties are used to capture information about the nodes, they are depicted using rectangular boxes.

Advantages:

  • Easy to expand so graph schemas can adapt in order to capture the complexity of an evolving system and are able to adequately represent data that is diverse and unpredictable.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is functional programming used with Big Data?

A
  • Big Data is often so big you can’t store all of the data onto a single machine or analyse it quickly enough.
  • Work is therefore spread over many servers or workstations over a network, distributing the processing between the processors of each.
  • Big Data has given rise to the re-emergence of functional programming, a programming paradigm that for a long time was considered niche.
  • Functional programming lends itself to producing code that can be proved correct and can be distributed across more than one machine without fear of unexpected results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Features of functional programming.

A

1. Immutable data structures

  • An immutable data structure cannot be changed during program execution
  • This means data in the data structure cannot be added, removed or replaced.
  • IDSs eliminate errors caused when data is overwritten by mistake which is essential for parallel processing

2. Statelessness

  • State refers to the variables and data used in the program at any point during execution
  • Functional programs are stateless as data structures are immutable
  • This makes producing code easier as we have more confidence that the same inputs will give the same outputs as it doesn’t depend on anything else

3. Higher-order functions

  • Higher-order functions can take other functions as parameters and return functions as a result.
  • For example, the ‘map’ higher-order function applies a given function to each element of a list so the programmer only needs to specify the function to be applied — not how the mapping is achieved
  • Higher-order functions can run in parallel systems safely because the processors can carry out the computation without disturbing other parts of the data set.
  • As a result, it is easier to solve and debug a problem, which also means that programs can be executed across more than one server.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly