Big Data Flashcards

1
Q

what is Big Data?

A

The term Big Data is described as data that can’t be processed or analysed using traditional processes or tools because it falls into one or more of the following categories:
- too big to fit into a single server
- too heterogeneous (diverse in character or content) - structured, semi-structured or totally unstructured
- its production can occur at very high rates
The three defining features of big data can be remembered as “the three Vs”:
Velocity, Volume and Variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Big Data Velocity

A
  • Data on the servers is created and modified rapidly.
  • The servers must respond to frequently changing data within a matter of milliseconds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

define Data in motion(velocity)

A
  • Data in motion is data that is streamed (received or sent) at some (high) frequency continuously (e.g. Rate of 1000 events per second).
  • In the Big Data scenario, this data is likely to arrive at a high rate, and from multiple sources simultaneously e.g. twitter streams
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

define Data at rest(velocity) and it’s batch processing

A
  • Data at rest is data that has been stored on some permanent data storage device.
  • The data may be processed at any time.
  • Big Data at rest is usually batch processed.
  • In batch processing, processing, once started, is carried out to completion without user interaction, this is where machine learning comes in.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Big Data Volume

A

Volume in Big Data refers to the size of the data to be processed. Large volumes of data fall into the Big Data category if that data must be analysed as a single dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a distributed file system?

A

A distributed file system is one in which the blocks of individual files are spread across more than one server.
e.g. Google’s distributed file system is GFS. Yahoo, Facebook, and Twitter use HDFS, the Hadoop Distributed File System.
Both systems use racks of servers with network switches interconnecting servers in a rack and servers in other racks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Big Data Variety

A
  • Data can appear in many forms from structured through semi-structured to unstructured.
  • Big data’s unstructured nature makes it difficult to analyse the data.
  • Conventional databases are not suited to storing big data because they require the data to conform to a row and column structure, and do not scale well across multiple servers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Examples of big data?

A

Twitter, continuously monitored banking interactions, data from surveillance systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does Machine Learning benefit Big Data?

A
  • Machine learning techniques are needed to discern patterns in data and to extract useful information.
  • This can take the form of a predictive model that can then be used in the algorithm that processes streaming data to extract the value from the data in the stream.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the principles of fact based modelling?

A
  • Raw data stored as atomic facts(smallest/single)
  • Each is identifiable so querying can identify duplicates
  • Facts capture one single piece of information
  • Facts are immutable and eternally true due to a timestamp

Big data can be stored this way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the advantages of fact based modelling?

A
  • Simplicity as no indexing is needed
  • New items simply appended to growing data set
  • Data is true forever
  • Immutable facts mean errors are easy to correct by returning to good facts
  • Historical queries are easy to run
  • Reduces risk of losing data due to human error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Graph schema (what does each component represent)

A
  • Graph schema uses graphs consisting of nodes and edges to graphically represent the structure of a dataset.
  • Nodes in a graph represent entities and can contain the properties of the entity.
  • Edges are used to represent relationships between entities and are labelled with a brief description of the relationship.
  • Timestamps are rarely included in graph schema diagrams, instead you should assume that each node contains the most recent information available.
  • to list an entity’s properties inside rectangles joined to entities with a dashed line
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Functional Programming

A
  • Functional programming is a solution to the problem of
    processing data over multiple machines.
  • Functional programs are stateless (meaning that they have no side
    effects) and make use of immutable data structures.
  • Furthermore, the functional programming paradigm supports
    higher-order functions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly