- how many different types of sources are there?

- how quickly is the data being created, moved, or accessed?

- is the data accurate and correct?

- is the data relevant to the use case at hand?

- how often does the data change?

- how can the data be presented to the user?

- can this data produce a meaningful return on investment?

Big Data Overview Flashcards by Sarah Mokrani

Big Data

refers to non-conventional strategies and
innovative technologies used by businesses and
organizations to capture, manage, process, and make
sense of a large volume of data

How well did you know this?

Not at all

Perfectly

challenges of big data

*Capturing, transporting, and moving the data

*Managing - the data, the hardware involved, and the software

*Processing - to provide insight

*Storing - safeguarding and securing

How well did you know this?

Not at all

Perfectly

conventional BI & DWH architecture

App Servers
Network Switches
Database Servers
SAN Switch
Storage Array
proprities : SQL based
High availability
Enterprise database
Right design for structured data

How well did you know this?

Not at all

Perfectly

Analytics Architecture

Edge node
Network switches
Data nodes
porprities :Not only SQL based
High scalability, availability, and flexibility
Compute and storage in the same box for reducing network latency
Right design for semi-structured and unstructured data
Data and Application are in the same machine (Data nodes)

How well did you know this?

Not at all

Perfectly

The Vs of Big Data

Volume Variety Velocity{the speed at which vast amounts of data are
being generated, collected and analyzed} Veracity {is the quality or trust of the data} Value

How well did you know this?

Not at all

Perfectly

Volume

how much data is there?

How well did you know this?

Not at all

Perfectly

Variety

how many different types of sources are there?

How well did you know this?

Not at all

Perfectly

Velocity

how quickly is the data being created, moved, or
accessed?

How well did you know this?

Not at all

Perfectly

Veracity

can we trust the data?

How well did you know this?

Not at all

Perfectly

Validity

is the data accurate and correct?

How well did you know this?

Not at all

Perfectly

Viability

is the data relevant to the use case at hand?

How well did you know this?

Not at all

Perfectly

Volatility

how often does the data change?

How well did you know this?

Not at all

Perfectly

Vulnerability -

can we keep the data secure?

How well did you know this?

Not at all

Perfectly

Visualization

how can the data be presented to the user?

How well did you know this?

Not at all

Perfectly

Value

can this data produce a meaningful return on
investment?

How well did you know this?

Not at all

Perfectly

Types of Big Data

Structured semi-structured unstructured

Structured

Data that can be stored
and processed in a
fixed format, aka schema

Semi-structured

Data that does not have a formal structure of a data model, i.e. a table
definition in a relational DBMS, but nevertheless it has some
organizational properties like tags and other markers to separate semantic
elements that makes it easier to analyze, aka XML or JSON

Unstructured

Data that has an unknown form and cannot be stored in RDBMS and
cannot be analyzed unless it is transformed into a structured format

5’Vs and Data : Volume Velocity Variety Veracity Value

Data at rest : not in use
Data in motion : analyzing data on the fly
Data in many forms
data in doubt
Data into money

Hadoop

Apache open source software framework for reliable,
scalable, distributed computing of massive amount of data

What Hadoop is good for

Massive amounts of data through
parallelism

A variety of data (structured, unstructured,
semi-structured)

Inexpensive commodity hardware

Hadoop is not good for

Not to process transactions (random access)

Not good when work cannot be parallelized

Not good for low latency data access

Not good for processing lots of small files

Not good for intensive calculations with little data

Data Lake

a large storage repository and processing engine

Data munging/Data wrangling

is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for analytics

Oceans of data

data at rest

Streams of data

data in motion

The main categories of data are

* Structured * Unstructured * Natural language * Machine-generated * Graph-based * Audio, video, and image * Streaming

The six design principles in Industry 4.0

Interoperability Virtualization Decentralization Real-time Capability Service Orientation Modularity