Big Data Overview Flashcards
Big Data
refers to non-conventional strategies and
innovative technologies used by businesses and
organizations to capture, manage, process, and make
sense of a large volume of data
challenges of big data
*Capturing, transporting, and moving the data
*Managing - the data, the hardware involved, and the software
*Processing - to provide insight
*Storing - safeguarding and securing
conventional BI & DWH architecture
App Servers
Network Switches
Database Servers
SAN Switch
Storage Array
proprities : SQL based
High availability
Enterprise database
Right design for structured data
Analytics Architecture
Edge node
Network switches
Data nodes
porprities :Not only SQL based
High scalability, availability, and flexibility
Compute and storage in the same box for reducing network latency
Right design for semi-structured and unstructured data
Data and Application are in the same machine (Data nodes)
The Vs of Big Data
Volume Variety Velocity{the speed at which vast amounts of data are
being generated, collected and analyzed} Veracity {is the quality or trust of the data} Value
Volume
how much data is there?
Variety
- how many different types of sources are there?
Velocity
- how quickly is the data being created, moved, or
accessed?
Veracity
can we trust the data?
Validity
- is the data accurate and correct?
Viability
- is the data relevant to the use case at hand?
Volatility
- how often does the data change?
Vulnerability -
can we keep the data secure?
Visualization
- how can the data be presented to the user?
Value
- can this data produce a meaningful return on
investment?
Types of Big Data
Structured semi-structured unstructured
Structured
Data that can be stored
and processed in a
fixed format, aka schema
Semi-structured
Data that does not have a formal structure of a data model, i.e. a table
definition in a relational DBMS, but nevertheless it has some
organizational properties like tags and other markers to separate semantic
elements that makes it easier to analyze, aka XML or JSON
Unstructured
Data that has an unknown form and cannot be stored in RDBMS and
cannot be analyzed unless it is transformed into a structured format
5’Vs and Data : Volume Velocity Variety Veracity Value
Data at rest : not in use
Data in motion : analyzing data on the fly
Data in many forms
data in doubt
Data into money
Hadoop
Apache open source software framework for reliable,
scalable, distributed computing of massive amount of data
What Hadoop is good for
Massive amounts of data through
parallelism
A variety of data (structured, unstructured,
semi-structured)
Inexpensive commodity hardware
Hadoop is not good for
Not to process transactions (random access)
Not good when work cannot be parallelized
Not good for low latency data access
Not good for processing lots of small files
Not good for intensive calculations with little data
Data Lake
a large storage repository and processing engine