L6 Flashcards
def Big Data
data sets that are too large or complex to be dealt with by traditional data-processing application software
- size or type
- data is coming from multiple sources (sensors, devices, web, log files)
- includes structured semi-structured and unstructured data
5 V’s of Big Data
volume
variety
veracity
value
velocity
variability
Variety in BD
- various formats
- structured, semi-structured, unstructured
Structured data
transactional file in traditional DBMS rows and columns
Semi-structured data
combination of structure and unstructured
eg) email, csv files, log files, NoSQL database
Unstructured data
text, images, videos, social media
Variability in Big Data
- need for being variable because of everchanging datapatterns, etc
- inconsistencies and outliers need to be detected
- data can be used in many different ways, for different purposes
4 BD Technologies
- Data storage: Apache (Hadoop, Cassandra)
- Data mining: RapidMiner
- Data analytics: Apache Spark
- Data visualization: Tableau
Apache
- open source
- amercian nonprofit formed in 1999
Apache Hadoop
- written in java
- supports processing of large data sets
- can store large volume of all kind of structured data in distributed file systems
- parallel processing
Cassandra
- apache NoSQL database management system
- handles large amounts of data across many commodity servers
- high availability with no point of failure
Commodity servers
use of lots of already-available computing components for parallel computing
get low computation at low cost
Hadoop vs Cassandra
cassandra is NoSQL and good for
- high-speed, online transactional data
- online web-mobile applications
eg) e-commerce and inventory management, personlization, recommend, IoT and edge computing
hadoop is good for
- big data analytics
- warehousing
- data lake uses
- cold and historical data
eg) retail analytics, financial risk analysis, trading, forecastin, social media with very high volumes
-> can complement each other
Apache Spark
- unified analytics engine for large-scale data processing
- leading platform for large scale SQL, batch processing, stream processing and ML
Hadoop vs Spark
spark:
- in-memory processing (much faster)
- both batch processing and real-time
- fewer lines of code
- easier authentification
- fault tolerance
- better analytics
- more flexibility
hadoop:
- slow data processing
- only batch processing
- more lines of code
- written in java, takes longer to execute
- difficult to manage authentification -> kerberos