L6 Flashcards
(23 cards)
def Big Data
data sets that are too large or complex to be dealt with by traditional data-processing application software
- size or type
- data is coming from multiple sources (sensors, devices, web, log files)
- includes structured semi-structured and unstructured data
5 V’s of Big Data
volume
variety
veracity
value
velocity
variability
Variety in BD
- various formats
- structured, semi-structured, unstructured
Structured data
transactional file in traditional DBMS rows and columns
Semi-structured data
combination of structure and unstructured
eg) email, csv files, log files, NoSQL database
Unstructured data
text, images, videos, social media
Variability in Big Data
- need for being variable because of everchanging datapatterns, etc
- inconsistencies and outliers need to be detected
- data can be used in many different ways, for different purposes
4 BD Technologies
- Data storage: Apache (Hadoop, Cassandra)
- Data mining: RapidMiner
- Data analytics: Apache Spark
- Data visualization: Tableau
Apache
- open source
- amercian nonprofit formed in 1999
Apache Hadoop
- written in java
- supports processing of large data sets
- can store large volume of all kind of structured data in distributed file systems
- parallel processing
Cassandra
- apache NoSQL database management system
- handles large amounts of data across many commodity servers
- high availability with no point of failure
Commodity servers
use of lots of already-available computing components for parallel computing
get low computation at low cost
Hadoop vs Cassandra
cassandra is NoSQL and good for
- high-speed, online transactional data
- online web-mobile applications
eg) e-commerce and inventory management, personlization, recommend, IoT and edge computing
hadoop is good for
- big data analytics
- warehousing
- data lake uses
- cold and historical data
eg) retail analytics, financial risk analysis, trading, forecastin, social media with very high volumes
-> can complement each other
Apache Spark
- unified analytics engine for large-scale data processing
- leading platform for large scale SQL, batch processing, stream processing and ML
Hadoop vs Spark
spark:
- in-memory processing (much faster)
- both batch processing and real-time
- fewer lines of code
- easier authentification
- fault tolerance
- better analytics
- more flexibility
hadoop:
- slow data processing
- only batch processing
- more lines of code
- written in java, takes longer to execute
- difficult to manage authentification -> kerberos
RapidMiner
comprehensive data science platform with visual workflow design and full automation
Tableau
data visualization and BI tool used for reporting and analyzing vast volumes of data
Big Data Life Cycle (5)
- collection
- preprocessing
- storage
- analysis
- interpretation
BD preprocessing (4)
- integration: uniform
- cleaning: outliers
- reduction: dimensions
- transformation: format
def Data Science
interdisciplinary field that uses sicentific mathods, processes, algorithms and systmes to extract knowledge and insight from structural and unstructured data
Data Science Life Cycle (7)
- business understanding
- data collection
- data preparation
- exploratory data analysis
- modelling
- model evaluation
- model deployment
Big Data Challenges
- lack of understanding and expertise
- poor data quality and data silos
- issues in scaling
- variety of BD technologies
- incorrect integration
- expensive
- real-time big data problems
- data verification
- organizational resistance
- security and privacy
Applications
any industry:
banking, media, insurance, healthcare, transportation, energy, education, manufacturing