Lecture 8 Flashcards
Big Data
Data that cannot be stored and processed on a singled device.
2 aspects of big data
- Distributed storage (Distributed File Systems / Sharing)
- Distributed processing (and handling derived data)
Processing on Big-Data
Not as easy as writing an SQL Query and expecting fast results
- Exploration
- Analytics
- Processing
- Publishing
Database architecture
Used by management to monitor business performance.
- Dashboard are built once in software based on information needs.
- View
(Rest API / SDK) - Controller
(Database API / SQL) - Model / DB
(Database API / SQL) - Power BI, etc
Data warehousing
Collecting data for reporting purposes.
- Make static snapshots to send to a central data warehouse.
- Extract, transform, load (ETL)
- Staging - preparing data for reporting an integration.
- Takes load off operational systems.
- Enriches information by combining systems.
Database Architecture
- Buy bigger machines
- Effectiveness of upgrading hardware is limited and expensive
- Single point of failure
- Buy more machines
- Create replicas for instances
Data Processing
ETL to Big Data
- Relational databases
- ETL
- Big Data
- Cloud Solutions
HDFS
Hadoop Distributed File System
- Storage layer for Hadoop BigData System
- Based on Google File system
- Fault tolerant distributed file system
- Designed to turn a computing cluster (a large collection of loosely connected compute nodes) into a massively scalable pool of storage.
-Provides redundant storage for massive amounts of data
Properties of HDFS
- Made to be resilient and fail proof, when each data node writes its memory to disk data blocks, it also replicates that memory to another server.
- Data nodes can be made rack aware, since redundancy does not work when you write data to two disk drives in the same rack.
- The name node tells the data nodes where to write data.
- The name node also tells your application which data nodes hold the file.
HBASE Column Family
Column families give way to optimal sharding and compression.
Streaming Data
Imagine have to process incoming messages from:
A mmorpg where players are moving around, finding gold and loot.
Uber drivers all over a country moving around.
We need real-time processing of information.
Apache Kafka
Functions like a distributed publish-subscribe messaging system.
Apache Kafka features
- Durability
- Scalability
- High availability
- High throughput (scalable managing system)
- Distributed, reliable publish-subscribe system
- Design as message queue and implementation as a distributed log service.
Batch processing
Processing of blocks of data that have already been stored over a period of time.
- Often on disk.
- Hadoop and MapReduce
e.g processing transactions that have been performed by a financial firm in a week.
Stream
Process data in real-time as they arrive and detect conditions within a small period of time from the point of receiving the data.
- Often in memory.
- Multiple publishers.
- Concurrency
- Kafka and Spark Streaming
e.g fraud detection, social media sentiment analysis, log monitoring, analysing customer behaviour.