Lecture 8 Flashcards
Big Data
Data that cannot be stored and processed on a singled device.
2 aspects of big data
- Distributed storage (Distributed File Systems / Sharing)
- Distributed processing (and handling derived data)
Processing on Big-Data
Not as easy as writing an SQL Query and expecting fast results
- Exploration
- Analytics
- Processing
- Publishing
Database architecture
Used by management to monitor business performance.
- Dashboard are built once in software based on information needs.
- View
(Rest API / SDK) - Controller
(Database API / SQL) - Model / DB
(Database API / SQL) - Power BI, etc
Data warehousing
Collecting data for reporting purposes.
- Make static snapshots to send to a central data warehouse.
- Extract, transform, load (ETL)
- Staging - preparing data for reporting an integration.
- Takes load off operational systems.
- Enriches information by combining systems.
Database Architecture
- Buy bigger machines
- Effectiveness of upgrading hardware is limited and expensive
- Single point of failure
- Buy more machines
- Create replicas for instances
Data Processing
ETL to Big Data
- Relational databases
- ETL
- Big Data
- Cloud Solutions
HDFS
Hadoop Distributed File System
- Storage layer for Hadoop BigData System
- Based on Google File system
- Fault tolerant distributed file system
- Designed to turn a computing cluster (a large collection of loosely connected compute nodes) into a massively scalable pool of storage.
-Provides redundant storage for massive amounts of data
Properties of HDFS
- Made to be resilient and fail proof, when each data node writes its memory to disk data blocks, it also replicates that memory to another server.
- Data nodes can be made rack aware, since redundancy does not work when you write data to two disk drives in the same rack.
- The name node tells the data nodes where to write data.
- The name node also tells your application which data nodes hold the file.
HBASE Column Family
Column families give way to optimal sharding and compression.
Streaming Data
Imagine have to process incoming messages from:
A mmorpg where players are moving around, finding gold and loot.
Uber drivers all over a country moving around.
We need real-time processing of information.
Apache Kafka
Functions like a distributed publish-subscribe messaging system.
Apache Kafka features
- Durability
- Scalability
- High availability
- High throughput (scalable managing system)
- Distributed, reliable publish-subscribe system
- Design as message queue and implementation as a distributed log service.
Batch processing
Processing of blocks of data that have already been stored over a period of time.
- Often on disk.
- Hadoop and MapReduce
e.g processing transactions that have been performed by a financial firm in a week.
Stream
Process data in real-time as they arrive and detect conditions within a small period of time from the point of receiving the data.
- Often in memory.
- Multiple publishers.
- Concurrency
- Kafka and Spark Streaming
e.g fraud detection, social media sentiment analysis, log monitoring, analysing customer behaviour.
Data Exploration
- Data exploration is about describing the data by means of statistical and visualisation techniques.
- We explore in order to understand the features and bring important features to our models.
Data Exploration with big data
- We cannot load all data in memory
- Some operations take too much time to run on a single machine.
Exploring using Pandas
- Pandas is an implementation of the DataFrame data structure.
- PySpark uses the same data structure to distribute computation in a cluster.
Dask provides another distributed DataFrame alternative.
-
When to use DataFrame?
The Pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields. DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc
Why use DataFrame?
DataFrame allows to store heterogeneous data while Series allows to store homogeneous data.
Distributed DataFrame
- Original DataFrame
- Split (the data frame in in-memory manageable chunks)
- Apply (the transformation to each chunk independently)
- Combine (each chunk back into a data frame)