Big Data Flashcards
Big data
A termed coined to handle masses of data generated through the internet and derive business insight from it.
Where does all this data come from
Web browsing patterns
RFID
Internet of things
Social media
Smartphones
Biomedical devices
Scaling up v Scaling out
Scaling up: migrating to a larger system with better CPUs/storage space (increases costs)
Scaling out: spread across several servers
Clustering
A cluster of low cost servers sharing the workload
Velocity
Rate at which new data enters the system and the rate at which data must be processed
Variety
Big data captures data in the form that it naturally exists in:
Structured, unstructured, semi-structured data
Structured data
Organised to fit into a predefined data model
Unstructured data
Not organised to fit into a predefined data model. Like videos, images, texts…
Semi-structured
Has elements of both structured and unstructured data
HDFS and it’s assumption’s
Hadoop Distributed File System
Assumes
• files will be really big. (divides them into blocks)
• write only, read many. (Simplifies concurrency issues and improves overall throughput)
• streaming access
• fault tolerance (replication means processing can continue even if one replicate fails)