Infrastructure & Architecture Flashcards
scale-up (vertical scaling)
upgrade existing machine (SMP,MPP)
scale-out (horizontal scaling)
adding more machines to network, unlimited scaling
Symmetric Multi-processing (SMP)
traditional PC, multiple processors share same RAM and storage
Massively Parallel Processing (MPP)
Each processor has its own dedicated RAM and storage (vendor lock-in)
Cluster Architecture
many computers connected to work as a single system.
how are nodes connected in cluster
usually gigabit ethernet, 8-64 per rack
Only pro of MPP over cluster
faster message passing between nodes. Ideal for single, vertical solutions like data warehousing
commodity hardware
standardized, market priced hardware
distributed computing
tasks split into smaller units processed simultaneously across machines
challenges of distributed computing (4) (ARFA)
Assigning SPLIT tasks,
Resource Allocation, Fault-tolerance, Aggregating (results)
solution to challenges of distributed computing
Use a framework to hide complexity of distributed computing from developers
Grid computing
- Collect computer resources from multiple locations.
- Each node perform different task
- Multi-purposed
What is HPC?
high-performance computing (GPU intensive)
What is NIST reference architecture for?
How a Big Data system should be designed
5 Components of NIST Reference Architecture
- Big Data Framework Provider
- processing
- storage
- networking - Data Provider
- Application provider
- Data Consumer
- System Orchestrator
- integrate components
- meet goals
What happens in application component?
Programs are written to process data from collection to visualization and user access
What happens in Data Provider component besides just data collection? (4) (Suck My Ass!)
- Scrub sensitive info
- Create Metadata
- data provenance
- access rights
- usage policies - Enforce access and authorizations
4 components professor adds to NIST
- Analytical data store
- Analysis and reporting
- Real-time message
ingestion (and buffer) - Stream processing
Batch vs Stream
Running analytical algorithms on:
- Large amounts of
stored data - Continuously running,
potentially infinite
amount of data as soon
as its collected
Lamda Architecture
Two pipelines:
1. Cold path (batch)
2. Hot path (real-time)
Kappa Architecture
One hot path (real-time)