Big Data Management Flashcards
Characterization (3Vs)
Variety: Different forms of data
Volume: Petabytes of data
Velocity: Real-time data
Big Data Analysis Pipeline
- Data acquisition: Select important data to be stored
- Information extraction & cleaning: Pull out required information from underlying sources
- Data integration & aggregation & representation: Full integration not always possible. Problem, that origin can not be tracked on derived data, selection of storage complex
- Modeling & analysis: Big data discloses hidden patterns and knowledge. Big pictures shows simple models
- Interpretation: Annotate base data and discuss interpretation of metadata
Data Lake requirements
- Secure
- Scalable
- Reliable
- Throughput
- Low Latency
- Store details
- Store native Forman
- All sources
Advantages Cloud
- Cost
- Extensibility
- Reliability
- Workload
- Sharing
Disadvantages Cloud
- Custom software
- Networking
- Maintenance
- Security
- Parallelization not always possible
Three-Tier Server
Presentation ➔ Logic ➔ Data
Design Cloud
- Transparent
- Flexible
- Reliable
- Performant
- Scalable
Fallancies of cloud
- Network is reliable
- Latency is zero
- Bandwidth is infinite
- Network is secure
- Topology doesn’t change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Cloud characteristics
- Dynamic
- Massively scalable
- Multi-tenant
- Self-service
- Per-usage based pricing model
- IP-based architecture
Google File System
Store chunks across chunk servers, replicate chunks, access control by master node
Map Reduce
- Extract data as key value
- Group by key
- Reduce groups
- Split data and perform mapping parallel
ACID
- Atomicity
- Consistency
- Isolation
- Durability
CAP
- Consistency
- Availability
- Partition-tolerance
➔ Not all three possible at the same time
BASE
- Basically Available
- Soft state
- Eventual Consistency
Types of NoSQL storage
- Key/Value
- Wide-column
- Document database
- Graph database
Steps of machine learning
Data ➔ Preprocessing ➔ Featuring ➔ Learning ➔ Testing ➔ Analysis
Decision tree
- Created using greedy top down
- Choose attribute with most information value for each node step
K-means clustering
- Variance within clusters minimal
- Random start points, assign data based on least distance to start points, recalculate start points, iterate