Big Data, Lambda Architecture, & SQL Flashcards
Lambda architecture
A data processing architecture designed to handle massive volumes of data, often associated with big data applications. It was introduced by Nathan Marz to address the challenges of real-time data processing and batch processing in big data systems. Lambda architecture combines both batch and real-time processing to provide a comprehensive and scalable solution for processing large datasets.
Purposes of Lambda Architecture
Real-Time Data Processing
Batch Data Processing
Scalability
Fault Tolerance
Consistency of Results
Fault Tolerance in Big Data
Big data systems are distributed and complex, making them prone to failures. Lambda architecture’s fault tolerance ensures that the system can recover from failures and maintain data consistency.
“Consistency of Results”
Lambda architecture guarantees that the data processed by both real-time and batch layers eventually converges, ensuring consistent results across the entire system.
Scalability in Big Data Systems
Big data systems need to scale horizontally to accommodate the increasing volume of data and processing requirements. Lambda architecture’s design allows for horizontal scaling of both the real-time and batch processing components.
Batch Data Processing
In addition to real-time data, big data systems often deal with historical data and large datasets that require batch processing. Lambda architecture includes a batch processing layer to handle these vast amounts of data efficiently.
Real-Time Data Processing
Big data systems often receive continuous streams of data from various sources, such as sensors, social media, or clickstreams. Lambda architecture incorporates a real-time processing layer to handle these streams of data and provide low-latency processing and analytics.
Three Layers of Lambda Architecture
- Batch Layer
- Speed Layer
- Server Layer
Batch Layer for Large-Scale Data Processing
Big data applications deal with massive volumes of data that cannot be processed in real-time due to computational limitations. The Batch Layer is designed to handle these large-scale datasets efficiently by breaking them into manageable batches and processing them in parallel.
Batch Layer for Historical Data Processing
The Batch Layer is well-suited for processing historical data, which accumulates over time. It enables the system to process and analyze the entire historical dataset to produce accurate and comprehensive batch views.
Batch Layer for Precomputing Results
The Batch Layer precomputes batch views by running computationally intensive algorithms and data processing tasks on the entire dataset. This precomputation provides consistent and reliable results for later queries.
Batch Layer for Fault Tolerance
The Batch Layer’s batch processing is typically executed on distributed data processing frameworks, such as Apache Hadoop MapReduce or Apache Spark. These frameworks provide fault tolerance by handling failures and ensuring that the batch processing completes even in the presence of node failures.
Batch Layer for Scalability
The Batch Layer can scale horizontally by distributing data and computation across multiple nodes in a cluster. As the dataset grows, additional nodes can be added to handle the increased workload, making it suitable for big data scenarios.
Batch Layer for Consistant Results
By processing the entire dataset in batches, the Batch Layer ensures that the results are consistent and complete. It avoids the issues of partial or incomplete data views that may arise in real-time processing.
The Batch Layer
The Batch Layer is one of the three main layers designed to handle big data processing. It is responsible for processing large volumes of historical data in batches. The Batch Layer’s primary function is to compute batch views or batch-processing results from the entire dataset.
The Batch Layer’s primary goal is to provide a comprehensive and accurate view of historical data, which complements the real-time processing provided by the Speed Layer in the Lambda Architecture. The Batch Layer’s precomputed batch views are stored and updated periodically, enabling low-latency query processing and efficient retrieval of historical data.
The Speed Layer
The Speed Layer is one of the three main layers designed to handle real-time data processing in big data applications. The Speed Layer is responsible for processing and analyzing continuous streams of data in near real-time, providing low-latency results and insights.
The Speed Layer’s primary focus is to process and analyze real-time data streams, ensuring that the system can respond promptly to incoming events and provide real-time insights and analytics. By working in conjunction with the Batch Layer in the Lambda Architecture, the Speed Layer enables big data systems to handle both real-time and historical data efficiently, providing a complete and up-to-date view of the data for various use cases.
Speed Layer for Real-Time Data Processing
Big data applications often deal with continuous streams of data from various sources, such as sensor data, social media feeds, logs, or clickstreams. The Speed Layer is designed to handle these streams of data in real-time, ensuring that data is processed and analyzed as it arrives.
Speed Layer for Low-Latency Processing
The Speed Layer aims to provide low-latency results to support real-time decision-making and provide timely insights. It is essential for applications where immediate actions or responses are required based on incoming data.
Speed Layer for Event-Driven Architecture
The Speed Layer is based on an event-driven architecture, where it continuously processes events as they occur. It responds to events as they arrive, making it well-suited for time-sensitive and dynamic data scenarios.
Speed Layer for Stream Processing
The Speed Layer utilizes stream processing technologies, such as Apache Storm or Apache Flink, to process and analyze data streams efficiently. These technologies enable parallel processing and support fault tolerance in distributed environments.
How the Speed Layer Compliments the Batch Layer
While the Batch Layer handles historical data processing, the Speed Layer complements it by processing real-time data. Both layers work together to provide a comprehensive view of the data, including both historical and up-to-date information.
Speed Layer for Data Integration
The Speed Layer integrates with various data sources to ingest real-time data streams. It can process and aggregate the data, enrich it with contextual information, and make it available for real-time analytics.
Speed Layer for Incremental Updates
Unlike the Batch Layer, which processes the entire dataset in batches, the Speed Layer performs incremental updates on the data as new events arrive. This enables it to provide real-time insights and responses.
Speed Layer for Complex Event Processing
The Speed Layer can handle complex event processing tasks, identifying patterns, correlations, and anomalies in real-time data streams.
Ther Serving Layer in Lambda Architecture
The Serving Layer is responsible for combining and serving the results from both the Batch Layer and the Speed Layer to provide a unified and up-to-date view of the data. Overall, the Serving Layer plays a crucial role in Lambda Architecture by combining and presenting the results from both the Batch Layer and the Speed Layer to users and applications. It ensures that big data applications can efficiently process and serve both historical and real-time data, enabling users to make informed decisions based on a comprehensive and up-to-date data view.
Serving Layer for Unified Data Views
The Serving Layer combines the precomputed batch views from the Batch Layer and the real-time views from the Speed Layer to create a comprehensive and unified data set. This unified view ensures that queries can access both historical and real-time data.
Serving Layer for Low-Latency Query Processing
The Serving Layer provides low-latency query processing by leveraging the real-time views generated by the Speed Layer. This allows users or applications to receive up-to-date data insights without significant delays.
Serving Layer for Consistency of Results
The Serving Layer ensures that the results from both the Batch Layer and the Speed Layer eventually converge. This convergence guarantees that the data presented to users or applications is consistent and reflects the latest available information.
Serving Layer for Scalability
The Serving Layer needs to handle query requests efficiently, even in the face of high data volumes and complex queries. It should be designed for horizontal scalability to accommodate increasing user demand.
Serving Layer for User-Facing Interface
The Serving Layer provides an interface for users and applications to interact with the data. It offers APIs, web services, or other methods for querying and accessing the processed data.
Serving Layer for Data Visualization and Reporting
The Serving Layer may also include tools and components for data visualization and reporting, allowing users to gain insights and analyze data in a user-friendly manner.
Serving Layer for Load Balancing
To maintain low-latency query response times, the Serving Layer may use load balancing techniques to distribute incoming queries across multiple servers or nodes.
Four Techniques for Load Balancing
- Round-Robin Load Balancing: Requests are distributed sequentially across available resources in a circular manner.
- Weighted Load Balancing: Each resource is assigned a weight based on its capacity or capabilities, and requests are distributed proportionally.
- Least Connections Load Balancing: Requests are sent to the resource with the fewest active connections to evenly distribute the workload.
- Dynamic Load Balancing: Load balancers continuously monitor resource utilization and dynamically adjust the distribution of workloads based on real-time data.
Six Purposes of Load-Balancing
- Optimizing Resource
- Utilization
- Reducing Processing
- Bottlenecks
- Minimizing Response Times
- Fault Tolerance and High Availability
- Scaling and Elasticity
- Data Distribution in Distributed Storage
Distributed Storage System Technologies
- Apache Cassandra
- Apache HBase
- Amazon S3 (Simple Storage Service)
- Google Cloud Storage
- Microsoft Azure Blob Storage
- Apache GlusterFS
- Ceph
- IBM Specturm Scale (GPFS)
- OpenStack Swift
- Red Hat Gluster Storage
Load Balancing
Load balancing is a technique used in big data and distributed computing systems to distribute processing workloads evenly across multiple computing resources (e.g., servers, nodes, or clusters). The goal of load balancing is to optimize resource utilization, enhance performance, and prevent overloading specific components, ensuring that the system operates efficiently and reliably.
Load balancing can be implemented using dedicated load balancer hardware or software, as well as through software-defined load balancers in cloud environments. It is a fundamental technique in building scalable, fault-tolerant, and high-performance big data systems that efficiently handle the challenges posed by large data volumes and processing requirements.
Technologies used in the Batch Layer
- Apache Oozie
- Cascading
- Apache Kylin
- Apache Beam
- Apache Crunch
- Apache Pig
- Apache Five
- Apache Flink
- Apache Spark
- Apache Hadoop MapReduce
Technologies used in the Speed Layer
- Apache Storm
- Apache Flink
- Apache Kafka
- Apache Samza
- Amazon Kinesis
- Google Cloud Dataflow
- Microsoft Azure Stream Analytics
- Spark Streaming
- NATS
- RabbitMQ