Big Data, Lambda Architecture, & SQL Flashcards

Question

Ther Serving Layer in Lambda Architecture

Answer 1

The Serving Layer is responsible for combining and serving the results from both the Batch Layer and the Speed Layer to provide a unified and up-to-date view of the data. Overall, the Serving Layer plays a crucial role in Lambda Architecture by combining and presenting the results from both the Batch Layer and the Speed Layer to users and applications. It ensures that big data applications can efficiently process and serve both historical and real-time data, enabling users to make informed decisions based on a comprehensive and up-to-date data view.

Answer 2

The Serving Layer combines the precomputed batch views from the Batch Layer and the real-time views from the Speed Layer to create a comprehensive and unified data set. This unified view ensures that queries can access both historical and real-time data.

Answer 3

The Serving Layer provides low-latency query processing by leveraging the real-time views generated by the Speed Layer. This allows users or applications to receive up-to-date data insights without significant delays.

Answer 4

The Serving Layer ensures that the results from both the Batch Layer and the Speed Layer eventually converge. This convergence guarantees that the data presented to users or applications is consistent and reflects the latest available information.

Answer 5

The Serving Layer needs to handle query requests efficiently, even in the face of high data volumes and complex queries. It should be designed for horizontal scalability to accommodate increasing user demand.

Answer 6

The Serving Layer provides an interface for users and applications to interact with the data. It offers APIs, web services, or other methods for querying and accessing the processed data.

Answer 7

The Serving Layer may also include tools and components for data visualization and reporting, allowing users to gain insights and analyze data in a user-friendly manner.

Answer 8

To maintain low-latency query response times, the Serving Layer may use load balancing techniques to distribute incoming queries across multiple servers or nodes.

Answer 9

1. Round-Robin Load Balancing: Requests are distributed sequentially across available resources in a circular manner. 2. Weighted Load Balancing: Each resource is assigned a weight based on its capacity or capabilities, and requests are distributed proportionally. 3. Least Connections Load Balancing: Requests are sent to the resource with the fewest active connections to evenly distribute the workload. 4. Dynamic Load Balancing: Load balancers continuously monitor resource utilization and dynamically adjust the distribution of workloads based on real-time data.

Answer 10

1. Optimizing Resource 2. Utilization 3. Reducing Processing 4. Bottlenecks 5. Minimizing Response Times 6. Fault Tolerance and High Availability 7. Scaling and Elasticity 8. Data Distribution in Distributed Storage

Answer 11

1. Apache Cassandra 2. Apache HBase 3. Amazon S3 (Simple Storage Service) 4. Google Cloud Storage 5. Microsoft Azure Blob Storage 6. Apache GlusterFS 7. Ceph 8. IBM Specturm Scale (GPFS) 9. OpenStack Swift 10. Red Hat Gluster Storage

Answer 12

Load balancing is a technique used in big data and distributed computing systems to distribute processing workloads evenly across multiple computing resources (e.g., servers, nodes, or clusters). The goal of load balancing is to optimize resource utilization, enhance performance, and prevent overloading specific components, ensuring that the system operates efficiently and reliably. Load balancing can be implemented using dedicated load balancer hardware or software, as well as through software-defined load balancers in cloud environments. It is a fundamental technique in building scalable, fault-tolerant, and high-performance big data systems that efficiently handle the challenges posed by large data volumes and processing requirements.

Answer 13

1. Apache Oozie 2. Cascading 3. Apache Kylin 4. Apache Beam 5. Apache Crunch 6. Apache Pig 7. Apache Five 8. Apache Flink 9. Apache Spark 10. Apache Hadoop MapReduce

Answer 14

1. Apache Storm 2. Apache Flink 3. Apache Kafka 4. Apache Samza 5. Amazon Kinesis 6. Google Cloud Dataflow 7. Microsoft Azure Stream Analytics 8. Spark Streaming 9. NATS 10. RabbitMQ

Answer 15

1. Apache HBase 2. Apache Cassaandra 3. Apache Druid 4. Elasticsearch 5. AmazonDynamoDB 6. Google Cloud Bigtable 7. Apache Ignite 8. Redis 9. MongDB 10. Apache Solr

Answer 16

Least Connections Load Balancing is a load balancing technique used in distributed computing and networking to distribute incoming requests or connections to a group of resources (such as servers or nodes) based on the current number of active connections on each resource. The basic principle is to direct new requests to the resource with the fewest active connections, aiming to evenly distribute the workload and prevent overloading any specific resource. Least Connections Load Balancing is a load distribution strategy that plays a significant role in big data systems, contributing to their scalability, fault tolerance, and efficient utilization of computing resources. It helps ensure that big data applications can efficiently process vast amounts of data and deliver timely insights to users and applications.

Answer 17

Batch processing is a data processing method in which data is collected, processed, and analyzed in fixed-size, discrete batches. In this approach, data is accumulated over a certain period or until a specific volume is reached before it is processed as a group. The processing occurs offline, away from real-time interactions, and is typically done during non-peak hours or when the system load is low. Batch processing is a fundamental concept in big data systems, allowing them to efficiently manage and process vast amounts of data. It complements real-time processing, such as stream processing, in the Lambda Architecture, providing a comprehensive and efficient solution for handling big data workloads.

Answer 18

1. Handling Large Volumes of Data 2. Scalability 3. Resource Optimization 4. Reduced Latency Sensitivity 5. Consistency and Determinism 6. Complex Computations 7. Error Handling 8. Reduced Overhead

Answer 19

Batch processing provides consistent and deterministic results. By processing an entire batch of data at once, the system ensures that all data points within the batch are processed using the same set of rules and algorithms, avoiding potential discrepancies due to real-time fluctuations.

Answer 20

Resource optimization in the context of big data refers to the efficient and effective utilization of computing resources, storage, and other infrastructure elements to handle the large volumes of data and processing requirements of big data applications. It involves maximizing the performance, scalability, and cost-effectiveness of the resources while minimizing waste and inefficiencies. Resource optimization in big data involves various strategies, including load balancing, efficient data partitioning, smart scheduling of processing tasks, storage optimization, and leveraging specialized hardware or cloud-based services. It requires a careful understanding of the application's requirements, data characteristics, and performance goals. Overall, resource optimization is a fundamental aspect of big data management and infrastructure design. It plays a key role in ensuring that big data applications can handle large-scale data processing efficiently, meet performance expectations, and achieve cost-effectiveness, contributing to the success and sustainability of big data projects.

Answer 21

1. Cloud Adoption 2. Serverless Architectures 3. Data Compression (Storage Optimization) 4. Data Lifecycle Management 5. Efficient Data Processing 6. "Spot and Preemptible Instances" 7. Auto-Scaling 8. Data Archiving and Backup 9. Open Source Technologies 10. Resource Optimization 11. Data Governance and Quality 12. Data Security and Compliance 13. Data Visualization and Reporting

Answer 22

1. Cloud Computing and Virtualization 2. Data Center Efficiency 3. Renewable Energy Sources to power big data operations 4. Consolidation and Data Sharing 5. Data Compression and Deduplication 6. Energy-Efficient Hardware 7. Server and Resource Optimization 8. Data Lifecycle Management 9. Smart Data Replication 10. Energy-Aware Algorithms 11. Data Caching and Preprocessing 12. Real-time Data Pruning 13. Data Center Location 14. Power Management Policies 15. Monitoring and Reporting

Answer 23

1. Data Sources 2. Data Ingestion Layer 3. Storage Layer 4. Data Processing Layer 5. Batch Processing 6. Speed Processing 7. Serving Layer 8. Query and Analytics 9. Data Visualization and Reporting 10. Security and Governance 11. Monitoring and Management 12. Scaling and High Availability

Answer 24

Data is at the core of big data systems. The entire ecosystem is built around collecting, storing, processing, and analyzing vast amounts of data from various sources.

Answer 25

The 5Vs are fundamental characteristics of big data. They define the key challenges and opportunities posed by big data systems. 1. Volume 2. Variety 3. Velocity 4. Veracity 5. Value These 5Vs collectively define the complexity and potential of big data and highlight the importance of efficient data management, processing, and analysis in modern data-driven environments. Big data systems are designed to address the challenges posed by these 5Vs while leveraging the opportunities they present for data-driven innovation and business growth.

Answer 26

Volume refers to the vast amount of data generated and collected in big data environments. It involves dealing with massive datasets that may range from terabytes to petabytes or even exabytes in size. The ability to handle and process such large volumes of data is a defining characteristic of big data systems.

Answer 27

Velocity represents the speed at which data is generated, collected, and processed. In the era of real-time data streams, big data systems must be capable of ingesting and analyzing data as it is generated to provide timely and up-to-date insights. Velocity is crucial for time-sensitive applications and real-time analytics.

Answer 28

Variety refers to the diversity of data types and sources in big data environments. Data can be structured, semi-structured, or unstructured, and it comes from various sources such as social media, sensors, log files, videos, images, and more. The challenge lies in managing and analyzing this diverse data effectively.

Answer 29

Veracity represents the quality, reliability, and trustworthiness of data. Big data often involves dealing with data from various sources, and the veracity of this data can vary. Ensuring data quality, cleaning, and validation are critical to making accurate and reliable decisions based on big data insights.

Answer 30

Value refers to the insights and meaningful information that can be extracted from big data. The ultimate goal of big data is to derive actionable insights that lead to valuable outcomes, such as data-driven decision-making, business optimization, and improved customer experiences.

Answer 31

1. Batch Processing 2. Real Time Processing 3. Micro-Batch Processing 4. Stream Processing 5. Interactive Processing 6. Near-line Processing 7. Asynchronous Processing 8. Graph Processing 9. In-Memory Processing 10. Predictive Processing 11. Interactive Batch Processing 12. Adaptive Processing 13. Parallel Processing 14. Complex Event Processing (CEP) 15. Edge Processing 16. Machine Learning Inference

Answer 32

A computing model in which a group of interconnected computers work together to solve a computational problem or perform a task. In this model, processing tasks are divided into smaller sub-tasks and distributed across multiple nodes (computers) within a network. Each node processes its assigned sub-task independently, and the results are combined to obtain the final result. The goal of distributed computing is to achieve higher performance, scalability, fault tolerance, and resource efficiency compared to traditional centralized computing models. In the context of big data, distributed computing plays a crucial role in handling the massive volume of data generated and processed in big data systems. Big data often exceeds the capacity of a single computer or server to handle, making distributed computing essential for managing and analyzing large datasets efficiently. The two main aspects of distributed computing relevant to big data are: 1. Data Storage 2. Data Processing The combination of distributed data storage and distributed data processing enables big data systems to address the challenges posed by the 5Vs of big data (Volume, Velocity, Variety, Veracity, and Value)

Answer 33

Distributed storage systems, such as Hadoop Distributed File System (HDFS) and distributed NoSQL databases like Apache Cassandra and MongoDB, are used to store and manage vast amounts of data across multiple nodes. Data is partitioned and replicated across nodes to ensure data availability, fault tolerance, and data locality.

Answer 34

Distributed computing frameworks like Apache Hadoop, Apache Spark, and Apache Flink are employed for distributed data processing. These frameworks divide data processing tasks into smaller chunks, and each node processes its portion of the data independently. The results are then combined to form the final output. This parallel processing allows big data systems to handle massive datasets efficiently.

Answer 35

Horizontal scalability (aka Scale-out) -Ability of a system to handle increasing workloads and growing demands by adding more nodes or resources to the system. As data volume or processing requirements increase, new nodes can be added to the existing infrastructure, distributing the workload across the added nodes. This allows the system to maintain or improve performance, throughput, and responsiveness as the demand grows. Vertical scalability, (aka scale-up) -Involves increasing the capacity of individual nodes (e.g., adding more memory or CPU to a single server) to handle increased workloads. While vertical scalability can be effective up to a certain point, it eventually faces limitations due to hardware constraints, and it becomes more expensive and challenging to scale further.

Answer 36

1. Handling Massive Data Volumes 2. Performance and Throughput 3. Fault Tolerance/High Availability 4. Cost-Effectiveness 5. Flexibility and Elasticity 6. Future-Proofing

Answer 37

Flexibility refers to the system's ability to adapt and accommodate varying workloads, data sources, and processing requirements. A flexible big data system can handle different data types (structured, semi-structured, unstructured) from diverse sources, allowing for easy integration and processing of data with varying formats and characteristics. Key aspects of flexibility in big data include: 1. Data Variety: A flexible big data system can ingest, process, and analyze data in different formats, such as text, images, videos, log files, social media feeds, and sensor data. 2. Data Schema Evolution: A flexible system can accommodate changes in data schema over time without disrupting existing data processing pipelines or applications. 3. Support for Various Processing Paradigms: A flexible big data system can support batch processing, real-time processing, interactive processing, and streaming data processing, depending on the specific use case and requirements. 4. Data Exploration and Ad-Hoc Queries: Flexibility allows data scientists and analysts to explore data and perform ad-hoc queries efficiently, enabling data discovery and deeper insights.

Answer 38

Refers to the system's ability to automatically and dynamically scale its computing resources up or down in response to changing workloads or demands. An elastic big data system can add or remove resources (e.g., nodes or computing instances) on-the-fly to handle varying data processing needs, ensuring optimal performance and resource utilization. Key aspects of elasticity in big data include: 1. Automatic Scaling: An elastic system can automatically scale its resources based on predefined rules or thresholds. For example, when data processing demand increases, more nodes are added to distribute the workload. 2. Resource Optimization: An elastic system optimizes resource allocation to match the current workload, avoiding overprovisioning or underutilization of resources. 3. Cost Efficiency: Elasticity allows for cost optimization by scaling resources based on demand. During periods of low demand, unnecessary resources can be removed, reducing costs. 4. Resilience: Elasticity enhances system resilience by enabling it to adapt to sudden spikes in data processing demands without sacrificing performance or availability.

Answer 39

Refer to on-the-fly, user-initiated queries that are performed in an exploratory and interactive manner to gain immediate insights and answers from large and diverse datasets. Ad-hoc queries are typically unplanned and ad hoc, meaning they are not predefined or part of a regular processing workflow. Instead, they are formulated based on specific analytical needs or questions raised by data analysts, scientists, or business users.

Answer 40

Approaches for handling data in big data computing systems. Processing paradigms are employed to analyze and extract insights from large and diverse datasets. Each processing paradigm has its characteristics and use cases.

Answer 41

Extracting data from various sources, transforming it into a suitable format, and loading it into a data repository for analysis.

Answer 42

Analyzing and deriving insights from massive datasets to identify patterns, trends, and correlations. This includes tasks like data mining, predictive modeling, and machine learning on large datasets.

Answer 43

Processing and analyzing data as it arrives in real-time, such as streaming data from IoT devices, social media feeds, and financial markets.

Answer 44

Storing and managing vast amounts of structured and unstructured data in a data warehouse for historical analysis and reporting.

Answer 45

Conducting searches and retrieval operations on large databases or search engines

Answer 46

Generating personalized recommendations for users based on their preferences and historical data

Answer 47

Analyzing relationships and connections between entities in social networks

Answer 48

Processing and analyzing log data from various systems to detect anomalies, monitor system health, and identify potential issues

Answer 49

Analyzing genetic data to study genetic variations, gene expression, and disease-related factors.

Answer 50

Analyzing and processing large volumes of image and video data, such as in computer vision applications and video surveillance.

Answer 51

Analyzing and processing natural language text data for tasks like sentiment analysis, language translation, and information extraction.

Answer 52

Analyzing financial transactions, market data, and economic indicators to support financial decision-making and risk analysis.

Answer 53

Analyzing data from scientific experiments, simulations, and observations in fields like astronomy, climate research, and bioinformatics.

Answer 54

Analyzing geographic and spatial data for applications in geographic information systems (GIS) and location-based services

Answer 55

In-memory computing is a computing technique that involves storing and processing data directly in the main memory (RAM) of a computer, as opposed to storing data on disk or other slower storage devices. In this approach, data is accessed and manipulated at a much faster speed since RAM has significantly lower access times compared to disk storage. This allows for real-time data processing and analysis, making it ideal for handling data-intensive and time-sensitive tasks. In-memory computing is widely used in various domains and applications, including big data analytics, real-time data processing, financial trading systems, online transaction processing (OLTP) systems, and high-performance computing (HPC). It is often utilized in conjunction with distributed computing frameworks, enabling large-scale data processing and analytics with high performance and efficiency.

Answer 56

Volume, Velocity, Variety, Veracity, and Value

Answer 57

Organizes data into tables, where each table consists of rows and columns. The rows represent individual records or entries, and the columns represent attributes or characteristics of those records.

Answer 58

Each row in a table represents a single record or data entry containing specific values for each column.

Answer 59

Columns represent the specific characteristics or attributes of the data being stored in the table.

Answer 60

Primary Key: A unique identifier for each row in a table, ensuring each row is distinct. Foreign Key: A field in a table that refers to the primary key of another table. It establishes relationships between tables.

Answer 61

The process of organizing data to minimize redundancy and dependency by dividing large tables into smaller tables and defining relationships between them.

Answer 62

The language used to interact with RDBMSs. SQL allows for the creation, manipulation, and retrieval of data from databases

Answer 63

A subset of attributes that can be used to uniquely identify a record withing a table (since no two records will ever contain the same values for all those attributes) -These are called candidate keys because you will choose one of them to be your primary key *each table will likely have more than one Candidate Key

Answer 64

One of the Candidate keys: the one specifically chosen to be the unique identifier

Answer 65

Used to enforce relationships between two tables. Each Foreign key corresponds to an existing primary key in another table in the RDBMS

Answer 66

A type of database that organizes and stores data in a structured format using rows and columns within tables. It is based on the principles of the relational model of data

Answer 67

1) Aggregation: If the enemy collects enough non-sensitive data, they can sell it to someone who will know how to exploit it (can be prevented by zero trust/least privilege) 2) Inference: If the enemy obtains aggregated non-sensitive data, they can use it to draw conclusions that are actually sensitive (can be prevented by "blurring data" and "database partitioning") 3) Database ransomware: Attackers can lock you out of your database so that you cannot use the data anymore.

Answer 68

1) Performance Degradation: As the amount of data stored in a database grows, the performance can degrade. Retrieving, updating, or deleting data becomes slower, affecting response times for applications and end-users. 2) Scalability Issues: Large databases can become difficult to scale. Scaling vertically (increasing the resources of a single machine) has limits, and scaling horizontally (adding more servers) can be complex and costly, especially if the database was not designed for distributed architecture. 3) Increased Backup and Recovery Times: Backing up and recovering a large database takes more time and resources. The longer the recovery time, the higher the potential impact on business continuity and data availability during downtime. 4) Higher Maintenance Complexity: Managing and maintaining a large database is more complex. Tasks like indexing, optimizing queries, and performing maintenance activities (e.g., reorganization, updates) become time-consuming and resource-intensive. 5) Difficulty in Data Management: As the database grows, it becomes challenging to manage and organize the data efficiently. Data may become fragmented, leading to suboptimal data retrieval and storage processes. 6) Security Risks: Larger databases present a more attractive target for cyber threats. A breach can expose a substantial amount of sensitive data, increasing the potential impact on individuals and the organization. Proper security measures become even more critical. 7) Compliance Challenges: Compliance with regulatory requirements becomes more complex as the amount of data increases. Ensuring adherence to data privacy and protection laws, as well as auditing and tracking data access, becomes more challenging. 8) Resource Contention: A large database can consume a significant portion of system resources, causing contention with other applications or services running on the same server. This can lead to overall system instability and affect the performance of other applications. 9) Reduced Flexibility and Agility: It becomes harder to adapt and make changes to the database structure, schema, or applications when the database is excessively large. Changes may require significant planning and testing to avoid disruptions.

Answer 69

Represents the memory directly accessible by the processor (usually volitile RAM)

Answer 70

Inexpensive non-volitile resources available for long-term use

Answer 71

Allows a system to simulate additional primary memory through the use of secondary memory

Answer 72

Loses its contents when the system powers down (RAM is the most common example)

Answer 73

Does not depend on the presence of power to maintain its contents (magnetic/optical media and non-volitile ram)

Big Data, Lambda Architecture, & SQL Flashcards

(97 cards)