Big Data, Lambda Architecture, & SQL Flashcards

1
Q

Lambda architecture

A

A data processing architecture designed to handle massive volumes of data, often associated with big data applications. It was introduced by Nathan Marz to address the challenges of real-time data processing and batch processing in big data systems. Lambda architecture combines both batch and real-time processing to provide a comprehensive and scalable solution for processing large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Purposes of Lambda Architecture

A

Real-Time Data Processing
Batch Data Processing
Scalability
Fault Tolerance
Consistency of Results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Fault Tolerance in Big Data

A

Big data systems are distributed and complex, making them prone to failures. Lambda architecture’s fault tolerance ensures that the system can recover from failures and maintain data consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

“Consistency of Results”

A

Lambda architecture guarantees that the data processed by both real-time and batch layers eventually converges, ensuring consistent results across the entire system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Scalability in Big Data Systems

A

Big data systems need to scale horizontally to accommodate the increasing volume of data and processing requirements. Lambda architecture’s design allows for horizontal scaling of both the real-time and batch processing components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Batch Data Processing

A

In addition to real-time data, big data systems often deal with historical data and large datasets that require batch processing. Lambda architecture includes a batch processing layer to handle these vast amounts of data efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Real-Time Data Processing

A

Big data systems often receive continuous streams of data from various sources, such as sensors, social media, or clickstreams. Lambda architecture incorporates a real-time processing layer to handle these streams of data and provide low-latency processing and analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Three Layers of Lambda Architecture

A
  1. Batch Layer
  2. Speed Layer
  3. Server Layer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Batch Layer for Large-Scale Data Processing

A

Big data applications deal with massive volumes of data that cannot be processed in real-time due to computational limitations. The Batch Layer is designed to handle these large-scale datasets efficiently by breaking them into manageable batches and processing them in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Batch Layer for Historical Data Processing

A

The Batch Layer is well-suited for processing historical data, which accumulates over time. It enables the system to process and analyze the entire historical dataset to produce accurate and comprehensive batch views.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Batch Layer for Precomputing Results

A

The Batch Layer precomputes batch views by running computationally intensive algorithms and data processing tasks on the entire dataset. This precomputation provides consistent and reliable results for later queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Batch Layer for Fault Tolerance

A

The Batch Layer’s batch processing is typically executed on distributed data processing frameworks, such as Apache Hadoop MapReduce or Apache Spark. These frameworks provide fault tolerance by handling failures and ensuring that the batch processing completes even in the presence of node failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Batch Layer for Scalability

A

The Batch Layer can scale horizontally by distributing data and computation across multiple nodes in a cluster. As the dataset grows, additional nodes can be added to handle the increased workload, making it suitable for big data scenarios.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Batch Layer for Consistant Results

A

By processing the entire dataset in batches, the Batch Layer ensures that the results are consistent and complete. It avoids the issues of partial or incomplete data views that may arise in real-time processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The Batch Layer

A

The Batch Layer is one of the three main layers designed to handle big data processing. It is responsible for processing large volumes of historical data in batches. The Batch Layer’s primary function is to compute batch views or batch-processing results from the entire dataset.
The Batch Layer’s primary goal is to provide a comprehensive and accurate view of historical data, which complements the real-time processing provided by the Speed Layer in the Lambda Architecture. The Batch Layer’s precomputed batch views are stored and updated periodically, enabling low-latency query processing and efficient retrieval of historical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The Speed Layer

A

The Speed Layer is one of the three main layers designed to handle real-time data processing in big data applications. The Speed Layer is responsible for processing and analyzing continuous streams of data in near real-time, providing low-latency results and insights.
The Speed Layer’s primary focus is to process and analyze real-time data streams, ensuring that the system can respond promptly to incoming events and provide real-time insights and analytics. By working in conjunction with the Batch Layer in the Lambda Architecture, the Speed Layer enables big data systems to handle both real-time and historical data efficiently, providing a complete and up-to-date view of the data for various use cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Speed Layer for Real-Time Data Processing

A

Big data applications often deal with continuous streams of data from various sources, such as sensor data, social media feeds, logs, or clickstreams. The Speed Layer is designed to handle these streams of data in real-time, ensuring that data is processed and analyzed as it arrives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Speed Layer for Low-Latency Processing

A

The Speed Layer aims to provide low-latency results to support real-time decision-making and provide timely insights. It is essential for applications where immediate actions or responses are required based on incoming data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Speed Layer for Event-Driven Architecture

A

The Speed Layer is based on an event-driven architecture, where it continuously processes events as they occur. It responds to events as they arrive, making it well-suited for time-sensitive and dynamic data scenarios.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Speed Layer for Stream Processing

A

The Speed Layer utilizes stream processing technologies, such as Apache Storm or Apache Flink, to process and analyze data streams efficiently. These technologies enable parallel processing and support fault tolerance in distributed environments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How the Speed Layer Compliments the Batch Layer

A

While the Batch Layer handles historical data processing, the Speed Layer complements it by processing real-time data. Both layers work together to provide a comprehensive view of the data, including both historical and up-to-date information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Speed Layer for Data Integration

A

The Speed Layer integrates with various data sources to ingest real-time data streams. It can process and aggregate the data, enrich it with contextual information, and make it available for real-time analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Speed Layer for Incremental Updates

A

Unlike the Batch Layer, which processes the entire dataset in batches, the Speed Layer performs incremental updates on the data as new events arrive. This enables it to provide real-time insights and responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Speed Layer for Complex Event Processing

A

The Speed Layer can handle complex event processing tasks, identifying patterns, correlations, and anomalies in real-time data streams.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Ther Serving Layer in Lambda Architecture

A

The Serving Layer is responsible for combining and serving the results from both the Batch Layer and the Speed Layer to provide a unified and up-to-date view of the data. Overall, the Serving Layer plays a crucial role in Lambda Architecture by combining and presenting the results from both the Batch Layer and the Speed Layer to users and applications. It ensures that big data applications can efficiently process and serve both historical and real-time data, enabling users to make informed decisions based on a comprehensive and up-to-date data view.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Serving Layer for Unified Data Views

A

The Serving Layer combines the precomputed batch views from the Batch Layer and the real-time views from the Speed Layer to create a comprehensive and unified data set. This unified view ensures that queries can access both historical and real-time data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Serving Layer for Low-Latency Query Processing

A

The Serving Layer provides low-latency query processing by leveraging the real-time views generated by the Speed Layer. This allows users or applications to receive up-to-date data insights without significant delays.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Serving Layer for Consistency of Results

A

The Serving Layer ensures that the results from both the Batch Layer and the Speed Layer eventually converge. This convergence guarantees that the data presented to users or applications is consistent and reflects the latest available information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Serving Layer for Scalability

A

The Serving Layer needs to handle query requests efficiently, even in the face of high data volumes and complex queries. It should be designed for horizontal scalability to accommodate increasing user demand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Serving Layer for User-Facing Interface

A

The Serving Layer provides an interface for users and applications to interact with the data. It offers APIs, web services, or other methods for querying and accessing the processed data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Serving Layer for Data Visualization and Reporting

A

The Serving Layer may also include tools and components for data visualization and reporting, allowing users to gain insights and analyze data in a user-friendly manner.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Serving Layer for Load Balancing

A

To maintain low-latency query response times, the Serving Layer may use load balancing techniques to distribute incoming queries across multiple servers or nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Four Techniques for Load Balancing

A
  1. Round-Robin Load Balancing: Requests are distributed sequentially across available resources in a circular manner.
  2. Weighted Load Balancing: Each resource is assigned a weight based on its capacity or capabilities, and requests are distributed proportionally.
  3. Least Connections Load Balancing: Requests are sent to the resource with the fewest active connections to evenly distribute the workload.
  4. Dynamic Load Balancing: Load balancers continuously monitor resource utilization and dynamically adjust the distribution of workloads based on real-time data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Six Purposes of Load-Balancing

A
  1. Optimizing Resource
  2. Utilization
  3. Reducing Processing
  4. Bottlenecks
  5. Minimizing Response Times
  6. Fault Tolerance and High Availability
  7. Scaling and Elasticity
  8. Data Distribution in Distributed Storage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Distributed Storage System Technologies

A
  1. Apache Cassandra
  2. Apache HBase
  3. Amazon S3 (Simple Storage Service)
  4. Google Cloud Storage
  5. Microsoft Azure Blob Storage
  6. Apache GlusterFS
  7. Ceph
  8. IBM Specturm Scale (GPFS)
  9. OpenStack Swift
  10. Red Hat Gluster Storage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Load Balancing

A

Load balancing is a technique used in big data and distributed computing systems to distribute processing workloads evenly across multiple computing resources (e.g., servers, nodes, or clusters). The goal of load balancing is to optimize resource utilization, enhance performance, and prevent overloading specific components, ensuring that the system operates efficiently and reliably.
Load balancing can be implemented using dedicated load balancer hardware or software, as well as through software-defined load balancers in cloud environments. It is a fundamental technique in building scalable, fault-tolerant, and high-performance big data systems that efficiently handle the challenges posed by large data volumes and processing requirements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Technologies used in the Batch Layer

A
  1. Apache Oozie
  2. Cascading
  3. Apache Kylin
  4. Apache Beam
  5. Apache Crunch
  6. Apache Pig
  7. Apache Five
  8. Apache Flink
  9. Apache Spark
  10. Apache Hadoop MapReduce
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Technologies used in the Speed Layer

A
  1. Apache Storm
  2. Apache Flink
  3. Apache Kafka
  4. Apache Samza
  5. Amazon Kinesis
  6. Google Cloud Dataflow
  7. Microsoft Azure Stream Analytics
  8. Spark Streaming
  9. NATS
  10. RabbitMQ
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Technologies used in the Serving Layer

A
  1. Apache HBase
  2. Apache Cassaandra
  3. Apache Druid
  4. Elasticsearch
  5. AmazonDynamoDB
  6. Google Cloud Bigtable
  7. Apache Ignite
  8. Redis
  9. MongDB
  10. Apache Solr
40
Q

Least Connections Load Balancing

A

Least Connections Load Balancing is a load balancing technique used in distributed computing and networking to distribute incoming requests or connections to a group of resources (such as servers or nodes) based on the current number of active connections on each resource. The basic principle is to direct new requests to the resource with the fewest active connections, aiming to evenly distribute the workload and prevent overloading any specific resource.
Least Connections Load Balancing is a load distribution strategy that plays a significant role in big data systems, contributing to their scalability, fault tolerance, and efficient utilization of computing resources. It helps ensure that big data applications can efficiently process vast amounts of data and deliver timely insights to users and applications.

41
Q

Batch Processing

A

Batch processing is a data processing method in which data is collected, processed, and analyzed in fixed-size, discrete batches. In this approach, data is accumulated over a certain period or until a specific volume is reached before it is processed as a group. The processing occurs offline, away from real-time interactions, and is typically done during non-peak hours or when the system load is low. Batch processing is a fundamental concept in big data systems, allowing them to efficiently manage and process vast amounts of data. It complements real-time processing, such as stream processing, in the Lambda Architecture, providing a comprehensive and efficient solution for handling big data workloads.

42
Q

Benefits of Batch Processing for Big Data

A
  1. Handling Large Volumes of Data
  2. Scalability
  3. Resource Optimization
  4. Reduced Latency Sensitivity
  5. Consistency and Determinism
  6. Complex Computations
  7. Error Handling
  8. Reduced Overhead
43
Q

Consistency and Determinism in Batch Processing

A

Batch processing provides consistent and deterministic results. By processing an entire batch of data at once, the system ensures that all data points within the batch are processed using the same set of rules and algorithms, avoiding potential discrepancies due to real-time fluctuations.

44
Q

Resource Optimization

A

Resource optimization in the context of big data refers to the efficient and effective utilization of computing resources, storage, and other infrastructure elements to handle the large volumes of data and processing requirements of big data applications. It involves maximizing the performance, scalability, and cost-effectiveness of the resources while minimizing waste and inefficiencies.
Resource optimization in big data involves various strategies, including load balancing, efficient data partitioning, smart scheduling of processing tasks, storage optimization, and leveraging specialized hardware or cloud-based services. It requires a careful understanding of the application’s requirements, data characteristics, and performance goals.
Overall, resource optimization is a fundamental aspect of big data management and infrastructure design. It plays a key role in ensuring that big data applications can handle large-scale data processing efficiently, meet performance expectations, and achieve cost-effectiveness, contributing to the success and sustainability of big data projects.

45
Q

Best way to reduce costs in big data

A
  1. Cloud Adoption
  2. Serverless Architectures
  3. Data Compression (Storage Optimization)
  4. Data Lifecycle Management
  5. Efficient Data Processing
  6. “Spot and Preemptible Instances”
  7. Auto-Scaling
  8. Data Archiving and Backup
  9. Open Source Technologies
  10. Resource Optimization
  11. Data Governance and Quality
  12. Data Security and Compliance
  13. Data Visualization and Reporting
46
Q

Strategies for making big data more eco-friendly

A
  1. Cloud Computing and Virtualization
  2. Data Center Efficiency
  3. Renewable Energy Sources to power big data operations
  4. Consolidation and Data Sharing
  5. Data Compression and Deduplication
  6. Energy-Efficient Hardware
  7. Server and Resource Optimization
  8. Data Lifecycle Management
  9. Smart Data Replication
  10. Energy-Aware Algorithms
  11. Data Caching and Preprocessing
  12. Real-time Data Pruning
  13. Data Center Location
  14. Power Management Policies
  15. Monitoring and Reporting
47
Q

The 12 Distributed computing architectures (Layers) for the structure of Big Data systems

A
  1. Data Sources
  2. Data Ingestion Layer
  3. Storage Layer
  4. Data Processing Layer
  5. Batch Processing
  6. Speed Processing
  7. Serving Layer
  8. Query and Analytics
  9. Data Visualization and Reporting
  10. Security and Governance
  11. Monitoring and Management
  12. Scaling and High Availability
48
Q

The Foundation of all Big Data Systems

A

Data is at the core of big data systems. The entire ecosystem is built around collecting, storing, processing, and analyzing vast amounts of data from various sources.

49
Q

5Vs of Data

A

The 5Vs are fundamental characteristics of big data. They define the key challenges and opportunities posed by big data systems.
1. Volume
2. Variety
3. Velocity
4. Veracity
5. Value
These 5Vs collectively define the complexity and potential of big data and highlight the importance of efficient data management, processing, and analysis in modern data-driven environments. Big data systems are designed to address the challenges posed by these 5Vs while leveraging the opportunities they present for data-driven innovation and business growth.

50
Q

Data Volume

A

Volume refers to the vast amount of data generated and collected in big data environments. It involves dealing with massive datasets that may range from terabytes to petabytes or even exabytes in size. The ability to handle and process such large volumes of data is a defining characteristic of big data systems.

51
Q

Data Velocity

A

Velocity represents the speed at which data is generated, collected, and processed. In the era of real-time data streams, big data systems must be capable of ingesting and analyzing data as it is generated to provide timely and up-to-date insights. Velocity is crucial for time-sensitive applications and real-time analytics.

52
Q

Data Variety

A

Variety refers to the diversity of data types and sources in big data environments. Data can be structured, semi-structured, or unstructured, and it comes from various sources such as social media, sensors, log files, videos, images, and more. The challenge lies in managing and analyzing this diverse data effectively.

53
Q

Data Veracity

A

Veracity represents the quality, reliability, and trustworthiness of data. Big data often involves dealing with data from various sources, and the veracity of this data can vary. Ensuring data quality, cleaning, and validation are critical to making accurate and reliable decisions based on big data insights.

54
Q

Data Value

A

Value refers to the insights and meaningful information that can be extracted from big data. The ultimate goal of big data is to derive actionable insights that lead to valuable outcomes, such as data-driven decision-making, business optimization, and improved customer experiences.

55
Q

The 16 Basic Processing Paradigms

A
  1. Batch Processing
  2. Real Time Processing
  3. Micro-Batch Processing
  4. Stream Processing
  5. Interactive Processing
  6. Near-line Processing
  7. Asynchronous Processing
  8. Graph Processing
  9. In-Memory Processing
  10. Predictive Processing
  11. Interactive Batch Processing
  12. Adaptive Processing
  13. Parallel Processing
  14. Complex Event Processing (CEP)
  15. Edge Processing
  16. Machine Learning Inference
56
Q

Distributed Computing

A

A computing model in which a group of interconnected computers work together to solve a computational problem or perform a task. In this model, processing tasks are divided into smaller sub-tasks and distributed across multiple nodes (computers) within a network. Each node processes its assigned sub-task independently, and the results are combined to obtain the final result. The goal of distributed computing is to achieve higher performance, scalability, fault tolerance, and resource efficiency compared to traditional centralized computing models.
In the context of big data, distributed computing plays a crucial role in handling the massive volume of data generated and processed in big data systems. Big data often exceeds the capacity of a single computer or server to handle, making distributed computing essential for managing and analyzing large datasets efficiently. The two main aspects of distributed computing relevant to big data are:
1. Data Storage
2. Data Processing
The combination of distributed data storage and distributed data processing enables big data systems to address the challenges posed by the 5Vs of big data (Volume, Velocity, Variety, Veracity, and Value)

57
Q

Distributed Computing for Data Storage

A

Distributed storage systems, such as Hadoop Distributed File System (HDFS) and distributed NoSQL databases like Apache Cassandra and MongoDB, are used to store and manage vast amounts of data across multiple nodes. Data is partitioned and replicated across nodes to ensure data availability, fault tolerance, and data locality.

58
Q

Distributes Computing for Data Processing

A

Distributed computing frameworks like Apache Hadoop, Apache Spark, and Apache Flink are employed for distributed data processing. These frameworks divide data processing tasks into smaller chunks, and each node processes its portion of the data independently. The results are then combined to form the final output. This parallel processing allows big data systems to handle massive datasets efficiently.

59
Q

Horizontal Scalability v. Vertical Scalability

A

Horizontal scalability (aka Scale-out) -Ability of a system to handle increasing workloads and growing demands by adding more nodes or resources to the system. As data volume or processing requirements increase, new nodes can be added to the existing infrastructure, distributing the workload across the added nodes. This allows the system to maintain or improve performance, throughput, and responsiveness as the demand grows.
Vertical scalability, (aka scale-up) -Involves increasing the capacity of individual nodes (e.g., adding more memory or CPU to a single server) to handle increased workloads. While vertical scalability can be effective up to a certain point, it eventually faces limitations due to hardware constraints, and it becomes more expensive and challenging to scale further.

60
Q

Benefits of Horizontal Scalability

A
  1. Handling Massive Data Volumes
  2. Performance and Throughput
  3. Fault Tolerance/High Availability
  4. Cost-Effectiveness
  5. Flexibility and Elasticity
  6. Future-Proofing
61
Q

Flexibility in Big Data

A

Flexibility refers to the system’s ability to adapt and accommodate varying workloads, data sources, and processing requirements. A flexible big data system can handle different data types (structured, semi-structured, unstructured) from diverse sources, allowing for easy integration and processing of data with varying formats and characteristics.

Key aspects of flexibility in big data include:
1. Data Variety: A flexible big data system can ingest, process, and analyze data in different formats, such as text, images, videos, log files, social media feeds, and sensor data.
2. Data Schema Evolution: A flexible system can accommodate changes in data schema over time without disrupting existing data processing pipelines or applications.
3. Support for Various Processing Paradigms: A flexible big data system can support batch processing, real-time processing, interactive processing, and streaming data processing, depending on the specific use case and requirements.
4. Data Exploration and Ad-Hoc Queries: Flexibility allows data scientists and analysts to explore data and perform ad-hoc queries efficiently, enabling data discovery and deeper insights.

62
Q

Elasticity in Big Data

A

Refers to the system’s ability to automatically and dynamically scale its computing resources up or down in response to changing workloads or demands. An elastic big data system can add or remove resources (e.g., nodes or computing instances) on-the-fly to handle varying data processing needs, ensuring optimal performance and resource utilization.
Key aspects of elasticity in big data include:
1. Automatic Scaling: An elastic system can automatically scale its resources based on predefined rules or thresholds. For example, when data processing demand increases, more nodes are added to distribute the workload.
2. Resource Optimization: An elastic system optimizes resource allocation to match the current workload, avoiding overprovisioning or underutilization of resources.
3. Cost Efficiency: Elasticity allows for cost optimization by scaling resources based on demand. During periods of low demand, unnecessary resources can be removed, reducing costs.
4. Resilience: Elasticity enhances system resilience by enabling it to adapt to sudden spikes in data processing demands without sacrificing performance or availability.

63
Q

Ad-Hoc Queries

A

Refer to on-the-fly, user-initiated queries that are performed in an exploratory and interactive manner to gain immediate insights and answers from large and diverse datasets. Ad-hoc queries are typically unplanned and ad hoc, meaning they are not predefined or part of a regular processing workflow. Instead, they are formulated based on specific analytical needs or questions raised by data analysts, scientists, or business users.

64
Q

Processing Paradigms

A

Approaches for handling data in big data computing systems. Processing paradigms are employed to analyze and extract insights from large and diverse datasets. Each processing paradigm has its characteristics and use cases.

65
Q

Extract, Transform, Load (ETL)

A

Extracting data from various sources, transforming it into a suitable format, and loading it into a data repository for analysis.

66
Q

Big Data Analytics

A

Analyzing and deriving insights from massive datasets to identify patterns, trends, and correlations. This includes tasks like data mining, predictive modeling, and machine learning on large datasets.

67
Q

Real-Time Data Processing

A

Processing and analyzing data as it arrives in real-time, such as streaming data from IoT devices, social media feeds, and financial markets.

68
Q

Large-Scale Data Warehousing

A

Storing and managing vast amounts of structured and unstructured data in a data warehouse for historical analysis and reporting.

69
Q

Search and Information Retrieval

A

Conducting searches and retrieval operations on large databases or search engines

70
Q

Recommendation Systems

A

Generating personalized recommendations for users based on their preferences and historical data

71
Q

Social Network Analysis

A

Analyzing relationships and connections between entities in social networks

72
Q

Log Analysis

A

Processing and analyzing log data from various systems to detect anomalies, monitor system health, and identify potential issues

73
Q

Genomic Data Analysis

A

Analyzing genetic data to study genetic variations, gene expression, and disease-related factors.

74
Q

Image and Video Processing

A

Analyzing and processing large volumes of image and video data, such as in computer vision applications and video surveillance.

75
Q

Natural Language Processing (NLP)

A

Analyzing and processing natural language text data for tasks like sentiment analysis, language translation, and information extraction.

76
Q

Financial Data Analysis

A

Analyzing financial transactions, market data, and economic indicators to support financial decision-making and risk analysis.

77
Q

Scientific Data Analysis

A

Analyzing data from scientific experiments, simulations, and observations in fields like astronomy, climate research, and bioinformatics.

78
Q

Geospatial Data Analysis

A

Analyzing geographic and spatial data for applications in geographic information systems (GIS) and location-based services

79
Q

In-memory computing

A

In-memory computing is a computing technique that involves storing and processing data directly in the main memory (RAM) of a computer, as opposed to storing data on disk or other slower storage devices. In this approach, data is accessed and manipulated at a much faster speed since RAM has significantly lower access times compared to disk storage. This allows for real-time data processing and analysis, making it ideal for handling data-intensive and time-sensitive tasks.
In-memory computing is widely used in various domains and applications, including big data analytics, real-time data processing, financial trading systems, online transaction processing (OLTP) systems, and high-performance computing (HPC). It is often utilized in conjunction with distributed computing frameworks, enabling large-scale data processing and analytics with high performance and efficiency.

80
Q

The 5 V’s of Big Data

A

Volume, Velocity, Variety, Veracity, and Value

81
Q

Relational Databases in Big Data

A

Organizes data into tables, where each table consists of rows and columns. The rows represent individual records or entries, and the columns represent attributes or characteristics of those records.

82
Q

Rows/Tuples in a datatable

A

Each row in a table represents a single record or data entry containing specific values for each column.

83
Q

Columns/Attributes in a datatable

A

Columns represent the specific characteristics or attributes of the data being stored in the table.

84
Q

Keys in datatables

A

Primary Key: A unique identifier for each row in a table, ensuring each row is distinct.
Foreign Key: A field in a table that refers to the primary key of another table. It establishes relationships between tables.

85
Q

Database Normalization

A

The process of organizing data to minimize redundancy and dependency by dividing large tables into smaller tables and defining relationships between them.

86
Q

Structured Query Language (SQL)

A

The language used to interact with RDBMSs. SQL allows for the creation, manipulation, and retrieval of data from databases

87
Q

Candidate Keys for a data table

A

A subset of attributes that can be used to uniquely identify a record withing a table (since no two records will ever contain the same values for all those attributes)
-These are called candidate keys because you will choose one of them to be your primary key
*each table will likely have more than one Candidate Key

88
Q

Primary Key for a data table

A

One of the Candidate keys: the one specifically chosen to be the unique identifier

89
Q

Foreign Keys in RDBs

A

Used to enforce relationships between two tables. Each Foreign key corresponds to an existing primary key in another table in the RDBMS

90
Q

Relational Databases (RDBs)

A

A type of database that organizes and stores data in a structured format using rows and columns within tables. It is based on the principles of the relational model of data

91
Q

Security Vulnerabilities of a database

A

1) Aggregation: If the enemy collects enough non-sensitive data, they can sell it to someone who will know how to exploit it (can be prevented by zero trust/least privilege)

2) Inference: If the enemy obtains aggregated non-sensitive data, they can use it to draw conclusions that are actually sensitive (can be prevented by “blurring data” and “database partitioning”)

3) Database ransomware: Attackers can lock you out of your database so that you cannot use the data anymore.

92
Q

Reasons to avoid putting too much data in one database

A

1) Performance Degradation: As the amount of data stored in a database grows, the performance can degrade. Retrieving, updating, or deleting data becomes slower, affecting response times for applications and end-users.

2) Scalability Issues: Large databases can become difficult to scale. Scaling vertically (increasing the resources of a single machine) has limits, and scaling horizontally (adding more servers) can be complex and costly, especially if the database was not designed for distributed architecture.

3) Increased Backup and Recovery Times: Backing up and recovering a large database takes more time and resources. The longer the recovery time, the higher the potential impact on business continuity and data availability during downtime.

4) Higher Maintenance Complexity:
Managing and maintaining a large database is more complex. Tasks like indexing, optimizing queries, and performing maintenance activities (e.g., reorganization, updates) become time-consuming and resource-intensive.

5) Difficulty in Data Management:
As the database grows, it becomes challenging to manage and organize the data efficiently. Data may become fragmented, leading to suboptimal data retrieval and storage processes.

6) Security Risks: Larger databases present a more attractive target for cyber threats. A breach can expose a substantial amount of sensitive data, increasing the potential impact on individuals and the organization. Proper security measures become even more critical.

7) Compliance Challenges:
Compliance with regulatory requirements becomes more complex as the amount of data increases. Ensuring adherence to data privacy and protection laws, as well as auditing and tracking data access, becomes more challenging.

8) Resource Contention: A large database can consume a significant portion of system resources, causing contention with other applications or services running on the same server. This can lead to overall system instability and affect the performance of other applications.

9) Reduced Flexibility and Agility: It becomes harder to adapt and make changes to the database structure, schema, or applications when the database is excessively large. Changes may require significant planning and testing to avoid disruptions.

93
Q

Primary Storage in RDBMS

A

Represents the memory directly accessible by the processor (usually volitile RAM)

94
Q

Secondary Storage in RDBMS

A

Inexpensive non-volitile resources available for long-term use

95
Q

Virtual Memory

A

Allows a system to simulate additional primary memory through the use of secondary memory

96
Q

Volatile Storage

A

Loses its contents when the system powers down (RAM is the most common example)

97
Q

Nonvolatile Storage

A

Does not depend on the presence of power to maintain its contents (magnetic/optical media and non-volitile ram)