Infrastructure & Architecture Flashcards
What is scaling in the context of big data infrastructure?
Scaling refers to increasing the capacity of a system to handle more data or higher loads. It can be done through vertical scaling (adding resources to a single machine) or horizontal scaling (adding more machines).
What is vertical scaling (scale-up)?
Vertical scaling involves adding more resources like processors, RAM, and disks to a single machine or upgrading to a more powerful server.
What is horizontal scaling (scale-out)?
Horizontal scaling involves adding more machines to a system to increase capacity. This approach supports distributed computing and unlimited scalability but depends on network speed.
What is Symmetric MultiProcessing (SMP) architecture?
SMP architecture is where multiple processors share the same RAM, I/O bus, and disks. It is common in traditional workstations but has physical and speed limitations.
What is Massively Parallel Processing (MPP) architecture?
MPP architecture involves a shared-nothing system where each module has its own RAM and disks. It is used for tasks split into independent processes and is common in data warehousing.
What is a cluster architecture in big data?
A cluster is a group of connected computers (nodes) working together to perform as a single system. It offers scalability, is connected via fast LAN, and avoids vendor lock-in.
What are the pros of using MPP architecture?
MPP architecture has high-speed message passing, specialized hardware/software, better reliability, and is ideal for single, vertical solutions like data warehousing.
What are the pros of using cluster architecture?
Cluster architecture supports infinite scaling, is cheaper to set up, uses commodity hardware, avoids vendor lock-in, and is ideal for varied applications.
What is grid computing?
Grid computing involves using distributed computer resources from multiple locations for a common goal. It differs from clusters in that nodes perform different tasks and are geographically dispersed.
What is a data lake?
A data lake is a central repository for storing raw data in its original format, which is processed only when needed. It supports various data formats, including structured, semi-structured, and unstructured.
What is the NIST’s reference architecture for big data?
The NIST’s reference architecture outlines the ecosystem of tools and hardware needed for big data operations, defining roles like Big Data Framework Provider, Application Provider, and System Orchestrator.
What is the role of a Big Data Framework Provider?
A Big Data Framework Provider supplies the general resources or services for creating big data applications, including infrastructure, data management, and processing frameworks.
What is the role of a System Orchestrator in big data architecture?
A System Orchestrator integrates application activities into a system, configures resources, manages workloads, and ensures quality requirements are met.
What is the Lambda architecture?
The Lambda architecture processes incoming data through two paths: a hot path for real-time processing and a cold path for more accurate but delayed batch processing.
What is the Kappa architecture?
The Kappa architecture processes all data using a single stream processing system, simplifying architecture by eliminating the need for separate batch processing.