(F) Big Data Flashcards
What is MapReduce
MapReduce is a way to process big amounts of data by splitting the work between many computers.
MapReduce forces you to break the job into 3 stages
Map: involves processing unit data and transforming it into a set of intermediate key-value pairs. Each piece of input data is independently processed by a mapper function, which extra relevant information.
Shuffle: gathers and organizes the intermediate key-value pairs generated by the Map phase. It groups together key=value pairs with the same key from different mappers and sorts them based on the keys.
Reduce: takes the sorted intermediate key-value pairs and applies a reducer function to combine or aggregate values associated with the same key. This phase produces the final output, typically aggregating results or performing further processing on the data.
In simple terms, MapReduce divides a large dataset into smaller chunks, processes them in parallel, and then combines the results to produce the final output.
What are the 3 components of Hadoop that were specifically designed to work on big data?
- Storage Unit - if you have big data, HDFS (hadoop distributed file system) splits it into multiple blocks of data. If one crashes, HDFS makes copies of the data and stores it across multiple systems (replication method). Making it fault-tolerant
- MapReduce - traditionally, it gets processed on a single machine having a single processor which is time consuming and inefficient. MapReduce split the data into parts and processes each of them separately on data nodes, eventually aggregating all the results into one
- YARN - YARN in Hadoop manages and allocates resources such as CPU and memory across applications running on a cluster, ensuring efficient resource utilization for various data processing tasks.
What is Information Technology Infrastructure Library (ITIL) and why its good for us?
ITIL is a set of best practices designed to align IT services with business needs, helping improve efficient and service quality. For IT professionals, it provides essential tools for enhancing service management, fostering professional growth, and ensuring effective compliance and performance in their organizations.
What is roles, responsibilities, and standard procedures in Information Technology Infrastructure Library?
Roles: ITIL defines specific roles within IT service management to align IT services with business strategies and goals.
Responsibilities: ITIL designates responsibilities to enhance the execution and compliance of IT processes and service delivery.
Standard Procedures: ITIL offers a standardized framework to help organizations efficiently plan, implement, and evaluate IT services, promoting continual improvement.
What is Data Lake?
Data lake architecture: Data lake is a centralized repository that allows for the storage of vast amounts of raw data in its native format until needed for analysis or processing. This architecture supports wide accessibility and integration, suitable for diverse business needs.
Sidenote:
A data warehouse is a centralized repository that stores structured, integrated, and historical data from various sources within an organization. It is designed for querying, analyzing, and reporting purposes, providing a consolidated view of the organization’s data for business intelligence and decision-making purposes
Advantages and Disadvantages of Data Lake?
Advantages:
1. Flexibility: Data lakes store raw data without predefined schemas, making them highly adaptable to changing data types and structures; It offers flexibility to users in which they can utilize for their own specific needs.
2. Muti-workload data processing: The data lake supports different types of data processing workloads, including interactive queries, batch processing, and real-time data streaming
Disadvantages:
1. Complexity in Management: without strict data structures, data lakes can easily become disorganized and difficult to manage, often termed “data swamps”
Data Lake vs Data Warehouse
Data Lake:
- Shorter development process
- Schema-on-Read: Data is stored in its raw form and only structured when it is read, providing flexibility in handling various data types
- Multi-workload processing: Supports multiple processing methods
===================
Data Warehouse:
- Long development process
- Schema-on-Write: Data must be structured and formatted at the time of entry,
- OLAP workloads: optimized for online Analytical processing
What is Hadoop spark?
- Hadoop is a system that allows for storing large amounts of data across many computers using a component called HDFS. It is efficient scaling up thousands of nodes when HDFS store large data sets. It’s good for handling huge data volumes and is foundational for big data tasks.
-Spark is a tool that works with Hadoop to process data very quickly. It does this by keeping data in memory instead of on disk, which speeds up both real-time and batch processing tasks. Spark also makes it easier to write applications with its user-friendly programming interfaces in languages like Java, Scala, Python, and R.
Together, Hadoop and Spark provide a powerful combination for storing and rapidly processing big data.
What is Physical Storage?
Physical storage in databases organizes data on disk using structures like B+ trees, hash buckets, and ISAM to optimize retrieval and performance. These structures determine how data and indexes are arranged, affecting the speed and efficiency of data access and database operations.
What is Query processing & optimization
Query processing and optimization in databases improve performance by efficiently managing how queries are handled. This involves breaking down the query, creating detailed plans for execution, and then carrying out those plans effectively.
-Query Processor Components: The query processor has a compiler and an execution engine, with the compiler turning SQL queries into detailed plans for how to run them.
-Query Compilation: This process breaks down the SQL query, checks it for errors, and fine-tunes it to run best.
-Logical and Physical Query Plans: Logical plans are turned into physical plans that outline how and in what order to execute the query.
-Optimization Techniques: Certain rules are applied to rearrange query operations to use less resources and save time.
-Execution Engine: This part of the processor does the actual work of running the query, working with the database’s storage and management systems to do so efficiently