4. External Sorting Flashcards
What are the key considerations when sorting externally (on-disk)?
- Challenges with Slow I/O: External sorting deals with the slower I/O speeds of disk storage, making efficiency a key concern.
- Block Size Relevance: The size of blocks or pages being sorted is a critical factor, as it impacts how data is read from and written to the disk.
- Minimizing I/O: The primary goal is to reduce the number of I/O operations, which are the main bottleneck in external sorting.
- Type of Access: The method of access during sorting is significant. Random access is expensive due to higher I/O costs, whereas sequential I/O is much cheaper and more efficient for sorting large datasets.
What is a heap in a database, and how can it be search given no prior information about the order of the records?
- Heap Structure: In a database, a heap is a data structure where all blocks are stored. This structure consists of a file, which is organized into a series of pages or blocks. Records are stored within these pages or blocks.
- Searching Method: To search in a heap, since there is no prior information about the order of records, the entire file must be scanned sequentially.
Example Scenario:
Assume each block is 16KB, and each record is 160B, fitting 100 records per block.
- If a relation has 10 million tuples, it occupies 100,000 blocks.
- Searching for a specific record (e.g., authors named “Jane”) requires accessing each block, which takes 15 milliseconds per block.
- The total time for this exhaustive search is 100,000 blocks × 0.015 seconds/block = 1,500 seconds, or 25 minutes
Which beneifts does a sorted file give?
- Binary Search: Once the file is sorted, a binary search can be conducted, significantly reducing the number of I/O operations to logarithmic complexity.
- Ordered Presentation: Presenting results (e.g., order by name) becomes faster as no additional sorting is required.
- Efficient Aggregation: Computing aggregates (e.g., group by queries) is more efficient. In GROUP BY queries, sorting ensures that all records belonging to the same group are adjacent, which allows for single-pass aggregation, reducing the need for multiple scans or complex grouping algorithms.
- Duplicate Elimination: Duplicates can be easily identified and eliminated during a single scan, as identical items will be adjacent.
- Optimized Relation Algebra Operations: Operations like joins and projections are faster with a sorted file. Sorted files facilitate efficient join algorithms like merge join. In a merge join, two sorted files can be joined with a single pass through each file, comparing only the records with matching keys. For projections that involve removing duplicates, the sorted nature of the file means that duplicate values are consecutive, allowing for efficient duplicate removal.
Explain 2-way External merge sort by providing the initial set-up and the steps taken by this process.
Initial Setup:
- Available Buffers: 3 buffer frames in main memory.
- File Size: Contains 2^k pages.
Process Steps:
- Step 0 (Initial Sort Pass):
- Operation: Sort each page in memory and write it back out.
- Result: Creates 2^k sorted runs, each 1 page long.
-
Subsequent Passes (Merging Runs):
- Operation: Merge two runs in each pass, sort, and write back as a single run.
-
Result by Pass:
- Second pass: 2^(k−1) runs of 2 pages.
- Third pass: 2^(k−2) runs of 4 pages.
- Continue until the k-th pass.
-
Final Pass:
- Operation: Complete k passes.
- Final Result: One sorted run of 2^k pages, fully sorting the file.
Explain 3 key characteristics of External Merge Sort.
- Efficiency: The algorithm is efficient for large datasets because it breaks down the sorting task into smaller, manageable chunks that fit into the limited main memory.
- Minimizing I/O: By organizing the sorting process into runs and merging them, the algorithm minimizes the number of times data needs to be read from and written to disk.
- Sequential Access: The algorithm primarily utilizes sequential access of disk pages, which is faster and more efficient than random access in the context of disk I/O.
How are I/O operations calculated per page, per pass in a 2-way external merge sort, and what is the total number of I/O operations for the entire sort process?
I/Os per Page, per Pass:
- In each pass of the merge sort, every page undergoes 2 I/O operations: one for reading and one for writing after sorting or merging.
Total I/Os:
- The total number of I/O operations is estimated as (2B(log B + 1)), where:
- B: Number of blocks in the file.
- (log B): Number of passes needed (each pass doubles the run length).
- “+1”: Accounts for the initial pass of creating individual sorted runs.
- So, there are 2 I/Os for each of the (log B + 1) passes.
Example of Efficiency Concern:
If a scan takes 25 minutes, each pass would take 50 minutes due to the 2 I/Os per page. For 100,000 blocks, the total time would be 900 minutes, or 15 hours. This extended duration highlights the inefficiency of the process for large datasets, underscoring the need for optimizing I/O operations in database sorting tasks.
How does an increase of memory buffers effect External Merge Sort?
If you have N buffers in memory instead of 2, we can perform (N-1)-way merge sort where N−1 buffers are used for reading the runs, and 1 buffer is dedicated to writing the sorted output.
Drastic Improvement: Changing the base of the logarithm from 2 to N−1 drastically reduces the number of passes needed for sorting.
Reduced I/O: This reduction in passes directly translates to fewer I/O operations, making the sort much more efficient, especially for large datasets
Note: make sure to practice the example at the end of the lecture for N-1 way merge sort