Databases Flashcards

Question

What are the issue with 2 phase locking and timestamping in relation to concurrency control

Answer 1

Issues with 2-Phase Locking (2PL): Deadlocks: Transactions may wait for each other indefinitely. Blocking: Transactions are delayed waiting for locks. Cascading Rollbacks: Rollbacks propagate to dependent transactions. Reduced Parallelism: Locks limit simultaneous access to data. Starvation: Low-priority transactions may never get locks. Issues with Timestamping: Frequent Restarts: Transactions are often aborted and restarted. Starvation: Newer transactions are frequently delayed. High Overhead: Managing timestamps for every operation is costly. Resource Wastage: Restarted transactions waste system resources. Issues with Long Transactions: Long-running transactions block newer updates

Answer 2

Find Last Checkpoint. Scan Log for Committed Transactions that weren't written to Disk, and "re-apply" them. Scan Log for Uncommitted Transactions, and roll them back.

Answer 3

Types of partitioning: 1. Round Robin - Sequentially assigns data to different partitions in cyclic manner - ensures even distribution of the data across all partitions - Good for batch processing as data is spread out - poor performance for point and range queries as data is spread across multiple partitions - no logical grouping of related data 2. Hash partitioning - uses a hash function on an attribute to determine the partition used for a data tuple - good for point queries - distributes data evenly if the hash function is chosen appropriately - poor for range queries as the hash function does not preserve order 3. range partitioning - divides data into defined ranges 0-10 etc - efficient range queries - can lead to skewed data if not chosen correctly - not good for point queries unless the query falls within a range Batch: : Operates on an entire dataset or large portions of it. Like summing all values etc Range: Retrieves all records that fall within a specific range of values. Point queries: Retrieves a single record or a small number of records based on specific criteria.

Answer 4

Inter-query Parallelism: runs multiple queries on different processors. speeds up the transaction throughput. But does not speed up the individual query time. Intra-query Parallelism: splits up a single query into parts and runs those parts in parallel. This speeds up execution time of complex queries. Intra-operation Parallelism: This breaks up the individual operations of a query and runs those in parallel. Joins and sorts are broken up and done simultaneously.

Answer 5

Why its important: provides faster execution by splitting large queries into smaller tasks, processed simultaneously. Can now handle a lot bigger datasets efficiently by distributing the work. Improves throughput and system performance. 3 Methods to achieve this 1. Inter-query Parallelism: 2. Intra-query Parallelism: 3. Intra-operation Parallelism:

Answer 6

3 Ways to ensure concurrency control over multiple sites. 1. Two Phase locking prevents REFER TO NOTES (Can cause deadlocks) 2. Timestamp ordering: REFER TO NOTES 3. Distributed commit protocols: The coordinator site sends a prepare message. each site checks if it can commit the transaction and replies with yes or no. Phase 2: if all say yes then the transaction goes ahead if any sites says no then the transaction is aborted and changes are rolled back! Ensures consistency across the databases. (High communication cost for this and extra overhead for the system).

Answer 7

Data Storage: B-Tree: Keys and data (values) are stored together in both internal and leaf nodes. B+ Tree: Keys are stored in internal nodes for navigation, and all data (values) are stored exclusively in the leaf nodes. Leaf Nodes: B-Tree: Leaf nodes are not necessarily linked, making sequential access less efficient. B+ Tree: Leaf nodes are linked, forming a linked list, which makes range queries and sequential access much faster. Traversal: B-Tree: Search may terminate at any node (internal or leaf) that contains the target key. B+ Tree: Search always traverses to the leaf nodes, even if the key is found in an internal node, ensuring uniform data retrieval. Redundancy: B-Tree: Keys appear only once in the tree. B+ Tree: Keys in internal nodes are repeated in the leaf nodes. Use Cases: B-Tree: Better suited for in-memory data structures with point queries. B+ Tree: Ideal for databases and file systems due to efficient range queries and sequential access.

Answer 8

External (parallelized) merge sort Split the dataset into smaller chunks that fit in memory. Sort each chunk using an in-memory algorithm (e.g., quicksort). Save the sorted chunks to disk. Step 2: Merge the sorted chunks in multiple passes until one fully sorted file is produced. can parallelize the sorting and the joining

Answer 9

How Range Partitioning Works: Divide Data into Ranges: Split data into partitions based on predefined value ranges (e.g., Partition 1: 1–100, Partition 2: 101–200). Assign Records to Partitions: Each record is placed in the partition corresponding to its key's range. Parallel Processing: Each partition can be processed by a separate processor or node. Query Optimization: Range queries (e.g., key BETWEEN 50 AND 150) target only the relevant partitions, skipping others. Advantages: Efficient Range Queries: Directly access partitions relevant to the query, improving performance. Data Locality: Data in the same range is stored together, reducing disk I/O. Load Balancing: When data is evenly distributed, workload is shared across multiple processors. Simpler Merge Operations: Sorted partitions are easy to combine since ranges don’t overlap. Disadvantages: Skewed Data Distribution: Uneven ranges can cause some partitions to be overloaded (hotspots), leading to imbalanced workloads. Static Boundaries: Poorly chosen range boundaries can become inefficient if data distribution changes over time. Query Limitations: Point queries or queries outside defined ranges may require scanning multiple partitions.

Answer 10

Focus: Sorting a single large dataset that does not fit in main memory. Approach: Use External Merge Sort, which: Divides data into chunks that fit in memory. Sorts each chunk in memory and writes sorted chunks to disk. Merges the sorted chunks into one final sorted dataset. Key Example: Demonstrate sorting chunks and merging them.

Answer 11

Focus: Sorting tuples from a dataset that is already partitioned across multiple processors in a parallel database. Approach: Use Parallel Sort-Merge, which: Sorts data locally within each partition using an in-memory algorithm. Merges the sorted partitions globally to produce a fully sorted dataset. Key Example: Show how local sorting and global merging work across partitions. Efficiency: Discuss advantages of parallelism and potential overhead from merging.

Answer 12

A Protocol where changes made by Transactions are not written to the disk until they are committed.

Answer 13

A Protocol where changes made by Transactions are written to the disk Immediately, applying changes before they are committed.

Answer 14

Definition: A multi-dimensional indexing technique that divides data space into grid cells for efficient querying of multi-attribute data. How It Works: Divides each attribute (dimension) into intervals to form a grid. Maps data points to grid cells. Uses a directory to store references to non-empty cells. Advantages: Efficient for multi-dimensional queries (e.g., range or point searches). Compact storage (only stores non-empty cells). Dynamically adjustable grid. Disadvantages: Poor performance with skewed data distributions. High maintenance for dynamic adjustments. Limited for complex or high-dimensional queries. Use Cases: GIS: Spatial data (e.g., mapping coordinates). Multi-key Queries: Searching based on multiple attributes (e.g., age and salary).

Answer 15

Result(X) :- employees(X, Age, _), Age > 30; Result(Name) :- employees(Name, _, _). Result(E, D) :- employees(E, Dept), departments(D, Dept).

Answer 16

A Constraint in BLP that prevents a subject (user or process) with a given security clearance level from reading data from a higher level.

Answer 17

A model for enforcing data confidentiality in multilevel systems, defining rules to prevent unauthorized information access.

Answer 18

Inaccurate Tuples formed by incorrectly merging two tables.

Databases Flashcards

Databases (43 cards)