Databases Flashcards
Databases
What is an efficient method to jointly index multiple attributes in a database, and how does it improve query performance?
An efficient method is composite indexing where a single index is created on multiple columns like (Department, salary). increases performance by reducing query search space. Instead of having to go through irrelevant rows of data. The more queried attribute is put first int he index it speed up the query even more. Multiple attributes now accessed together.
Increases storage space and introduces redundancy. Therefore the updates and insertions are slower as the data needs to be changed in multiple areas
What indexing strategy can be used to jointly index attributes
A1, A2….An
for efficient queries involving selection criteria on multiple attributes?
Use composite index that combines attributes a1, a2, an into a single index. can have custom indexes for the most queried ones like A1, A2, A3. more efficient of the most queried one is first or if the query has multiple of these attributes. Quicker search times, but uses more storage and updates are now slower as multiple values need to be updated
In hashing what is a collision?
A collision occurs when hashing 2 different keys results in the same index in the hashing table.
Why can the simple collision resolution policy (CRP) result in poor search performance?
Colisions can occur due to a badly designed hash function.
Hash collisions increases the time it takes to search insert and delete.
This is because when a collision occurs it has to search for the next available slot which adds additional steps to the agorithm.
Collisions also create clusters of data that decreases performance as when a key is hashed to one of these clusters it needs to search past these clusters to insert.
Hash collisions create an uneven distribution in the data.
- What is Linear Probing?
And why do buckets need to be marked as tombstones when items are deleted from them?
Linear probing is a collision resolution strategy used in hashing. When two keys hash to the same index, linear probing resolves the collision by searching for the next available slot (bucket) in the hash table. When it reaches the end of the table it wraps around to the start. They are prone to clustering issues as when there is a grouping of data it takes longer to find a free slot. This increase the time to insert etc.
Buckets need to be marked as tombstones when items are deleted from them to ensure that when searching the search doesn’t think the its over and stop. The value might be below/after this tombstone bucket
- What is Dynamic About Extendible and Linear Hashing?
Dynamic hashing techniques like extendable hashing and linear hashing adapt the hash tables size dynamically to handle a increasing number of inputs/keys. Unlike static hashing where the hash table size is fixed these methods grow or shrink the hash table as needed.
How does Extendable and linear hashing differ?
Extendable hashing:
uses a directory to map hash keys to buckets.
The directory doubles in size when bucket overflows.
Only the overflowing bucket splits.
Required more memory for the for the directory but allows for fast access
Linear hashing:
does not use a directory.
Expands the hash table incrementally, splitting buckets sequentially when a bucket exceeds its overflow.
Offers a more gradual growth process consuming less memory during reorganization.
Outline an algorithm for deleting an item from a linearly hashed file.
Deleting an item from a linearly hashed file requires careful handing to maintain the tables integrity and the sequence of the buckets for correct lookups.
Steps to delete
use the hash function to locate the bucket.
search for the item in that bucket.
if found remove and mark spot as a tombstone to ensure the search still works.
Describe the structure of a b+ tree.
A b+ tree is a self balancing tree extended from the b tree. It ensures all values are stored in leaf nodes while internal nodes re used for indexing and traversal.
Root Node:
Contains pointers to child nodes
Internal nodes:
Contains pointers to child nodes
serve to shrink the search space.
Leaf Nodes:
Stores the actual data.
Linked together in a doubly linked list to allow for double traversal
The keys in the leaf nodes are in sorted order
Properties of a b+
Balanced structure
Efficient range queries
Search efficiency
In relation to a b+ tree write high level pseudocode for returning all values of Ai for some defined range.
- Start at the root of the B+ tree.
- Traverse down the tree:
- Follow the correct child pointers in internal nodes based on the lower bound of the range (low).
- Stop when a leaf node containing keys near “low” is reached.
- Scan the leaf nodes:
- Collect all keys in the current leaf that fall within the range [low, high].
- If the current leaf contains a key greater than “high,” stop.
- Move to the next leaf node if needed:
- Use the linked list of leaf nodes to continue scanning for more keys within the range.
- Stop when all keys up to “high” have been collected.
- Return the list of keys found within the range.
What are the motivations for using a dynamic hashing approach?
Also name any approach to hashing a dynamic file
dynamic hashing can dynamically adapt the hash table size, handling Growing datasets as opposed to static hash table sizes in static hashing.
Space efficiency
Adaptability
Minimized overhead as it reduces the manual intervention to resize or rehashing.
Type of Dynamic hashing:
Linear hashing is a type of dynamic hashing.
Explain what the algorithm for the b+ tree does to Insert values
First locate the leaf node.
Start at the root and traverse down to the appropriate leaf node where the key should be inserted.
use the internal nodes to decide which child pointer to follow.
Insert into the leaf Node.
Insert the key value in sorted order. within the leaf node if there is space.
If there is no space and there is overflow.
split the node into 2 nodes
move the middle key to the parent key.
If the parent overflows recursively split and propagate the middle upwards
Explain what is meant by the lost update problem and show with an example how a problem could arise
A lost update problem occurs when 2 or more transactions modify a single piece of shared data without proper synchronization. Typically happens when there is a lack of concurrency control.
when multiple transactions read a value and then update it based on the old value.
T1 read 100
T2 reads 100
T1 Adds 85 and writes 185
T2 Adds 50 and writes 150 instead of both going through one goes through and the other is lost.
What is Two Phase Locking (2PL). And with an example of transactions show how it would proceed under 2PL.
2PL is a concurrency control protocol.
2 Phases
Growing phase:
A transaction acquires locks on the data items it needs
Once the transaction releases the lock it cannot acquire more locks.
Shrinking phase:
The transaction releases lock as it finishes using the items
During this phase no new lock can be acuired
Transactions must acquire share or exclusive locks with data in order to read and write. These locks are then released and other transactions can then access it for reading and writing.
T1 starts: Acquires an exclusive lock on the value (100).
Reads the value (100).
Adds 85 → New value is 185.
Writes 185.
Releases the lock.
T2 starts (only after T1 finishes):
Acquires an exclusive lock on the value (now 185).
Reads the value (185).
Adds 50 → New value is 235.
Writes 235.
Releases the lock.
Prove that Two Phase Locking guarantees Conflict-serializability.
in 2PL there are 2 phases
Growing phase:
Transaction acquire all required locks without releasing any
Shrinking phase: Once a transaction releases a lock it cannot acquire any new locks.
This 2 Phase locking system ensures partial order on transactions based on the locking order.
This partial order prevents cycles in the precedence graph/Conflict graph. This attribute of being Acyclic (containing no cycles) is a must for being Conflict-Serializable.
This locking system creates a dependency so that t2 can only proceed when the lock form t1 has been released the lock.
This creates a dependency graph where cycles are not posible
Explain database recovery including system log, commit points, checkpoints.
Database recovery is returning the database to a correct consistent state after a failure like a power cut or crash etc.
Recovery ensures all committed transactions are preserved (REDO) and all incomplete transactions are Removed. (UNDO)
General Approach to recovery
1 System Log:
It records
When a transaction starts
What data is being changed Before value and after value.
What operations have been done
When a transaction finishes
System log helps the database fix errors if it crashes by undoing or redoing changes
2 Commit point:
This is the moment when a transaction is officially finished
Once a transaction reaches its commit point, all its changes are permanent and even if there is a system crash the changes are safe.
3 checkpoint
A save point for the database
It helps speed up recovery after a crash by reducing how much of the system log needs to be checked
Steps to create checkpoint:
Pause all current transactions
save all unfinished changes to disk
write a checkpoint marker in the system log
Resume the paused transactions.
After a crash the database can start recovery from the last checkpoint instead of reloading the last log it has.
Giving an example explain what is meant by a temporary update problem.()
Happens when one transaction updates a value and another reads that value before its committed. if the first transaction is rolled back then the second transaction will be working with the wrong value that it read in from T1 originally.
Example
T1 Reads 100 and +50
T2 Reads 150 and +50
T1 Rolls back
T2 Outputs 200
this is wrong as is known as a dirty read
Define the term minimal cover set and outline an algorithm for finding a minimal cover set of a set of functional dependencies.
Hence find the minimal cover set of A->B, C -> B, D -> ABC, AC ->D.
If F is the original functional dependency then the minimal cover set is a set of simplified equivalent set of dependencies that satisfies these criteria:
1 Every dependency in the set needs to have a single attribute on the right-hand-side.
2. Remove redundant attributes from left hand side.
3. Remove all redundant functional dependencies like A->B, B->C, A->C
Decomposing BCNF does no guarantee dependency-preservation. Explain this term
Dependency-preservation means ensures that all functional dependencies in the original relation are preserved in the decomposed relations.
meaning for every X -> Y in the original relation you should be able to enforce it on one of the decomposed relations without needing to compute a join.
BCNF ensures that the left hand side is a super key.
during this process some of the functional dependencies might not be enforceable without computing a join
Explain what denormalization is. Outline an example. Discuss the tradeoffs and challenges. Talk about data integrity efficiency and storage costs.
Denormalization is the process of combining normalized database tables to improve query performance by reducing the number of joins.
Examples
Normalized example:
Orders(OrderID, CustomerID)
Customers(CustomerID, Name, Address)
Denormalized table:
Orders(OrderID, CustomerID, CustomerID, Name, Address)
Advantages:
Faster query processing
Simplified queries
Disadvantages:
Introduces data redundency
Difficult to update/maintain consistent database when updates need to occur i multiple tables.
Increased storage for the same attribute in multiple tables
In relation to functional dependencies what is a “KEY” and how can it be found for a set of relations.
A key is the minimal set of attributes in a relation that can determine all the other attributes in the relation.
How to find the key:
checking the closure of a set of attributes and seeing if they can determine all attributes in the set.
Test for minimality by remove attributes and seeing if the closure is the whole set of values.
What are the rules for 2NF 3NF and BCNF?
2NF: An attribute that is not part of the candidate key must be functionally dependent on the WHOLE candidate key.
3NF: No transitive dependencies.
BCNF: Every functional dependency has the left hand side as a super-key.
So if A-> B B->C
A is a super-key but B is not so we need to decompose.
Its not R1(A, B) R2(B,C);
What is the incorrect summary problem in relation to concurrency control?
The incorrect summary problem is when a transaction reads and calculates the summary of a dataset while another transaction is concurrently modifying the data.
Can result in an incorrect summary of the data.
Register example:
T1 reads register 1: as $100
T2 register 1: $100- $50
T2 register 2: $200 +$50
T1 reads register 2: as $250
Now there is an incorrect summary of the 2 registers at the end of the day.
What is timestamping and how would a simply temporary update problem would progress with timestamping.
Timestamping is a method in concurrency control to ensure tat transactions occur in a consistency and logical order. Every transaction gets a timestamp when it starts. This timestamp decides its priority (Older transactions with smaller timestamps go first).
A timestamp cant update data if a newer timestamp has updated it. If a newer timestamp updates it the transaction is aborted.
Temporary update example
T1 $100 +$50 = $150
T2 Reads the temp value of $150
T1 Rolls back
T2 Writes the value of $150 which is incorrect.
Here timestamping would not allow the T2 to write that value as its timestamp is newer than T1.
Therefore T2 will be aborted and restarted.
How it Works:
Older transactions go first:
Transactions with smaller timestamps (older ones) have priority.
Rules for Conflicts:
If a newer transaction (with a larger timestamp) updates data, older transactions can’t touch that data anymore.
If an older transaction tries to update something that a newer one has already changed, the older transaction is aborted and restarted.
What are the issue with 2 phase locking and timestamping in relation to concurrency control
Issues with 2-Phase Locking (2PL):
Deadlocks: Transactions may wait for each other indefinitely.
Blocking: Transactions are delayed waiting for locks.
Cascading Rollbacks: Rollbacks propagate to dependent transactions.
Reduced Parallelism: Locks limit simultaneous access to data.
Starvation: Low-priority transactions may never get locks.
Issues with Timestamping:
Frequent Restarts: Transactions are often aborted and restarted.
Starvation: Newer transactions are frequently delayed.
High Overhead: Managing timestamps for every operation is costly.
Resource Wastage: Restarted transactions waste system resources.
Issues with Long Transactions: Long-running transactions block newer updates
Explain one approach/algorithm to database recover
Find Last Checkpoint.
Scan Log for Committed Transactions that weren’t written to Disk, and “re-apply” them.
Scan Log for Uncommitted Transactions, and roll them back.
What are the 3 types of partitioning in parallel databases and what are the negatives and positives. Mention Point, range queries and batch processing.
Types of partitioning:
- Round Robin
- Sequentially assigns data to different partitions in cyclic manner
- ensures even distribution of the data across all partitions
- Good for batch processing as data is spread out
- poor performance for point and range queries as data is spread across multiple partitions
- no logical grouping of related data - Hash partitioning
- uses a hash function on an attribute to determine the partition used for a data tuple
- good for point queries
- distributes data evenly if the hash function is chosen appropriately
- poor for range queries as the hash function does not preserve order - range partitioning
- divides data into defined ranges 0-10 etc
- efficient range queries
- can lead to skewed data if not chosen correctly
- not good for point queries unless the query falls within a range
Batch: : Operates on an entire dataset or large portions of it. Like summing all values etc
Range: Retrieves all records that fall within a specific range of values.
Point queries: Retrieves a single record or a small number of records based on specific criteria.
What are the 3 types of parrelelism?
Inter-query Parallelism:
runs multiple queries on different processors. speeds up the transaction throughput. But does not speed up the individual query time.
Intra-query Parallelism:
splits up a single query into parts and runs those parts in parallel. This speeds up execution time of complex queries.
Intra-operation Parallelism:
This breaks up the individual operations of a query and runs those in parallel. Joins and sorts are broken up and done simultaneously.
Why is parallel query processing important? Also discuss how parelelism can be achieved in query executing.
Why its important:
provides faster execution by splitting large queries into smaller tasks, processed simultaneously.
Can now handle a lot bigger datasets efficiently by distributing the work.
Improves throughput and system performance.
3 Methods to achieve this
1. Inter-query Parallelism:
2. Intra-query Parallelism:
3. Intra-operation Parallelism:
In distributed databases, the database items are distributed across a number of sites with some items duplicated across a number of sites.
How would you ensure correct concurrency control in a distributed databases.
3 Ways to ensure concurrency control over multiple sites.
- Two Phase locking prevents
REFER TO NOTES
(Can cause deadlocks) - Timestamp ordering:
REFER TO NOTES - Distributed commit protocols:
The coordinator site sends a prepare message. each site checks if it can commit the transaction and replies with yes or no.
Phase 2:
if all say yes then the transaction goes ahead
if any sites says no then the transaction is aborted and changes are rolled back!
Ensures consistency across the databases. (High communication cost for this and extra overhead for the system).
What is the difference between a B-Tree and a B+ Tree, including details about how their leaf nodes differ?
Data Storage:
B-Tree: Keys and data (values) are stored together in both internal and leaf nodes.
B+ Tree: Keys are stored in internal nodes for navigation, and all data (values) are stored exclusively in the leaf nodes.
Leaf Nodes:
B-Tree: Leaf nodes are not necessarily linked, making sequential access less efficient.
B+ Tree: Leaf nodes are linked, forming a linked list, which makes range queries and sequential access much faster.
Traversal:
B-Tree: Search may terminate at any node (internal or leaf) that contains the target key.
B+ Tree: Search always traverses to the leaf nodes, even if the key is found in an internal node, ensuring uniform data retrieval.
Redundancy:
B-Tree: Keys appear only once in the tree.
B+ Tree: Keys in internal nodes are repeated in the leaf nodes.
Use Cases:
B-Tree: Better suited for in-memory data structures with point queries.
B+ Tree: Ideal for databases and file systems due to efficient range queries and sequential access.
Outline an efficient approach to implementing sorting of data to satisfy, for
example, the ORDER BY command. You may assume that the quantity of data is too sort is too large to fit in main memory. Illustrate your approach with a
small example. Comment on the efficiency of the approach.
External (parallelized) merge sort
Split the dataset into smaller chunks that fit in memory.
Sort each chunk using an in-memory algorithm (e.g., quicksort).
Save the sorted chunks to disk.
Step 2: Merge the sorted chunks in multiple passes until one fully sorted file is produced.
can parallelize the sorting and the joining
Describe how
range partitioning operates and discuss any advantages or limitations that
exist with this approach.
How Range Partitioning Works:
Divide Data into Ranges:
Split data into partitions based on predefined value ranges (e.g., Partition 1: 1–100, Partition 2: 101–200).
Assign Records to Partitions:
Each record is placed in the partition corresponding to its key’s range.
Parallel Processing:
Each partition can be processed by a separate processor or node.
Query Optimization:
Range queries (e.g., key BETWEEN 50 AND 150) target only the relevant partitions, skipping others.
Advantages:
Efficient Range Queries:
Directly access partitions relevant to the query, improving performance.
Data Locality:
Data in the same range is stored together, reducing disk I/O.
Load Balancing:
When data is evenly distributed, workload is shared across multiple processors.
Simpler Merge Operations:
Sorted partitions are easy to combine since ranges don’t overlap.
Disadvantages:
Skewed Data Distribution:
Uneven ranges can cause some partitions to be overloaded (hotspots), leading to imbalanced workloads.
Static Boundaries:
Poorly chosen range boundaries can become inefficient if data distribution changes over time.
Query Limitations:
Point queries or queries outside defined ranges may require scanning multiple partitions.
General Sorting of a Large Dataset
Focus: Sorting a single large dataset that does not fit in main memory.
Approach: Use External Merge Sort, which:
Divides data into chunks that fit in memory.
Sorts each chunk in memory and writes sorted chunks to disk.
Merges the sorted chunks into one final sorted dataset.
Key Example: Demonstrate sorting chunks and merging them.
Sorting Partitioned Data in Parallel
Focus: Sorting tuples from a dataset that is already partitioned across multiple processors in a parallel database.
Approach: Use Parallel Sort-Merge, which:
Sorts data locally within each partition using an in-memory algorithm.
Merges the sorted partitions globally to produce a fully sorted dataset.
Key Example: Show how local sorting and global merging work across partitions.
Efficiency: Discuss advantages of parallelism and potential overhead from merging.
What is the Deferred Update Protocol
A Protocol where changes made by Transactions are not written to the disk until they are committed.
What is the immediate update problem?
A Protocol where changes made by Transactions are written to the disk Immediately, applying changes before they are committed.
What is grid file indexing? advantages disadvantages?
Definition: A multi-dimensional indexing technique that divides data space into grid cells for efficient querying of multi-attribute data.
How It Works:
Divides each attribute (dimension) into intervals to form a grid.
Maps data points to grid cells.
Uses a directory to store references to non-empty cells.
Advantages:
Efficient for multi-dimensional queries (e.g., range or point searches).
Compact storage (only stores non-empty cells).
Dynamically adjustable grid.
Disadvantages:
Poor performance with skewed data distributions.
High maintenance for dynamic adjustments.
Limited for complex or high-dimensional queries.
Use Cases:
GIS: Spatial data (e.g., mapping coordinates).
Multi-key Queries: Searching based on multiple attributes (e.g., age and salary).
How can the following Operations be expressed in a Deductive Database:
SELECT * FROM employees WHERE age > 30;
SELECT name FROM employee
SELECT * FROM employees e, departments d WHERE e.dept_id = d.dept_id;
Result(X) :- employees(X, Age, _), Age > 30;
Result(Name) :- employees(Name, _, _).
Result(E, D) :- employees(E, Dept), departments(D, Dept).
What is the Star Property?
DO IT
What is the Simple Security Property?
A Constraint in BLP that prevents a subject (user or process) with a given security clearance level from reading data from a higher level.
What is the Bell-LaPadula Model?
A model for enforcing data confidentiality in multilevel systems, defining rules to prevent unauthorized information access.
What are Spurious Tuples?
Inaccurate Tuples formed by incorrectly merging two tables.