Week 7 Flashcards

1
Q

Tree structures - a reminder

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Search tree?

A

Special type of tree used to guide search for a record, given a value of one of a field’s records (aka the search field)

  • Consider search tree, order p
  • Node contains at most p - 1 search values (K) and p pointers (P)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Search tree Example

A

To search for value X we follow the appropriate pointer P, according to the constraints

  • Tree here is of order p=3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Search tree characteristics

A
  1. Can be used as a mechanism to search for records stored in a disk file
    a. Values in tree are values of search fields
    for file (e.g. an indexing field)
    i. Each value is associated with a pointer to
    a record or a disk block containing record
  2. Search tree can itself be stored on disk
    a. Algorithms required for inserting and deleting values whilst maintaining constraints
    b. Such algorithms typically do not ensure tree is balanced
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Unbalanced tree

A
  1. Does not have all leaf nodes at the same level of the tree
  2. Depth of tree is not minimised for a given set of keys
  3. As records updated and deleted, tree may become skewed, with some branches going much deeper than others
    a. May result in wasted storage space
    b. Unpredictable search times as some keys
    may require far more tree traversal than
    others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a B-tree?

A
  1. Specialised case of search tree with additional constraints to:
    a. Ensure tree always remains balanced
    b. Minimise space wasted by deletion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the properties that need to be satisfied for a B-tree of order m

A
  1. Every node has at most m children
  2. Every internal node has at least m/2 children
  3. Every non-leaf node has at least 2 children
  4. All leaves on same level
  5. Non-leaf node with k children contains k-1 keys
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is B-tree a binary tree?

A

B-tree is NOT a binary tree
- Binary tree is a special case of tree structure with at most 2 child nodes per node
- B stands for balanced.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

B-tree Formal Definition

A

B-tree of order p, when used as access structure on a key field to search for records can be defined as:
- Each internal node is of the form:
a. <P1, <K1, Pr1 >, P2, <K2, Pr2>, … , <Kq-1, Prq-1>, Pq>
- Where q <= p
- P1 is a tree pointer to another node in the tree
- Pri is a data pointer to record with search key value = Ki (or to data block containing record)
- Within each node K1 < K2 < … < Kq-1
- For all search key values X in subtree Pi (the ith subtree), we have:
a. Ki-1 < X < Ki for 1 < i < q; X < Ki for I = 1; and Ki-1 < X for i=q
- Each node has at most p tree pointers
-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Node in a B-tree with q - 1 search values:

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

B-tree, order p = 3

A

Compare with original search tree
1. Tree is balanced all leave nodes on level 1
2. All tree pointers at level 1 are NULL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Root node insertion

A
  1. B-tree starts with a single root node (which is also a leaf)
  2. When the root is full with p-1 search key values and insertion attempted
    a. Root splits into two nodes at level 1
    b. Middle value is kept in root node
    c. Rest of values split evenly between other two ndoes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Non-root node insertion

A

When non-root node is full and new entry inserted:
1. Node split into two nodes at the same level
2. Middle entry is moved to parent node along with two pointers to new split nodes
3. If parent node is full it is also split
4. Splitting can thus propagate all the way to root node creating a new level if root is split

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Deletion on B-tree

A

If deleting values causes node to be less than half full
a. It is combined with neighbouring nodes
b. Combination of nodes can also propagate to root
c. Node deletion can reduce number of tree levels
d. This allows more efficient usage of storage space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a B+trees

A
  1. Variation of the B-tree
  2. Commonly used for implementing dynamic, multi-level indexes
  3. In B-tree, every value of search field appears at some level in tree along with data pointer
  4. In B+tree, data pointers only stored at leaf nodes of tree along with a value of search field ∴
    a. Structure of leaf nodes differs from structure of internal nodes
  5. Leaf nodes usually linked to provide ordered access on search fields to records
  6. Thus leaf nodes similar to base level of an index
  7. Internal nodes of B+ tree correspond to other levels of multi-level index
  8. Some search field values from leave nodes may be repeated in internal nodes to guide search
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

B+ trees internal nodes

A
  1. Each internal node has at most P pointers
  2. Each internal node (excluding root) has at leas [p/2] tree pointers. Root node has at least two tree pointers if it is an internal node
  3. An internal node with q tree pointers q<=p, has q-1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Internal node structure

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

B+ trees leaf nodes.

A
  1. No more P pointers in node structure pointing at subtrees (it is a leaf after all)
  2. Each leaf node has at least [p/2] values
  3. All leaf nodes at same level
  4. A Pprevious pointer can also be included
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Leaf node structure

A
20
Q

In a B+ tree, internal nodes contain search values and tree pointers without any data pointers, implies that?

A

More entries can be packed into an internal node of a B+ tree than for a similar B-tree

  • For the same block size, order P will be larger for a B+ tree
    1. Fewer levels
      a. Improved search time
    2. Order P can be different for internal nodes and leaf nodes as they have different structures
21
Q

B-trees and B+trees

A
  1. Analysis and simulation have shown that after m any random insertions and deletions on B-trees and B+trees
    a. Nodes approximately 69% full when no. of values in tree stabilises
    b. Node splitting and combination occurs rarely, hence insertion and deletion becomes efficient
    c. If number of values grows, tree will expand without problem
    i. Splitting of nodes may occur, so some insertions may take more time
22
Q

Ordered files and B-trees

A
  • These structures are very good for search
    * Binary search allows for very efficiently
    finding records in very large datasets
  • But it’s not the only or necessarily always the fastest way to find something
    * Searching for something v. just ‘knowing’
    where it is…
    * A hash table and hashing function gives
    us a mechanism for obtaining the
    location of a data item directly (either a
    memory location or disk block in
    which the record is stored)
23
Q

What is a hash?

A
  • Hash function is any function that maps a data item (of arbitrary size) to a fixed value
  • Hash table (also called a hash map) is a data structure that implements an
    associative array or dictionary.
    * maps many keys to values
  • Hash function (h) is applied to hash field value to yield the address of the disk block in which record is stored.
  • For most records located in this way, only need a single block access to retrieve the record.
    * Very very fast under certain circumstances
24
Q

Internal hashing

A
  • Hashing is also used as an internal search structure within a program
    * Can be used whenever a group of records
    is accessed exclusively by using the value
    of one field called the hash field
    * If hash field is also key field it is
    referred to as the hash key
      * Implemented as a hash table using an 
         array of records
    
      * Imagine an array with an index range of 0 to M-1
             * We have M available slots whose 
                addresses correspond to the array indexes
             * Choose a hashing function that 
               transforms hash field value x into 
               integer such that:
                      * 0 <= X <= M-1
25
Q

Example on Hashing functions

A
  • One common hash function is: h(K) = K mod M
    * Returns remainder of integer of hash
    field value K after division by M
    * Remember M is no of slots in hash table
    so operation ensures h(K) returns result
    between 0 and M-1
    * Value, h(K), is used as the address for the
    record
    * Requires an integer value to calculate
    * All binary data can be represented as
    integer, e.g. character strings can use
    ascii codes of characters
26
Q

More hashing functions

A

Folding:
* Apply
* arithmetic functions such as addition
or
* Logical functions such as exclusive OR
* To different portions of hash field values to calculate the hash address

  • Or simply pick some digits of hash field value e.g. 2nd, 4th and 6th digit to form a hash address
  • The hash address should be computationally cheap to calculate
27
Q

Using 2nd, 4th, 6th digit

A
28
Q

Hash collisions

A
  • Occur when multiple value of hashing field result in same output from the hashing function
  • More likely to occur when number of possible slots M is small compared to the possible number of hash field values
  • Ideal hashing function will minimise collisions whilst still being cheap to calculate
  • Strategy needed to handle (resolve) conflicts
29
Q

Collision resolution

A
  • Open addressing
    • Proceed from already occupied position
      specified by hash
      * Check subsequent positions in order
      * Until an unused position is found
    • Logically simple to implement
    • But
      * If no of collisions is large, hash table
      order begins to degrade
  • Multiple hashing
    • Apply a second hash function if first results
      in collision
    • If another collision occurs, use open
      addressing or apply third hash then use
      open addressing
30
Q

Collision resolution (2)

A
  • Chaining
    • In addition to our main locations 0 … M-1
    • Extend array with a number of overflow
      positions
    • Additionally, pointer field added to each
      record location
    • Collision is resolved by placing new record
      in unused overflow location and setting
      occupied hash address pointer to address
      of overflow location
    • Thus a linked list of overflow locations is
      maintained
31
Q

What is the advantage of internal hashing

A
  • Gives us a very rapid lookup mechanism
    * Where we are using equality for search
    * Doesn’t work for SELECT * WHERE X > Value
32
Q

What is External hashing?

A
  • This is hashing for disk files
  • Target address space is made of ‘buckets’
    * Bucket is one disk block or a cluster of
    contiguous disk blocks
    * Each bucket holds multiple records
  • Hashing function maps a key into a relative bucket
    * It does not assign an absolute block
    address
    * A bucket table maintained in the file
    header converts the bucket number into
    corresponding disk block address
33
Q

Collision on External hashing

A
  • Less severe due to use of buckets
    * As many records as will fit in the bucket
    can all hash to the same bucket
    * It is a possibility that a bucket will fill to
    capacity and a new record will hash to a
    full bucket

Collision resolution
- Use a variation of chaining

34
Q

Record deletion on External hashing

A
  • Remove record from bucket
  • If bucket has an overflow chain, can move one of overflow records into main bucket
  • If record to be deleted already in overflow bucket, simply remove it from linked list
  • Requires keeping track of empty overflow locations
35
Q

Problems of external hashing

A
  • Searching for record using non-hash field is as expensive as for an unordered file
  • Modifying hash field
    * May require moving record to different
    bucket
    * This requires deletion of old record plus
    the creation of modified record
36
Q

Static hashing

A
  • External hash scheme described so far is static
  • Fixed number of buckets M allocated
    * M = No of buckets for address space
    * n = Maximum number of records that can
    fit in one bucket
  • Maximum records possible in allocated space
    * M * n
    * Assuming records allocate equally across
    all buckets
37
Q

In static hashing, what if number of records &laquo_space;M * n

A

Left with a lot of unused space

38
Q

If number of records&raquo_space; M * n

A

There will be a large number of collision (link linked lists for overflow)

39
Q

What happens when using M in the hash scheme

A

It ties us to fixed and static size of M buckets

  1. Doesn’t cope well with dynamic files
    a. Not a problem if we know in advance, with
    some confidence how many records we
    will need to store
    b. Many (most?) real world scenarios we do not know in advance
  2. We may need to change M
    a. Implies we need a new hashing function
    b. Implies record redistribution
    c. THIS IS EXPENSIVE!
40
Q

What is Dynamic Hashing

A
  • Easier to expand or shrink a file (compared with static hashing)
  • Takes advantage of the fact:
    * The result of applying a hash function is always a non-negative integer
  • Can be represented as a binary number
41
Q

What is Extendible Hashing

A
  • Directory consisting of array of 2d bucket addresses is maintained
  • Does not require a distinct bucket for each of the 2d directory entries
    • Several directory entries with the same first
      d’ bits may contain the same bucket
      address
    • Records that hash to these location fit in a
      single bucket
42
Q

What does d in 2d directory entries

A
  • d is called the global depth of the directory
    • Integer value corresponding to the first
      (high-order) d bits of the hash value is used
      as index into the array to determine the
      directory entry
    • Address in the entry determines in which
      bucket corresponding records are stored
43
Q

What does d’ mean in the extendible hashing

A
  • Called the local depth
    • Stored with each bucket
    • Number of bits on which bucket contents
      are based
44
Q

Explain the Expansion and shrinkage in Extendible hashing

A
  • Value of d can be increased or decreased by 1 at a time
  • In this event, number of entries in directory array is doubled or halved
  • Doubling needed if bucket with d’ = d overflows
  • Halving is needed if (possibly after some deletions) d > d’ for all buckets
45
Q

What are the advantages of extendible hashing

A
  1. Performance does not degrade as file grows
    a. In static hashing, collisions increase and overflow chains become large
  2. No space allocated for future growth
    a. It’s all handled dynamically for additional buckets
  3. Space overhead for directory is negligible
    a. Maximum directory size is 2b - where b is
    number of bits in hash value
  4. Splitting causes minor reorganisation in most cases
    a. Only records in one bucket are
    redistributed to two new buckets
46
Q

What are the disadvantages of Extendible hashing?

A
  1. Reorg expensive when directory needs to be doubled or halved
  2. two block accesses required
    a. Directory must be searched before accessing buckets

Overall performance penalty is minor (desirable choice for dynamic files)

47
Q

EXTENDIBLE HASHING EXAMPLE

A

WEEK 7 LECTURE 2 SLIDE 42-62