Indexing Flashcards

1
Q

true or false: each index is associated with a search key

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

true or false: search keys can consist of a set rather than a single value

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

true or false: multiple records cannot have the same search key value

A

false

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

true or false: you can have multiple indexes on a file, each with its own search key

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

an index will have index entries, data entries and data records

what are the differences

A

the index entries will filter and provide data entries that meet the index’s search key. You will be given a collection of data entries that hold/points to the data records depending on which of the 3 alternatives you have for keeping data entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data entries have 3 alternative structures, what are they and describe them?

A
  1. <k, data record> (by value)
    - no need to follow pointers
    - you have the full data stored in the data entry
    - at most one index on a given collection of data records can use this method, otherwise data records are duplicated and there will be redundant storage and potential inconsistencies
  2. <k, rid> (by reference)
    - data entries are typically smaller than data records, therefore better than alt1 with large records
  3. <k, list of rids> (by list of references)
    - more compact than alt2 but leads to variable sized data entries even if search keys are fixed length
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

indexed files can be sorted or hashed, how does that affect the way k* (data entries) are stored

A

k* are sorted by k for sorted indexed files and hashed by k for hashed indexed files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what does it mean to be a unique index

A

Search key contains a candidate key (a key that can uniquely identify a record)

ie. we only return one record with this index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

sorted index files: what is a clustered vs unclustered index, define the differences

A

For sorted index, data entries k* are always sorted by search key k

If data records are also sorted by k it is a clustered index, otherwise, unclustered
index

Note
- A file has at most one clustered index (a table can have only one because the physical order of the data can only be in one way)
- Alternative 1 leads to clustered index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

between clustered and unclustered indexes, which will have the lowest I/O

A

clustered, because data records having same k are likely found on the same page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the difference between dense and sparce

A

dense: there is at least one data entry for every record

sparce: every record does not have its own data entry. One index per page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

why does sparce indexed always need to be clustered

A

because you don’t have a data entry point to everything, you need to be able to find records without pointers, so their position needs to be implied, which can only be done if its sorted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

true or false: alternative 1 leads to a dense index

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

can a file have more than one sparce index, why or why not

A

no, because you need to be sorted for a sparce indexes to work, but you can’t be sorted in 2 ways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is a composite search key

A

a search key that contains more than one field

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

for a sorted indexed file why does binary search not work when doing a range search

A

because we don’t have sequential storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

how do you do a range search with a sorted indexed file, what is the point of the values and pointers

A

create an index file to search indexes,there will be values and pointers, depending on if you are greater or less than the value you wil follow a specifc pointer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

in a ISAM tree index file, what is an overflow page

A

if all leafs are full you will add new info in the overflow pages

these pages will not be sorted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

true or false: in a ISAM tree, even after a deletion, the index file page will remain unchanged

A

true. the file structure will never change, but if you delete an overflow page the overflow page will be deallocated

20
Q

true or false: overflow pages in a ISAM tree, keep their allocated space after all their data has been deleted

A

false, space gets deallocated

21
Q

in an index file, where is the data entries contained

A

in the leaf pages

22
Q

in a ISAM tree, why do we not need next-leaf pointers?

A

leaf pages are allocated sequentially

23
Q

are ISAM trees height balanced?

A

no

24
Q

How is
1. file creation
2. Searching
3. Inserting
4. Deleting

done in an ISAM tree structure

A

file creation: leaf pages allocates sequentially. sorted by search key, then index pages allocated, then space of over flow pages

search: start at root, use key comparisons to go to leaf

insert: find lead data entry belongs to, put it in there, use overflow pages if needed

delete: fins and remove from leaf, if empty overflow page –> deallocate the space

25
Q

is an ISAM a dynamic or static tree structure?

A

static, insert and delete will only affect over flow pages, the tree itself will never change

26
Q

are overflow pages ordered or unordered in a ISAM tree

A

unordered

27
Q

are overflow pages in an isam tree sequentially allocated

A

no

28
Q

is B+ tree a dynamic or static tree structure

A

dynamic, it will change and respond to changes in data size

29
Q

what are the properties of a B+ tree

A

the tree must be
- height balanced
- maintain 50% occupancy in each node except for the root

30
Q

true or false: B+ trees can support both equality and range-searches efficiently

A

true

31
Q

what is d referring to in a b+ tree

A

d is the order of the tree. Use d to determine the minimum occupancy allowed.
Minimum occupancy is : d <= m <= 2d

32
Q

why do we aim to keep 50% occupancy for every node in a B+ tree

A

to not waste space

33
Q

In a B+ tree what does the number of I/O depend on?

A

the height of the tree and whether or not you are alt 1 or alt 2/3 and if you are clustered or unclustered

34
Q

in a B+ tree are they typically taller than they are fat or fatter than they are tall

A

trees are fatter than they are tall

35
Q

describe the process of inserting a data entry into a B+ tree

A
  1. find correct leaf
  2. put data into L
    2a. if L has enough room we are done
    2b. not enough room in leaf, must split the leaf into L and L2.
    - redistribute the entries between L and L2 and copy the middle key to the parent
    -check i parents also need to be split and follow the same logic
36
Q

what is the difference between splitting leaf nodes and splitting parent nodes after an insert in a B+ tree

A

in parent redistribute entries evenly but push up the middle key

in leaf, redistribute entries evenly but copy up the middle key

37
Q

describe the process of deleting a data entry form a B+ tree

A
  1. start at the room and find the leaf L where the entry belongs
  2. remove the entry
    - if L is still half full, done
  3. redistribute by borrowing entries from your sibling node
  4. if redistribution fails, merge L and it’s siblings
  5. if merged, must delete the entry pointing to L or sibling from the parent
38
Q

what is bulk loading of a B+ tree

A

when you create a tree, loading one record one at a time takes a while and will not give you sequential storage for your leaves. Instead you bulk load where you can efficiently create a B+ tree.

When initializing you need to, sort all data entries, insert a pointer to the first leaf page in the root page and continue to insert into the right most index page. when you fill up the page, split along the right most path

39
Q

index pages can typically hold many more entries than leaf pages, why?

A

index entries don’t have *, remember that * tells you that we are looking at a data entry. This data entry will contain a pointer to the data and a rid which is comprised of page id and slot number. Files with * just need to hold more information

40
Q

if data entries are alternative 1, what could possibly happen to rids

A

they could possibly change
rids = page number, slot number
if you insert/delete/merge/split, you could be changing the page numbers

41
Q

the pages in an index file will contain values and pointers, what is the relationship between the number of values in a page and the fanout (number of branches)?

A

fanout is equal to number of values + 1.
ex. if you have 9 values, you have 10 pointers

42
Q

describe the static hashing method in hash-based indexes for hashed files.

include how insertion works for this static hashing

A
  • # of primary pages is fixed
  • allocated sequentially
  • primary pages are never deallocated
  • will be overflow pages if needed
  • buckets are identified with
    h(k) mod N = bucket
  • h(k) is a hash function that works on the search key k
  • possible to develop long overflow chains which will degrade performance
  • if we are looking for an entry that has a long overflow chain, we will need to look at all the overflow pages to find the entry
43
Q

describe the dynamic hashing method, extendible hashing

include how insertion and deletion works for this method

A

EH uses a directory of pointers to point to buckets. We will split and rehash a bucket that overflowed, this may or may not lead to directory doubling

good b/c
- cheaper to double only the directory
- only one page is split and rehash
- no overflow pages

need to keep track of global and local dept, we will need to double the directory when an insert causes local depth > global depth after an insert

for delete
- if we delete and it leaves an empty bucket, the bucket is merged with its ‘split image’ (pairs that only differ in the left most bit)

44
Q

when will we face issues with extendible hashing

A

if we have skewed distribution, where many entries have the same hash value it can lead to an overflow chain because the directory can grow large and data entries remain unsplit

45
Q

describe linear hashing

A

we splits the bucket in round-robin fashion, without using directory. we have a collection of hash functions that we will swap in depending on how many bits we need to represent a difference in buckets.

there will be a next pointer that will point to the next bucket in line to be split. regardless of which bucket overflowed, we will split only the bucket indicated.

will degrade in performance if distribution skewed

46
Q

yes or no, in LH if a new overflow is created while we are handling a current overflow, will this new overflow trigger another split?

A

no