string and text algorithm Flashcards

Question 1

Q

heap and heapsort pseudocode

Answer

A

A heap is a binary tree that has two properties :

complete binary tree - all level are full expect possibly the bottom level with any nodes on the bottom level as far left as possible

value of node is greater or equal to value of its decedents

concept 
build array into heap //O(n)
for int k to n -1 //O(n) 
{
// invariant: item 0 to n-k-1 form heap
// invariant: item n-k to n-1 are sorted
find the largest unsorted item (the root) //O(1)
swap it into position n-1-k (the most bottom right) //O(1)
reduce the size of heap by 1 // O(1)
impose the heap property on root // O(log n)
}
restore size to original value

build -
for(each non leaf node in bottom to top right to left order)
impose heap property on that node

impose -
while(badvalue not in leaf && badvalue < largerchild)
swap badvalue with largerchild

Question 2

Q

explain LZW in detail

Answer

A

LZW uses dictionary D of strings which is initialized to contain all strings of length 1 over the underlying alphabet and grows by the insertion of additional strings as compression proceeds. Associated with each string in D is a unique integer codeword in the range 0 to n-1 where n is the number of strings in D. at any given moment there is a current codeword length k which is initialized so that 2^k is at least equal to the current dictionary size. The current position in the text is initialized to the first character. at each step, the longest string s at the current position which exists in D is identified. the k digit bit codeword of s is output and new string s + c where c is the next character is inserted into D with the next unused codeword. if the number of used codeword becomes greater than 2^k and the number of codewords are in the acceptable range, the the value of k is incremented.

Question 3

Q

radix sort explanation

Answer

A

each item has bit position labelled 0 ,1 …. m-1 with bit 0 being the least significant( the right most)
the algorithm uses m/b iterations where b = number of bits in the bucket. for example 111001 can be divided into “111” and “001” -> 2 iterations ( b = 3, m = 6)
each iteration contains 2^b buckets labelled 0…0 to 1…1
For each iteration the item is placed according to the corresponding bucket label, for example:

1100 ,1011 ,1111 ,1000

let b = 2 , m/b = 2 iterations
1st iteration  00 : 1100,1000
                     01:  empty
                     10: empty
                     11: 1011,1111
buckets are then concatenated to form the new sequence 1100,1000,1011,1111
2nd iteration  00: empty
                      01: empty
                      10: 1000,1011
                      11: 1100,1111
concat bucket gives :
1000,1011,1100,1111 <- sorted

the number of iteration = m/b = Constant
the number of insertion to buckets per iteration = O(n)
concat buckets = 2^b = constant
hence complexity is O(n)

Question 4

Q

Huffman tree explanation

Answer

A

First get frequency of char used in the text
Add leaves of Huffman tree - char with their frequencies label the leaf nodes
Next while there is more than one parentless node:
- add new parent to nodes of smallest weight
- weight of new node = sum of children weight

Pseudocode
//set up leaves
for( each distinct char c in text) 
{
   make new parent node n;
   int f = frequency for c;
   n.setWeight(f); // weight = frequency
   n.setChar(c); // set char value
   // leaf has no child
   n.setLeftchild(null);
   n.setRightchild(null);
}
//construct the branch nodes and links
while(no. of parentless nodes > 1)
{
   make a new parentless node z; 
   x ,y = 2 min weight nodes // children
   z.setLeftchild(x);
   z.setRightchild(y);
   int w = x.getWeight() + y.getWeight(); //calculate weight
   z.setWeight(w);
}
// the final z is root of tree

Huffman tree has min weighted path length over all binary tree with the given leaf weights

complexity
let n = text length , m = distinct char in text
O(n) to find frequency
O(m log m) time to construct tree - constant
So O(n) in total

Question 5

Q

KMP algorithm

Answer

A

it is an inline algorithm (does not require text to be backed up)

involves pre processing the string to build a border table( array b with b[j] for each position j of the string)

b[j] denotes the longest border of string s[0….j-1] where border is a substring that is both a prefix and a suffix

if there is a mismatch at position j in the string check border table for the corresponding position to find out which string character is next to be compared.

char array t //text
char array s //string
int array b // border table
int i // position in text
int j // position in string
int m // string length
int n // text length
setup(b) // set up border table
while(i != m) //not EOF
{ if(t[i] ==s[j]){ 
  i++
  j++
 }
  else{
  if( b[j]!=0) //corresponding border table not 0 
  { j=b[j]} //shift string pointer accordingly 
  else{ i ++; j=0;} //resume brute force
  } 
  if( j==m)  //all char in string match
  return i-j; //return first match char in text 
}

return -1 // not found

Question 6

Q

BM algorithm

Answer

A

t[i] = current text compared s[j] = current char in string

string is scanned right to left
text char involved in a mismatch is used to decide next comparison (t[i])
pre processing the string to get the position of the last occurrence of every char used in text in string (alphabet used must be fixed before search)
if the char not exists in string the value = -1
if there is a mismatch align last position in s of char t[i] with t[i]. if this moves s in the wrong direction, move s one position right, if t[i] does not appear in string , slide entire string past t[i]

Brainscape's Knowledge GenomeTM

string and text algorithm Flashcards

Brainscape's Knowledge Genome^TM