string and text algorithm Flashcards
heap and heapsort pseudocode
A heap is a binary tree that has two properties :
complete binary tree - all level are full expect possibly the bottom level with any nodes on the bottom level as far left as possible
value of node is greater or equal to value of its decedents
concept build array into heap //O(n) for int k to n -1 //O(n) { // invariant: item 0 to n-k-1 form heap // invariant: item n-k to n-1 are sorted find the largest unsorted item (the root) //O(1) swap it into position n-1-k (the most bottom right) //O(1) reduce the size of heap by 1 // O(1) impose the heap property on root // O(log n) } restore size to original value
build -
for(each non leaf node in bottom to top right to left order)
impose heap property on that node
impose -
while(badvalue not in leaf && badvalue < largerchild)
swap badvalue with largerchild
explain LZW in detail
LZW uses dictionary D of strings which is initialized to contain all strings of length 1 over the underlying alphabet and grows by the insertion of additional strings as compression proceeds. Associated with each string in D is a unique integer codeword in the range 0 to n-1 where n is the number of strings in D. at any given moment there is a current codeword length k which is initialized so that 2^k is at least equal to the current dictionary size. The current position in the text is initialized to the first character. at each step, the longest string s at the current position which exists in D is identified. the k digit bit codeword of s is output and new string s + c where c is the next character is inserted into D with the next unused codeword. if the number of used codeword becomes greater than 2^k and the number of codewords are in the acceptable range, the the value of k is incremented.
radix sort explanation
each item has bit position labelled 0 ,1 …. m-1 with bit 0 being the least significant( the right most)
the algorithm uses m/b iterations where b = number of bits in the bucket. for example 111001 can be divided into “111” and “001” -> 2 iterations ( b = 3, m = 6)
each iteration contains 2^b buckets labelled 0…0 to 1…1
For each iteration the item is placed according to the corresponding bucket label, for example:
1100 ,1011 ,1111 ,1000
let b = 2 , m/b = 2 iterations 1st iteration 00 : 1100,1000 01: empty 10: empty 11: 1011,1111 buckets are then concatenated to form the new sequence 1100,1000,1011,1111 2nd iteration 00: empty 01: empty 10: 1000,1011 11: 1100,1111 concat bucket gives : 1000,1011,1100,1111 <- sorted
the number of iteration = m/b = Constant
the number of insertion to buckets per iteration = O(n)
concat buckets = 2^b = constant
hence complexity is O(n)
Huffman tree explanation
First get frequency of char used in the text
Add leaves of Huffman tree - char with their frequencies label the leaf nodes
Next while there is more than one parentless node:
- add new parent to nodes of smallest weight
- weight of new node = sum of children weight
Pseudocode //set up leaves for( each distinct char c in text) { make new parent node n; int f = frequency for c; n.setWeight(f); // weight = frequency n.setChar(c); // set char value // leaf has no child n.setLeftchild(null); n.setRightchild(null); } //construct the branch nodes and links while(no. of parentless nodes > 1) { make a new parentless node z; x ,y = 2 min weight nodes // children z.setLeftchild(x); z.setRightchild(y); int w = x.getWeight() + y.getWeight(); //calculate weight z.setWeight(w); } // the final z is root of tree
Huffman tree has min weighted path length over all binary tree with the given leaf weights
complexity
let n = text length , m = distinct char in text
O(n) to find frequency
O(m log m) time to construct tree - constant
So O(n) in total
KMP algorithm
it is an inline algorithm (does not require text to be backed up)
involves pre processing the string to build a border table( array b with b[j] for each position j of the string)
b[j] denotes the longest border of string s[0….j-1] where border is a substring that is both a prefix and a suffix
if there is a mismatch at position j in the string check border table for the corresponding position to find out which string character is next to be compared.
char array t //text char array s //string int array b // border table int i // position in text int j // position in string int m // string length int n // text length setup(b) // set up border table while(i != m) //not EOF { if(t[i] ==s[j]){ i++ j++ } else{ if( b[j]!=0) //corresponding border table not 0 { j=b[j]} //shift string pointer accordingly else{ i ++; j=0;} //resume brute force } if( j==m) //all char in string match return i-j; //return first match char in text }
return -1 // not found
BM algorithm
t[i] = current text compared s[j] = current char in string
- string is scanned right to left
- text char involved in a mismatch is used to decide next comparison (t[i])
- pre processing the string to get the position of the last occurrence of every char used in text in string (alphabet used must be fixed before search)
- if the char not exists in string the value = -1
- if there is a mismatch align last position in s of char t[i] with t[i]. if this moves s in the wrong direction, move s one position right, if t[i] does not appear in string , slide entire string past t[i]