String Searching Aho-Corasick Automaton, Suffix Array, BW Transform Flashcards
What is/are expectations of the Aho-Corasick Automaton ?
All elements are static and memory is allocated prior. It benefits the find function over the insert and remove.
What is the time complexity, if a MultiWay Trie is use to build and implement this database?
What are the contingencies ?
O(nk)
Assuming the tree has already been constructed. O(n) for ever position, and O(k) traversals, giving a total of O(nk)
While better it requires to restart the traversal for scratch for each of the potential O(n) start positions
One large motivation/weakness of MultiWay Trie is the need to restart the the traversal through the TRIE at EVERY position of the query which creates REDUNDANT operations. How is this resolved with Aho-Corasick?
If a node is able to hold some information about a path FROM the root to the current node existing as a prefix of other paths starting at the root, then we can keep track of that connection to avoid searching the same substring again.
Instead, if we traverse to a end node and fail, instead of starting again at the root, we continue where we left off in the query but at another “related” node.
Failure Links: An edge/conenction that can be traversed in the case where current traversal fails at a non-existent terminal/child node.
We create a failure link from u to a node v such that the path from one root to node v is equivalent to the longest suffix of the path from the root to node u.
At worst case, if a node can not be found with a prefix we have a failure link pointing to the root.
Previously, our collection of short sequences was our database, and our single long sequence was our query. (solved by Aho-Corasick)
Now, our collection of short sequences is our query, and our single long sequence is our database.
Solution is to PREProcess the database (front end effort) to shorten the query time, since the queries is be changing and performing multiple at same time.
“Does this segment appear anywhere in the entire dictionary”
Notice how:
Sorting the suffixes in the first stage allows for suffixes/substrings with the same characters are grouped together.
THUS, (recall we also want to know where a substring exist in the database) that we can check all kinds of suffixes wi the same initial character(s). This truncates the range needed to search and once the next substring does not have the matching character, we know we can stop.
This eventually gives O(k log n) time worst case (binary search with potentially k letters comparison.
Burrows-Wheeler Transform
Addresses read mapping problem
We have a single long database string D of length n (e.g. the human genome), and we have a query Q containing m short strings of length k, where n»_space; k. Our computational problem is the following: For every string w in Q, does w appear as a substring of D, and if so, at what position(s) of D? We can assume D doesn’t change frequently, meaning the time it takes to preprocess is negligible in the long term, but Q is expected to change frequently.
We ast ended Suffix Array with runtime of O(k logn n), with a query of m short strings it becomes O(mk log n)
Burrows-Wheeler Transform
Addresses read mapping problem
We have a single long database string D of length n (e.g. the human genome), and we have a query Q containing m short strings of length k, where n»_space; k. Our computational problem is the following: For every string w in Q, does w appear as a substring of D, and if so, at what position(s) of D? We can assume D doesn’t change frequently, meaning the time it takes to preprocess is negligible in the long term, but Q is expected to change frequently.
We last ended Suffix Array with runtime of O(k logn n), with a query of m short strings it becomes O(mk log n)
BWT Algorithm
- add an end of string character, in examples this is the “$” character. End of string character should be unique and not occur in original string/alphabet (domain).
2. Construct Burrows-Wheeler matrix. This is the rotation of the string (with the end-of-string symbol). And index them. BANANA$ $BANANA A$BANAN ..... ANANA$B
- Sort the Matrix in alphabetical order. and record the order of the indexed list/matrix
- The last column (end characters) of the sorted matrix is the BURROWS-WHEELER Transform.
Instead of recording the entire string of each row, the index is recorded to condense the representation of the string.
- Using this index number,it can be seen as a “starting index” and can help to produce the actual rotation on the fly as the original string is essentially now a circular string.
With this algorithm we can transform a given string into its BWT.
The remainder of the problem is to uncompress or transform the BWT to its original string.
(RE)Construction Algorithm of BWT
- With the BWT (column) we can confidently construct the FIRST column since the first column is all the characters in the BWT but in alphabetical order.
Remember that each row of our Burrows-Wheeler Matrix is simply a rotation of our original string. That means that, for any given row, in the original string, the letter in the first column comes immediately after the letter in the last column. This helps but recurring characters pose a problem. However, we can leverage and use the $ (end of string character) to start this uncompression.
- Reconstruct the original string (top row where $ XXXXX). We will need to index all recurring characters and index them on both ends.
Move down BWT to find $ and find its character (@) on the left side. This “@” is adjacent to $ on the top row /original string. (replace * (or _ ?) with this character. Now with @, find this character (@) on the BWT and find the corresponding character on the left column. This is the next character in the top row. (Continue this process to rebuild just the top row/original string.)
- With the top row reconstructed the other row can be reconstructed as they are just rotations of the top row.
The reconstruction of the top row will be O(n) as it depends on how many characters to match with. This O(n) can be achieved by using RADIX SORT
L2F Column
P. Recall that we went from Original String -> BWT Matrix (rotations) -> Sorting the Matrix/rows -> BWT (column from sorted rows) . That is “compression.” Then we went from the BWT -> recreating the right column (alphabetically sorting the BWT, with indices) -> matching one letter at a time the corresponding characters starting with $ to the sequence of the original string.
With now the original string constructed from just the BWT we create the L2F “Column.” This is essentially a new column to help do what we had already but formalize it to reconstruct the rest of the matrix (mapping reference).
Essentially L2F tells you which row to go to if you see a character in the BWT.
$ will always map to 0-th row
Then in the 0th row the BWT character you find, you find the corresponding/matching character in the left column (first column) and that row number will fill in that L2F in the previous which should be the “first” or 0th index of BWT and keep repeating.
Given the BWT, we can searh for a substring (w) in backwards fashion (from end of w to beginning of w)
Aho-Corasick Automata
Fast data structure for string matching
Problem: Given a string T and k nonempy strings, P_1, P_2, … P_k, find al ocurences of P_1, …, P_k in T
T is called hte text string, and P_1, .., P_k are called pattern strings
Problem was originally studied in the context of compiling indexes, but has found applications in computer security and computational genomics
parameters:
m = |T| , lenght of string to be searched
n = |P_1| + |P_2| + … |P_n| , total length of all pattern strings.
L_max length of the longest pattern
Assumption that strings are drawn from ‘alphabet’ that is constant
Parallel Searching
Seraching for all the query strings all at the same time rather than serial(ly) one at a time one after another.
This will cut down on the number of re-scanning
Implementing this require TRIE (ReTRIEval)
Each node stores pointers to children. One pointer per child
Suffix Links/Failure Links
Edge corresponding to a string \alpha to the trie node corresponding to a string \omege (w) such that w is the longest proper suffix of \alpha that is still in the trie
Can improve performance DRASTICALLY
Intuition when we hit a part of the string where we cannot continue to read characters, we fall back by following suffix links to try to preserve a much context as possible.
Every node in the tree (except the root, which corresponds to the empty string), will have a suffix/failure link associated with it.
What failure/suffix links do:
At each step we either…
- advance to the end pointer forward on the trie,
- or advance the start pointer forward
Each pointer can advance forward at most O(m) times
Each pointer can advance forward at most O(m) times
This reduces the amount of time scanning characters from O(mL_max) down to O(m)
This is only useful if we can compute suffix links quickly