String Searching Aho-Corasick Automaton, Suffix Array, BW Transform Flashcards

1
Q

What is/are expectations of the Aho-Corasick Automaton ?

A

All elements are static and memory is allocated prior. It benefits the find function over the insert and remove.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the time complexity, if a MultiWay Trie is use to build and implement this database?

What are the contingencies ?

A

O(nk)

Assuming the tree has already been constructed. O(n) for ever position, and O(k) traversals, giving a total of O(nk)

While better it requires to restart the traversal for scratch for each of the potential O(n) start positions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

One large motivation/weakness of MultiWay Trie is the need to restart the the traversal through the TRIE at EVERY position of the query which creates REDUNDANT operations. How is this resolved with Aho-Corasick?

A

If a node is able to hold some information about a path FROM the root to the current node existing as a prefix of other paths starting at the root, then we can keep track of that connection to avoid searching the same substring again.

Instead, if we traverse to a end node and fail, instead of starting again at the root, we continue where we left off in the query but at another “related” node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Failure Links: An edge/conenction that can be traversed in the case where current traversal fails at a non-existent terminal/child node.

We create a failure link from u to a node v such that the path from one root to node v is equivalent to the longest suffix of the path from the root to node u.

A

At worst case, if a node can not be found with a prefix we have a failure link pointing to the root.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Previously, our collection of short sequences was our database, and our single long sequence was our query. (solved by Aho-Corasick)

Now, our collection of short sequences is our query, and our single long sequence is our database.

Solution is to PREProcess the database (front end effort) to shorten the query time, since the queries is be changing and performing multiple at same time.

A

“Does this segment appear anywhere in the entire dictionary”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Notice how:

Sorting the suffixes in the first stage allows for suffixes/substrings with the same characters are grouped together.

THUS, (recall we also want to know where a substring exist in the database) that we can check all kinds of suffixes wi the same initial character(s). This truncates the range needed to search and once the next substring does not have the matching character, we know we can stop.

A

This eventually gives O(k log n) time worst case (binary search with potentially k letters comparison.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Burrows-Wheeler Transform

Addresses read mapping problem

We have a single long database string D of length n (e.g. the human genome), and we have a query Q containing m short strings of length k, where n&raquo_space; k. Our computational problem is the following: For every string w in Q, does w appear as a substring of D, and if so, at what position(s) of D? We can assume D doesn’t change frequently, meaning the time it takes to preprocess is negligible in the long term, but Q is expected to change frequently.

A

We ast ended Suffix Array with runtime of O(k logn n), with a query of m short strings it becomes O(mk log n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Burrows-Wheeler Transform

Addresses read mapping problem

We have a single long database string D of length n (e.g. the human genome), and we have a query Q containing m short strings of length k, where n&raquo_space; k. Our computational problem is the following: For every string w in Q, does w appear as a substring of D, and if so, at what position(s) of D? We can assume D doesn’t change frequently, meaning the time it takes to preprocess is negligible in the long term, but Q is expected to change frequently.

A

We last ended Suffix Array with runtime of O(k logn n), with a query of m short strings it becomes O(mk log n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

BWT Algorithm

  1. add an end of string character, in examples this is the “$” character. End of string character should be unique and not occur in original string/alphabet (domain).
2. Construct Burrows-Wheeler matrix. This is the rotation of the string (with the end-of-string symbol). And index them.
BANANA$
$BANANA
A$BANAN
.....
ANANA$B
  1. Sort the Matrix in alphabetical order. and record the order of the indexed list/matrix
  2. The last column (end characters) of the sorted matrix is the BURROWS-WHEELER Transform.

Instead of recording the entire string of each row, the index is recorded to condense the representation of the string.

  1. Using this index number,it can be seen as a “starting index” and can help to produce the actual rotation on the fly as the original string is essentially now a circular string.
A

With this algorithm we can transform a given string into its BWT.

The remainder of the problem is to uncompress or transform the BWT to its original string.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

(RE)Construction Algorithm of BWT

  1. With the BWT (column) we can confidently construct the FIRST column since the first column is all the characters in the BWT but in alphabetical order.

Remember that each row of our Burrows-Wheeler Matrix is simply a rotation of our original string. That means that, for any given row, in the original string, the letter in the first column comes immediately after the letter in the last column. This helps but recurring characters pose a problem. However, we can leverage and use the $ (end of string character) to start this uncompression.

  1. Reconstruct the original string (top row where $ XXXXX). We will need to index all recurring characters and index them on both ends.

Move down BWT to find $ and find its character (@) on the left side. This “@” is adjacent to $ on the top row /original string. (replace * (or _ ?) with this character. Now with @, find this character (@) on the BWT and find the corresponding character on the left column. This is the next character in the top row. (Continue this process to rebuild just the top row/original string.)

  1. With the top row reconstructed the other row can be reconstructed as they are just rotations of the top row.
A

The reconstruction of the top row will be O(n) as it depends on how many characters to match with. This O(n) can be achieved by using RADIX SORT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

L2F Column

P. Recall that we went from Original String -> BWT Matrix (rotations) -> Sorting the Matrix/rows -> BWT (column from sorted rows) . That is “compression.” Then we went from the BWT -> recreating the right column (alphabetically sorting the BWT, with indices) -> matching one letter at a time the corresponding characters starting with $ to the sequence of the original string.

With now the original string constructed from just the BWT we create the L2F “Column.” This is essentially a new column to help do what we had already but formalize it to reconstruct the rest of the matrix (mapping reference).

A

Essentially L2F tells you which row to go to if you see a character in the BWT.

$ will always map to 0-th row
Then in the 0th row the BWT character you find, you find the corresponding/matching character in the left column (first column) and that row number will fill in that L2F in the previous which should be the “first” or 0th index of BWT and keep repeating.

Given the BWT, we can searh for a substring (w) in backwards fashion (from end of w to beginning of w)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Aho-Corasick Automata

Fast data structure for string matching

Problem: Given a string T and k nonempy strings, P_1, P_2, … P_k, find al ocurences of P_1, …, P_k in T

A

T is called hte text string, and P_1, .., P_k are called pattern strings

Problem was originally studied in the context of compiling indexes, but has found applications in computer security and computational genomics

parameters:
m = |T| , lenght of string to be searched
n = |P_1| + |P_2| + … |P_n| , total length of all pattern strings.
L_max length of the longest pattern
Assumption that strings are drawn from ‘alphabet’ that is constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Parallel Searching

Seraching for all the query strings all at the same time rather than serial(ly) one at a time one after another.

This will cut down on the number of re-scanning

A

Implementing this require TRIE (ReTRIEval)

Each node stores pointers to children. One pointer per child

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Suffix Links/Failure Links

Edge corresponding to a string \alpha to the trie node corresponding to a string \omege (w) such that w is the longest proper suffix of \alpha that is still in the trie

Can improve performance DRASTICALLY

A

Intuition when we hit a part of the string where we cannot continue to read characters, we fall back by following suffix links to try to preserve a much context as possible.

Every node in the tree (except the root, which corresponds to the empty string), will have a suffix/failure link associated with it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What failure/suffix links do:

At each step we either…

  1. advance to the end pointer forward on the trie,
  2. or advance the start pointer forward

Each pointer can advance forward at most O(m) times

Each pointer can advance forward at most O(m) times

A

This reduces the amount of time scanning characters from O(mL_max) down to O(m)

This is only useful if we can compute suffix links quickly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Challenges to Failure/Suffix link implementation:

Optimized string searcher no longer stars searching from each position in the string. This leads to failure to output matches in certain cases.

How to correct this?

A

For example: If we were searching ‘sting’, we would find maybe sting, but miss ‘tin’ and ‘in’ because each was the proper suffix of ‘stin’

This leads to output Links

17
Q

Output Links

Precomputing our end points, we can read off any extra patterns to emit at this point.

The output link (dictionary links) are linked lists

A

For example: if we had ‘tin’ and the output link of ‘n’ pointed to an ‘n’ from ‘in’ then tin would need to be printed as well as in

18
Q

Aho Corasick Algorithm

● Start at the root node in the trie.
● For each character c in the string:
● While there is no edge labeled c:
– If you’re at the root, break out of this loop.
– Otherwise, follow a suffix link.
● If there is an edge labeled c, follow it.
● If the current node corresponds to a pattern,
output that pattern.
● Output all words in the chain of output links
originating at this node.

A

Runtime:

In the worst case we print out all the matches of the string, but this is inherent in any search algorithm

Let z denote the number of matches reported by algorithm

19
Q

Aho Corasick Algorithm

● Start at the root node in the trie.
● For each character c in the string:
● While there is no edge labeled c:
– If you’re at the root, break out of this loop.
– Otherwise, follow a suffix link.
● If there is an edge labeled c, follow it.
● If the current node corresponds to a pattern,
output that pattern.
● Output all words in the chain of output links
originating at this node.

A

Runtime:

In the worst case we print out all the matches of the string, but this is inherent in any search algorithm

Let z denote the number of matches reported by algorithm. Then the runtime of match phase is \theta O(m +z)

This is a output sensitive algorithms

20
Q

To construct the Aho-Corasick Automaton:

  1. construct the TRIE
  2. construct suffix links
  3. construct ouput links
A

Naive Approach is to use construction and do a doubly nested loop, which is very inefficient

21
Q

Constructing Failure/Suffix Links:

(follow stipek/Niema guide) look at entire string and see if another path contains truncated versions taking away the initial characters one at a time until final character then point back to root.

Failure links point from longer string to shorter string

Precompute suffix links for nodes in ascending order of string length, all of the information needed for the above approach will be available at the time we need it.

A

Do a breadth-first search of the trie, performing the
following operations:
● If the node is the root, it has no suffix link.
● If the node is one hop away from the root, its suffix link
points to the root.
● Otherwise, the node corresponds to some string wa.
● Let x be the node pointed at by w’s suffix link. Then, do the
following:
– If the node xa exists, wa’s suffix link points to xa.
– Otherwise, if x is the root node, wa’s suffix link points to the root.
– Otherwise, set x to the node pointed at by x’s suffix link and
repeat.

22
Q

Analyzing Efficiency

How much time does it take to actually build all the failure/suffix links?

When filling in any individual failure/suffix link, need to keep walking backwards in the the trie following failure/suffix links repeatedly while searching for a lace to extend.

It might seem like quadratic

A

Claim: the algorithm for cmputing failure links takes time O(n)

Intuition: focus on any one word in the trie. As you add failure/link, keep track of the depth of the node pointed at by the current node’s failure/suffix link.

In actuality, the total time required to construct all failure/suffix links is O(n)

23
Q

Dictionary Link (output link)

Dictionary link at a node corresponding to a string w points to the node corresponding to the longest proper suffix of qe that is a pattern or NULL if no suffix exists

A

By always pointing to the node corresponding to the longest such word, we ensure that we chain together all the patterns using output links.

24
Q

Total Net Complexity

Processing time is:

  • Theta(n) to build TRIE
  • O(n) to fill in faiure/suffix links, and
  • O(n) work to fill in output links

THETA (n) is total preprocessing time

A

This algorithm is good when there is a fixed set/string of patterns and a variable string/pattern to search

This problem of finding all possible suffix is solved with a data structure in O(m) , O(n+z) time.