10 - Strings and Text 1 Flashcards
What is the terminology for this topic?
an alphabet = {A, C, G, T} (DNA)
= ASCII, Unicode, etc
a string = ACAGTCCGGGACT
all strings indexed from position 1
Let S,T be two strings
|S| is length of S
S(i) is indexing S S(i .. j) denotes S(i)S(i+1) ..
S(j) ST is concatenation of S and T
What is a substring?
ACAGTCCGGGACTGACG
a string contained within a string a substring of S is S(i .. j) (i
What is a common substring?
TCCACTGACTGCTGC
ACAGTCCGGGACTGACG
GACTG is a common substring U is a common substring of S and T if U is a substring of both S and T
What is a subsequence?
ACAGTCCGGGACTGACG
obtained by deleting 0 or more characters from the string
What is a common subsequence?
ACAGTCCGGGACTGACG
TCCACTGACTGCTGC
U is a common subsequence of S and T if U is a subsequence of both S and T
What is a prefix?
ACAGTCCGGGACTGACG
S(1 . . k) for some k (1 k n), where n = |S|
What is a suffix?
ACAGTCCGGGACTGACG
S(k . . n) for some k (1 k n), where n = |S|
What is star / *?
* = {,A,C,G,T,AA,AC,AG,AT,CA,…}
set of all strings composed of symbols from the alphabet
What is the leaf terminology?
a leaf node: no children
a branch node: 1 or more children
a unary node: exactly 1 child
a binary node: exactly 2 children
What is the naive solution to finding longest repeated substring?
- String S of length n, build n x n array A with A(i, j) = 1 if S(i) = S(j) and A(i, j) = 0 otherwise
- repeated substrings are represnted by sequences of 1’s on a diagonal
- scan all diagonals to find longest repeat
- O(n2) time and space
- repeated substrings may overlap
What is naive solution to common substrings?
- Find longest common substring of two strings
- two strings S, T of lengths m, n
- build similar m x n array of 0’s and 1’s
- gives longest common substring of T and S in O(mn) time and space
Using trees for suffixes
- Let S be string of length n
- kth suffix of S is the suffix S(k..n)
- S has n suffixes, including S itself
- Data structures to store the n suffixes of S
- Suffix Trie - O(n2) space
- Suffix Tree - O(n) space
What problems use a suffix tree to solve them?
- Longest repeat of S
- O(n) time/spacespace
- Longest common substring
- O(m+n) time/space for 2 strings S,T where m = |S| and n = |T|
- Multiple string searching
- O(n+r) time/space, for long peice of text T (n = |T|) and strings of total length r
What is a Suffix Trie?
- A trie is a multiway branching tree T
- used to store strings C over an alphabet Σ
What are the properties of a Suffix Trie?
- suffix trie T for string S is a trie used to store all suffixes of S
- T is rooted at some vertex v
- Each edge of T is labelled with some σ∈Σ
- No 2 children of a node have same edge label
- Each node w corresponds to a string S∈Σ* (concatentation of edge labels on the path from v to w) - S is the path label of w
- Each node is marked according to whether ir corresponds to a string S∈C
- each suffix of S represented by unique leaf node of T
- if some suffix of S is a prefix of another suffix
- then a suffix trie for S may not exist, e.g. consider S = queue
- can ensure the suffix trie exists by appending to s a character $ not appearing in S, before constructing T