10 - Strings and Text 1 Flashcards

1
Q

What is the terminology for this topic?

A

an alphabet = {A, C, G, T} (DNA)

= ASCII, Unicode, etc

a string = ACAGTCCGGGACT

all strings indexed from position 1

Let S,T be two strings

|S| is length of S

S(i) is indexing S S(i .. j) denotes S(i)S(i+1) ..

S(j) ST is concatenation of S and T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a substring?

A

ACAGTCCGGGACTGACG

a string contained within a string a substring of S is S(i .. j) (i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a common substring?

A

TCCACTGACTGCTGC

ACAGTCCGGGACTGACG

GACTG is a common substring U is a common substring of S and T if U is a substring of both S and T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a subsequence?

A

ACAGTCCGGGACTGACG

obtained by deleting 0 or more characters from the string

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a common subsequence?

A

ACAGTCCGGGACTGACG

TCCACTGACTGCTGC

U is a common subsequence of S and T if U is a subsequence of both S and T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a prefix?

A

ACAGTCCGGGACTGACG

S(1 . . k) for some k (1 k n), where n = |S|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a suffix?

A

ACAGTCCGGGACTGACG

S(k . . n) for some k (1 k n), where n = |S|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is star / *?

A

* = {,A,C,G,T,AA,AC,AG,AT,CA,…}

set of all strings composed of symbols from the alphabet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the leaf terminology?

A

a leaf node: no children

a branch node: 1 or more children

a unary node: exactly 1 child

a binary node: exactly 2 children

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the naive solution to finding longest repeated substring?

A
  • String S of length n, build n x n array A with A(i, j) = 1 if S(i) = S(j) and A(i, j) = 0 otherwise
  • repeated substrings are represnted by sequences of 1’s on a diagonal
  • scan all diagonals to find longest repeat
  • O(n2) time and space
  • repeated substrings may overlap
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is naive solution to common substrings?

A
  • Find longest common substring of two strings
  • two strings S, T of lengths m, n
    • build similar m x n array of 0’s and 1’s
    • gives longest common substring of T and S in O(mn) time and space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Using trees for suffixes

A
  • Let S be string of length n
    • kth suffix of S is the suffix S(k..n)
    • S has n suffixes, including S itself
  • Data structures to store the n suffixes of S
    • Suffix Trie - O(n2) space
    • Suffix Tree - O(n) space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What problems use a suffix tree to solve them?

A
  • Longest repeat of S
    • O(n) time/spacespace
  • Longest common substring
    • O(m+n) time/space for 2 strings S,T where m = |S| and n = |T|
  • Multiple string searching
    • O(n+r) time/space, for long peice of text T (n = |T|) and strings of total length r
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a Suffix Trie?

A
  • A trie is a multiway branching tree T
    • used to store strings C over an alphabet Σ
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the properties of a Suffix Trie?

A
  • suffix trie T for string S is a trie used to store all suffixes of S
  • T is rooted at some vertex v
  • Each edge of T is labelled with some σ∈Σ
  • No 2 children of a node have same edge label
  • Each node w corresponds to a string S∈Σ* (concatentation of edge labels on the path from v to w) - S is the path label of w
  • Each node is marked according to whether ir corresponds to a string S∈C
  • each suffix of S represented by unique leaf node of T
  • if some suffix of S is a prefix of another suffix
    • then a suffix trie for S may not exist, e.g. consider S = queue
  • can ensure the suffix trie exists by appending to s a character $ not appearing in S, before constructing T
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Example of suffix trie

A

see slides

lecture 10 slide 11/14

17
Q

Building a suffix trie

A
  • Inserting ith suffix takes O(n-i) steps
    • so requires O(n2) time overall
  • How many nodes?
    • text length n, typically proportional to n2
    • space needed is quadratic in n
  • Searching for string length m is O(m) after O(n2) preprocessing time
  • Finding longest repeat is O(n2)
  • In both cases O(n2) space needed
  • Can we improve?
    • Yes - suffix tree