Lectures Flashcards
What is Bioinformatics?
Bioinformatics is the computational analysis of
biological data in order to solve biological
questions
• Interdisciplinary discipline at the intersection of
Biology and computer science (and occasionally
chemistry, physics,mathematics)
• often used interchangeably with computational
biology
- strongly data driven
- development of algorithms and tools
Topics in Bioinformatics?
- Sequence analysis
- Genome annotation
- Structural Bioinformatics
- Phylogenetics
- Gene expression analysis
- Genetics and Population Analysis
- Databases and Ontologies
- Data and Text Mining
- Systems Biology
Data sources
• Sequences
Gene sequences, genomes, personal genomes,
metagenomes
• Structures
High resolution structures from X-ray crystallography,
NMR Low resolution from cryo-EM, EPR, SAX
• Gene expression data
microarrays and transcriptome sequencing
• Protein abundances
Mass spectrometry
• In all areas exponential growth of data
• Rapid development of high-throughput technologies
• Analysis of most experiments is impossible without
efficient algorithms
Revolution in Sequencing Costs
- Sequencing costs fall exponentially
- Since 2008 faster than Moore’s law
- Latest technology < $1000 per human genome
Sequence Analysis –> Foundation of bioinformatics?
• Sequence determines structure and ultimately
function
• Sequences are easier to obtain than structures
• Homologous sequences are likely to have same or
similar function
• But, sequences will evolve faster than structures and
function
• We need to be able to detect remote homolgies
Homology vs. Similarity ?
• Two sequences are homologous if they are derived
from acommon ancestor
• Similarity ̸= Homology
• Homologous sequences are likely to exhibit some
similarity
• Short sequences might be similar because of
convergent evolution
• Significant similarity implies homology
Measuring Sequence Similarity
Distance measures:
Simplest distance between strings: Hamming distance, number of differences
- Works only for sequences of equal length
- Example for a simple edit distance
- Considers point mutations only
AGUUCGAUGG
AGUCCGGUCG
δ(a, b) = { 1 if a ̸= b, 0 else
dH (AGUUCGAUGG, AGUCCGGUCG) = 3
Edit Distances
An edit distance between two objects x, y is the minimum number (or cost) of operations needed to convert x into y
Edit distances almost automatically satisfy axioms of a metric:
d(x, y) = 0 ⇐⇒ x = y
d(x, y) = d(y, x) symmetry
d(x, y) ≤ d(x, z) + d(z, y) triangle equation
The Levenshtein distance
• Simplest (earliest) sequence alignment
algorithm(1965)
• Computes an edit distance between two strings x, y
using insertion, deletion or substitution of a single
character
• Result can be displayed as an alignment
Example:
Levenshtein distance between “computation” and
“complication” is 3
Needs 2 substitutions, 1 insertion
Corresponding alignment:
comput-ation
complication
Computable in O(n · m) time
Computing the Levenshtein distance
Can be computed efficiently via dynamic programming
Given two sequences x = x1 . . . xn, y = y1 . . . ym,
let d[i,j] be the Levenshtein distance of the prefixes
x1 . . . xi and y1 . . . yj, then
d[i, 0] = i
d[0, j] = j
d[i, j] = min d[i − 1, j] + 1
d[i, j − 1] + 1
d[i − 1, j − 1] + [xi ̸= yj]
for i, j > 0
What’s an Alignment?
Given two sequence x = x1x2 . . . xn, y = y1y2 . . . ym, with xi , yi ∈ A.
An Alignment of x and y is obtained by inserting gap characters (’–’) so that both resulting strings are equal length.
An alignment encodes a sequence of edit operations to convert a sequence x into y (or vice versa) using substitutions, insertions and deletions.
It can be interpreted as an hypothesis of the evolutionary history of the two sequences.
Example:
Y E − S T E R D A Y
− E A S T E R − − −
means (del(x, 1), ins(x, 2, A), del(x, 7), del(x, 8), del(x, 9))
Since we cannot distinguish between an insertion in x and and deletion in y, we usually talk about “indels”.
What’s an Alignment?
An alignment (also called alignment trace) is a “matching” between two sequences.
Matching positions are written above each other, unmatched positions are written with a gap (’–’) on
the opposite side.
This suggest a functional interpretation of the alignment:
matching positions should fulfill the equivalent functions in the two biomolecules
Counting Alignments
Alternatively, count the number of traces, i.e. possible
matchings.
AU-C
A-GC and
A-UC
AG-C
have the same matches, but different alignment.
The number of traces for two sequences of length n and m is
n+m
n
What is the best Alignment?
The biologically correct alignment:
• The Alignment reflecting the true evolutionary history
• The alignment matching functionally equivalent
positions
The optimal Alignment:
• The alignment fulfilling an optimization criterion such
as Levenshtein distance
optimal ̸= correct. But good optimization criteria can help.
The biologically correct alignment:
• The Alignment reflecting the true evolutionary history
• The alignment matching functionally equivalent
positions
The optimal Alignment:
• The alignment fulfilling an optimization criterion such
as Levenshtein distance
Similarity Alignments
Making alignments more realistic
- Use different scores for different substitutions
- Gaps should be more costly than mismatches
• Instead of minimizing a distance we can maximize a
similarity
• Easier to derive realistic similarity scores.
The Needleman Wunsch Algorithm
Let s(a, b) be the similarity score for aligning characters a, b, g the penalty for a gap (indel).
Let Si,j be the score of the optimal alignment of x1 . . . xi
with y1 . . . yj.
Si,j can be computed using the recursion
Si−1,j−1+ s(xi, yj) Si,j = max Si,j−1+ g Si−1,j+ g
with initialization S0,j= g · j and Si,0= g · i
The Needleman Wunsch Algorithm
Example:
AATAC aligned with GATTAC
g = −3 s(a, a) = +2 s(a, b) = −1
- G A T T A C - 0 -3 -6 -9 -12 -15 -18 A -3 -1 -1 -4 -7 -10 -13 A -6 -4 1 -2 -5 -5 -8 T -9 -7 -2 3 0 -3 -6 A -12 -10 -5 0 2 2 -1 C -15 -13 -8 -3 -1 1 4
The optimal alignment score 4 can be found in the lower right corner.
Backtracing
• Filling the table S calculates the score of the optimal
alignment, but not the alignment itself.
• To obtain the alignment we backtrace, i.e. we ask how
each entry was obtained from the recursion.
• In the example, the final entry S5,6 = 4 was obtained
as S4,5 + s(C, C) = 2 + 2 = 4 Thus the alignment has to
end with a CC match and we continue backtracing
from S4,5.
• Backtracing yields a path through the DP matrix from
the lower right to the upper left corner
• There is a one-to-one correspondence between paths
through the matrix and alignment
• Often there will be more than one optimal alignment
Variations of alignment
So far we considered global alignments, appropriate if both sequences are complete and homologous over the full length.
Often were this is not ideal because:
• Some of our sequences may be incomplete Leads to
lots of spurious gaps at beginning or end
• We may want to compare short with very long
sequences e.g. short sequence vs. a complete
genome
• We’re searching for local similarities
In the first two cases we need a semi-global alignment
In the last one a local alignment
Semi-global or glocal Alignment
a. Aligning a short sequence (fragment) against a long
one
• Ignore gaps at the beginning and end of the short sequence these do not represent insertion or deletions • Two changes to algorithm: Change initialization S0,j = 0 (ignores leading gaps in first seq) Start backtracing at highest value in bottom row (ignores trailing gaps)
b. Aligning two possibly incomplete sequences
• Solved by aligning with free end gaps • Changes to algorithm: Change initialization S0,j = Si,0 = 0 (ignore all leading gaps) Start backtracing at highest value in last row and column (ignores trailing gaps)
Since we deal with incomplete sequences most of the time. Semi-global or “free endgap” alignment is often the default.
Local Alignments
Often we are interested in only local similarity
• Sequences might be too diverged for meaningful
global alignment
• Otherwise unrelated sequences might share short
functional elements
• Helps focus on strongly conserved functionally
important regions
- Useful for aligning very long sequences
- local similarity is easily lost in global alignments
Computing Local Alignments
At first glance local alignment seem very hard:
• Should compare every pair of subsequences
xi 1. . . xi 2 and yj 1. . . yj 2
• that would entail O(n^4) space and O(n^6) time
Smith Waterman solution:
• Idea:
• We’re not interested in poor alignments (score < 0)
• Compute only 1 local alignment for every endpoint
xi and yj
• Same recursion as Needleman-Wunsch, except:
1 Initialize with 0 as in free end gap alignment
2 Replace negative scores with 0, marks start of a
new alignment
Smith Waterman Alignment
Let s(a, b) be the similarity score for aligning characters a, b, g the penalty for a gap (indel). Let Si,j be the score of the best local alignment ending at position i in x and j in y.
Same effort as Needleman Wunsch O(n · m) in memory and time!
Smith Waterman Backtracing
- Find the highest entry in the table
- Begin normal backtracing from highest entry
- Stop as soon as you encouter a 0
Smith Waterman Backtracing
What if we want to find additional (suboptimal)
alignments?
Smith Waterman Deblocking
Smith Waterman Deblocking
• remove cells used in the first alignment by setting
them 0
• “repair” table by recomputing cells that might have
changed
• backtrace in new table
• repeat for as many alignments as desired
Global alignment:
- Input: two potentially equivalent sequences
- Goal: identify conserved regions and differences
- Algorithm: Needleman-Wunsch dynamic programming
- Applications: comparing two related genes/proteins
Semi-global alignment:
- Input: two sequences, one short and one long
- Goal: is the short one a part of the long one?
- Algorithm: free end gap modification of NW DP
- Applications: sequence assembly, overlap detection
Local alignment:
• Input: two sequences that may or may not be related
• Goal: find regions of similarity between the sequences
• Algorithm: Smith-Waterman dynamic programming
• Applications: searching for local similarities in large
sequences, identification of conserved domains/motifs
Better Substitution Scores
For DNA/RNA mostly ad-hoc score models used
• positive scores for matches
• negative for mismatches
• sometimes better score for transitions (G-A, T-C) than
for transversions (C-G, C-A, T-G, T-A)
• Negative scores are essential for local alignments
• More precisely, the expected score for random
sequences should be negative
Better Substitution Scores
For DNA/RNA mostly ad-hoc score models used
• positive scores for matches
• negative for mismatches
• sometimes better score for transitions (G-A, T-C) than
for transversions (C-G, C-A, T-G, T-A)
• Negative scores are essential for local alignments
• More precisely, the expected score for random
sequences should be negative