Sequence analysis Flashcards
What are the primary DNA sequence databases?
GenBank in USA at National Centre for Biotechnology Information
(NCBI) Washington DC
-ENA – European Nucleotide Archive often called EMBL database EMBL- (European Molecular Biology Laboratory ) run from EMBL – EBI (European Bioinformatics Institute (Hinxton, Cambridge)
* Includesannotation
- DDBJ DNA Data Bank of Japan\
Sequence from DNA world wide
- Initial DNA deposition translated into protein sequences
* GENBANK to GENPEPT * EMBL to TrEMBL
- In parallel SWISSPROT (Amos Bairoch) is a high quality source of annotation for some sequences
What is UniPRotKB?
UniProKB = UniProt KnowledgeBaseEuropean-based (230M sequences)
Species distribution in TrEMBL
What are possible problems and errors in databases?
-organisartion of databases changes rapidly
- names very variable
- errors very slow to correct
- sometimes errors will not be corrected as organisation will not change submission without action by submitter
What is metagenomics
study of genetic material recovered directly from environmental and clinical settings through sequencing
Work pioneered by Craig Ventor to obtain sequences in batch from microorganisms in exotic locations such as the middle of the ocean or human gut. Many sequences of poor quality but gives insight into biodiversity.
What are orthologues and prologues?
What is the result of gene duplication?
Gene duplication: gene duplicated within in a genome the two proteins are paralogues
What can happen when you get gene duplication?
Can result in change of function – only 1 copy required to provide original protein, so second gene/protein can evolve a new function.
What is the result of speciation?
Speciation: a new species is created. As a result the two species have a single copy of the same gene – the two proteins are orthologues.
What is the result of speciation?
Speciation: a new species is created. As a result the two species have a single copy of the same gene – the two proteins are orthologues.
What happens to the function of proteins during speciation?
Both species only have a single copy so their function is less likely to change.
What are the requirements of a pairwise proteins sequence alignment?
Scoring scheme of similarity of amino acid residues
Algorithm to establish the alignment
Aim that the combined use of the algorithm with the scoring scheme generates the best alignment in terms of the biology
Potential to be extended to database searching
Scoring scheme- identity
Simplest is to score 1 for identical amino acids, 0 for different ones
Similarly identical bases can be scored
For proteins, evolution imposes constraints on types of amino acid changes that generally occur to modify, but not destroy protein function
Residues tend to keep their chemical property,e.g. the tendency to be buried (i.e non-polar or hydrophobic character)
Maintenance of chemical property called conservative substitution
Scoring scheme – Dayhoff (PAM)
Based on counting number of times residue types changed in aligned sequences of closely homologous sequences
Extended to detect more distant relationships by assuming matrix can be multiplied by itself.
PAM 250 developed to model sequences with 20% identity.
What is PAM?
Point accepted mutation
The PAM 250 matrix
Scoring scheme – BLOSUM62
Derived by Henikoff & Henikoff in early 90s
Based on aligned segments of protein families called BLOCKS –
hence BLOcks SUbstitution Matrix.
BLOSUM62 includes clustered sequences in BLOCKS where
pairwise identity > 62%
Currently the widely used matrix and included in the BLAST / PSIBLAST familiy of database searching algorithsms
BLOSUM62
How do we score scheme gaps?
Penalise gaps (insertion/deletions collectively known as indels)
Penalty = o + el
o = gap opening constant
e =gap extension constant
l = length of gap extension (no res in gap - 1)
o>e as evolutionary event is making the gap and we often see long gaps
Alignment of protein domains
Often a protein sequence is formed from parts known as domains, where each domain is a different homologous family
Domains are the evolutionary unit
Local vs global alignment
Needleman-Wunsch Algorithm
General algorithm for sequence comparison
Maximise a similarity score, to give ‘maximum match’
Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible insertions/deletions (indels).
N-W involves an iterative matrix method of calculation
All possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array
All possible alignments (comparisons) are represented by pathways through this array
Does NW algorithm give you global or local alignment?
Finds the best GLOBL alignment of any two sequences
What are the steps in Needleman-Wunsch algorithm
1.Assign similarity values
2. For each cell, look at all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway
3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment
Describe the similarity values in the Needleman-Wunsch algorithm
Describe construct alignment in Needleman-Wunsch algorithm
The alignment score is
cumulative by adding along a P path through the array
The best alignment has the highest score i.e. the maximum match
Maximum match = largest number resulting from summing the cell values of every pathway
The maximum match will ALWAYS be somewhere in the outer row or column shown
The alignment is constructed by working backwards from the maximum match
Needleman-Wunsch Algorithm (Gaps)
A gap penalty can be introduced
Score of next step is:
Best of
{Just continue alignment
Add gap in vertical sequence Add gap in horizontal sequence }
Smith-Waterman Algorithm
Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximises the similarity measure.
Fast Pairwise Search Algorithms
Single query aligned independently to any (similar) database entry
Must perform local search
Smith-Waterman guaranteed to find mathematically optimal solution BUT too slow for searching except on specialist parallel processing computers
Various fast methods developed based on finding short local matches and then building up alignment
Methods good but not guaranteed to find mathematically optimal solution
FASTA – popular method developed in 1985 but no longer widely used
BLAST - Basic Local Alignment Search Tool
This family of programs the major sequence search tool in protein and DNA bioinformatics
BLAST
A highly sophisticated approach developed by Altschul in 1990
Very fast local search program (50 x speed of Smith-Waterman)
First finds short segments or seeds (known as words) in query that have matches in database using BLOSUM62 score.
Then extends suitable seeds to form HSPs (high scoring pairs) using ungapped and gapped alignments
Significance of HSP match of given length evaluated by precise statistics
BLAST also used for DNA / DNA and Protein / 6 frame DNA translation
(PSI-BLAST also developed that uses multiple sequences – see later)
What is P(S)
P(S) is the probability of achieving a score S or a better score by chance (i.e. P is a cumulative score).
N.B. P is a probability 0<P≤1
What is E value
Also use a related measure which is the expectation of an error
in a database scan (E-value)
E-value is the expected number of matches that are errors if you searched and took all matches up to (and including S)
E-value = Estimated number of false positives found using S as the cut off
What is E(s)
E(S) is the expected number of chance occurrences of scores equal to or better than S
P-values and E-values
Most search programs return one or both of these values
Values do consider the size of the database searched and the
score of the match
Should also consider the length of the match as short matches are easier to find (BLAST does this)
For matches < 20 residues must be very cautious in suggesting true homology. Also one CANNOT infer short matches will have similar 3D structure.
Confident if P or E < 10-3 but as these are estimated values and these may well be wrong. You need experience of current version of program to identify best cut off values.
Note P is a probability and P <= 1
E can be greater than 1
For low values (<10-3) P and E are virtually the same
Significance of match
A variety of approaches are used to estimate P- and E-values but the implementation often changes faster than the actual algorithm so read the manual.
Take randomised sequence and obtain distribution of scores * But actual sequences not random
Use observed distribution of scores from one query against database and generate distribution of random scores (e.g. extreme value distribution)
Use theoretical model for distribution of scores
How to read BLAST output?
Understanding alignment
Key functional residues
- strong conservation due to a required structural role
- Another reason for observing strong conservation of
particular residues is if they play a key functional role
-For instance, in our serine protease example there is a triad of SER, ASP and HIS residues in the active site responsible for catalysis - These are conserved to preserve function
Algorithms for multiple alignment
Multiple alignment is a much more difficult problem than pairwise alignment due to the time required
Solve by heuristic methods
Heuristic is an “educated guess”
Is not guaranteed to get the best solution
But usually finds a reasonable solution in a reasonable time
A widely-used method is the CLUSTAL family of programs - Early version now explained
CLUSTAL Step 1 – Build guide tree
Perform all pairwise sequence alignments to obtain scores of each sequence against the others
Construct a tree where closed sequences (e.g. A and B) are neighbouring branches
This tree is the guide tree for the order of pairwise alignments
CLUSTAL Step 2 -Progressive alignment
Align the sequences progressively
start by aligning the most closely related pair or pairs. add in the next most closely related sequences by aligning them with these existing alignment
Other multiple alignment programs
Clustal omega – latest version
T-Coffee
More advanced but slower, suitable for smaller alignments
Can align one multiple alignment with another
Can use one or more structures to guide (and improve alignment)
MUSCLE
Very fast algorithm, particularly good for proteins
Initially estimates sequence similarity using short sequence words of n- residues
PROSITE patterns
General meaning
[AGT] means A, G or T
{AF} means anything but A or F
x(n) means a run of n amino acids of any type
PROSITE database
Created manually from multiple alignments and expert knowledge (Amos Bairoch)
Extensively annotated and linked with SWISSPROT database
Known false positives and negatives listed
PROSITE problems
Very useful, but…
Does not describe whole sequences only small sections
Can leads to several false negatives and false positives
Strict rules of matching are inflexible
cannot describe statistical properties of a family
Describing whole domain sequences: Two main ways
sequence profiles (now used in PROSITE as an alternative
to patterns)
Hidden Markov Models (HMMs)
Problems with frequency scores
Fails to consider a new sequence may differ from the PSSM derived from existing set
Fails to include our evolutionary knowledge that certain residue changes often occur (conservative changes such as Leu to Ile
PSSM = (substitution matrix) ** (observed frequencies) where ** indicates some method of combining scores
Hidden Markov Models - HMM
- Scoring matrices such as BLOSUM, are used to find sequence homology at the amino acid level regardless of location
- Profiles (PSSM) extend this by including the position within the protein structure when scoring an alignment (e.g. PSIBLAST)
- Hidden Markov Models (HMMs) take this one stage further
- They include similarity and position but also gaps (insertions and deletions)
- They also take into account what comes one residue before the position in addition to that position
- They are looking at more of the overall pattern
Producing an HMM
HMM
A HMM for a protein family is built by aligning known sequences in a MSA
The HMM is then built by traversing the alignment and calculating the probability for each possible transition between alignment positions
Sequence comparisons are generated from the HMM by starting at the beginning then traversing the appropriate path for the sequence being searched
Each transition possibility has a probability score and the overall quality of an unknown sequence to the HMM is calculated by multiplying together the scores
Advantages od HMM
Full probabilistic model
leads to rigorous interpretation
Takes into account the residue before the position you are scoring
Insertion and deletion have probability that is position specific
expect some parts of the sequence to be more susceptible to change
Pfam: Domain analysis
A sequence can be matched against a set of HMMs and the highest scoring is the most likely family
PFAM (Sanger Centre) is such a database of protein domain family HMMs
PFAM family all homologues
SCOP superfamily can be subdivided into several PFAM families if each PFAM family has distinct function
Key levelsof output from InterPro Q99895
Family - a group of proteins that share a common evolutionary origin reflected by their related functions, sequence homology or similarities in their structure.
Domain - a distinct functional, structural or sequence unit often found associated with other types of domains.
Homologous Superfamily - a group of proteins that share a common evolutionary origin, reflected by similarity in their structure, even if sequence similarity is low. This entry type contains signatures from the CATH-Gene3D and SUPERFAMILY member databases exclusively
Site - a short sequence containing one or more conserved residues, including: active sites, binding sites, conserved sites and sites of post- translational modification.
Repeat - A short sequence (usually <50 amino acids) typically repeated many times within a protein.
Unintegrated - member database signatures that might not yet be curated in InterPro but may still provide useful information.
General sequence analysis
Not all known sequences are in libraries in InterPro
PSIBLAST is a general method to take a sequence, generate a multiple alignment and thereby find remote homologues.
PSI-BLAST –
Iterative database searches
Position Specific Iterated BLAST
Start with an ordinary BLAST search
Take the significant hits and form a sequence profile
Next iteration is to search with this profile
Add further significant hits to profile and repeat until no more significant hits can be found
PSI-BLAST
Types of sequences
Low complexity regions
Sections with repeats of a few residues often a coil (GEPGEPGEP) Must be masked out as leads to spurious matches
Program SEG in BLAST included
Coiled-coils - local extended helices that intertwine
Tend to have periodic hydrophobic residues and if suspected, use
specific programs to identify
Transmembrane regions - parts of protein buried within membrane formed primarily from hydrophobic residues
Need to be found first (see later lecture)
Signal peptide - a short region (c 15 - 25 residues) near the start of protein chain (<40 residues) that directs the correct location of a protein in the cell
Using PSI-BLAST
Mask out low complexity, coiled-coil, transmembrane and signal sequences.
PDI-BLAST can drift
Feature - Sequences confidently found in an early run disappear
later on
Reason -A rogue match found that brings in its homologues that pollutes the scoring matrix
Cure – Watch the output and either stop earlier or be more stringent with E-value to include sequences in next steps.
PSI-BLAST not symmetric
Searching with sequence A finds B, but searching with B does not find A.
Accuracy of Alignment
Pairwise alignment
Below about 60% identity, some regions probably will be
incorrectly aligned Multiple alignment
PSIBLAST and CLUSTALW give more accurate results but still below 40% pairwise identity, one should expect some errors
Recognition of homology
Use Pfam superfamilies to identify gold standard of homologues and non-homologues
Take one sequence as query and evaluate at an chosen E- value:
how many true homologues (true positives) it finds
how many errors (false positives) it finds
How many true homologues missed (false negatives)