Sequence analysis Flashcards
What are the primary DNA sequence databases?
GenBank in USA at National Centre for Biotechnology Information
(NCBI) Washington DC
-ENA – European Nucleotide Archive often called EMBL database EMBL- (European Molecular Biology Laboratory ) run from EMBL – EBI (European Bioinformatics Institute (Hinxton, Cambridge)
* Includesannotation
- DDBJ DNA Data Bank of Japan\
Sequence from DNA world wide
- Initial DNA deposition translated into protein sequences
* GENBANK to GENPEPT * EMBL to TrEMBL
- In parallel SWISSPROT (Amos Bairoch) is a high quality source of annotation for some sequences
What is UniPRotKB?
UniProKB = UniProt KnowledgeBaseEuropean-based (230M sequences)
Species distribution in TrEMBL
What are possible problems and errors in databases?
-organisartion of databases changes rapidly
- names very variable
- errors very slow to correct
- sometimes errors will not be corrected as organisation will not change submission without action by submitter
What is metagenomics
study of genetic material recovered directly from environmental and clinical settings through sequencing
Work pioneered by Craig Ventor to obtain sequences in batch from microorganisms in exotic locations such as the middle of the ocean or human gut. Many sequences of poor quality but gives insight into biodiversity.
What are orthologues and prologues?
What is the result of gene duplication?
Gene duplication: gene duplicated within in a genome the two proteins are paralogues
What can happen when you get gene duplication?
Can result in change of function – only 1 copy required to provide original protein, so second gene/protein can evolve a new function.
What is the result of speciation?
Speciation: a new species is created. As a result the two species have a single copy of the same gene – the two proteins are orthologues.
What is the result of speciation?
Speciation: a new species is created. As a result the two species have a single copy of the same gene – the two proteins are orthologues.
What happens to the function of proteins during speciation?
Both species only have a single copy so their function is less likely to change.
What are the requirements of a pairwise proteins sequence alignment?
Scoring scheme of similarity of amino acid residues
Algorithm to establish the alignment
Aim that the combined use of the algorithm with the scoring scheme generates the best alignment in terms of the biology
Potential to be extended to database searching
Scoring scheme- identity
Simplest is to score 1 for identical amino acids, 0 for different ones
Similarly identical bases can be scored
For proteins, evolution imposes constraints on types of amino acid changes that generally occur to modify, but not destroy protein function
Residues tend to keep their chemical property,e.g. the tendency to be buried (i.e non-polar or hydrophobic character)
Maintenance of chemical property called conservative substitution
Scoring scheme – Dayhoff (PAM)
Based on counting number of times residue types changed in aligned sequences of closely homologous sequences
Extended to detect more distant relationships by assuming matrix can be multiplied by itself.
PAM 250 developed to model sequences with 20% identity.
What is PAM?
Point accepted mutation
The PAM 250 matrix
Scoring scheme – BLOSUM62
Derived by Henikoff & Henikoff in early 90s
Based on aligned segments of protein families called BLOCKS –
hence BLOcks SUbstitution Matrix.
BLOSUM62 includes clustered sequences in BLOCKS where
pairwise identity > 62%
Currently the widely used matrix and included in the BLAST / PSIBLAST familiy of database searching algorithsms
BLOSUM62
How do we score scheme gaps?
Penalise gaps (insertion/deletions collectively known as indels)
Penalty = o + el
o = gap opening constant
e =gap extension constant
l = length of gap extension (no res in gap - 1)
o>e as evolutionary event is making the gap and we often see long gaps
Alignment of protein domains
Often a protein sequence is formed from parts known as domains, where each domain is a different homologous family
Domains are the evolutionary unit
Local vs global alignment
Needleman-Wunsch Algorithm
General algorithm for sequence comparison
Maximise a similarity score, to give ‘maximum match’
Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible insertions/deletions (indels).
N-W involves an iterative matrix method of calculation
All possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array
All possible alignments (comparisons) are represented by pathways through this array
Does NW algorithm give you global or local alignment?
Finds the best GLOBL alignment of any two sequences
What are the steps in Needleman-Wunsch algorithm
1.Assign similarity values
2. For each cell, look at all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway
3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment