Sequence analysis Flashcards

1
Q

What are the primary DNA sequence databases?

A

GenBank in USA at National Centre for Biotechnology Information
(NCBI) Washington DC
-ENA – European Nucleotide Archive often called EMBL database EMBL- (European Molecular Biology Laboratory ) run from EMBL – EBI (European Bioinformatics Institute (Hinxton, Cambridge)
* Includesannotation
- DDBJ DNA Data Bank of Japan\
Sequence from DNA world wide
- Initial DNA deposition translated into protein sequences
* GENBANK to GENPEPT * EMBL to TrEMBL
- In parallel SWISSPROT (Amos Bairoch) is a high quality source of annotation for some sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is UniPRotKB?

A

UniProKB = UniProt KnowledgeBaseEuropean-based (230M sequences)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Species distribution in TrEMBL

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are possible problems and errors in databases?

A

-organisartion of databases changes rapidly
- names very variable
- errors very slow to correct
- sometimes errors will not be corrected as organisation will not change submission without action by submitter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is metagenomics

A

study of genetic material recovered directly from environmental and clinical settings through sequencing

Work pioneered by Craig Ventor to obtain sequences in batch from microorganisms in exotic locations such as the middle of the ocean or human gut. Many sequences of poor quality but gives insight into biodiversity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are orthologues and prologues?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the result of gene duplication?

A

Gene duplication: gene duplicated within in a genome the two proteins are paralogues

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What can happen when you get gene duplication?

A

Can result in change of function – only 1 copy required to provide original protein, so second gene/protein can evolve a new function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the result of speciation?

A

Speciation: a new species is created. As a result the two species have a single copy of the same gene – the two proteins are orthologues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the result of speciation?

A

Speciation: a new species is created. As a result the two species have a single copy of the same gene – the two proteins are orthologues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What happens to the function of proteins during speciation?

A

Both species only have a single copy so their function is less likely to change.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the requirements of a pairwise proteins sequence alignment?

A

Scoring scheme of similarity of amino acid residues
 Algorithm to establish the alignment
 Aim that the combined use of the algorithm with the scoring scheme generates the best alignment in terms of the biology
 Potential to be extended to database searching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Scoring scheme- identity

A

Simplest is to score 1 for identical amino acids, 0 for different ones
 Similarly identical bases can be scored
 For proteins, evolution imposes constraints on types of amino acid changes that generally occur to modify, but not destroy protein function
 Residues tend to keep their chemical property,e.g. the tendency to be buried (i.e non-polar or hydrophobic character)
 Maintenance of chemical property called conservative substitution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Scoring scheme – Dayhoff (PAM)

A

Based on counting number of times residue types changed in aligned sequences of closely homologous sequences
 Extended to detect more distant relationships by assuming matrix can be multiplied by itself.
 PAM 250 developed to model sequences with 20% identity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is PAM?

A

Point accepted mutation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The PAM 250 matrix

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Scoring scheme – BLOSUM62

A

 Derived by Henikoff & Henikoff in early 90s
 Based on aligned segments of protein families called BLOCKS –
hence BLOcks SUbstitution Matrix.
 BLOSUM62 includes clustered sequences in BLOCKS where
pairwise identity > 62%
 Currently the widely used matrix and included in the BLAST / PSIBLAST familiy of database searching algorithsms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

BLOSUM62

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do we score scheme gaps?

A

Penalise gaps (insertion/deletions collectively known as indels)
 Penalty = o + el
o = gap opening constant
e =gap extension constant
l = length of gap extension (no res in gap - 1)
o>e as evolutionary event is making the gap and we often see long gaps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Alignment of protein domains

A

Often a protein sequence is formed from parts known as domains, where each domain is a different homologous family
 Domains are the evolutionary unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Local vs global alignment

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Needleman-Wunsch Algorithm

A

General algorithm for sequence comparison
 Maximise a similarity score, to give ‘maximum match’
 Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible insertions/deletions (indels).
 N-W involves an iterative matrix method of calculation
 All possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array
 All possible alignments (comparisons) are represented by pathways through this array

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Does NW algorithm give you global or local alignment?

A

 Finds the best GLOBL alignment of any two sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the steps in Needleman-Wunsch algorithm

A

1.Assign similarity values
2. For each cell, look at all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway
3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe the similarity values in the Needleman-Wunsch algorithm

A
25
Q

Describe construct alignment in Needleman-Wunsch algorithm

A

The alignment score is
cumulative by adding along a P path through the array
The best alignment has the highest score i.e. the maximum match
Maximum match = largest number resulting from summing the cell values of every pathway
The maximum match will ALWAYS be somewhere in the outer row or column shown
The alignment is constructed by working backwards from the maximum match

26
Q

Needleman-Wunsch Algorithm (Gaps)

A

A gap penalty can be introduced
Score of next step is:
Best of
{Just continue alignment
Add gap in vertical sequence Add gap in horizontal sequence }

27
Q

Smith-Waterman Algorithm

A

Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximises the similarity measure.

28
Q

Fast Pairwise Search Algorithms

A

Single query aligned independently to any (similar) database entry
 Must perform local search
 Smith-Waterman guaranteed to find mathematically optimal solution BUT too slow for searching except on specialist parallel processing computers
 Various fast methods developed based on finding short local matches and then building up alignment
 Methods good but not guaranteed to find mathematically optimal solution
 FASTA – popular method developed in 1985 but no longer widely used
 BLAST - Basic Local Alignment Search Tool
 This family of programs the major sequence search tool in protein and DNA bioinformatics

29
Q

BLAST

A

A highly sophisticated approach developed by Altschul in 1990
 Very fast local search program (50 x speed of Smith-Waterman)
 First finds short segments or seeds (known as words) in query that have matches in database using BLOSUM62 score.
 Then extends suitable seeds to form HSPs (high scoring pairs) using ungapped and gapped alignments
 Significance of HSP match of given length evaluated by precise statistics
 BLAST also used for DNA / DNA and Protein / 6 frame DNA translation
 (PSI-BLAST also developed that uses multiple sequences – see later)

30
Q

What is P(S)

A

P(S) is the probability of achieving a score S or a better score by chance (i.e. P is a cumulative score).
 N.B. P is a probability 0<P≤1

31
Q

What is E value

A

Also use a related measure which is the expectation of an error
in a database scan (E-value)
E-value is the expected number of matches that are errors if you searched and took all matches up to (and including S)
 E-value = Estimated number of false positives found using S as the cut off

32
Q

What is E(s)

A

E(S) is the expected number of chance occurrences of scores equal to or better than S

33
Q

P-values and E-values

A

Most search programs return one or both of these values
 Values do consider the size of the database searched and the
score of the match
 Should also consider the length of the match as short matches are easier to find (BLAST does this)
 For matches < 20 residues must be very cautious in suggesting true homology. Also one CANNOT infer short matches will have similar 3D structure.
 Confident if P or E < 10-3 but as these are estimated values and these may well be wrong. You need experience of current version of program to identify best cut off values.
 Note P is a probability and P <= 1
 E can be greater than 1
 For low values (<10-3) P and E are virtually the same

34
Q

Significance of match

A

A variety of approaches are used to estimate P- and E-values but the implementation often changes faster than the actual algorithm so read the manual.
 Take randomised sequence and obtain distribution of scores * But actual sequences not random
 Use observed distribution of scores from one query against database and generate distribution of random scores (e.g. extreme value distribution)
 Use theoretical model for distribution of scores

35
Q

How to read BLAST output?

A
36
Q

Understanding alignment

A
37
Q

Key functional residues

A
  • strong conservation due to a required structural role
  • Another reason for observing strong conservation of
    particular residues is if they play a key functional role
    -For instance, in our serine protease example there is a triad of SER, ASP and HIS residues in the active site responsible for catalysis
  • These are conserved to preserve function
38
Q

Algorithms for multiple alignment

A

 Multiple alignment is a much more difficult problem than pairwise alignment due to the time required
 Solve by heuristic methods
 Heuristic is an “educated guess”
 Is not guaranteed to get the best solution
 But usually finds a reasonable solution in a reasonable time
 A widely-used method is the CLUSTAL family of programs - Early version now explained

39
Q

CLUSTAL Step 1 – Build guide tree

A

Perform all pairwise sequence alignments to obtain scores of each sequence against the others
Construct a tree where closed sequences (e.g. A and B) are neighbouring branches
This tree is the guide tree for the order of pairwise alignments

40
Q

CLUSTAL Step 2 -Progressive alignment

A

Align the sequences progressively
start by aligning the most closely related pair or pairs. add in the next most closely related sequences by aligning them with these existing alignment

41
Q

Other multiple alignment programs

A

 Clustal omega – latest version
T-Coffee
 More advanced but slower, suitable for smaller alignments
 Can align one multiple alignment with another
 Can use one or more structures to guide (and improve alignment)
 MUSCLE
 Very fast algorithm, particularly good for proteins
 Initially estimates sequence similarity using short sequence words of n- residues

42
Q

PROSITE patterns

A

General meaning
 [AGT] means A, G or T
 {AF} means anything but A or F
 x(n) means a run of n amino acids of any type

43
Q

PROSITE database

A

Created manually from multiple alignments and expert knowledge (Amos Bairoch)
 Extensively annotated and linked with SWISSPROT database
 Known false positives and negatives listed

44
Q

PROSITE problems

A

Very useful, but…
 Does not describe whole sequences only small sections
 Can leads to several false negatives and false positives
 Strict rules of matching are inflexible
 cannot describe statistical properties of a family

45
Q

Describing whole domain sequences: Two main ways

A

 sequence profiles (now used in PROSITE as an alternative
to patterns)
 Hidden Markov Models (HMMs)

46
Q

Problems with frequency scores

A

 Fails to consider a new sequence may differ from the PSSM derived from existing set
 Fails to include our evolutionary knowledge that certain residue changes often occur (conservative changes such as Leu to Ile
 PSSM = (substitution matrix) ** (observed frequencies) where ** indicates some method of combining scores

47
Q

Hidden Markov Models - HMM

A
  • Scoring matrices such as BLOSUM, are used to find sequence homology at the amino acid level regardless of location
  • Profiles (PSSM) extend this by including the position within the protein structure when scoring an alignment (e.g. PSIBLAST)
  • Hidden Markov Models (HMMs) take this one stage further
  • They include similarity and position but also gaps (insertions and deletions)
  • They also take into account what comes one residue before the position in addition to that position
  • They are looking at more of the overall pattern
48
Q

Producing an HMM

A
49
Q

HMM

A

A HMM for a protein family is built by aligning known sequences in a MSA
The HMM is then built by traversing the alignment and calculating the probability for each possible transition between alignment positions
Sequence comparisons are generated from the HMM by starting at the beginning then traversing the appropriate path for the sequence being searched
Each transition possibility has a probability score and the overall quality of an unknown sequence to the HMM is calculated by multiplying together the scores

50
Q

Advantages od HMM

A

 Full probabilistic model
 leads to rigorous interpretation
 Takes into account the residue before the position you are scoring
 Insertion and deletion have probability that is position specific
 expect some parts of the sequence to be more susceptible to change

51
Q

Pfam: Domain analysis

A

 A sequence can be matched against a set of HMMs and the highest scoring is the most likely family
 PFAM (Sanger Centre) is such a database of protein domain family HMMs
 PFAM family all homologues
 SCOP superfamily can be subdivided into several PFAM families if each PFAM family has distinct function

52
Q

Key levelsof output from InterPro Q99895

A

 Family - a group of proteins that share a common evolutionary origin reflected by their related functions, sequence homology or similarities in their structure.
 Domain - a distinct functional, structural or sequence unit often found associated with other types of domains.
 Homologous Superfamily - a group of proteins that share a common evolutionary origin, reflected by similarity in their structure, even if sequence similarity is low. This entry type contains signatures from the CATH-Gene3D and SUPERFAMILY member databases exclusively
 Site - a short sequence containing one or more conserved residues, including: active sites, binding sites, conserved sites and sites of post- translational modification.
 Repeat - A short sequence (usually <50 amino acids) typically repeated many times within a protein.
 Unintegrated - member database signatures that might not yet be curated in InterPro but may still provide useful information.

53
Q

General sequence analysis

A

 Not all known sequences are in libraries in InterPro
 PSIBLAST is a general method to take a sequence, generate a multiple alignment and thereby find remote homologues.

54
Q

PSI-BLAST –
Iterative database searches

A

 Position Specific Iterated BLAST
 Start with an ordinary BLAST search
 Take the significant hits and form a sequence profile
 Next iteration is to search with this profile
 Add further significant hits to profile and repeat until no more significant hits can be found

55
Q

PSI-BLAST

A
56
Q

Types of sequences

A

 Low complexity regions
 Sections with repeats of a few residues often a coil (GEPGEPGEP)  Must be masked out as leads to spurious matches
 Program SEG in BLAST included
 Coiled-coils - local extended helices that intertwine
 Tend to have periodic hydrophobic residues and if suspected, use
specific programs to identify
 Transmembrane regions - parts of protein buried within membrane formed primarily from hydrophobic residues
 Need to be found first (see later lecture)
 Signal peptide - a short region (c 15 - 25 residues) near the start of protein chain (<40 residues) that directs the correct location of a protein in the cell

57
Q

Using PSI-BLAST

A

 Mask out low complexity, coiled-coil, transmembrane and signal sequences.
 PDI-BLAST can drift
 Feature - Sequences confidently found in an early run disappear
later on
 Reason -A rogue match found that brings in its homologues that pollutes the scoring matrix
 Cure – Watch the output and either stop earlier or be more stringent with E-value to include sequences in next steps.
 PSI-BLAST not symmetric
 Searching with sequence A finds B, but searching with B does not find A.

58
Q

Accuracy of Alignment

A

 Pairwise alignment
 Below about 60% identity, some regions probably will be
incorrectly aligned  Multiple alignment
 PSIBLAST and CLUSTALW give more accurate results but still below 40% pairwise identity, one should expect some errors

59
Q

Recognition of homology

A

Use Pfam superfamilies to identify gold standard of homologues and non-homologues
 Take one sequence as query and evaluate at an chosen E- value:
 how many true homologues (true positives) it finds
 how many errors (false positives) it finds
 How many true homologues missed (false negatives)