Sequence analysis Flashcards

Question

Describe the similarity values in the Needleman-Wunsch algorithm

Answer 1

The alignment score is cumulative by adding along a P path through the array The best alignment has the highest score i.e. the maximum match Maximum match = largest number resulting from summing the cell values of every pathway The maximum match will ALWAYS be somewhere in the outer row or column shown The alignment is constructed by working backwards from the maximum match

Answer 2

A gap penalty can be introduced Score of next step is: Best of {Just continue alignment Add gap in vertical sequence Add gap in horizontal sequence }

Answer 3

Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximises the similarity measure.

Answer 4

Single query aligned independently to any (similar) database entry  Must perform local search  Smith-Waterman guaranteed to find mathematically optimal solution BUT too slow for searching except on specialist parallel processing computers  Various fast methods developed based on finding short local matches and then building up alignment  Methods good but not guaranteed to find mathematically optimal solution  FASTA – popular method developed in 1985 but no longer widely used  BLAST - Basic Local Alignment Search Tool  This family of programs the major sequence search tool in protein and DNA bioinformatics

Answer 5

A highly sophisticated approach developed by Altschul in 1990  Very fast local search program (50 x speed of Smith-Waterman)  First finds short segments or seeds (known as words) in query that have matches in database using BLOSUM62 score.  Then extends suitable seeds to form HSPs (high scoring pairs) using ungapped and gapped alignments  Significance of HSP match of given length evaluated by precise statistics  BLAST also used for DNA / DNA and Protein / 6 frame DNA translation  (PSI-BLAST also developed that uses multiple sequences – see later)

Answer 6

P(S) is the probability of achieving a score S or a better score by chance (i.e. P is a cumulative score).  N.B. P is a probability 0

Answer 7

Also use a related measure which is the expectation of an error in a database scan (E-value) E-value is the expected number of matches that are errors if you searched and took all matches up to (and including S)  E-value = Estimated number of false positives found using S as the cut off

Answer 8

E(S) is the expected number of chance occurrences of scores equal to or better than S

Answer 9

Most search programs return one or both of these values  Values do consider the size of the database searched and the score of the match  Should also consider the length of the match as short matches are easier to find (BLAST does this)  For matches < 20 residues must be very cautious in suggesting true homology. Also one CANNOT infer short matches will have similar 3D structure.  Confident if P or E < 10-3 but as these are estimated values and these may well be wrong. You need experience of current version of program to identify best cut off values.  Note P is a probability and P <= 1  E can be greater than 1  For low values (<10-3) P and E are virtually the same

Answer 10

A variety of approaches are used to estimate P- and E-values but the implementation often changes faster than the actual algorithm so read the manual.  Take randomised sequence and obtain distribution of scores * But actual sequences not random  Use observed distribution of scores from one query against database and generate distribution of random scores (e.g. extreme value distribution)  Use theoretical model for distribution of scores

Answer 11

- strong conservation due to a required structural role - Another reason for observing strong conservation of particular residues is if they play a key functional role -For instance, in our serine protease example there is a triad of SER, ASP and HIS residues in the active site responsible for catalysis - These are conserved to preserve function

Answer 12

 Multiple alignment is a much more difficult problem than pairwise alignment due to the time required  Solve by heuristic methods  Heuristic is an “educated guess”  Is not guaranteed to get the best solution  But usually finds a reasonable solution in a reasonable time  A widely-used method is the CLUSTAL family of programs - Early version now explained

Answer 13

Perform all pairwise sequence alignments to obtain scores of each sequence against the others Construct a tree where closed sequences (e.g. A and B) are neighbouring branches This tree is the guide tree for the order of pairwise alignments

Answer 14

Align the sequences progressively start by aligning the most closely related pair or pairs. add in the next most closely related sequences by aligning them with these existing alignment

Answer 15

 Clustal omega – latest version T-Coffee  More advanced but slower, suitable for smaller alignments  Can align one multiple alignment with another  Can use one or more structures to guide (and improve alignment)  MUSCLE  Very fast algorithm, particularly good for proteins  Initially estimates sequence similarity using short sequence words of n- residues

Answer 16

General meaning  [AGT] means A, G or T  {AF} means anything but A or F  x(n) means a run of n amino acids of any type

Answer 17

Created manually from multiple alignments and expert knowledge (Amos Bairoch)  Extensively annotated and linked with SWISSPROT database  Known false positives and negatives listed

Answer 18

Very useful, but...  Does not describe whole sequences only small sections  Can leads to several false negatives and false positives  Strict rules of matching are inflexible  cannot describe statistical properties of a family

Answer 19

 sequence profiles (now used in PROSITE as an alternative to patterns)  Hidden Markov Models (HMMs)

Answer 20

 Fails to consider a new sequence may differ from the PSSM derived from existing set  Fails to include our evolutionary knowledge that certain residue changes often occur (conservative changes such as Leu to Ile  PSSM = (substitution matrix) ** (observed frequencies) where ** indicates some method of combining scores

Answer 21

* Scoring matrices such as BLOSUM, are used to find sequence homology at the amino acid level regardless of location * Profiles (PSSM) extend this by including the position within the protein structure when scoring an alignment (e.g. PSIBLAST) * Hidden Markov Models (HMMs) take this one stage further * They include similarity and position but also gaps (insertions and deletions) * They also take into account what comes one residue before the position in addition to that position * They are looking at more of the overall pattern

Answer 22

A HMM for a protein family is built by aligning known sequences in a MSA The HMM is then built by traversing the alignment and calculating the probability for each possible transition between alignment positions Sequence comparisons are generated from the HMM by starting at the beginning then traversing the appropriate path for the sequence being searched Each transition possibility has a probability score and the overall quality of an unknown sequence to the HMM is calculated by multiplying together the scores

Answer 23

 Full probabilistic model  leads to rigorous interpretation  Takes into account the residue before the position you are scoring  Insertion and deletion have probability that is position specific  expect some parts of the sequence to be more susceptible to change

Answer 24

 A sequence can be matched against a set of HMMs and the highest scoring is the most likely family  PFAM (Sanger Centre) is such a database of protein domain family HMMs  PFAM family all homologues  SCOP superfamily can be subdivided into several PFAM families if each PFAM family has distinct function

Answer 25

 Family - a group of proteins that share a common evolutionary origin reflected by their related functions, sequence homology or similarities in their structure.  Domain - a distinct functional, structural or sequence unit often found associated with other types of domains.  Homologous Superfamily - a group of proteins that share a common evolutionary origin, reflected by similarity in their structure, even if sequence similarity is low. This entry type contains signatures from the CATH-Gene3D and SUPERFAMILY member databases exclusively  Site - a short sequence containing one or more conserved residues, including: active sites, binding sites, conserved sites and sites of post- translational modification.  Repeat - A short sequence (usually <50 amino acids) typically repeated many times within a protein.  Unintegrated - member database signatures that might not yet be curated in InterPro but may still provide useful information.

Answer 26

 Not all known sequences are in libraries in InterPro  PSIBLAST is a general method to take a sequence, generate a multiple alignment and thereby find remote homologues.

Answer 27

 Position Specific Iterated BLAST  Start with an ordinary BLAST search  Take the significant hits and form a sequence profile  Next iteration is to search with this profile  Add further significant hits to profile and repeat until no more significant hits can be found

Answer 28

 Low complexity regions  Sections with repeats of a few residues often a coil (GEPGEPGEP)  Must be masked out as leads to spurious matches  Program SEG in BLAST included  Coiled-coils - local extended helices that intertwine  Tend to have periodic hydrophobic residues and if suspected, use specific programs to identify  Transmembrane regions - parts of protein buried within membrane formed primarily from hydrophobic residues  Need to be found first (see later lecture)  Signal peptide - a short region (c 15 - 25 residues) near the start of protein chain (<40 residues) that directs the correct location of a protein in the cell

Answer 29

 Mask out low complexity, coiled-coil, transmembrane and signal sequences.  PDI-BLAST can drift  Feature - Sequences confidently found in an early run disappear later on  Reason -A rogue match found that brings in its homologues that pollutes the scoring matrix  Cure – Watch the output and either stop earlier or be more stringent with E-value to include sequences in next steps.  PSI-BLAST not symmetric  Searching with sequence A finds B, but searching with B does not find A.

Answer 30

 Pairwise alignment  Below about 60% identity, some regions probably will be incorrectly aligned  Multiple alignment  PSIBLAST and CLUSTALW give more accurate results but still below 40% pairwise identity, one should expect some errors

Answer 31

Use Pfam superfamilies to identify gold standard of homologues and non-homologues  Take one sequence as query and evaluate at an chosen E- value:  how many true homologues (true positives) it finds  how many errors (false positives) it finds  How many true homologues missed (false negatives)

Sequence analysis Flashcards

(60 cards)