Lecture 5&6: Protein Sequence Alignment Flashcards
What are Orthologs?
Copies resulting from a gene duplication event that come from speciation.
What are Paralogs?
Copies resulting from a gene duplication event within the same organism.
What is the formula for percentage sequence identity?
Number of identical residues/number of residues in smallest protein) * 100
What are the general steps to solve the function of a protein?
- Do fast scans using approximate methods. (BLAST)
- Align proteins using a dynamic programming method (Needleman & Wunsch, Smith & Waterman)
- Scan against sequence profiles or HMMs in secondary databases (Pfam, InterPro)
- Align sequence against family relatives using ClustalW, Jalview
What is the difference between Needleman & Wunsh and Smith & Waterman algorithms?
Needleman & Wunsch uses Global Alignment
Smith and Waterman uses Local Alignment
What are some rules of sequence homology?
Protein pairs having more than 150 residues are homologs if they have a sequence identity > 25%
For shorter fragment proteins, 30% sequence identity is required.
Structure within families tends to be much more conserved compared to sequence.
Inheriting functional properties from a homolog requires around 60% sequence identity.
What are the different matrices used when comparing to proteins?
Identity Matrix (Binary)
Physicochemical properties matrix (range)
Evolutionary matrices (Dayhoff, BLOSUM matrices)
What is the Dayhoff matrix?
It is an evolutionary matrix.
It measures the evolutionary distance by determining the number of point accepted mutations, where 1 PAM = 1 point mutation/100 residues
if more than 100PAM, it means multiple substitutions have occurred to the same site.
What is the BLOSUM matrix?
It is an evolutionary matrix.
It is derived from analyzing substitution patterns in more distant relatives.
What is the difference between a p-value and an e-value?
the p-value is the likelihood that this match was obtained by chance, which is converted to an e-value, which takes into consideration the size of the database.
What types of residues are most conserved?
Catalytic residues are the most highly conserved residues. Others could include residues in the binding pocket, the surface of a protein.
Highly conserved residues are usually associated with the function.
What is progressive alignment?
It is a heuristic approach that uses the idea that sequences are evolutionarily related and can be aligned using an underlying phylogenetic tree.
What are the features of the Clustal W algorithm?
It has position specific gap opening and extension penalties (higher within strands and helices, lower between them).
It uses two different amino acid substitution matrices: one for close relatives, one for distant.
What are some alternatives to Clustal W?
MAFFT
T-Coffee
MUSCLE
JALVIEW
How can conservation be measured?
While there are various methods to measure the magnitude of conservation, common ones use the frequency of a residue at a particular site.
Entropy scores are generated. A lower entropy score indicates a less conserved region.