Structure bioinformatics - Jens lectures Flashcards
Where can we find protein structures?
In the PDB database.
Why do we classify proteins based on the structure?
To understand biological function.
Structural and functional similarity cannot always be understood at sequence level.
What is a domain?
Domains are different functional subunits of a protein that have different topology and fold.
What is CATH and SCOP?
CATH and SCOP are databases that classify protein domains based on the organization of secondary structure elements. They identify structurally similar proteins regardless of function.
The data (protein structures) is organized on different hierarchical levels whereas PDB has all the data in a pile.
The goal is to understand the evolutionary relationships between proteins.
Which is more conservative, sequence or fold?
The structure/fold is more conservative than the sequence - different sequences can have the same fold and 20% of the folds account for 80% of the proteins.
Two different proteins can have a very low sequence identity but can be structurally very similar and have similar function.
sequence identity over 40 % indicates a similar fold.
Does similar structure/fold indicate a evolutionary relationship and similar function?
Similar structure often indicates an evolutionary relationship and similar function between proteins but it is not always the case.
Proteins with very different folds can have the same function and vice versa.
Sequence and structural similarity together (common ancestor) is a strong indication of similar function.
Explain the database SCOP
Protein structures in the PDB are organized according to their structural
and evolutionary relationships:
class = proteins with similar secondary structure element composition.
Fold = proteins with similar secondary structure elements in a specific topology.
Superfamily = Proteins with low sequence identity but with common structural elements. Assumed common origin.
Family = Proteins with established common ancestor and conserved sequence.
Explain the database CATH.
Alternative to SCOP.
Classifies protein structures in the PDB to identify structural and evolutionary relationships.
class = proteins with similar secondary structure element composition. (mainly alpha, mainly beta, mixed).
architecture = overall domain shape orientation of secondary structure elements.
topology = orientation of secondary structure elements and connections between them.
homologous superfamily = evolutionary related.
What are motifs?
- If similarity between an unknown sequence and a
sequence of known function is limited to a few
residues, standard sequence alignment won’t work since the alignment won’t find the similarity and proteins with very low sequence identity can still have the same fold and function.
We can instead search for short patterns called sequence motifs that can be valuable in prediction function of a protein. The residues in the motif does not need to be close together in sequence but only in 3D structure.
An example of a motif can be an enzymes active site.
Give examples of databases for motif/pattern searching.
PROSITE
Pfam
InterPro
What can you search for in PROSITE?
Search with a sequence of unknown function to find functional motifs and make predictions of function based on these.
These databases can give many false positives since they are based on short sequences.
What can you search for in Pfam?
Functional motifs. Pfam gives a more complex description than PROSITE.
Large collection of multiple sequence alignments of protein
domains or conserved regions.
Hidden Markov Model (HMM) based profiles are used to
represent protein families or domains.
By searching a protein sequence against the Pfam library of
profile HMMs, you can determine which domains it carries.
As common patterns are recognized, protein function can be
assigned even for low homology sequences
What is DALI?
A program that compares protein structures in 3D to reveal similarities that can indicate similar functions.
It does this by comparing the carbon alpha backbone coordinates of proteins (the distances between the alpha carbons) and searches for structurally related proteins in PDB.
Why do we need protein structure prediction?
- We can understand function from structure.
- It is difficult to determine structures experimentally.
- The gap between the size of the sequence data and structure data is large and increasing.
Give examples of how we can define secondary structures?
With phi/psi angles in a ramachandran plot.
With hydrogen bonding patterns.
How can we assess a secondary strucure prediction?
By looking at the percentage of correctly predicted secondary structures (Q3).
Note that random Q3 = 33%.
What is the Chou-Fasman method?
A method for predicting secondary structures.
There are certain probabilities of certain amino acids to occur in helices/sheets. The Chou-Fasman method looks at these probabilities without context.
The amino acids are defined as helix/beta “formers”, “indifferent” and “breakers”.
Ex. Chou-Fasman would say that the occurrence of Alanine gives high probability for the strucure being a helix without looking at the context- what is around that Alanine? The secondary structure is context based and will change depending on environment and aa around.
Accuracy of about 50-60%.
What is the GOR method?
The Garnier, Osguthorpe and Robson method is an improvement of the Chou-Fasman method for prediction of secondary structures.
It incorporates effects of local interactions (context).
The assignment of secondary structure to residue j
depends on neighbouring residues i-8 to j+8. This means that the method does not look at one a at a time but at a window of 17 aa and makes a prediction for each of them based on the window.
Q3 = around 65%.
GOR5 with improvements has Q3 = around 70%.
Does GOR perform better for helices or beta strands?
GOR performs better for helices than for beta strands.
What does the GOR method not take into account?
It does not take into account the predictions for neighboring residues.
What is the PhD method?
A more advanced method for predicting secondary structures.
It is machine learning trained on proteins of known secondary structure.
How do you know if an amino acid sequence has the structure of a helix or beta strand considering the Chou-Fasman method?
For every aa in the sequence you look to see if they are former/breakers/intermediate for helices/strands and choose the one that is the strongest former.
Then add up how many aa indicate helixes and strands and choose the one with majority.
What is the phd method?
- What are the different steps in the method?
An advanced 3rd generation method for secondary structure prediction of protein sequences based on machine learning and neural networks. It takes evolutionary information into account.
The first step is a profile generation of the sequence based on an MSA.
There is an input layer, hidden layer and an output layer.
The hidden layer consists of two neural networks that makes predictions about the secondary structure and a jury decision that takes the output of the neural networks as input and gives which prediction was made the highest number of times and decides on that.
The first neural network takes the profiles of the sequence and gives a prediction of structure.
The second neural network takes the output from the first as input and makes a second prediction based on that (correlates predictions of neighbouring
residues).
What makes the phd method better than for example chou-fasman and GOR?
It is a more complex model that also incorporates evolutionary information which gives better results such as:
- higher accuracy 70%
- Better beta-strand predictions
- Better predictions for larger sequence alignments.
What is the difference between 1st, 2nd and 3rd generation methods for protein structure prediction?
1st:
Based on residue correlations
Chou-Fasman Q50%.
2nd:
Context dependent
GOR Q60-70%.
3rd:
Evolutionary information, neural networks.
PhD, PSIPRED method Q > 70%
Name the different membrane protein structures.
Helices
Helix bundles
Beta barrels
Describe the characteristics of a helix bundle.
Helices are 15-30 residues long
Tightly packed helices with coiled-coil structure (multiple helices coiled together).
Predominantly apolar/hydrophobic helix.
Positive residues such as K and R are more common on the cytosolic side.
Aromatic residues such as Y and W are more common at the ends of helices.
Describe the characteristics of a beta barrel.
The sheets in the barrel have hydrophobic residues every second position.
Beta-strands flanked by aromatic residues.
They are more difficult to predict than transmembrane alpha-helices.