Structural bioinformatics Flashcards
Why do we need protein-protein docking?
PDB contains ~4,000 hetero-dimers (including duplicate copies for same complex)
Compare to number of entries ~170,000
Can one predict the structure of a complex starting with
unbound components?
Unbound components can be experimental or high quality predicted structures
Need to be able to model limited conformational change.
Describe typical protein protein docking
Walk through protein docking step by step
- Global search to find goo,d overlap of surfaces of protein
- Residue residue interactions - score with empirical residue park potential. You need to check if a given pair is present in the protein more frequently than just by chance
- Search for clusters of similar complex geometry with low energy. Many more ways to be reign than correct so correct solution will be found much more often than any individual wrong solution.
- Refinement - search for optimal combination of side-chain rotamers by energy calculation.
- Functional residue information - the function can give you info about the structure
Template based prediction from homologous structure of a complex
X-ray structure of protein A’ complexed with protein B’
A’ is homologue of A B’ is homologue of B
If A/B interface is favourable evaluated in 3D then predict A interacts with B
Template based modelling - sequence search
Start with sequence protein A and protein B
Based on sequence similarity, search library of complexes in PDB for a complex A’ / B’ where A is homologous to A and B is homologous to B’
Align sequence A to A’ and B to B’
Sequence search via BLAST, PSIBLAST or an advanced statistical model known as Hidden Markov Model HMM
Template based modelling - 2 model construction
On 3D structure of complex change sequence from A’ to A and B’ to B
Adjust any loops where there is an insertion or deletion Refine complex
Steps template-based modelling 3 – alternate model selection
Sometimes there can be several suitable templates as several have similar sequence identity
Construct several models
Score models (similar to ab initio docking)
Choose best model
NB this is one approach but several variations in template-based modelling
Coevolution and protein interactions
Concept of correlated mutations extend to homo and hetero complexes
AlphaFold multimer and Colab can consider complexes Active area of research – results very encouraging
What is the gene ontology?
A controlled vocabulary that can be applied to all organisms
Used to describe gene products - proteins and RNA - in any organism
All descriptions are supported by some level of evidence
How does GO work?
It captures information about 3 important features of function:
What does the gene product do?
Why does it perform these activities? Where does it act?
Describe the 3 gene ontologies
Molecular Function = biochemical function
the tasks performed by individual gene products; examples
are carbohydrate binding and GTPase activity
Biological Process = biological goal or objective (higher level function)
broad biological goals, such as mitosis or purine metabolism, that are accomplished by combinations of individual molecular functions.
Cellular Component = active location
subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme
Can you compare gene ontologies?
No, each branch is independent so you can’t really compare
Where do annotations come from?
Annotation source is important
It enables you to assess how confident the annotation is
GO associates annotations with an evidence code that indicates its source.
Uses of GO
Enhanced predictors of protein function return prediction of GO terms
Common features in a set of over-expressed / under- expressed genes can be reported as belonging to a common GO group
Why do you want to computionally predict protein function
About 240M sequences but time taken experimentally to determine function can be several years
Approches to function prediction
General homology search
BLAST (single sequence information)
PSIBLAST (multiple sequence information in profile)
HMMs scan (multiple sequence information in hidden Markov model)
Accuracy of database searching
Orthologues and paralouges and their EC numbers
Orthologues: same EC and high %id
Paralogues: Different EC and lower % id
Relationship %id and functional transfer
To some extent these concepts about EC numbers apply to all types of function but no clear cut rules
Function challenging to quantify
Start with a query sequence
Perform BLASTP search
What can you predict based on the %id between two proteins?
If find a protein from another species and >~50% identity could easily be an orthologue with same function
If find a protein from another species and >~30% identity could be a paralogue and have related function
If find a protein from same species and >30%id could be a paralogue and have related functions
Further complications - Domains
- Confident match by a search program
- Search programs identify local similarities. Two sequences may share a region/domain of similarity but also other domains that are different.
- But not all domains are shared – function probably different
NetGo – Using different sources (Details not important)
Structural searching
Use structural searching to identify similar fold (e.g. DALI ot in CATH))
Similar 3D structure may suggest related function
If functional site in match is known examine if similar residues in your newly-determined protein
Can finding a similar fold tell you anything about function?
No, to find about function you need to dig a little deeper
Convergence of serine protease active site
Trypsin and subtilisin have quite different 3D folds
Trypsin and subtilisin have totally different sequences
Same three active site residues (Ser, His, Asp) located in very similar positions
Order along chain of Ser, His, Asp different
Infer convergence of function
Favourable arrangement of residues to perform a function that arose vias evolution independently
Principles for non polar proteins
Apart from the ends, the protein chain in the membrane is in a non-polar (i.e. hydrophobic) environment.
Side-chains tend to be hydrophobic
Main-chain cannot have NH and CO groups not forming
hydrogen bonds
Single spanning membrane regions one α-helix
Most (but not all) membrane bound proteins formed from α-helical segments
Transmembrane helices
Hydrophobic section of membrane about 35 – 50A wide
Section not translocated (i.e. pass through membrane) that
abuts membrane often contains + ve charges
Each residue in an a-helix advances structure by 1.8A
Transmembrane helices tend to be between 20-30 residues long
Simple methods to identify transmembrane regions
Early methods searched for runs of very non-polar (hydrophobic) residues along squence
Use a scale of how hydrophobic each residue is Hopp & Woods scale
Kyte & Doolitle scale
Calculate average over a window of several residues (typically 11)
Signal peptides
Signal peptides refer to the sequence at start of proteins ranging from 15 – 60 residues that directs the protein to the correct cellular location.
Generally signal peptides cleaved off
Often has a hydrophobic region followed by a pattern typical of
the cleavage site
In predictions need to distinguish transmembrane regions from signal peptides
Current methods for transmembrane predictions
Until recently prediction methods developed HMMs (hidden Markov models) based on aligned sequences
DeepTMHHM – The algorithm predicts how the sequence maps onto
the different potential from the N- to the C- terminus a deep learning approach that predicts transmembrane structures and signal peptides,
Low complexity regions
Regions with composition biased strongly to a small number of amino acids
e.g. RRRRRRR, RSSRSRSSRSR, GGSSSSDDS
Occur in the sequences of a significant number of
proteins
Distort statistical significance scores of alignments
The program SEG (by Wootton) is often used
Replaces low complexity regions with lower case in BLAST searches at NCBI
Coiled-coils
Two or three intertwined α-helices
Can be short segment (20 residues) or far longer Identified using COILS written by Lupas
Disordered proteins
Some globular proteins have small regions which are disordered and cannot be identified by crystallography or NMR – often N- or C- terminus or a long loop.
Some proteins do not adopt a single structure but are highly flexible
Often protein becomes structured when protein binds to another protein
Prediction of disordered regions
Based on principles– tend to have lower fraction of hydrophobic residues than folded protein with a hydrophobic core.
Machine learning based- Most programs now based on neural networks, support vector machines
PONDR-FIT (Dunker)
DISOPRED2 (Jones, UCL)
IUPRED
(see Expasy page – list of programs)
Use AlphaFold prediction where confidence given by pLDDT<70% (see Piovesan, et al Protein Science, 31(11), e4466.)