Structural bioinformatics Flashcards

1
Q

Why do we need protein-protein docking?

A

PDB contains ~4,000 hetero-dimers (including duplicate copies for same complex)
 Compare to number of entries ~170,000
 Can one predict the structure of a complex starting with
unbound components?
 Unbound components can be experimental or high quality predicted structures
 Need to be able to model limited conformational change.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe typical protein protein docking

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Walk through protein docking step by step

A
  1. Global search to find goo,d overlap of surfaces of protein
  2. Residue residue interactions - score with empirical residue park potential. You need to check if a given pair is present in the protein more frequently than just by chance
  3. Search for clusters of similar complex geometry with low energy. Many more ways to be reign than correct so correct solution will be found much more often than any individual wrong solution.
  4. Refinement - search for optimal combination of side-chain rotamers by energy calculation.
  5. Functional residue information - the function can give you info about the structure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Template based prediction from homologous structure of a complex

A

X-ray structure of protein A’ complexed with protein B’
A’ is homologue of A B’ is homologue of B
If A/B interface is favourable evaluated in 3D then predict A interacts with B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Template based modelling - sequence search

A

Start with sequence protein A and protein B
 Based on sequence similarity, search library of complexes in PDB for a complex A’ / B’ where A is homologous to A and B is homologous to B’
 Align sequence A to A’ and B to B’
 Sequence search via BLAST, PSIBLAST or an advanced statistical model known as Hidden Markov Model HMM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Template based modelling - 2 model construction

A

On 3D structure of complex change sequence from A’ to A and B’ to B
 Adjust any loops where there is an insertion or deletion  Refine complex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Steps template-based modelling 3 – alternate model selection

A

Sometimes there can be several suitable templates as several have similar sequence identity
 Construct several models
 Score models (similar to ab initio docking)
 Choose best model
 NB this is one approach but several variations in template-based modelling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Coevolution and protein interactions

A

 Concept of correlated mutations extend to homo and hetero complexes
 AlphaFold multimer and Colab can consider complexes  Active area of research – results very encouraging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the gene ontology?

A

A controlled vocabulary that can be applied to all organisms
 Used to describe gene products - proteins and RNA - in any organism
 All descriptions are supported by some level of evidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does GO work?

A

It captures information about 3 important features of function:
 What does the gene product do?
 Why does it perform these activities?  Where does it act?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe the 3 gene ontologies

A

 Molecular Function = biochemical function
 the tasks performed by individual gene products; examples
are carbohydrate binding and GTPase activity
 Biological Process = biological goal or objective (higher level function)
 broad biological goals, such as mitosis or purine metabolism, that are accomplished by combinations of individual molecular functions.
 Cellular Component = active location
 subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Can you compare gene ontologies?

A

No, each branch is independent so you can’t really compare

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Where do annotations come from?

A

Annotation source is important
 It enables you to assess how confident the annotation is
 GO associates annotations with an evidence code that indicates its source.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Uses of GO

A

 Enhanced predictors of protein function return prediction of GO terms
 Common features in a set of over-expressed / under- expressed genes can be reported as belonging to a common GO group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why do you want to computionally predict protein function

A

About 240M sequences but time taken experimentally to determine function can be several years

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Approches to function prediction

A

General homology search
 BLAST (single sequence information)
 PSIBLAST (multiple sequence information in profile)
 HMMs scan (multiple sequence information in hidden Markov model)

17
Q

Accuracy of database searching

A
18
Q

Orthologues and paralouges and their EC numbers

A

Orthologues: same EC and high %id
Paralogues: Different EC and lower % id

19
Q

Relationship %id and functional transfer

A

To some extent these concepts about EC numbers apply to all types of function but no clear cut rules
 Function challenging to quantify
 Start with a query sequence
 Perform BLASTP search

20
Q

What can you predict based on the %id between two proteins?

A

 If find a protein from another species and >~50% identity could easily be an orthologue with same function
 If find a protein from another species and >~30% identity could be a paralogue and have related function
 If find a protein from same species and >30%id could be a paralogue and have related functions

21
Q

Further complications - Domains

A
  • Confident match by a search program
  • Search programs identify local similarities. Two sequences may share a region/domain of similarity but also other domains that are different.
  • But not all domains are shared – function probably different
22
Q

NetGo – Using different sources (Details not important)

A
23
Q

Structural searching

A

 Use structural searching to identify similar fold (e.g. DALI ot in CATH))
 Similar 3D structure may suggest related function
 If functional site in match is known examine if similar residues in your newly-determined protein

24
Q

Can finding a similar fold tell you anything about function?

A

No, to find about function you need to dig a little deeper

25
Q

Convergence of serine protease active site

A

Trypsin and subtilisin have quite different 3D folds
 Trypsin and subtilisin have totally different sequences
 Same three active site residues (Ser, His, Asp) located in very similar positions
 Order along chain of Ser, His, Asp different
 Infer convergence of function
 Favourable arrangement of residues to perform a function that arose vias evolution independently

26
Q

Principles for non polar proteins

A

 Apart from the ends, the protein chain in the membrane is in a non-polar (i.e. hydrophobic) environment.
 Side-chains tend to be hydrophobic
 Main-chain cannot have NH and CO groups not forming
hydrogen bonds
 Single spanning membrane regions one α-helix
 Most (but not all) membrane bound proteins formed from α-helical segments

27
Q

Transmembrane helices

A

 Hydrophobic section of membrane about 35 – 50A wide
 Section not translocated (i.e. pass through membrane) that
abuts membrane often contains + ve charges
 Each residue in an a-helix advances structure by 1.8A
 Transmembrane helices tend to be between 20-30 residues long

28
Q

Simple methods to identify transmembrane regions

A

 Early methods searched for runs of very non-polar (hydrophobic) residues along squence
 Use a scale of how hydrophobic each residue is  Hopp & Woods scale
 Kyte & Doolitle scale
 Calculate average over a window of several residues (typically 11)

29
Q

Signal peptides

A

Signal peptides refer to the sequence at start of proteins ranging from 15 – 60 residues that directs the protein to the correct cellular location.
 Generally signal peptides cleaved off
 Often has a hydrophobic region followed by a pattern typical of
the cleavage site
 In predictions need to distinguish transmembrane regions from signal peptides

30
Q

Current methods for transmembrane predictions

A

 Until recently prediction methods developed HMMs (hidden Markov models) based on aligned sequences
 DeepTMHHM – The algorithm predicts how the sequence maps onto
the different potential from the N- to the C- terminus a deep learning approach that predicts transmembrane structures and signal peptides,

31
Q

Low complexity regions

A

 Regions with composition biased strongly to a small number of amino acids
 e.g. RRRRRRR, RSSRSRSSRSR, GGSSSSDDS
 Occur in the sequences of a significant number of
proteins
 Distort statistical significance scores of alignments
 The program SEG (by Wootton) is often used
 Replaces low complexity regions with lower case in BLAST searches at NCBI

32
Q

Coiled-coils

A

Two or three intertwined α-helices
 Can be short segment (20 residues) or far longer  Identified using COILS written by Lupas

33
Q

Disordered proteins

A

 Some globular proteins have small regions which are disordered and cannot be identified by crystallography or NMR – often N- or C- terminus or a long loop.
 Some proteins do not adopt a single structure but are highly flexible
 Often protein becomes structured when protein binds to another protein

34
Q

Prediction of disordered regions

A

 Based on principles– tend to have lower fraction of hydrophobic residues than folded protein with a hydrophobic core.
 Machine learning based- Most programs now based on neural networks, support vector machines
 PONDR-FIT (Dunker)
 DISOPRED2 (Jones, UCL)
 IUPRED
 (see Expasy page – list of programs)
 Use AlphaFold prediction where confidence given by pLDDT<70% (see Piovesan, et al Protein Science, 31(11), e4466.)