Structure bioinformatics - Jens lectures Flashcards

1
Q

Where can we find protein structures?

A

In the PDB database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we classify proteins based on the structure?

A

To understand biological function.

Structural and functional similarity cannot always be understood at sequence level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a domain?

A

Domains are different functional subunits of a protein that have different topology and fold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is CATH and SCOP?

A

CATH and SCOP are databases that classify protein domains based on the organization of secondary structure elements. They identify structurally similar proteins regardless of function.

The data (protein structures) is organized on different hierarchical levels whereas PDB has all the data in a pile.

The goal is to understand the evolutionary relationships between proteins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which is more conservative, sequence or fold?

A

The structure/fold is more conservative than the sequence - different sequences can have the same fold and 20% of the folds account for 80% of the proteins.

Two different proteins can have a very low sequence identity but can be structurally very similar and have similar function.

sequence identity over 40 % indicates a similar fold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Does similar structure/fold indicate a evolutionary relationship and similar function?

A

Similar structure often indicates an evolutionary relationship and similar function between proteins but it is not always the case.

Proteins with very different folds can have the same function and vice versa.

Sequence and structural similarity together (common ancestor) is a strong indication of similar function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the database SCOP

A

Protein structures in the PDB are organized according to their structural
and evolutionary relationships:

class = proteins with similar secondary structure element composition.

Fold = proteins with similar secondary structure elements in a specific topology.

Superfamily = Proteins with low sequence identity but with common structural elements. Assumed common origin.

Family = Proteins with established common ancestor and conserved sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the database CATH.

A

Alternative to SCOP.
Classifies protein structures in the PDB to identify structural and evolutionary relationships.

class = proteins with similar secondary structure element composition. (mainly alpha, mainly beta, mixed).

architecture = overall domain shape orientation of secondary structure elements.

topology = orientation of secondary structure elements and connections between them.

homologous superfamily = evolutionary related.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are motifs?

A
  • If similarity between an unknown sequence and a
    sequence of known function is limited to a few
    residues, standard sequence alignment won’t work since the alignment won’t find the similarity and proteins with very low sequence identity can still have the same fold and function.

We can instead search for short patterns called sequence motifs that can be valuable in prediction function of a protein. The residues in the motif does not need to be close together in sequence but only in 3D structure.

An example of a motif can be an enzymes active site.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Give examples of databases for motif/pattern searching.

A

PROSITE
Pfam
InterPro

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can you search for in PROSITE?

A

Search with a sequence of unknown function to find functional motifs and make predictions of function based on these.

These databases can give many false positives since they are based on short sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What can you search for in Pfam?

A

Functional motifs. Pfam gives a more complex description than PROSITE.

Large collection of multiple sequence alignments of protein
domains or conserved regions.

Hidden Markov Model (HMM) based profiles are used to
represent protein families or domains.

By searching a protein sequence against the Pfam library of
profile HMMs, you can determine which domains it carries.

As common patterns are recognized, protein function can be
assigned even for low homology sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is DALI?

A

A program that compares protein structures in 3D to reveal similarities that can indicate similar functions.

It does this by comparing the carbon alpha backbone coordinates of proteins (the distances between the alpha carbons) and searches for structurally related proteins in PDB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do we need protein structure prediction?

A
  • We can understand function from structure.
  • It is difficult to determine structures experimentally.
  • The gap between the size of the sequence data and structure data is large and increasing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give examples of how we can define secondary structures?

A

With phi/psi angles in a ramachandran plot.

With hydrogen bonding patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can we assess a secondary strucure prediction?

A

By looking at the percentage of correctly predicted secondary structures (Q3).

Note that random Q3 = 33%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the Chou-Fasman method?

A

A method for predicting secondary structures.

There are certain probabilities of certain amino acids to occur in helices/sheets. The Chou-Fasman method looks at these probabilities without context.

The amino acids are defined as helix/beta “formers”, “indifferent” and “breakers”.

Ex. Chou-Fasman would say that the occurrence of Alanine gives high probability for the strucure being a helix without looking at the context- what is around that Alanine? The secondary structure is context based and will change depending on environment and aa around.

Accuracy of about 50-60%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the GOR method?

A

The Garnier, Osguthorpe and Robson method is an improvement of the Chou-Fasman method for prediction of secondary structures.

It incorporates effects of local interactions (context).

The assignment of secondary structure to residue j
depends on neighbouring residues i-8 to j+8. This means that the method does not look at one a at a time but at a window of 17 aa and makes a prediction for each of them based on the window.

Q3 = around 65%.
GOR5 with improvements has Q3 = around 70%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Does GOR perform better for helices or beta strands?

A

GOR performs better for helices than for beta strands.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the GOR method not take into account?

A

It does not take into account the predictions for neighboring residues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the PhD method?

A

A more advanced method for predicting secondary structures.

It is machine learning trained on proteins of known secondary structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How do you know if an amino acid sequence has the structure of a helix or beta strand considering the Chou-Fasman method?

A

For every aa in the sequence you look to see if they are former/breakers/intermediate for helices/strands and choose the one that is the strongest former.

Then add up how many aa indicate helixes and strands and choose the one with majority.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the phd method?
- What are the different steps in the method?

A

An advanced 3rd generation method for secondary structure prediction of protein sequences based on machine learning and neural networks. It takes evolutionary information into account.

The first step is a profile generation of the sequence based on an MSA.

There is an input layer, hidden layer and an output layer.

The hidden layer consists of two neural networks that makes predictions about the secondary structure and a jury decision that takes the output of the neural networks as input and gives which prediction was made the highest number of times and decides on that.

The first neural network takes the profiles of the sequence and gives a prediction of structure.

The second neural network takes the output from the first as input and makes a second prediction based on that (correlates predictions of neighbouring
residues).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What makes the phd method better than for example chou-fasman and GOR?

A

It is a more complex model that also incorporates evolutionary information which gives better results such as:

  • higher accuracy 70%
  • Better beta-strand predictions
  • Better predictions for larger sequence alignments.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the difference between 1st, 2nd and 3rd generation methods for protein structure prediction?

A

1st:
Based on residue correlations
Chou-Fasman Q50%.

2nd:
Context dependent
GOR Q60-70%.

3rd:
Evolutionary information, neural networks.
PhD, PSIPRED method Q > 70%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Name the different membrane protein structures.

A

Helices
Helix bundles
Beta barrels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Describe the characteristics of a helix bundle.

A

Helices are 15-30 residues long

Tightly packed helices with coiled-coil structure (multiple helices coiled together).

Predominantly apolar/hydrophobic helix.

Positive residues such as K and R are more common on the cytosolic side.

Aromatic residues such as Y and W are more common at the ends of helices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Describe the characteristics of a beta barrel.

A

The sheets in the barrel have hydrophobic residues every second position.

Beta-strands flanked by aromatic residues.

They are more difficult to predict than transmembrane alpha-helices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the TMHMM method?

A

A method to predict transmembrane helices in a protein using a hidden markov model.

HMMs are trained on known examples of transmembrane and non-transmembrane regions in protein sequences. The model learns the characteristics of these regions and uses this knowledge to predict the likelihood of transmembrane helices in a given protein sequence.

30
Q

In fold recognition (threading), what is the compatibility score based on?

A

In the Gen Threader approach we look at local and global measures for scoring.

Local measures:
secondary structure preferences - specific aa are more probable in secondary structures.

Global measures:
If there are clashes in the structure.

If there is a hydrophobic core.

If there are reasonable bonds
ect.

30
Q

What are the different categories for methods for structure prediction?

A

de novo
fold recognition
Homology modeling

30
Q

What is the difference between homology modeling and fold recognition (threading)?

A

Unlike homology modeling, threading is not based on sequence identity but is based on the fact that there is a limited number of folds that a protein can have.

In threading you simply thread the target sequence onto different fold possibilities and look at compatibility scores to find the fold that best fits the sequence.

31
Q

Explain homology modeling.

A

Identify structure templates (all proteins that have structures in PDB) and then do alignments to those templates to find the ones with the highest sequence identity. For the parts that have high sequence identity you transfer the coordinates of the backbone of the template to the target. Then evaluate the model.

32
Q

When we do the model quality evaluation of a homology modeling, what do we look at?

A

We look at how protein-like is your structure?

Stereochemistry:
- bond length and angles
- peptide bond planarity
- torsion angles
- clashes
- ramachadran plots

Spatial features:
- hydrophobic core
- solvent accessibility
- distribution of charged groups

If we get big insertions in secondary structures then something is off.

There are servers that do these evaluations (PROCHECK, MolProbity).

33
Q

If you’re sequence identity is in the twilight zone of sequence identity, what can you do to predict the structure?

A

Being in the twilight zone means that the sequence identity is lower than around 30% and in these situations you cannot do homology modeling and you should choose to do a fold recognition.

34
Q

In homology modeling, what do you do for the side chains if they have overlapping atoms vs non-overlapping atoms?

A

If the side chain is conserved and if atoms are overlapping, coordinates are copied.

For non-overlapping atoms in the side-chains use a rotamer library and choose the rotamer that does not clash with a residue.

35
Q

Explain how the hidden Markov model works in the TMHMM for prediction of transmembrane helices.

A

Every amino acid in the query sequence is in one of the following states outside, inside or in the membrane of the cell:

  • Core TM helix
  • Helical tails
  • loop on cytoplasmic side
  • short loop
  • long loops outside of the cell
  • globular-domain inside loop.

Each state has a probability distribution for each amino acid and there are probabilities for moving between states and when moving through the query sequence the HMM will choose the state path that has the highest probability –> gives most probable topology.

36
Q

What can you do to improve the accuracy of secondary structure prediction methods like GOR?

A

Use multiple of the methods and take the consensus prediction which is the average of all the methods as result.

37
Q

What is the difference between using methods like GOR and Jpred?

A

Methods like GOR looks at only the query sequence and makes a prediction based on the preference of being in different structures for all of the amino acids or windows of many amino acids. Some of the models include context but they do not include evolutionary information.

Methods like Jpred also looks at homologous sequences and makes prediction about the target based on the structures of the homologous proteins.

This is good because secondary and tertiary structures are conserved between homologous sequences. Also insertions and deletions can be found and these are less likely to be in secondary structures and can be predicted to be in coils. m

38
Q

Describe GPCRs

A

GPCRs are all transmembrane proteins and they consist of seven alpha-helices passing through the membrane with each helix parallell to the membrane.

39
Q

what does homology modeling rely on?

A

The fact that structure is more conserved than sequence in evolution.

40
Q

When should you use fold recognition (threading) for secondary structure prediction?

A

Fold recognition is used if you cannot find any templates with high sequence identity to you target and if you want to predict the overall structure for a protein. Meaning that you are not very interested in the atomic regions.

41
Q

What is the assumption that fold recognition makes?

A

It assumes that a protein with a similar fold has been observed before. This is very likely.

42
Q

If your sequence identity is in the twilight zone, what methods for secondary structure prediction are good options?

A

Ab initio modeling
Fold recognition
alphafold

43
Q

what is energy minimization in homology modeling?

A

A technique to apply to your predicted structure to find the most energetically favorable configuration of the molecule. The aim is to refine and relax the structure and make it more realistically plausible.

It is often performed using force fields and molecular dynamics simulations.

44
Q

Describe the process of the fold recognition process.

A
  • construct a template library (CATH and SCOP can be used to select a representative set of folds from PDB)..
  • sequence to structure alignment (the threading).
  • design a scoring function
  • template selection and model construction.
45
Q

Explain how the threading can be done in fold recognition., which method is the most efficient.

A

You can thread your target to all fold templates and score all of them and find the optimal one - time consuming and challenging.

You can score only a few of the threadings and then improve the alignments using sequence profiles for the templates - faster and works relatively well.

46
Q

What should you consider when doing a fold recognition of a protein that contains several domains?

A

You should do separate searches for each domain.

47
Q

What is a decoy?

A

A molecule that is expected not to bind, with similar properties [2] (logP, molecular weight, number of
rotatable bonds, . . . ) of a ligand, but with a different topology (molecular connectivity)

48
Q

What is decoy and ligand enrichment in molecular docking?

A

Does the algorithm favor ligands or decoys? It is an analysis to test if the algorithm can tell the difference between actives and non-actives and in a way test the rate of false positives (chooses decoys).

Ligand enrichment will be viewed as a curve far to the left of the random curve in a graph with decoys on x and ligands on y.

Decoy enrichment will be viewed as a curve far to the right.

49
Q

What is the docking score?

A

The sum of all of the energies (electrostatic interactions). Low energy favors docking.

50
Q

Why do we need decoys for molecular docking?

A

Decoys are molecules that are very similar to the ligands but they are not active. We allow the algorithm to dock these and learn that they get low docking scores and should not be docked in the future.

51
Q

Why is ligand recognition important? What drives the binding between ligand and protein?

A

That a ligand recognizes the active site of a protein is the key for protein function and drug design.

  • Negative free energy
    (deltaG = deltaH - TdeltaS < 0)
  • Shape complementarity
52
Q

How does free energy drive ligand-protein binding?

How is this important for drug design?

A

If the free energy from the interactions between ligand and protein is negative it means that it is energetically favorable for them to form a complex.

This means that we can predict which ligands will interact with proteins and improve their binding by altering the chemical structures of the ligand.

53
Q

What is molecular docking? What is it replacing and why is it better?

A

Molecular docking is computational structure-based drug discovery.

We want to rapidly predict protein-ligand complexes by looking at the protein structure.

This is more time and cost efficient than high throughput screening for a ligand where the hit-rates are also very low with many false positives.

54
Q

What are the two steps of the
docking algorithm?

A
  1. Sampling:
    Generation of orientations of molecule in the receptor.
  2. Scoring:
    Determine which orientation has the lowest energy.

Scoring is more difficult than sampling.

55
Q

How can we enable fast sampling of orientations in molecular docking?

A

By making approximations like:

  • Freeze the protein to only look at the crystal structure of the binding pocket without any flexibility.
  • Do not include water molecules explicitly. Important water molecules can be treated as part of the protein.
  • Constrain search to only one binding site on the protein (If you know where you want your ligand to bind).
  • Only treat the ligand as flexible. Sample different orientations in the binding site.
56
Q

What are the different types of scoring functions in molecular docking?

A
  • Force field based:
    Score is calculated from molecular mechanics force fields.
  • Empirical:
    Empirical scoring functions assign a score to a ligand pose based on empirical rules and parameters derived from experimental data.

We can look at for example hydrogen bonds and hydrophobic interactions at see if they are similar to experimental binding affinities.

  • Knowledge-based:
    Make
    statistics for which interactions that are favorable and
    give score based on its probability.
57
Q

How does the different scoring functions in molecular docking look at a hydrogen bond in the prediction?

A

The forcefield based function calculates the energy of the bond with coloumbs law.

The empirical ones looks at the distance and angle of the bond and scores based on how the parameters look in experimental data. If the atoms are within 2.8Å and has a good angle then the score is positive.

The knowledge-based method looks at the probability of this bond happening based on statistics from PDB. High probability gives high score.

58
Q

Describe the empirical scoring function in molecular docking.

A

The empirical scoring functions is a way to see how well an orientation of a ligand forms a complex with the protein in terms of binding affinity.

The function looks at the energy change that happens when the ligand binds to the protein in the following parameters and scores based on how they look in experimental data:

  • solvation: Changes in interaction with salvation.
  • Conformation.
  • Interaction: energy that comes from interactions based on distance/angle ect.
  • Rotatable bonds: reduction of conformational freedom of the ligand’s
    rotatable bonds.
  • trans/rot: loss of translational and rotational freedom for ligand.
  • Vibrations: changes in bond and angle vibrations (often ignored)
59
Q

What is redocking?

A

Redocking is a way to see if a molecular docking algorithm works and finds what it should find.

For this we collect experimentally determined complexes and take them apart to see if the algorithm will predict the complex.

To measure success we look at the similarity between experimental ligand pose and the one that the algorithm decides on (RMSD).

The algorithms can find ligands very similar (low RMSD) to the experimental but not the exact one.

60
Q

What is ligand enrichment?

A

A way to test if a molecular docking algorithm works by seeing if it can find ligands among decoys.

Performance is based on target and software.

61
Q

What is relative affinity prediction?

A

A way to test the molecular docking algorithm to see if it can predict the complex with highest binding affinity out of multiple closely related ligands with known affinity to the protein.
This prediction is generally very weak.

62
Q

What is the workflow for molecular docking when you do not have a ligand in mind?

A

If you do not have a ligand in mind then you can do a virtual screen for suitable structures from chemical libraries and do a docking screening to the protein and choose the ones that seem to fit. Then run these through a docking algorithm.

This is much more time and cost efficient than high through put screening since it is computational.

63
Q

What approximations do we make in molecular dynamics that lower the accuracy of the algorithm? Why do we do them?

A

We make approximations like:

  • We do not look at the explicit water molecules around the peptide.
  • We only treat the ligand as flexible, not the binding pocket of the protein.
  • We only look at one active site when sampling for compounds.

We make these approximations to enable the speed of the algorithm since it is supposed to be able to screen through a large number of possible compounds.

64
Q

What tests do we do to see how well a molecular docking algorithm performs?

A

Ligand enrichment - can the algorithm find active ligands out of closely related non-active decoys?

Redocking - can the algorithm predict complexes similar to experimental complexes that we have taken apart?

Relative affinity prediction - can the algorithm find the complex with the highest binding affinity out of multiple closely related ligands?

65
Q

How has the phd method improved the Chou-Fasman method?

A

GOR and Chou-Fasman are purely sequence based while the phd method incorporates evolutionary information from multiple sequence alignments.

66
Q

Where do we find the different structures in a ramachandran plot?

What can two outliers in the plot imply?

A

Top left - beta sheet
lower left - right handed helix
to the right - left handed helix

If we only have two outliers then it could be the terminals of the peptide because they have a lot of variation and could for example be loops.

If we have many outliers then that would instead indicate that the structure is not optimal.

67
Q

What are the pros/cons of the different experimental methods for structure determination?

A

X-ray crystallography:
high resolution
Proteins must form crystals
We can’t see density map for entire protein.

NMR:
good for small soluble proteins

3D-microscopy:
Good for larger complexes

68
Q

Could you use x-ray crystallography to determine the structure of a disordered protein sequence?

A

no because disordered peptides usually do not have crystal structures.