Bioinformatics & homology modelling Flashcards

Question 1

Q

What is a homologue? How can they be identified?

What is an orthologue?

A paralogue?

What is a databank?

A database?

What is a primary databank?

Secondary?

Answer

A

Organism with a common ancestor. Seq of 100 AA>35% identity (but lower sequence identity can also be homologues)

Homologues from speciation event (e.g haemoglobin in mice vs humans)

Gene copying in a species (haemoglobin vs myoblobin)

Collection of data without a fixed query tool

Structured collection of data with a fixed search tool

Raw sequence data (maybe with detailed annotations)

Derived data, highly annotated (e.g sequence profiles)

Question 2

Q

What do primary databanks typically contain?

What is wrong with data from primary databanks?

What do secondary databanks contain?

How is a dotplot measured and scored for sequence alignment?

How does Dayhoff matrix score sequence alignment/similarities in amino acids?

What’s the difference between global and local sequence alignment?

Answer

A

Sequence data, feature information, translations & predictions from genomic data

Predictions- so it is hypothetical until experiments are done to verify it

Derived information, patterns that characterise a protein family & detailed annotations

Comparing 2 alignments to see which is correct- measured based on the similarity amino acids & scored as 1 for a match, 0 for mismatch

All logged- so highly positive means very likely substitution, very negative means low chance, and 0 = substitution at the expected base rate

Global = whole sequence
Local = local subunits of the sequence

Question 3

Q

What are the different alignment methods based off of percentage identity?
40%+
25%+
10%+
0%+

E = pN what is the threshold for a homologue?

What is an example of a secondary databank pattern? [FY]-C-x(7,8)-{P} and what do the brackets mean?

What do secondary databases annotations include?

What is wrong with the annotation in the genome world?

Answer

A

Automatic pairwise
Consensus- conserved patterns (expected AA at position)
Profile- frequency of amino acid at a position
Structure prediction- predict 3D & functional similarity

< or equal to 0.1

Prosite. [or] {not} (number range)

References, methods, cross links to other databases, feature tables, descriptions, authors

Based on computer analysis- getting it right is expensive. Poor annotation negates benefit of having genome sequence data all together

Question 4

Q

What is comparative modelling?

What is the better way of identifying a homologue rather than sequence identity?

Why do comparative modelling? (3)

What is required? (5)

What are the 8 steps?

Answer

A

Building a 3D model of a protein based off a known structure of a homologous proteins sequence (100AA >35% but can have lower seq identity)

E = pN (<0.1)

Volume of sequence data>structural data
X-ray crystallography = time consuming, expensive, crystallise
NMR = expensive, purify proteins at high conc

Protein sequence & related sequences of known structure
alignment between sequences
method for performing modelling
target (sequence/structure to model)
parent/template sequence/structure for basis

identify parents/templates
align target sequence with parents
find structurally conserved regions & variable regions
inherit SCR (structurally conserved) from the parents
build SVRs
build sidechains
refine the model
evaluate the errors in your model

Question 5

Q

identifying the parents/template

How can you find it?

What do you need to find?

Answer

A

search target sequences against your sequence with protein data bank with FASTA & BLAST

Sequence with closest identity/homologue to target- don’t want distance homologues

Question 6

Q

alignment

What is the most correct alignment? What is the problem with this?

What RMSD is needed for modelling?

How can you align the parent with the template?

Answer

A

structural alignment- if you read off of the structure, this is the sequence you would get. Don’t have the 3D structure of the target, so can’t do this method.

<2

Dynamic programming for the initial sequence alignment, and then hand correct it yourself- look in context of structure, where indels occur & want to minimise disruption to the parent structure

Question 7

Q

identifying SCRs & SVRs

How can you get the most reliable SCRs/SVRs?

What do you assume with only one parent?

Inheriting SCRs

What do you do with single parent?

Multiple parents? (4)

Answer

A

With multiple parents

All regions conserved structurally except where indels occur (mostly in loop regions which we assume are SVRs) - need to look at context of known structure to identify these (not directly related to sequence identity)

Copy SCRs directly

Fit structures in 3 dimensions- select SCRs on local sequence similarity to target, local RMSD (structural similarities), low temperature factors, length/sequence of adjacent of SVRs

Question 8

Q

Build the SVRs

Where are they most likely?

How does their accuracy compare to the rest of the structure?

What are the 3 methods?

Which methods are better? how do they build the SVRs?

Answer

A

Loop regions

Lower accuracy

By hand (modify parent with molecular graphics)
Knowledge-based (database of loops from other proteins)
ab initio (search all possible conformations)

Knowledge & ab initio
- generate lots of potential conformations with energy, compactness & accessibility
e.g bury hydrophobics, avoid protrusion & clashes
rank on lower energies & fewer interferences

Question 9

Q

Build the sidechains

What’s the method?

What is a rotamer library?

What does SCWRL do?

Answer

A

Maximum overlap protocol- inherit torsions from parent & build atoms from a standard conformation

Preference for staggered conformation- but specific ones (60 degrees of atoms from front to back)
- preference for tighter looking conformations

Puts side chains in preferred positions to see clashes & refines to outperform MOP

Question 10

Q

refining the model

Which 2 ideas help with this?

How do they both refine the model?

Answer

A

Molecular dynamics & energy minimisation

MD = assign initial velocity to each atom with Boltzmann & calculate their accelerations- likely to find global energy minimum by putting energy in
- moves atom to new position to update acceleration

EM = move atoms to an energy minimum but only make minor movements to be stuck in local minimum/small pockets

Question 11

Q

assessing the model

What approach can be used to see how happy a protein is? (with amino acid in the environments)

To what could you compare your model to? When can you calculate this?

What impacts model quality? (4)

How can you use sequence identity to link model quality?

What is the blind prediction experiment that happens every 2 years?

Answer

A

pseudo-energy - ANOLEA

with the true target structure (quality): RMSD = square root of sum of distance squared between atoms / number of atoms. only calculate when know correct answer

sequence identity with parent, number/size indels, quality of alignment, amount of change necessary to parent

> 70% high quality RMSD ~0.5A, >55% high RMSD ~2A, rest not good

Critical assessment of structure prediction

Bioinformatics & homology modelling Flashcards

(11 cards)