Bioinformatics & homology modelling Flashcards
What is a homologue? How can they be identified?
What is an orthologue?
A paralogue?
What is a databank?
A database?
What is a primary databank?
Secondary?
Organism with a common ancestor. Seq of 100 AA>35% identity (but lower sequence identity can also be homologues)
Homologues from speciation event (e.g haemoglobin in mice vs humans)
Gene copying in a species (haemoglobin vs myoblobin)
Collection of data without a fixed query tool
Structured collection of data with a fixed search tool
Raw sequence data (maybe with detailed annotations)
Derived data, highly annotated (e.g sequence profiles)
What do primary databanks typically contain?
What is wrong with data from primary databanks?
What do secondary databanks contain?
How is a dotplot measured and scored for sequence alignment?
How does Dayhoff matrix score sequence alignment/similarities in amino acids?
What’s the difference between global and local sequence alignment?
Sequence data, feature information, translations & predictions from genomic data
Predictions- so it is hypothetical until experiments are done to verify it
Derived information, patterns that characterise a protein family & detailed annotations
Comparing 2 alignments to see which is correct- measured based on the similarity amino acids & scored as 1 for a match, 0 for mismatch
All logged- so highly positive means very likely substitution, very negative means low chance, and 0 = substitution at the expected base rate
Global = whole sequence Local = local subunits of the sequence
What are the different alignment methods based off of percentage identity? 40%+ 25%+ 10%+ 0%+
E = pN what is the threshold for a homologue?
What is an example of a secondary databank pattern? [FY]-C-x(7,8)-{P} and what do the brackets mean?
What do secondary databases annotations include?
What is wrong with the annotation in the genome world?
Automatic pairwise
Consensus- conserved patterns (expected AA at position)
Profile- frequency of amino acid at a position
Structure prediction- predict 3D & functional similarity
< or equal to 0.1
Prosite. [or] {not} (number range)
References, methods, cross links to other databases, feature tables, descriptions, authors
Based on computer analysis- getting it right is expensive. Poor annotation negates benefit of having genome sequence data all together
What is comparative modelling?
What is the better way of identifying a homologue rather than sequence identity?
Why do comparative modelling? (3)
What is required? (5)
What are the 8 steps?
Building a 3D model of a protein based off a known structure of a homologous proteins sequence (100AA >35% but can have lower seq identity)
E = pN (<0.1)
Volume of sequence data>structural data
X-ray crystallography = time consuming, expensive, crystallise
NMR = expensive, purify proteins at high conc
Protein sequence & related sequences of known structure
alignment between sequences
method for performing modelling
target (sequence/structure to model)
parent/template sequence/structure for basis
- identify parents/templates
- align target sequence with parents
- find structurally conserved regions & variable regions
- inherit SCR (structurally conserved) from the parents
- build SVRs
- build sidechains
- refine the model
- evaluate the errors in your model
- identifying the parents/template
How can you find it?
What do you need to find?
search target sequences against your sequence with protein data bank with FASTA & BLAST
Sequence with closest identity/homologue to target- don’t want distance homologues
- alignment
What is the most correct alignment? What is the problem with this?
What RMSD is needed for modelling?
How can you align the parent with the template?
structural alignment- if you read off of the structure, this is the sequence you would get. Don’t have the 3D structure of the target, so can’t do this method.
<2
Dynamic programming for the initial sequence alignment, and then hand correct it yourself- look in context of structure, where indels occur & want to minimise disruption to the parent structure
- identifying SCRs & SVRs
How can you get the most reliable SCRs/SVRs?
What do you assume with only one parent?
- Inheriting SCRs
What do you do with single parent?
Multiple parents? (4)
With multiple parents
All regions conserved structurally except where indels occur (mostly in loop regions which we assume are SVRs) - need to look at context of known structure to identify these (not directly related to sequence identity)
Copy SCRs directly
Fit structures in 3 dimensions- select SCRs on local sequence similarity to target, local RMSD (structural similarities), low temperature factors, length/sequence of adjacent of SVRs
- Build the SVRs
Where are they most likely?
How does their accuracy compare to the rest of the structure?
What are the 3 methods?
Which methods are better? how do they build the SVRs?
Loop regions
Lower accuracy
By hand (modify parent with molecular graphics) Knowledge-based (database of loops from other proteins) ab initio (search all possible conformations)
Knowledge & ab initio
- generate lots of potential conformations with energy, compactness & accessibility
e.g bury hydrophobics, avoid protrusion & clashes
rank on lower energies & fewer interferences
- Build the sidechains
What’s the method?
What is a rotamer library?
What does SCWRL do?
Maximum overlap protocol- inherit torsions from parent & build atoms from a standard conformation
Preference for staggered conformation- but specific ones (60 degrees of atoms from front to back)
- preference for tighter looking conformations
Puts side chains in preferred positions to see clashes & refines to outperform MOP
- refining the model
Which 2 ideas help with this?
How do they both refine the model?
Molecular dynamics & energy minimisation
MD = assign initial velocity to each atom with Boltzmann & calculate their accelerations- likely to find global energy minimum by putting energy in
- moves atom to new position to update acceleration
EM = move atoms to an energy minimum but only make minor movements to be stuck in local minimum/small pockets
- assessing the model
What approach can be used to see how happy a protein is? (with amino acid in the environments)
To what could you compare your model to? When can you calculate this?
What impacts model quality? (4)
How can you use sequence identity to link model quality?
What is the blind prediction experiment that happens every 2 years?
pseudo-energy - ANOLEA
with the true target structure (quality): RMSD = square root of sum of distance squared between atoms / number of atoms. only calculate when know correct answer
sequence identity with parent, number/size indels, quality of alignment, amount of change necessary to parent
> 70% high quality RMSD ~0.5A, >55% high RMSD ~2A, rest not good
Critical assessment of structure prediction