TB2-5 - Bioinformatic of Sequence Data Flashcards
What is a meant by sequence homology? How does this relate to sequence similarity?
Sequence homology is statement about the common evolutionary ancestry of two sequences that takes into account their sequence similarity (degree of likeness between two sequences).
Two sequences are saif to be homologous if they are both derived from a common ancestral sequence.
If two sequences are homologous, does this mean they have identical sequences?
No
They may have largely similar sequences which indicate a common ancestral sequence but they are unlikely to be fully identical
What is bioinformatics at the basic level?
Computer storage and analysis of complex biological data such as sequence, structure, proteomics and metabolomics (essentially the ‘omics’ data in general
What used to be considered the central activity of bioinformatics (and is still common today)?
annotation of genomes
(i.e. what do the sequences do)
Why can imaging data now be considered a part of bioinformatics?
Imaging data with new microscopes produce large data files and bioinformatics helps to analyse these files e.g. access to the them and labelling
Name three central activities of bioinformatics
Annotation of genomes
protein structure prediction
Computer simulations of proteins
What types of activities would protein strucutre prediction entail?
Secondary structure prediction
Identifying folds
Homology modelling
What vague activities would computer simulations of proteins entail?
Molecular dynamics
Simulations of folding and of function
Structure refinement
What is structure refinement?
The process of achieving agreement between the structural model and the experimental data
What is molecular dynamics?
A computer simulation method that is used to calculate motion of individual atoms or molecules
Why do we need to carry out bioinformatics? 4 broad areas
Sequence/structure and sequence/function defiicits
(sequence to strucutre to function annotoation
Pattern recognition or sequences or structures
Prediction of structure, function and regulation
Explain the sequence/structure deficit? give numerical values for comparison
(As of 2022) 2.8 billion entries into the ENA database (i.e. this many sequences have been identfied with their base. Compare this to the much lower 568,363 (roughly 550,000) sequences (in the UniProtKB/Swiss-Prot database) which have been curated by humans to define a structure.
I.e. there is a MUCH smaller proportion of sequences with a known structure than unknown
Compare this to the much lower value
What is the significance of the difference in sequence entries for the ENA databaseand Swiss-prot? Why is there this difference? (use numerical values in your answer)
The ENA database contains a very large number of sequences (2.8 billion sequences) that are automatically translated into the database.
Whereas, Swiss-prot contains a much smaller number of sequences (roughly 560,000) which have all been curated by humans (i.e. checked that the sequence is the genuine protein it is believed to encode etc.)
This indicates a lag between what is known automatically about sequences and what we trust actually is a protein sequence from human curation.
This difference occurs because human curation (checking the protein) takes a long time.
What issue arose due to high throughput automated translation of sequences? How was this issue resolved?
Lot of bacteria were sequenced. But many are very similar and have highly redundant proteomes (they weren’t very interesting)
In 2015, removed 47 million of these proteomes.
How many structures are there roughly in the PDB?
Just under 200,000
196,565
How many structures of proteins have been modelled computationally by AlphaFold?
About 1 million
Roughly how many folds are there (according to SCOP)?
1562
What is significant about the difference in the number of folds to the number of potential structures of proteins?
There are very few different folds given the number of possible/potential structures. i.e. there are many different amino acid sequences but lots of folds are very common (can be seen when overlaying the fold backbone)
Roughly how many bacterial genomes are uploaded to ensembl? Why has this number not really changed from last year?
Over 31, 000
Only nonredundant bacterial genomes tend to be reported
Wormbase.org is a database dedicated to the sequence information of which organism?
Nematodes (C.elegans)
Name a database that contains bacterial genomes?
ensemble.org
Name a database that has bacterial sequencing solely for nematodes?
wormbase.org
What’s 1KP?
The 1000 Plant Transcriptomes Initiative (Genome Project) was an international multi-disciplinary consortium that generated large-scale gene sequencing data for over 1000 phylodiverse plants.
Do sequences in genome project ever get updated?
Yes, there is a continuous revision of and updates of the most studied genomes e.g. mouse, huma
When did the 100,000 genome project start and when did it sequence its last genome?
Start August 2015 - end December 2018
The genome sequence project has sequenced all 100, 000 genomes it set out to sequence, why would this be considered “easy” in relation to the wider aim of the project?
The data now needs to be looked into and analysed to try and find clues of patterns of genes that drive e.g. cancer, Parkinson’s and rare diseases (these are their main focus). This analysis is.
Why is sequence alignment considered a useful approach when investigating a protein sequence with an unknown function? (Hint: what would you align the unknown sequence to?)
We can scan a database with the new (query) protein sequence to see if there are any homologous sequences. If we know the function of the homologous protein we can describe/annotate the new protein with a function
What is an orthologue?
(from ‘ortho’ = exact)
Genes that are found in different species that form proteins with the same function
i.e. homologs across different species
Do orthologues always evolve due to convergent evolution?
Usually but not necessarily
If you go back far enough in time, there could be a species divergent point and the proteins remains to perform the same function
What is a paralogue?
(from ‘para’ = beside/next to)
Describes a set of closely related genes within one species
i.e homologs that are in the same species but have different functions (although these functions will likely be related)
Compare the key differences between orthologues and paralogues?
Paralogues occur within the same species whereas orthologues occur in different species
Paralogues have different but related functions whereas orthologues have the same function
What is the likely cause of paralogues evolving?
Diverged from a duplication event to have slightly different functions
What is the likely cause of orthologue evolution?
Genes diverged after a speciation event to maintain the same function
Sequence alignment can tell us information about the relationship between different proteins. Why might this be useful and what does this information help us construct?
Important for phylogeny and understanding evolution
Can use the information to construct phylogenetic trees
Is it easier to detect homology in a sequence alignment with DNA sequences or with amino acid (protein) sequences? Explain your reasoning?
Easier with amino acids because there are 20 “letters” as opposed to 4. This means that the significance of the alignment is higher. Because it is less likely that similarity between a “letter” will be due to chance. i.e a 1/4 chance of being the same if DNA bases are used vs the 1/20 chance if amino acids are used
Define identity in relation to sequence comparison?
The extent to which two sequences (nucleotide or amino acid) are invariant (identical)
Define similarity in relation to sequence comparison? How does this relate to identity?
The eten to which sequences(nucleotide or amino acid) are related. The extent of similarity between two sequences can be based on percent sequence identity/conservation.
N.b identity here implies being homologous
What score would be given in BLAST if there was similarity?
positive???