TB2-5 - Bioinformatic of Sequence Data Flashcards

1
Q

What is a meant by sequence homology? How does this relate to sequence similarity?

A

Sequence homology is statement about the common evolutionary ancestry of two sequences that takes into account their sequence similarity (degree of likeness between two sequences).
Two sequences are saif to be homologous if they are both derived from a common ancestral sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

If two sequences are homologous, does this mean they have identical sequences?

A

No
They may have largely similar sequences which indicate a common ancestral sequence but they are unlikely to be fully identical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is bioinformatics at the basic level?

A

Computer storage and analysis of complex biological data such as sequence, structure, proteomics and metabolomics (essentially the ‘omics’ data in general

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What used to be considered the central activity of bioinformatics (and is still common today)?

A

annotation of genomes
(i.e. what do the sequences do)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why can imaging data now be considered a part of bioinformatics?

A

Imaging data with new microscopes produce large data files and bioinformatics helps to analyse these files e.g. access to the them and labelling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name three central activities of bioinformatics

A

Annotation of genomes
protein structure prediction
Computer simulations of proteins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What types of activities would protein strucutre prediction entail?

A

Secondary structure prediction
Identifying folds
Homology modelling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What vague activities would computer simulations of proteins entail?

A

Molecular dynamics
Simulations of folding and of function
Structure refinement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is structure refinement?

A

The process of achieving agreement between the structural model and the experimental data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is molecular dynamics?

A

A computer simulation method that is used to calculate motion of individual atoms or molecules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why do we need to carry out bioinformatics? 4 broad areas

A

Sequence/structure and sequence/function defiicits
(sequence to strucutre to function annotoation
Pattern recognition or sequences or structures
Prediction of structure, function and regulation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the sequence/structure deficit? give numerical values for comparison

A

(As of 2022) 2.8 billion entries into the ENA database (i.e. this many sequences have been identfied with their base. Compare this to the much lower 568,363 (roughly 550,000) sequences (in the UniProtKB/Swiss-Prot database) which have been curated by humans to define a structure.
I.e. there is a MUCH smaller proportion of sequences with a known structure than unknown
Compare this to the much lower value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the significance of the difference in sequence entries for the ENA databaseand Swiss-prot? Why is there this difference? (use numerical values in your answer)

A

The ENA database contains a very large number of sequences (2.8 billion sequences) that are automatically translated into the database.
Whereas, Swiss-prot contains a much smaller number of sequences (roughly 560,000) which have all been curated by humans (i.e. checked that the sequence is the genuine protein it is believed to encode etc.)
This indicates a lag between what is known automatically about sequences and what we trust actually is a protein sequence from human curation.
This difference occurs because human curation (checking the protein) takes a long time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What issue arose due to high throughput automated translation of sequences? How was this issue resolved?

A

Lot of bacteria were sequenced. But many are very similar and have highly redundant proteomes (they weren’t very interesting)
In 2015, removed 47 million of these proteomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How many structures are there roughly in the PDB?

A

Just under 200,000
196,565

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How many structures of proteins have been modelled computationally by AlphaFold?

A

About 1 million

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Roughly how many folds are there (according to SCOP)?

A

1562

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is significant about the difference in the number of folds to the number of potential structures of proteins?

A

There are very few different folds given the number of possible/potential structures. i.e. there are many different amino acid sequences but lots of folds are very common (can be seen when overlaying the fold backbone)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Roughly how many bacterial genomes are uploaded to ensembl? Why has this number not really changed from last year?

A

Over 31, 000
Only nonredundant bacterial genomes tend to be reported

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Wormbase.org is a database dedicated to the sequence information of which organism?

A

Nematodes (C.elegans)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Name a database that contains bacterial genomes?

A

ensemble.org

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Name a database that has bacterial sequencing solely for nematodes?

A

wormbase.org

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What’s 1KP?

A

The 1000 Plant Transcriptomes Initiative (Genome Project) was an international multi-disciplinary consortium that generated large-scale gene sequencing data for over 1000 phylodiverse plants.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Do sequences in genome project ever get updated?

A

Yes, there is a continuous revision of and updates of the most studied genomes e.g. mouse, huma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

When did the 100,000 genome project start and when did it sequence its last genome?

A

Start August 2015 - end December 2018

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

The genome sequence project has sequenced all 100, 000 genomes it set out to sequence, why would this be considered “easy” in relation to the wider aim of the project?

A

The data now needs to be looked into and analysed to try and find clues of patterns of genes that drive e.g. cancer, Parkinson’s and rare diseases (these are their main focus). This analysis is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Why is sequence alignment considered a useful approach when investigating a protein sequence with an unknown function? (Hint: what would you align the unknown sequence to?)

A

We can scan a database with the new (query) protein sequence to see if there are any homologous sequences. If we know the function of the homologous protein we can describe/annotate the new protein with a function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is an orthologue?

A

(from ‘ortho’ = exact)
Genes that are found in different species that form proteins with the same function
i.e. homologs across different species

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Do orthologues always evolve due to convergent evolution?

A

Usually but not necessarily
If you go back far enough in time, there could be a species divergent point and the proteins remains to perform the same function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is a paralogue?

A

(from ‘para’ = beside/next to)
Describes a set of closely related genes within one species
i.e homologs that are in the same species but have different functions (although these functions will likely be related)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Compare the key differences between orthologues and paralogues?

A

Paralogues occur within the same species whereas orthologues occur in different species
Paralogues have different but related functions whereas orthologues have the same function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the likely cause of paralogues evolving?

A

Diverged from a duplication event to have slightly different functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the likely cause of orthologue evolution?

A

Genes diverged after a speciation event to maintain the same function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Sequence alignment can tell us information about the relationship between different proteins. Why might this be useful and what does this information help us construct?

A

Important for phylogeny and understanding evolution
Can use the information to construct phylogenetic trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Is it easier to detect homology in a sequence alignment with DNA sequences or with amino acid (protein) sequences? Explain your reasoning?

A

Easier with amino acids because there are 20 “letters” as opposed to 4. This means that the significance of the alignment is higher. Because it is less likely that similarity between a “letter” will be due to chance. i.e a 1/4 chance of being the same if DNA bases are used vs the 1/20 chance if amino acids are used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Define identity in relation to sequence comparison?

A

The extent to which two sequences (nucleotide or amino acid) are invariant (identical)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Define similarity in relation to sequence comparison? How does this relate to identity?

A

The eten to which sequences(nucleotide or amino acid) are related. The extent of similarity between two sequences can be based on percent sequence identity/conservation.
N.b identity here implies being homologous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What score would be given in BLAST if there was similarity?

A

positive???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Using an identity scale (100% being identical and 0% being completely dissimilar sequences) for what range would it be acceptable to use automatic methods to produce a sequence alignment?

A

50-100%

40
Q

Using an identity scale (100% being identical and 0% being completely dissimilar sequences) for what range would it be acceptable to use consensus methods to produce a sequence alignment?

A

30-50%

41
Q

Using an identity scale (100% being identical and 0% being completely dissimilar sequences) for what range would profile methods need to be used to generate a sequence alignment?

A

15-30%

42
Q

What range of percentage sequence identity would be considered the twilight zone in sequence alignment?

A

Below 30%

43
Q

What are pairwise alignments use to identify?

A

Regions of similarity that may indicate functional, structural and/or evolutionary relationships between two biological sequences (protein or nucleic acid sequence

44
Q

In sequence alignments, are the sequences protein or nucleic acid sequences?

A

yes… they can be either
(trick question)

45
Q

Why should we try and avoid putting in gaps/insertions to a sequence alignment? What implication does this have on our alignment methodology?

A

Could theoretically put in as many gaps as we need in order to gain a perfect alignment every time
There’s a penalty for putting gaps into an alignment

46
Q

Can you think of an example where having a gap in a sequence alignment would be logical?

A

In order to maintain secondary structural elements.
e.g. both sequences have two helices connected y a loop but the loop size is different. Insert a gap into the shorter loop so that the helices can align

47
Q

Why does it make sense to try and predict where the secondary structural elements are in a sequence for sequence alignment before putting in gaps/insertions?

A

We want to try and avoid gaps/inserts in secondary structural elements because these are key to the structure and therefore function of the protein

48
Q

Why is it not useful to simply align each vertical position so that the residues have the same identity for sequence alignment?

A

There are very few proteins that actually have high identity

49
Q

What would aligning by sequence similarity entail?

A

Aligning amino acids vertically that have similar chemical properties e.g. Val and Leu aligned because they both have hydrophobic side chains

50
Q

Using sequence similarity is not the best option for for alignment. What is the best method of alignment?

A

Alignment by domain-domain

51
Q

What is a local alignment?

A

Alignment by domain
Identifies REGIONS of similarity within long sequence that are often widely divergent overall

52
Q

What is global alignment?

A

A form of global optimization that forces the alignment to cover the entire length of all the query sequences
tries to find the best overall match for all the sequences

53
Q

What’s the difference between local and global alignment?

A

Global forces the alignment to cover the entire length of all the query sequences to find the best overall match whereas local alignments find regions of similarity within the long sequences

54
Q

Which method/mechanism of sequence alignment guarantees an optimal answer?

A

(local alignment) Smith and Waterman dynamic programming

55
Q

Name 3 methods/mechanisms used for sequence alignment?

A

(all local alignments)
Smith and Waterman (dynamic programming)
BLAST
FastA

56
Q

Which two methods/mechanisms of sequence alignment do not guarantee an optimal answer?

A

BLAST and FastA

57
Q

What value is used to show if a sequence alignment is significant (or due to chance)?

A

The E (expectation) value

58
Q

What is the difference between an E (expecation) value and a P value?

A

E value takes into account the size of the database (that there are a finite number of sequences and don’t have an infinite probability distribution)

59
Q

What method is often used to score sequence alignments?

A

Substitution matrices

60
Q

What algorithm is used to find the sequence alignment with the best score?

A

Dynamic programming

61
Q

What do the numbers in a substitution matrix for sequence alignment correspond to?

A

The score for swapping one amino acid for another

62
Q

What is a substitution matrix?

A

A collection of scores for aligning nucleotides or amino acids with one another.

63
Q

What do the scores in a substitution matrix represent?

A

the relative ease with which a nucleotide or amino acid can mutate or substitute to another

64
Q

What is a substitution matrix used to measure?

A

Measures the similarity in sequence alignments

65
Q

What are names of the two commonly used substitution matrices?

A

PAM (point accepted mutation) e.g. PAM250
BLOSUM (block substitution matrix)

66
Q

Which substitution matrix was developed to look at more divergent sequences?

A

BLOSUM

67
Q

What does the number in a BLOSUM substitution matrix mean?

A

x% pairwise identity between all the sequences used to derive the matrix

68
Q

What does the number in PAMx e.g. PAM250 correspond to?

A

The matrix used is the expected matrix that you would expect to see during an evolutionary period that is long enough for 250 mutations to occur per 100 amino acids

69
Q

What difference is there in the numbers of a PAM and BLOSUM matrix?

A

Higher PAM number => larger evolutionary distance
Higher BLOSUM number => smaller evolutionary distance

70
Q

What is the equation for the number of possible alignments of sequences length n?

A

(2n)!/(n!)^2

71
Q

What us FASTA used for?

A

Sequence alignment

72
Q

How does FASTA optimise sequence alignment?

A

Via dynamic programming
Eliminates alignments in areas that represent having lots of gapes/insertions.

73
Q

Describe the principle of how FASTA is carried out

A
  • Find runs of identities
  • Re-score using PAM matrix
    -Keep the top scoring segments
  • Apply “joining threshold” to eliminate segments that are unlikely to be part of the highest scoring alignment
  • Use dynamic programming to optimise the alignment in a narrow band that encompasses the top scoring segments.
74
Q

FASTA is used to optimise sequence alignment, using runs of identities, which areas are taken out (sequences running horizontal and vertical from the top left corner) and what do these areas represent?

A

Taken out top right and bottom left because they represent alignments with lots of gaps and insertions

75
Q

What does BLAST stand for?

A

Basic Local Alignment Similarity Tool

76
Q

Using BLAST for sequence alignment is a compromise between what when compared to using the dynamic programming approach

A

Speed and Accuracy
If wanted true accuracy would have to undergo the dynamic programming approach

77
Q

What is the range of numbers for values of the E number?

A

between 0 and 1

78
Q

How long is a queried “word” in BLAST?

A

3 amino acid letters

79
Q

What would an E-value of 1 correspond to in BLAST?

A

alignment is sue to pure chance

80
Q

What is the significance level of the E value in BLAST?

A

<10^-4 is significant

81
Q

In BLAST, what happens after a query “word” is found in the sequence? What is the name given to what is identified/produced?

A

Try to expand out from the query word with matches either side
High-scoring segment pair (HSP)

82
Q

What is a high scoring segment pair (HSP)?

A

The matching section of DNA sequences that BLAST returns
A section of a pairof sequences (nucleotide or amino acid that share a high level of similarity

83
Q

Why are multiple sequence alignments considered very difficult and expensive?

A

They use a lot of computational power and require manual intervention in order in order to be accurate

84
Q

Why is the use of manual intervention important for mutliple sequence alignments?

A

It provides biasing of alignments based on additional experimental data e.g.if we know a significant function of a specific residue then it is important to make sure these lineup.

85
Q

Name a common program used for multiple sequence alignments

A

Jalview

86
Q

What is the formula for the computational cost associated with multiple sequence alignments?

A

O (m^n)
m = length of sequences
n = number of sequences

87
Q

What is a consensus sequence?

A

A sequence (DNA, RNA, protein) that represents aligned and related sequences.
A theoretical representative sequence in which each nucleotide/amino acid is the one that occurs most frequently at that site (when comparing to the other sequences)

88
Q

In a consensus sequence, what would a capital letter represent?

A

All sequences in the mutliple alignment have the same amino acid residue at that site

89
Q

In a consensus sequence, what does a lower case letter represent?

A

The amino acid that occurred most frequently at that site compared to the other squences (i.e. had the majortiy but other amino acids do exist at that located)

90
Q

In a consensus motive, how would a site with equal number of amino acids across the squences in the alignment be represented?

A

slash
e.g. if equal X and Y
X/Y

91
Q

What change is there to the dynamic programming process of sequence alignment when using the method for multiple sequence alignments?

A

The matric has increased number of dimensions (i.e. no longer a 2D matrix)
Number of dimensions = number sequences used.
n=> nD
Therefore, as we cannot do this as humans, we must use computers.

92
Q

What method/process would be used when searching for distant (evolutionary) relationships between alignment sequences?

A

PSI-BLAST
Position-Specific-Integrated BLAST

93
Q

Describe the basic principles of carrying out PSI-BLAST.

A
  • initial database search of sequences using BLAST
  • generates a consensus sequence that can be used to generate a position-specific score matrix
  • use this matrix to carry out the search again
  • forms a new position specific score matrix
    -repeat 3/4 times
94
Q

Why is it important that PSI-BLAST is only repeated for around 3 or 4 iterations?

A

There is a danger of dilution. i.e. the sequence coming back has no relation to the original sequence

95
Q

What are two methods other than PSI-BLAST used to search for distant relationships between sequences?

A

Hiddwn Markov Models (HMM)
Fingerprints

96
Q

Up to end of lecture slide 34

A