Sequence alignments and databases Flashcards

1
Q

What’s the difference global and local alignment?

A

A global alignment forces the entire sequence to align while the local alignment only alignes the best matching subpart of the sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When is it better to do a global vs. local alignment?

A

If you are interested in stating something about the genetic similarity between two far apart species you can do a global alignment on the mitochondrial DNA from both species to get the overall similarity and the relationship between them.

If you want to find what parts of the dna that are similar you could do a local alignment of the mitochondrial dna and you would end up with just the genes since those tend to have a low evolutionary rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does the following terms tell you?
Alignment score
Alignment length
Identity
Similarity

A

Alignment score: Tells you how well the sequences align based on what scoring matrix you have chosen.

Alignment length: Tells you how long the aligned sequence is.

Identity: Tells you how many perfect matches are in the alignment

Similarity: Tells you how many of your mismatches are close to each other. Regions with high similarity and identity could indicate that the region is conserved or evolutionary important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why are the scoring matrixes and gap penalties important for the result of an alignment?

A

Because the scoring matrix and wether or not you penalize gaps decides the score of the alignment.

Different matrixes uses different methods of scoring and if you choose not to penalize gaps you would get a much higher score than if you would have penalized them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

You perform an alignment of two sequences. You then shuffle one of them and perform the alignment again. What outcome would you expect if the similarity from the alignment was true?

A

If the similarity from the alignment was true then the alignment from the shuffled sequence should show nothing.

If you were to see a similar result with the shuffled sequence to the original alignment then you could expect the similarity is simply by chance - meaning that the E-value is high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

If we performed an alignment where we allowed al lot of gaps, what would the alignment scores be? Could you trust an alignment like this?

A

When you allow a lot of gaps you allow many consecutive gaps. Even if you penalize the opening and ends the score is very likely to increase since the alignment sequence gets longer.

A very big part of the alignment however will just be gaps so we can’t really trust the alignment since the matches in-between the gaps could still be by chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a database structure?

A

The organization of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a database management system?

A

software to
control organization, storage and retrieval of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the interface of a database?

A

how to access the data (e.g. website, GUI, command line)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What types of biological databases are there?

A

Repositories
Curated

Primary
Derivative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a repository database?

A

Open submission Archiving

Submitters responsible for data quality

Often redundant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a curated database?

A

Closed submission

Actively maintained

Database admin responsible for data quality

Often non-redundant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a primary database?

A

Original submissions by experimentalist

Content controlled by the submitters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a derivative database?

A

Built from primary data

Content controlled by database admin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is an alignment?

A

Arranging two or more character strings to identify similar segments without changing the order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the goal of an alignment?

A

to predict function

to predict protein structure

find related sequences in a database

reconstruct phylogenetic relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is homology?

A

Sequences are homologous if they evolved from a common ancestor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a pairwise alignment?

A

compare two sequences or search databases for similar sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a multiple alignment?

A

identify homologous sites in sequences from many taxa (e.g. hemoglobin from different species) for phylogenetic/historical analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do we decide if an alignment is of good quality?

A

Scores measure the overall similarity and rank the alignments. You can choose what scoring matrix to use which will affect you result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the Needleman-Wunsch algorithm?

A

An alignment algorithm for global pairwise alignments. It will use a scoring matrix ex. BLOSUM or PAM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the Smith-Waterman algorithm?

A

Modified version of Needleman-Wunsch algorithm for local alignments.
It gives no negative scores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why would we want to search for similarities between sequences?

A

Characterize unknown sequences

Similarity is a predictor of homology

Homology is a computational predictor of function

Homology is essential to discover evolutionary relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does BLAST stand for and what is it?

A

Basic Local Alignment Tool is an algorithm to search for similar sequences in databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What assumptions does BLAST make?

A

Good alignments contain short stretches of exact matches

Short matches can be extended to longer alignments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the three steps to find high scoring pairs in BLAST?

A

Seeding
Extension
Evaluation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the basic idea for BLAST?

A

The idea is to only search for a fraction of the possible search space and try to include the good parts (try to find high scoring pairs between two sequences).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Explain seeding in BLAST

A

W generate a list of words for query and scan the database.

Ex. we want to search for RQCS wordcount 2. The words will be RQ, QC, CS.

We then generate all neighbouring words with similarity > T. We use BLOSUM62 to get the score of T. We use BLOSUM62 to look at all possible words of two and see if they get a score > T which we have set beforehand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Explain extension in BLAST

A

Extend word to both sides and calculate the score.

30
Q

How does BLAST know when to stop the alignment?

A

We keep track of the current score for our alignment and how much it has dropped since the last maximum in the extension. If the drop since last max is greater than X the alignment trimmes back to the last max. Low X-value means that you may miss the next max but we will be faster. High X-value gives a longer running time.

31
Q

Explain evaluation in BLAST

A

Rank and identify the highest quality alignments.

32
Q

How do we identify the best alignment when BLAST will obtain many alignments?

A

E-values with low values (below 0.01) indicate that the sequences in this alignment is probably homologous. These are good sequences to set up hypotheses for.

33
Q

What is the T-value and X-value in BLAST?

A

In BLAST you have to set a threshold value (T) for what score you view as important or conservative enough. In the seeding only the scores above this value are kept.

The X-value is how BLAST knows when to stop the search. BLAST keeps track of your maximum score and how much it has dropped since the last max. If the drop is greater than a pre-set value of X the alignment trims back to the last max and starts over. A low X-value would mean that you might miss a higher max score but a high X-value would take longer time.

34
Q

Why do we want to search for similarities between sequences?

A

characterize unknown sequences

similarity is a predictor of homology

homology is a computational predictor of function

homology is essential to discover evolutionary relationships

35
Q

What does BLAST stand for and what is it?

A

Basic Local Alignment Search tool is an algorithm to search for similar sequences in databases.

36
Q

What assumptions does BLAST make?

A

Good alignments contain short stretches of exact matches

Short matches can be extended to longer alignments

37
Q

What is the basic idea of BLAST?

A

The idea is to only search for a fraction of the possible search space and try to include the good parts. Find high scoring pairs between two sequences (protein or nucleotide).

38
Q

What are the three steps in BLAST to find high scoring pairs?

A

Seeding
Extension
Evaluation

39
Q

Explain seeding in BLAST

A

Generate a list of words for query and scan the database.

Ex. we want to search for RQCS wordcount 2. The words will be RQ, QC, CS. We then generate all neighbouring words with similarity > T. We use BLOSUM62 to get the score of T. We use BLOSUM62 to look at all possible words of two and see if they get a score > T which we have set beforehand.

40
Q

Explain extension in BLAST

A

Extend alignments. Extend word to both sides and calculate the score.

41
Q

How does BLAST know when to stop the extension?

A

We keep track of the current score for our alignment and how much it has dropped since the last maximum. If the drop since last max is greater than X the alignment trimmes back to the last max. Low X-value means that you may miss the next max but we will be faster. High X-value gives a longer running time.

42
Q

Explain evaluation in BLAST

A

Rank and identify the highest quality alignments by the E-value

43
Q

How do we identify the best alignment when BLAST will obtain many alignments?

A

E-values with low values indicate that the sequences in this alignment is probably homologous. These are good sequences to set up hypotheses for.

44
Q

What sequences should we look at when comparing sequences from distant vs closely related species?

A

DNA evolves faster than proteins since proteins conserve functions - use protein level for distant homology between genes/proteins. If you are comparing sequences between humans DNA is the better choice.

45
Q

What does the E-value in BLAST tell us?

A

The expected value(E) is a parameter that tells us the number of hits on can expect to see by chance when searching a database of a particular size. It decreases as the score of the alignment increases. E-value essentially describes the background noises of the alignment.

46
Q

What is the Needleman-Wunsch algorithm?

A

An algorithm to obtain the best global alignment based on cost/scoring functions.

47
Q

Describe the Needleman-Wunsch algorithm.

A

The algorithm works by creating a two-dimensional matrix with the sequences on x and y and then proceeds to find the alignment that gives the maximum score by the following steps:

initialization
matrix filling
trace back

48
Q

Describe the initialization of the Needleman-Wunsch algorithm

A

In the initialization the algorithm fills the first rows of the matrix based on the gap penalties.

49
Q

Describe the matrix filling in the Needleman-Wunsch algorithm

A

After the initialization the algorithm fills the matrix by giving a match +1 and mismatches and gaps -1.

When deciding on a value in one index the algorithm can come from above, left or left above diagonal and it chooses the way that gives the highest possible score based on if it’s a match or mismatch. The only way to get a positive score is to come from the diagonal when there is a match between sequences.

The algorithm then does this row by row for the entire matrix.

50
Q

Describe the trace back in the Needleman-Wunsch algorithm

A

The traceback is when we find the optimal alignment. It goes from the bottom right corner to the top left corner. If there’s a match we move diagonally and if there is not a match we move up or left depending on where the highest score is.

A diagonal step means match/mismatch

A step up means that there’s a gap loch the y sequence

A step to the left means that there’s a gap on the x sequence.

51
Q

What is the Smith-Waterman algorithm?

A

An algorithm to find the best local alignments. It is a modified version of the Needleman-Wunsch algorithm

52
Q

What are the differences between the Needleman-Wunsch and Smith-Waterman algorithms?

A

In the initialization the Smith-waterman fills the first rows with 0 instead of the negative scores since the smith-waterman matrix has no negative scores because we are trying to find the best subsequence even though the alignment does not start at the beginning of the sequence.

The Smith-Waterman sets the value to 0 where there would have been a negative value in the needle man to ignore a possible negative alignment score.

In the trace-back the Smith-Waterman starts wherever the highest value is and stops at 0 whilst needle man goes from corner to corner of the matrix to force the entire matrix to align.

53
Q

What is the aim of BLAST?

A

To rapidly search through a large database of sequences to find a match to the query sequence and all the sequences that database contains.

54
Q

What are k-mers in BLAST?

A

k-mers are “perfect matching words” in our alignment. The longer the k-mer the more conservative that region of a protein is.

An alignment with k-mers are though of as better alignments than dispersed matches even though the number of matches are the same since k-mers indicate conservative regions. Dispersed matches could be by chance.

When we are seeding in BLAST we set a word count to how long k-mers we want.

55
Q

What is BLOSUM62?

A

BLOSUM62 is the default scoring matrix in BLAST.

The BLOSUM62 is a matrix that give scores on how similar and different a protein will be if you change one amino acid for another based on evolutionary frequencies of the changes (and the probability of that change to happen). High scores indicate little change in proteins which means that the probability of that change to happen is relatively high. Low scores indicate that that the change in the protein is high and the probability for that to happen is low.

56
Q

Why do we penalize gap openings more than gap extensions?

A

You usually penalize gap openings more than gap extensions so that alignments with every other gap and match won’t get a high score since longer k-mers usually indicate more conservative and important regions than dispersed matches over the sequence.

57
Q

How can we see the significance of a BLAST alignment?

A

If we have a big database however there’s always the chance of finding an alignment that fits your criteria just by chance. We can take our raw score and look at the percentage of the alignments that got that score or higher which gives us the probability that the alignment is significant.

58
Q

What databases are curated/Non-redundant?

A

SwissProt
RefSeq
UniProt

59
Q

What databases are uncurated/redundant/repository?

A

Genbank
ENA
EMBLE

60
Q

What databases are derivative?

A

UniProt
Ensamble
SwissProt

61
Q

What databases are primary?

A

Genbank
ENA
TrEMBLE

62
Q

What is SwissProt?

A

A sub-database from UniProt that is manually annotated.

63
Q

What is TrEMBLE?

A

A sub-database from UniProt that is computationally annotated.

64
Q

When the blosum62 matrix is constructed, what two models are compared and what does positive vs negative values mean into the matrix?

A

The BLOSUM62 matrix contains scores that tells you the probability of a certain amino acid substitution occurring. The matrix is constructed based on two models:

The random model that states that substitutions happen completely randomly with the only parameter being how often the amino acids in the sequence occur. If the sequence is 20 characters long and A appears 2 times and I appears once the probability for their substitution is based on nothing more than their occurrence multiplied (0.1*0.05).

The observation/expectation model states that the substitutions are not random but due to other factors like the properties of the amino acids. This model is based on the observed frequency of substitutions in alignment data.

If the observed probability is higher than the random it means that the substitution occur more often than random. This means that the chemical properties of the amino acids are similar and the substitution is therefore more likely to happen since it won’t change the function of the protein. This will give a positive number in the BLOSUM matrix.

65
Q

In what type of alignment would you lower the gap cost?

A

When you are aligning sequences that are very distant from each other.

66
Q

When would you choose to use a lower BLOSUM matrix than 62?

A

When you want to be less strict with the gap costs. For example if you want to align two sequences that are very distant since you are expecting a lot of gaps in an alignment like that.

67
Q

What is the difference between blastn, blastp, tblastn, tblastx and blastx?

A

Blast n is for nucleotides, blastp is for proteins and blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.

tblastn = protein to nucleotide
tblastx = nucleotides to nucleotides but resulting in proten to protein because blast translates both.

68
Q

We estimate that 10% of all amino acids are Alanine and 5% are Isoleucine. You have observed that 1% of all substitutions is A-I from sequence alignment data.

Calculate the expected proportion of A-I substitutions under a random model. Is this more or less than the observed A-I in the alignments, what does this mean for the amino acids in terms of chemical properties and BLOSUM62 scores?

A

To calculate the expected proportion under a random model pA-I = pA*pI.

0.1*0.05 = 0.5% which is lower than the observed.

Meaning that A-I occurs more often than random and the amino acids probably have similar qualities and the substitution is more likely to happen which gives a positive BLOSUM62 score.

69
Q

How do we score an MSA?

A

We calculate the sum of pairscores:

score(seq1-seq2) + score(seq1-seq3) + …

70
Q

What is a progressive alignment?

A

A progressive alignment builds up to a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related.

This is a good alternative if you want to pairwise align a big number of genes over the NMW since it would not be efficient to do NMW on all of the sequences.

71
Q

What does it mean if you increase you word count in blast to a higher number?

A

You start looking for more conservative regions and you will most likely get fewer hits. If your word size is the same length as the sequence you’re just doing a local alignment.

BLAST is faster due to the fact that the entire search space is not used when you seed with small wordcounts.

72
Q

Why is it better to do progressive alignment over needleman wunsch when you want to compare 100 sequences of length 100?

A

A Needleman wunsch would create a matrix with many dimensions while the progressive alignment would do the pairwise alignments in just two dimensions which would be much faster.