Sequence alignments and databases Flashcards

Question

What assumptions does BLAST make?

Answer 1

Good alignments contain short stretches of exact matches Short matches can be extended to longer alignments

Answer 2

Seeding Extension Evaluation

Answer 3

The idea is to only search for a fraction of the possible search space and try to include the good parts (try to find high scoring pairs between two sequences).

Answer 4

W generate a list of words for query and scan the database. Ex. we want to search for RQCS wordcount 2. The words will be RQ, QC, CS. We then generate all neighbouring words with similarity > T. We use BLOSUM62 to get the score of T. We use BLOSUM62 to look at all possible words of two and see if they get a score > T which we have set beforehand.

Answer 5

Extend word to both sides and calculate the score.

Answer 6

We keep track of the current score for our alignment and how much it has dropped since the last maximum in the extension. If the drop since last max is greater than X the alignment trimmes back to the last max. Low X-value means that you may miss the next max but we will be faster. High X-value gives a longer running time.

Answer 7

Rank and identify the highest quality alignments.

Answer 8

E-values with low values (below 0.01) indicate that the sequences in this alignment is probably homologous. These are good sequences to set up hypotheses for.

Answer 9

In BLAST you have to set a threshold value (T) for what score you view as important or conservative enough. In the seeding only the scores above this value are kept. The X-value is how BLAST knows when to stop the search. BLAST keeps track of your maximum score and how much it has dropped since the last max. If the drop is greater than a pre-set value of X the alignment trims back to the last max and starts over. A low X-value would mean that you might miss a higher max score but a high X-value would take longer time.

Answer 10

characterize unknown sequences similarity is a predictor of homology homology is a computational predictor of function homology is essential to discover evolutionary relationships

Answer 11

Basic Local Alignment Search tool is an algorithm to search for similar sequences in databases.

Answer 12

Good alignments contain short stretches of exact matches Short matches can be extended to longer alignments

Answer 13

The idea is to only search for a fraction of the possible search space and try to include the good parts. Find high scoring pairs between two sequences (protein or nucleotide).

Answer 14

Seeding Extension Evaluation

Answer 15

Generate a list of words for query and scan the database. Ex. we want to search for RQCS wordcount 2. The words will be RQ, QC, CS. We then generate all neighbouring words with similarity > T. We use BLOSUM62 to get the score of T. We use BLOSUM62 to look at all possible words of two and see if they get a score > T which we have set beforehand.

Answer 16

Extend alignments. Extend word to both sides and calculate the score.

Answer 17

We keep track of the current score for our alignment and how much it has dropped since the last maximum. If the drop since last max is greater than X the alignment trimmes back to the last max. Low X-value means that you may miss the next max but we will be faster. High X-value gives a longer running time.

Answer 18

Rank and identify the highest quality alignments by the E-value

Answer 19

E-values with low values indicate that the sequences in this alignment is probably homologous. These are good sequences to set up hypotheses for.

Answer 20

DNA evolves faster than proteins since proteins conserve functions - use protein level for distant homology between genes/proteins. If you are comparing sequences between humans DNA is the better choice.

Answer 21

The expected value(E) is a parameter that tells us the number of hits on can expect to see by chance when searching a database of a particular size. It decreases as the score of the alignment increases. E-value essentially describes the background noises of the alignment.

Answer 22

An algorithm to obtain the best global alignment based on cost/scoring functions.

Answer 23

The algorithm works by creating a two-dimensional matrix with the sequences on x and y and then proceeds to find the alignment that gives the maximum score by the following steps: initialization matrix filling trace back

Answer 24

In the initialization the algorithm fills the first rows of the matrix based on the gap penalties.

Answer 25

After the initialization the algorithm fills the matrix by giving a match +1 and mismatches and gaps -1. When deciding on a value in one index the algorithm can come from above, left or left above diagonal and it chooses the way that gives the highest possible score based on if it's a match or mismatch. The only way to get a positive score is to come from the diagonal when there is a match between sequences. The algorithm then does this row by row for the entire matrix.

Answer 26

The traceback is when we find the optimal alignment. It goes from the bottom right corner to the top left corner. If there's a match we move diagonally and if there is not a match we move up or left depending on where the highest score is. A diagonal step means match/mismatch A step up means that there's a gap loch the y sequence A step to the left means that there's a gap on the x sequence.

Answer 27

An algorithm to find the best local alignments. It is a modified version of the Needleman-Wunsch algorithm

Answer 28

In the initialization the Smith-waterman fills the first rows with 0 instead of the negative scores since the smith-waterman matrix has no negative scores because we are trying to find the best subsequence even though the alignment does not start at the beginning of the sequence. The Smith-Waterman sets the value to 0 where there would have been a negative value in the needle man to ignore a possible negative alignment score. In the trace-back the Smith-Waterman starts wherever the highest value is and stops at 0 whilst needle man goes from corner to corner of the matrix to force the entire matrix to align.

Answer 29

To rapidly search through a large database of sequences to find a match to the query sequence and all the sequences that database contains.

Answer 30

k-mers are "perfect matching words" in our alignment. The longer the k-mer the more conservative that region of a protein is. An alignment with k-mers are though of as better alignments than dispersed matches even though the number of matches are the same since k-mers indicate conservative regions. Dispersed matches could be by chance. When we are seeding in BLAST we set a word count to how long k-mers we want.

Answer 31

BLOSUM62 is the default scoring matrix in BLAST. The BLOSUM62 is a matrix that give scores on how similar and different a protein will be if you change one amino acid for another based on evolutionary frequencies of the changes (and the probability of that change to happen). High scores indicate little change in proteins which means that the probability of that change to happen is relatively high. Low scores indicate that that the change in the protein is high and the probability for that to happen is low.

Answer 32

You usually penalize gap openings more than gap extensions so that alignments with every other gap and match won’t get a high score since longer k-mers usually indicate more conservative and important regions than dispersed matches over the sequence.

Answer 33

If we have a big database however there’s always the chance of finding an alignment that fits your criteria just by chance. We can take our raw score and look at the percentage of the alignments that got that score or higher which gives us the probability that the alignment is significant.

Answer 34

SwissProt RefSeq UniProt

Answer 35

Genbank ENA EMBLE

Answer 36

UniProt Ensamble SwissProt

Answer 37

Genbank ENA TrEMBLE

Answer 38

A sub-database from UniProt that is manually annotated.

Answer 39

A sub-database from UniProt that is computationally annotated.

Answer 40

The BLOSUM62 matrix contains scores that tells you the probability of a certain amino acid substitution occurring. The matrix is constructed based on two models: The random model that states that substitutions happen completely randomly with the only parameter being how often the amino acids in the sequence occur. If the sequence is 20 characters long and A appears 2 times and I appears once the probability for their substitution is based on nothing more than their occurrence multiplied (0.1*0.05). The observation/expectation model states that the substitutions are not random but due to other factors like the properties of the amino acids. This model is based on the observed frequency of substitutions in alignment data. If the observed probability is higher than the random it means that the substitution occur more often than random. This means that the chemical properties of the amino acids are similar and the substitution is therefore more likely to happen since it won’t change the function of the protein. This will give a positive number in the BLOSUM matrix.

Answer 41

When you are aligning sequences that are very distant from each other.

Answer 42

When you want to be less strict with the gap costs. For example if you want to align two sequences that are very distant since you are expecting a lot of gaps in an alignment like that.

Answer 43

Blast n is for nucleotides, blastp is for proteins and blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. tblastn = protein to nucleotide tblastx = nucleotides to nucleotides but resulting in proten to protein because blast translates both.

Answer 44

To calculate the expected proportion under a random model pA-I = pA*pI. 0.1*0.05 = 0.5% which is lower than the observed. Meaning that A-I occurs more often than random and the amino acids probably have similar qualities and the substitution is more likely to happen which gives a positive BLOSUM62 score.

Answer 45

We calculate the sum of pairscores: score(seq1-seq2) + score(seq1-seq3) + ...

Answer 46

A progressive alignment builds up to a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related. This is a good alternative if you want to pairwise align a big number of genes over the NMW since it would not be efficient to do NMW on all of the sequences.

Answer 47

You start looking for more conservative regions and you will most likely get fewer hits. If your word size is the same length as the sequence you're just doing a local alignment. BLAST is faster due to the fact that the entire search space is not used when you seed with small wordcounts.

Answer 48

A Needleman wunsch would create a matrix with many dimensions while the progressive alignment would do the pairwise alignments in just two dimensions which would be much faster.

Sequence alignments and databases Flashcards

(72 cards)