Sequence alignments and databases Flashcards
What’s the difference global and local alignment?
A global alignment forces the entire sequence to align while the local alignment only alignes the best matching subpart of the sequence.
When is it better to do a global vs. local alignment?
If you are interested in stating something about the genetic similarity between two far apart species you can do a global alignment on the mitochondrial DNA from both species to get the overall similarity and the relationship between them.
If you want to find what parts of the dna that are similar you could do a local alignment of the mitochondrial dna and you would end up with just the genes since those tend to have a low evolutionary rate.
What does the following terms tell you?
Alignment score
Alignment length
Identity
Similarity
Alignment score: Tells you how well the sequences align based on what scoring matrix you have chosen.
Alignment length: Tells you how long the aligned sequence is.
Identity: Tells you how many perfect matches are in the alignment
Similarity: Tells you how many of your mismatches are close to each other. Regions with high similarity and identity could indicate that the region is conserved or evolutionary important.
Why are the scoring matrixes and gap penalties important for the result of an alignment?
Because the scoring matrix and wether or not you penalize gaps decides the score of the alignment.
Different matrixes uses different methods of scoring and if you choose not to penalize gaps you would get a much higher score than if you would have penalized them.
You perform an alignment of two sequences. You then shuffle one of them and perform the alignment again. What outcome would you expect if the similarity from the alignment was true?
If the similarity from the alignment was true then the alignment from the shuffled sequence should show nothing.
If you were to see a similar result with the shuffled sequence to the original alignment then you could expect the similarity is simply by chance - meaning that the E-value is high
If we performed an alignment where we allowed al lot of gaps, what would the alignment scores be? Could you trust an alignment like this?
When you allow a lot of gaps you allow many consecutive gaps. Even if you penalize the opening and ends the score is very likely to increase since the alignment sequence gets longer.
A very big part of the alignment however will just be gaps so we can’t really trust the alignment since the matches in-between the gaps could still be by chance.
What is a database structure?
The organization of the data
What is a database management system?
software to
control organization, storage and retrieval of data
What is the interface of a database?
how to access the data (e.g. website, GUI, command line)
What types of biological databases are there?
Repositories
Curated
Primary
Derivative
What is a repository database?
Open submission Archiving
Submitters responsible for data quality
Often redundant
What is a curated database?
Closed submission
Actively maintained
Database admin responsible for data quality
Often non-redundant
What is a primary database?
Original submissions by experimentalist
Content controlled by the submitters
What is a derivative database?
Built from primary data
Content controlled by database admin
What is an alignment?
Arranging two or more character strings to identify similar segments without changing the order.
What is the goal of an alignment?
to predict function
to predict protein structure
find related sequences in a database
reconstruct phylogenetic relationships
What is homology?
Sequences are homologous if they evolved from a common ancestor.
What is a pairwise alignment?
compare two sequences or search databases for similar sequences
What is a multiple alignment?
identify homologous sites in sequences from many taxa (e.g. hemoglobin from different species) for phylogenetic/historical analysis
How do we decide if an alignment is of good quality?
Scores measure the overall similarity and rank the alignments. You can choose what scoring matrix to use which will affect you result.
What is the Needleman-Wunsch algorithm?
An alignment algorithm for global pairwise alignments. It will use a scoring matrix ex. BLOSUM or PAM.
What is the Smith-Waterman algorithm?
Modified version of Needleman-Wunsch algorithm for local alignments.
It gives no negative scores.
Why would we want to search for similarities between sequences?
Characterize unknown sequences
Similarity is a predictor of homology
Homology is a computational predictor of function
Homology is essential to discover evolutionary relationships
What does BLAST stand for and what is it?
Basic Local Alignment Tool is an algorithm to search for similar sequences in databases.
What assumptions does BLAST make?
Good alignments contain short stretches of exact matches
Short matches can be extended to longer alignments
What are the three steps to find high scoring pairs in BLAST?
Seeding
Extension
Evaluation
What is the basic idea for BLAST?
The idea is to only search for a fraction of the possible search space and try to include the good parts (try to find high scoring pairs between two sequences).
Explain seeding in BLAST
W generate a list of words for query and scan the database.
Ex. we want to search for RQCS wordcount 2. The words will be RQ, QC, CS.
We then generate all neighbouring words with similarity > T. We use BLOSUM62 to get the score of T. We use BLOSUM62 to look at all possible words of two and see if they get a score > T which we have set beforehand.