4.1 Sequence Alignment Flashcards
What is the objective of a global alignment?
optimal alignment that includes all characters from each sequence (ex: cluster generates global alignment)
What is the objective of a local alignment?
optimal alignment that includes only the most similar local region(s)
ex” BLAST generates local alignments
What are the two main groups aligners can be classified into?
- sequence similarity searching with ranked solutions (one sequence is compared against many and rank solutions returned
- Sequence similarity searching returning only the optimal solution (comparing many sequences to one)
Why are statistics critical for sequence similarity searching?
Used to discriminate between real and artifactual matches which is done using a estimate of probability that the matched occurred by chance
Status allow us to give rank order and find optimal solution
Why are references important in bioinformatic analysis?
- A change in reference/database changes your search space and expect score –> a change in coordinate structure sequences are aligned to, it changes results
- Can’t move b/w references
- references impact alignment statistics
Why doesn’t BLAST scale to NGS requirements?
- BLAST uses a 3 word score that is extended in both direction and generates an expect score + value
- alignment of reads from single human genome re-sequencing experiment would take years
What are some examples where sequence alignment strategies are references specific
EX 1: Use BWA for genomic alignments
Ex2: Use STAR aligner for RNA alignments
Strategy used depends on the research question being asked
What is the objective of short read alignment?
To align 100s of millions of short reads against a known reference
Note: repeats longer than read length are problematic
What is Heuristics in computational bio?
There is a trade off b/w computational efficiency/resources and accuracy/precision
Heuristics goal to produce a good enough solution in reasonable time
Make choice between completeness & speed
What are some (4) examples of heuristics?
- Optimality
- Completeness
- Accuracy and Precision
- Computer and resources
How is indexing used?
to significantly increase alignment speed by converting the genome &/or reads into an index table of short “words”
T/F Indexing the reference genome is 0 based
T
What does position refer to in a index look up table?
The location in the genome that the sequence occurs
How are BLAST and NGS fundamentally different?
BWA extracts a seed from the 5’ end (which has higher quality). The 5’ end serves as the search space anchor
BLAST takes a 3 char word and searched for it all across the read
What are the 3 steps in indexing?
- deciding a seed
- Aligning the seed
- Extending the seed
What is the purpose of seed extention?
To attempt to resolve the optimal alignment (since the seed that matches part of the read may occur in more than one place)
Read is extended past the seed and the corresponding adjacent sequence are compared. Show which of the original seeds the read aligns
How does the aligner decided where a read aligns to (in BWA)?
- Takes into account base qualities
- Threshold based on the sum of mismatched based qualities and if the sum exceeds threshold the read stops extending and hard or soft clips
Is it better to have higher or lower phred scores for mismatched bases during alignment?
Higher quality bases are more penalized in the extension b/c there is high confidence that there is a mismatch compared to a base with a lower phred score
What affect does altering the phred score threshold have in aligning?
Decreasing the score allows less mismatches to be tolerated, alignment is more specific
Use higher phred score to allow more mismatches
Note that there will always be mismatches to reference due to ex: polymorphisms
How does the BWA align reads to the reference?
- reference converted to indexed 32mer table
2a. 1st (5’) 32 bases of the sequence read is extracted and matched to table
2b. Up to 2 miss-matched allowed in seed - Read is extended from seed
- read assigned a mapping quality
What is a mapping quality?
- assigned to read to indicate confidence with alignment
- quantify the probability that a read is misplaced
- derived from base qualities and # + frequency of mismatches for best alignment vs all possible alignments
- reported as phred scores
Why are indels problematic in aligning?
- insertion in seed shifts everything over
- problems in seed and extension - Can use shorter seeds to decrease chances of including indel in seed
What happens when a read aligns to more than 1 place in the refence genome
- assigned MQ score = 0 because aren’t confident where read aligned since it aligns to more than 1 position with same MQ
- usually ignored downstream but sometimes randomly signed to 1 position on genome w MQ=0
What are the 2 main modules in BWA and how do they differ?
- aln (reads < 75 nt); 32mer seed
- mem (reads > 75 nt); 22mer seed
Differ in seeds and how gaps are handled in alignment