4.1 Sequence Alignment Flashcards by Melanie Law

What is the objective of a global alignment?

optimal alignment that includes all characters from each sequence (ex: cluster generates global alignment)

How well did you know this?

Not at all

Perfectly

What is the objective of a local alignment?

optimal alignment that includes only the most similar local region(s)

ex” BLAST generates local alignments

How well did you know this?

Not at all

Perfectly

What are the two main groups aligners can be classified into?

sequence similarity searching with ranked solutions (one sequence is compared against many and rank solutions returned
Sequence similarity searching returning only the optimal solution (comparing many sequences to one)

How well did you know this?

Not at all

Perfectly

Why are statistics critical for sequence similarity searching?

Used to discriminate between real and artifactual matches which is done using a estimate of probability that the matched occurred by chance

Status allow us to give rank order and find optimal solution

How well did you know this?

Not at all

Perfectly

Why are references important in bioinformatic analysis?

A change in reference/database changes your search space and expect score –> a change in coordinate structure sequences are aligned to, it changes results
Can’t move b/w references
references impact alignment statistics

How well did you know this?

Not at all

Perfectly

Why doesn’t BLAST scale to NGS requirements?

BLAST uses a 3 word score that is extended in both direction and generates an expect score + value
alignment of reads from single human genome re-sequencing experiment would take years

How well did you know this?

Not at all

Perfectly

What are some examples where sequence alignment strategies are references specific

EX 1: Use BWA for genomic alignments

Ex2: Use STAR aligner for RNA alignments

Strategy used depends on the research question being asked

How well did you know this?

Not at all

Perfectly

What is the objective of short read alignment?

To align 100s of millions of short reads against a known reference

Note: repeats longer than read length are problematic

How well did you know this?

Not at all

Perfectly

What is Heuristics in computational bio?

There is a trade off b/w computational efficiency/resources and accuracy/precision

Heuristics goal to produce a good enough solution in reasonable time

Make choice between completeness & speed

How well did you know this?

Not at all

Perfectly

What are some (4) examples of heuristics?

Optimality
Completeness
Accuracy and Precision
Computer and resources

How well did you know this?

Not at all

Perfectly

How is indexing used?

to significantly increase alignment speed by converting the genome &/or reads into an index table of short “words”

How well did you know this?

Not at all

Perfectly

T/F Indexing the reference genome is 0 based

How well did you know this?

Not at all

Perfectly

What does position refer to in a index look up table?

The location in the genome that the sequence occurs

How well did you know this?

Not at all

Perfectly

How are BLAST and NGS fundamentally different?

BWA extracts a seed from the 5’ end (which has higher quality). The 5’ end serves as the search space anchor

BLAST takes a 3 char word and searched for it all across the read

How well did you know this?

Not at all

Perfectly

What are the 3 steps in indexing?

deciding a seed
Aligning the seed
Extending the seed

How well did you know this?

Not at all

Perfectly

What is the purpose of seed extention?

Study These Flashcards

To attempt to resolve the optimal alignment (since the seed that matches part of the read may occur in more than one place)

Read is extended past the seed and the corresponding adjacent sequence are compared. Show which of the original seeds the read aligns

How does the aligner decided where a read aligns to (in BWA)?

Study These Flashcards

Takes into account base qualities
Threshold based on the sum of mismatched based qualities and if the sum exceeds threshold the read stops extending and hard or soft clips

Is it better to have higher or lower phred scores for mismatched bases during alignment?

Study These Flashcards

Higher quality bases are more penalized in the extension b/c there is high confidence that there is a mismatch compared to a base with a lower phred score

What affect does altering the phred score threshold have in aligning?

Study These Flashcards

Decreasing the score allows less mismatches to be tolerated, alignment is more specific

Use higher phred score to allow more mismatches

Note that there will always be mismatches to reference due to ex: polymorphisms

How does the BWA align reads to the reference?

Study These Flashcards

reference converted to indexed 32mer table
2a. 1st (5’) 32 bases of the sequence read is extracted and matched to table
2b. Up to 2 miss-matched allowed in seed
Read is extended from seed
read assigned a mapping quality

What is a mapping quality?

Study These Flashcards

assigned to read to indicate confidence with alignment
quantify the probability that a read is misplaced
derived from base qualities and # + frequency of mismatches for best alignment vs all possible alignments
reported as phred scores

Why are indels problematic in aligning?

Study These Flashcards

insertion in seed shifts everything over
- problems in seed and extension
Can use shorter seeds to decrease chances of including indel in seed

What happens when a read aligns to more than 1 place in the refence genome

Study These Flashcards

assigned MQ score = 0 because aren’t confident where read aligned since it aligns to more than 1 position with same MQ
usually ignored downstream but sometimes randomly signed to 1 position on genome w MQ=0

What are the 2 main modules in BWA and how do they differ?

Study These Flashcards

aln (reads < 75 nt); 32mer seed
mem (reads > 75 nt); 22mer seed

Differ in seeds and how gaps are handled in alignment

What are the main steps in BWA?

1. reference indexed 2. call aligner to align reads 3. reads are paired together 5. Pairs outputted in SAM file

What are 2 databases reference genomes can be obtained

NCBI | UCSC Genome browser

What is an md5sum and what is it used for?

- calculates and verifies 128-but MD5 hashes of a file | - used to confirm identity and completeness of any file

How are read pairs used to rescue sequences aligned to repetitive genomic regions?

During the pairing process the aligner looks for reads with MQ=0 If read 1 has MQ=0, read 2 maps to unique position, & read 1 and read 2 aligned to read 1 within some distance of a median insert size, can determine location of read 1

What is a SAM file?

- A mapping file standard | - stores large nt sequence alignments

What does the SAM format allow?

1. flexible storage of all alignment info generated across platforms 2. simple to generate by alignment programs or converted b/w formats 3. Allow most alignment operations without loading whole alignment into memory 4. allows files to be indexed by genomic position to efficiently retrieve all reads aligning to a locus

What is a BAM file and how does it differ from SAM?

Bam is a binary format of the SAM BAM & SAM contain same info in different formats

4.1 Sequence Alignment Flashcards

(31 cards)