4.1 Sequence Alignment Flashcards

1
Q

What is the objective of a global alignment?

A

optimal alignment that includes all characters from each sequence (ex: cluster generates global alignment)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the objective of a local alignment?

A

optimal alignment that includes only the most similar local region(s)

ex” BLAST generates local alignments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two main groups aligners can be classified into?

A
  1. sequence similarity searching with ranked solutions (one sequence is compared against many and rank solutions returned
  2. Sequence similarity searching returning only the optimal solution (comparing many sequences to one)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why are statistics critical for sequence similarity searching?

A

Used to discriminate between real and artifactual matches which is done using a estimate of probability that the matched occurred by chance

Status allow us to give rank order and find optimal solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why are references important in bioinformatic analysis?

A
  • A change in reference/database changes your search space and expect score –> a change in coordinate structure sequences are aligned to, it changes results
  • Can’t move b/w references
  • references impact alignment statistics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why doesn’t BLAST scale to NGS requirements?

A
  • BLAST uses a 3 word score that is extended in both direction and generates an expect score + value
  • alignment of reads from single human genome re-sequencing experiment would take years
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some examples where sequence alignment strategies are references specific

A

EX 1: Use BWA for genomic alignments

Ex2: Use STAR aligner for RNA alignments

Strategy used depends on the research question being asked

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the objective of short read alignment?

A

To align 100s of millions of short reads against a known reference

Note: repeats longer than read length are problematic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Heuristics in computational bio?

A

There is a trade off b/w computational efficiency/resources and accuracy/precision

Heuristics goal to produce a good enough solution in reasonable time

Make choice between completeness & speed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some (4) examples of heuristics?

A
  1. Optimality
  2. Completeness
  3. Accuracy and Precision
  4. Computer and resources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is indexing used?

A

to significantly increase alignment speed by converting the genome &/or reads into an index table of short “words”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

T/F Indexing the reference genome is 0 based

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does position refer to in a index look up table?

A

The location in the genome that the sequence occurs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How are BLAST and NGS fundamentally different?

A

BWA extracts a seed from the 5’ end (which has higher quality). The 5’ end serves as the search space anchor

BLAST takes a 3 char word and searched for it all across the read

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 3 steps in indexing?

A
  1. deciding a seed
  2. Aligning the seed
  3. Extending the seed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the purpose of seed extention?

A

To attempt to resolve the optimal alignment (since the seed that matches part of the read may occur in more than one place)

Read is extended past the seed and the corresponding adjacent sequence are compared. Show which of the original seeds the read aligns

17
Q

How does the aligner decided where a read aligns to (in BWA)?

A
  1. Takes into account base qualities
  2. Threshold based on the sum of mismatched based qualities and if the sum exceeds threshold the read stops extending and hard or soft clips
18
Q

Is it better to have higher or lower phred scores for mismatched bases during alignment?

A

Higher quality bases are more penalized in the extension b/c there is high confidence that there is a mismatch compared to a base with a lower phred score

19
Q

What affect does altering the phred score threshold have in aligning?

A

Decreasing the score allows less mismatches to be tolerated, alignment is more specific

Use higher phred score to allow more mismatches

Note that there will always be mismatches to reference due to ex: polymorphisms

20
Q

How does the BWA align reads to the reference?

A
  1. reference converted to indexed 32mer table
    2a. 1st (5’) 32 bases of the sequence read is extracted and matched to table
    2b. Up to 2 miss-matched allowed in seed
  2. Read is extended from seed
  3. read assigned a mapping quality
21
Q

What is a mapping quality?

A
  • assigned to read to indicate confidence with alignment
  • quantify the probability that a read is misplaced
  • derived from base qualities and # + frequency of mismatches for best alignment vs all possible alignments
  • reported as phred scores
22
Q

Why are indels problematic in aligning?

A
  1. insertion in seed shifts everything over
    - problems in seed and extension
  2. Can use shorter seeds to decrease chances of including indel in seed
23
Q

What happens when a read aligns to more than 1 place in the refence genome

A
  1. assigned MQ score = 0 because aren’t confident where read aligned since it aligns to more than 1 position with same MQ
  2. usually ignored downstream but sometimes randomly signed to 1 position on genome w MQ=0
24
Q

What are the 2 main modules in BWA and how do they differ?

A
  1. aln (reads < 75 nt); 32mer seed
  2. mem (reads > 75 nt); 22mer seed

Differ in seeds and how gaps are handled in alignment

25
Q

What are the main steps in BWA?

A
  1. reference indexed
  2. call aligner to align reads
  3. reads are paired together
  4. Pairs outputted in SAM file
26
Q

What are 2 databases reference genomes can be obtained

A

NCBI

UCSC Genome browser

27
Q

What is an md5sum and what is it used for?

A
  • calculates and verifies 128-but MD5 hashes of a file

- used to confirm identity and completeness of any file

28
Q

How are read pairs used to rescue sequences aligned to repetitive genomic regions?

A

During the pairing process the aligner looks for reads with MQ=0
If read 1 has MQ=0, read 2 maps to unique position, & read 1 and read 2 aligned to read 1 within some distance of a median insert size, can determine location of read 1

29
Q

What is a SAM file?

A
  • A mapping file standard

- stores large nt sequence alignments

30
Q

What does the SAM format allow?

A
  1. flexible storage of all alignment info generated across platforms
  2. simple to generate by alignment programs or converted b/w formats
  3. Allow most alignment operations without loading whole alignment into memory
  4. allows files to be indexed by genomic position to efficiently retrieve all reads aligning to a locus
31
Q

What is a BAM file and how does it differ from SAM?

A

Bam is a binary format of the SAM

BAM & SAM contain same info in different formats