6. NGS: read mapping Flashcards

1
Q

What is the similarity/difference between assembly and read mapping?

A

Assembly: first genome of a species
- de novo assembly (high quality assembly: expensive/difficult)
- genome structure, gene inventory, etc

Read mapping: next genome(s) of that species (or a close relative)
- de novo assembly serves as reference: read mapping, re-sequencing, alignment
- cheaper: lower coverage required
- different questions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is read mapping aka

A

AKA
mapping, alignment, read sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is reference bias?

What is a possible solution?

A

Mapping bias / reference bias

  • reads containing non-reference alleles
    • might not be mapped or incorrectly mapped
    • receive lower mapping scores (mismatches!)
  • effect increases with genetic distance from the reference genome
  • affects downstream analysis (e.g., population genomics)

solution
* map to multiple reference genomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

NGS read mapping - approaches?

A

1: (pre-proccess &) identify short matches
* hash-table based (Sanger, NGS)
* BWT based (NGS)

2: extend seeds to longer alignments (e.g., DP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a pan-genome?

A

collection of common and unique genomes present in given species.

combines genetic information of all genomes sampled –> large and diverse range of genetic material

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a suffix tree?

A

compressed trie containing all suffixes of the given text as their keys and positions in the text as their values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

define genotype

A

In a broad sense, the term “genotype” refers to the genetic makeup of an organism; in other words, it describes an organism’s complete set of genes. In a more narrow sense, the term can be used to refer to the alleles, or variant forms of a gene, that are carried by an organism.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a hash table

A

eg dictionary in python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is hashing used for read mapping?

A
  • create hash table with substrings of length k called k-mers extracted from reference genome as keys, and their positions on the reference as values
  • Then, some k-mers from each read are looked up in the hash table to find candidate locations of the read on the reference (hits)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In hash -table based read mapping, what are the drawbacks of short kmers? of long kmers?

What is the solution?

A

smaller templates (substrings, k-mers)
- allow to map in the presence of more errors/variation
- are not very effective seeds (match too many regions)
- dilemma!

solution: spaced seeds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are spaced seeds?

A

templates: optimized for a given read length and
expected error/mismatch rate

pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions.

usually represented as a sequence of zeroes and ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

consecutive seeds vs. spaced seeds

A

consecutive seeds: one template, needs contiguous matches

spaces seed: template allows mismatches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a BW matrix?

A

BW matrix:
rows are sorted cyclical rotations of string$ - equivalent to sorting the suffixes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is BWT?

A

Burrows-Wheeler Transform
- full-text indexing approach
- suffix array-like index structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the advantages of BWT-based approaches for read-mapping?

A
  • space efficient, can be efficiently compressed
  • original string can be recovered
  • can fit in RAM of most computers
  • pre-computed indices can be stored & shared
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the most widely used read mapper tools?

A

BWA, Bowtie2

17
Q

What problem does mapping of long reads present and how can this be addressed?

A
  • higher number of sequencing errors (and different types of errors)
  • hash-based and BWT-based approaches
  • longer reads: many short seeds per read
  • higher error rates –> use chaining
  • use representative seeds only (minimizers)
18
Q

What is SAM and BAM?

A

Alignment output
* compact file size
* analysis without loading the alignment into memory
* can be used by alignment viewers (IGV, samtools)
* SAM (Sequence Alignment/Map) format - human readable (sort of), can be parsed easily
* BAM format - same information but in binary format

19
Q

In SAM format, what is the flag code?

A

tell us if the strand is mapped, (un)paired, etc

20
Q

What is mapping quality?

When is this relevant?

Origin? Usage?

A

mapping quality= probability of read aligned to incorrect region

Relevant with repetitive genome regions, SNPs, sequencing errors
➡ reads map equally well to multiple positions
➡ reduced probability of assigning a read to the correct genomic location
➡ measure the confidence that a read actually comes from the position it is aligned to

first introduced with MAQ mapping software (2008)
different programs compute this differently! (and some don’t compute it at all)

21
Q

What is a single bp variant? What is an SNP?

A

SNPs
* a single bp change with a frequency ofat least 1% within a population
* bi-allelic

others: rare, just called s bp variant

22
Q

What can cause single bp variants?

What do we need to consder in order to resolve this?

What approaches?

A

Cause:
errors, paralogs, diploidy, polyploidy, …

considerations:
read depth, read quality, alignment quality, frequency and ratio of SNPs, number of equally good alignments per read, …

approaches
- allele counting
- probabilistic methods

23
Q

What is variant calling?

Name one package and the tools to do this

A

Variant calling entails identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data

bcftools

  1. bcftools mpileup
    * for every position, considers data from all reads covering the position
    * calculates genotype likelihoods
    - based on mapping qualities of the reads, base qualities, the probability of local misalignment, per-base alignment quality (BAQ)
  2. bcftools call
    * calls the most likely genotype under the assumption of Hardy-Weinberg equilibrium, using allele frequencies estimated from the data
    or provided explicitly by the user
24
Q

What is VCF?

A

Variant Call Format
* text based
* meta-information, genome variation by position
* can include info about structural variation

25
What are structural variants? size? relevant for? etc
variation generally >50bp ~2% of the human genome relevant for medicine, molecular biology, evolution least studied type of variation, because they are very difficult to detect with short reads
26
What are different types of structural variants?
- insertions, deletions, duplications - inversions, rearrangements - copy number variations
27
Structural variants with: short reads long reads genome alignment
short reads - need to take into consideration distance and orientation of paired reads; split reads - some specialized mappers exist - none of them can identify all SV types - limit inherent in the short reads long reads - in theory advantageous, span SV regions - in practice: not very many methods yet available - often specific for sequencing technology genome alignment - required: de novo assembly, spanning the SV ➡ useful & important! methods are still evolving!
28
Describe fasta format
- text-based format - representing either Nts or AA using single-letter codes. - begins with a single-line description - followed by lines of sequence data
29
Describe fastq format
- text-based format - for storing biological sequence + corresponding quality scores. - Both encoded with a single ASCII character
30
EXAM QUESTION Describe 2 structural variants and how to detect them during mapping of long-read sequencing data (2022)
- insertions - deletions - duplications - inversions, rearrangements - copy number variations long reads are split into sub-reads and mapped - anlaysis can then determine if they are one of the SVs above