6. NGS: read mapping Flashcards

1
Q

What is the similarity/difference between assembly and read mapping?

A

Assembly: first genome of a species
- de novo assembly (high quality assembly: expensive/difficult)
- genome structure, gene inventory, etc

Read mapping: next genome(s) of that species (or a close relative)
- de novo assembly serves as reference: read mapping, re-sequencing, alignment
- cheaper: lower coverage required
- different questions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is read mapping aka

A

AKA
mapping, alignment, read sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is reference bias?

What is a possible solution?

A

Mapping bias / reference bias

  • reads containing non-reference alleles
    • might not be mapped or incorrectly mapped
    • receive lower mapping scores (mismatches!)
  • effect increases with genetic distance from the reference genome
  • affects downstream analysis (e.g., population genomics)

solution
* map to multiple reference genomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

NGS read mapping - approaches?

A

1: (pre-proccess &) identify short matches
* hash-table based (Sanger, NGS)
* BWT based (NGS)

2: extend seeds to longer alignments (e.g., DP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a pan-genome?

A

collection of common and unique genomes present in given species.

combines genetic information of all genomes sampled –> large and diverse range of genetic material

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a suffix tree?

A

compressed trie containing all suffixes of the given text as their keys and positions in the text as their values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

define genotype

A

In a broad sense, the term “genotype” refers to the genetic makeup of an organism; in other words, it describes an organism’s complete set of genes. In a more narrow sense, the term can be used to refer to the alleles, or variant forms of a gene, that are carried by an organism.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a hash table

A

eg dictionary in python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is hashing used for read mapping?

A
  • create hash table with substrings of length k called k-mers extracted from reference genome as keys, and their positions on the reference as values
  • Then, some k-mers from each read are looked up in the hash table to find candidate locations of the read on the reference (hits)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In hash -table based read mapping, what are the drawbacks of short kmers? of long kmers?

What is the solution?

A

smaller templates (substrings, k-mers)
- allow to map in the presence of more errors/variation
- are not very effective seeds (match too many regions)
- dilemma!

solution: spaced seeds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are spaced seeds?

A

templates: optimized for a given read length and
expected error/mismatch rate

pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions.

usually represented as a sequence of zeroes and ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

consecutive seeds vs. spaced seeds

A

consecutive seeds: one template, needs contiguous matches

spaces seed: template allows mismatches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a BW matrix?

A

BW matrix:
rows are sorted cyclical rotations of string$ - equivalent to sorting the suffixes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is BWT?

A

Burrows-Wheeler Transform
- full-text indexing approach
- suffix array-like index structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the advantages of BWT-based approaches for read-mapping?

A
  • space efficient, can be efficiently compressed
  • original string can be recovered
  • can fit in RAM of most computers
  • pre-computed indices can be stored & shared
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the most widely used read mapper tools?

A

BWA, Bowtie2

17
Q

What problem does mapping of long reads present and how can this be addressed?

A
  • higher number of sequencing errors (and different types of errors)
  • hash-based and BWT-based approaches
  • longer reads: many short seeds per read
  • higher error rates –> use chaining
  • use representative seeds only (minimizers)
18
Q

What is SAM and BAM?

A

Alignment output
* compact file size
* analysis without loading the alignment into memory
* can be used by alignment viewers (IGV, samtools)
* SAM (Sequence Alignment/Map) format - human readable (sort of), can be parsed easily
* BAM format - same information but in binary format

19
Q

In SAM format, what is the flag code?

A

tell us if the strand is mapped, (un)paired, etc

20
Q

What is mapping quality?

When is this relevant?

Origin? Usage?

A

mapping quality= probability of read aligned to incorrect region

Relevant with repetitive genome regions, SNPs, sequencing errors
➡ reads map equally well to multiple positions
➡ reduced probability of assigning a read to the correct genomic location
➡ measure the confidence that a read actually comes from the position it is aligned to

first introduced with MAQ mapping software (2008)
different programs compute this differently! (and some don’t compute it at all)

21
Q

What is a single bp variant? What is an SNP?

A

SNPs
* a single bp change with a frequency ofat least 1% within a population
* bi-allelic

others: rare, just called s bp variant

22
Q

What can cause single bp variants?

What do we need to consder in order to resolve this?

What approaches?

A

Cause:
errors, paralogs, diploidy, polyploidy, …

considerations:
read depth, read quality, alignment quality, frequency and ratio of SNPs, number of equally good alignments per read, …

approaches
- allele counting
- probabilistic methods

23
Q

What is variant calling?

Name one package and the tools to do this

A

Variant calling entails identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data

bcftools

  1. bcftools mpileup
    * for every position, considers data from all reads covering the position
    * calculates genotype likelihoods
    - based on mapping qualities of the reads, base qualities, the probability of local misalignment, per-base alignment quality (BAQ)
  2. bcftools call
    * calls the most likely genotype under the assumption of Hardy-Weinberg equilibrium, using allele frequencies estimated from the data
    or provided explicitly by the user
24
Q

What is VCF?

A

Variant Call Format
* text based
* meta-information, genome variation by position
* can include info about structural variation

25
Q

What are structural variants?

size?

relevant for?

etc

A

variation generally >50bp

~2% of the human genome

relevant for medicine, molecular biology, evolution

least studied type of variation, because they are very difficult to detect with short reads

26
Q

What are different types of structural variants?

A
  • insertions, deletions, duplications
  • inversions, rearrangements
  • copy number variations
27
Q

Structural variants with:

short reads
long reads
genome alignment

A

short reads
- need to take into consideration distance and orientation of
paired reads; split reads
- some specialized mappers exist
- none of them can identify all SV types
- limit inherent in the short reads

long reads
- in theory advantageous, span SV regions
- in practice: not very many methods yet available
- often specific for sequencing technology

genome alignment
- required: de novo assembly, spanning the SV
➡ useful & important! methods are still evolving!

28
Q

Describe fasta format

A
  • text-based format
  • representing either Nts or AA using single-letter codes.
  • begins with a single-line description
  • followed by lines of sequence data
29
Q

Describe fastq format

A
  • text-based format
  • for storing biological sequence + corresponding quality scores.
  • Both encoded with a single ASCII character
30
Q

EXAM QUESTION

Describe 2 structural variants and how to detect them during mapping of long-read sequencing data (2022)

A
  • insertions
  • deletions
  • duplications
  • inversions, rearrangements
  • copy number variations

long reads are split into sub-reads and mapped - anlaysis can then determine if they are one of the SVs above