6. NGS: read mapping Flashcards by Stevie Davies

What is the similarity/difference between assembly and read mapping?

Assembly: first genome of a species
- de novo assembly (high quality assembly: expensive/difficult)
- genome structure, gene inventory, etc

Read mapping: next genome(s) of that species (or a close relative)
- de novo assembly serves as reference: read mapping, re-sequencing, alignment
- cheaper: lower coverage required
- different questions

How well did you know this?

Not at all

Perfectly

What is read mapping aka

AKA
mapping, alignment, read sequencing

How well did you know this?

Not at all

Perfectly

What is reference bias?

What is a possible solution?

Mapping bias / reference bias

reads containing non-reference alleles
- might not be mapped or incorrectly mapped
- receive lower mapping scores (mismatches!)
effect increases with genetic distance from the reference genome
affects downstream analysis (e.g., population genomics)

solution
* map to multiple reference genomes

How well did you know this?

Not at all

Perfectly

NGS read mapping - approaches?

1: (pre-proccess &) identify short matches
* hash-table based (Sanger, NGS)
* BWT based (NGS)

2: extend seeds to longer alignments (e.g., DP)

How well did you know this?

Not at all

Perfectly

What is a pan-genome?

collection of common and unique genomes present in given species.

combines genetic information of all genomes sampled –> large and diverse range of genetic material

How well did you know this?

Not at all

Perfectly

What is a suffix tree?

compressed trie containing all suffixes of the given text as their keys and positions in the text as their values.

How well did you know this?

Not at all

Perfectly

define genotype

In a broad sense, the term “genotype” refers to the genetic makeup of an organism; in other words, it describes an organism’s complete set of genes. In a more narrow sense, the term can be used to refer to the alleles, or variant forms of a gene, that are carried by an organism.

How well did you know this?

Not at all

Perfectly

What is a hash table

eg dictionary in python

How well did you know this?

Not at all

Perfectly

How is hashing used for read mapping?

create hash table with substrings of length k called k-mers extracted from reference genome as keys, and their positions on the reference as values
Then, some k-mers from each read are looked up in the hash table to find candidate locations of the read on the reference (hits)

How well did you know this?

Not at all

Perfectly

In hash -table based read mapping, what are the drawbacks of short kmers? of long kmers?

What is the solution?

smaller templates (substrings, k-mers)
- allow to map in the presence of more errors/variation
- are not very effective seeds (match too many regions)
- dilemma!

solution: spaced seeds

How well did you know this?

Not at all

Perfectly

What are spaced seeds?

templates: optimized for a given read length and
expected error/mismatch rate

pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions.

usually represented as a sequence of zeroes and ones

How well did you know this?

Not at all

Perfectly

consecutive seeds vs. spaced seeds

consecutive seeds: one template, needs contiguous matches

spaces seed: template allows mismatches

How well did you know this?

Not at all

Perfectly

What is a BW matrix?

BW matrix:
rows are sorted cyclical rotations of string$ - equivalent to sorting the suffixes

How well did you know this?

Not at all

Perfectly

What is BWT?

Burrows-Wheeler Transform
- full-text indexing approach
- suffix array-like index structure

How well did you know this?

Not at all

Perfectly

What are the advantages of BWT-based approaches for read-mapping?

space efficient, can be efficiently compressed
original string can be recovered
can fit in RAM of most computers
pre-computed indices can be stored & shared

How well did you know this?

Not at all

Perfectly

What are the most widely used read mapper tools?

Study These Flashcards

BWA, Bowtie2

What problem does mapping of long reads present and how can this be addressed?

Study These Flashcards

higher number of sequencing errors (and different types of errors)
hash-based and BWT-based approaches
longer reads: many short seeds per read
higher error rates –> use chaining
use representative seeds only (minimizers)

What is SAM and BAM?

Study These Flashcards

Alignment output
* compact file size
* analysis without loading the alignment into memory
* can be used by alignment viewers (IGV, samtools)
* SAM (Sequence Alignment/Map) format - human readable (sort of), can be parsed easily
* BAM format - same information but in binary format

In SAM format, what is the flag code?

Study These Flashcards

tell us if the strand is mapped, (un)paired, etc

What is mapping quality?

When is this relevant?

Origin? Usage?

Study These Flashcards

mapping quality= probability of read aligned to incorrect region

Relevant with repetitive genome regions, SNPs, sequencing errors
➡ reads map equally well to multiple positions
➡ reduced probability of assigning a read to the correct genomic location
➡ measure the confidence that a read actually comes from the position it is aligned to

first introduced with MAQ mapping software (2008)
different programs compute this differently! (and some don’t compute it at all)

What is a single bp variant? What is an SNP?

Study These Flashcards

SNPs
* a single bp change with a frequency ofat least 1% within a population
* bi-allelic

others: rare, just called s bp variant

What can cause single bp variants?

What do we need to consder in order to resolve this?

What approaches?

Study These Flashcards

Cause:
errors, paralogs, diploidy, polyploidy, …

considerations:
read depth, read quality, alignment quality, frequency and ratio of SNPs, number of equally good alignments per read, …

approaches
- allele counting
- probabilistic methods

What is variant calling?

Name one package and the tools to do this

Study These Flashcards

Variant calling entails identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data

bcftools

bcftools mpileup
* for every position, considers data from all reads covering the position
* calculates genotype likelihoods
- based on mapping qualities of the reads, base qualities, the probability of local misalignment, per-base alignment quality (BAQ)
bcftools call
* calls the most likely genotype under the assumption of Hardy-Weinberg equilibrium, using allele frequencies estimated from the data
or provided explicitly by the user

What is VCF?

Study These Flashcards

Variant Call Format
* text based
* meta-information, genome variation by position
* can include info about structural variation

What are structural variants? size? relevant for? etc

variation generally >50bp ~2% of the human genome relevant for medicine, molecular biology, evolution least studied type of variation, because they are very difficult to detect with short reads

What are different types of structural variants?

- insertions, deletions, duplications - inversions, rearrangements - copy number variations

Structural variants with: short reads long reads genome alignment

short reads - need to take into consideration distance and orientation of paired reads; split reads - some specialized mappers exist - none of them can identify all SV types - limit inherent in the short reads long reads - in theory advantageous, span SV regions - in practice: not very many methods yet available - often specific for sequencing technology genome alignment - required: de novo assembly, spanning the SV ➡ useful & important! methods are still evolving!

Describe fasta format

- text-based format - representing either Nts or AA using single-letter codes. - begins with a single-line description - followed by lines of sequence data

Describe fastq format

- text-based format - for storing biological sequence + corresponding quality scores. - Both encoded with a single ASCII character

EXAM QUESTION Describe 2 structural variants and how to detect them during mapping of long-read sequencing data (2022)

- insertions - deletions - duplications - inversions, rearrangements - copy number variations long reads are split into sub-reads and mapped - anlaysis can then determine if they are one of the SVs above

6. NGS: read mapping Flashcards

(30 cards)