Quality of NGS Reads, Reference mapping and SNP analyses Flashcards

1
Q

What are Quality control (QC) sequencing reads?

A

Short DNA sequences provided by the sequencing instruments. The length may vary between 50 and 40,000
bases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What formats do QC sequencing reads come in?

A

Binary standard flowgram format (SFF format): Roche 454

FASTQ format: a text-based format; example: Illumina reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What format are Illumina reads in?

A

FASTQ format: a text-based format including base ATCG

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a quality score/Phred score of 10 compared to 60 mean?

A

QC=10 means 1/10 chance the base has been correctly placed
QC=60 means 1/10x106 or 1/1million = 99.9999% certainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does an N represent when using QC sequencing? (Instead of a ATCG)

A

No phred score could be calculated, therefore N was placed instead of a base.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why do we accept all bases with a QC higher than 20?

A

20 QC = 99% certainty in the base above 20 = more than 99% certain whereas 10 QC is only 90% accurate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What could cause poor distribution when looking at per base sequence quality (QC reads)? How could this be avoided? ( QC page with red, yellow green)

A

Degradation of quality over the duration of long runs

quality trimming could be helpful in such cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does the Per tile sequence quality function?

A

Record average Phred score for each tile in a flow cell.
Each Flow cell contains 8 lanes - 2 columns each lane - 50/60 tiles per collumn
Each tile is imaged 4 times, once per base and a colour is given blue to red
RED = Poor quality
Blue = at or above average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does per sequence QC tell you and why might this be low?

A

Displays one peak showing average quality per sequence, this needs to be 20+.
This could be caused by poor imaging or other technical problems during the run.
Low-quality reads must be removed avoid to problems, such as wrongly called bases, which could introduce bias in the downstream analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what does Per Sequence GC Content tell you in QC? What does differentiation suggest?

A

GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content. Differentiation/ jagged not smooth lines may suggest specific contaminant for example: adaptor dimer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is adapter sequence removal and why do we carry it out in QC?

A

adaptor sequences that were used for the sequencing reaction need to be removed before sequencing, this is the process of defining these sequencing so they can be removed from the sequencing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the Per Base Sequence Content tell you, and what are normal levels?

A

The proportion of each base position and GC content
This should be equal or similar overrepresented sequences such as adapter dimers or rRNA in a sample may cause bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does low/ high Sequence duplication tell you?

A

A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the process of De novo assembly?

A

De novo assembly is used when we have no priori knowledge of the correct sequence.
1. Reads are taken from genomic information Short-reads” typically range in size 35 – 1,000 bp “long-reads” typically range in size 1,000 – 500,000 bp.

2.Contig creation - Contigs are a set of overlapping oriented reads. A single contig is constructed from two or more overlapping and oriented reads.

  1. Scaffold formation - Scaffolds are a set of joined-oriented contigs. A single scaffold is constructed from two or more joined and oriented contigs.
  2. Assembled genome/ chromosome - A single chromosome is constructed from two or more joined and oriented scaffolds.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why do we have gaps in de novo assembly and how to avoid?

A

Regions of DNA in which the same sequence of bases is repeated or sections of the genome where DNA is densely packed.

Paired-end DNA sequencing reads provide high-quality alignment across DNA regions containing repetitive sequences, and produce long contigs for de novo sequencing by filling gaps in the consensus sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are paired end reads in de-novo assembly?

A

Paired end reads are produced when the fragment size used in the sequencing process is much longer (typically 250 - 500 bp long) and the ends of the fragment are read in towards the middle.

17
Q

What are K-mers and how do they work?

A

A k-mer is just a sequence of k characters in a string (or nucleotides in a DNA sequence).
Essentially you need to get the first k characters, then move just a single character for the start of the next k-mer and so on.

k-mers can also be used to detect genome mis-assembly by identifying k-mers that are overrepresented which suggest the presence of repeated DNA sequences that have been combined.

18
Q

What are the 3 K-mer for ATCGATCAC

A

3-mer #0: ATC
3-mer #1: TCG
3-mer #2: CGA
3-mer #3: GAT
3-mer #4: ATC
3-mer #5: TCA
3-mer #6: CAC

19
Q

What is Reference Mapping and what is it useful for ?

A

Reference Mapping is alligning sequencing reads to similar strains to determine Single Nucleotide Polymorphisms (SNPs), insertions and deletions (indels)

20
Q

Which programs are useful for reference mapping?

A

BWA
Bowtie

21
Q

What is a Single-nucleotide polymorphism (SNP)?

A

SNP is a single nucleotide variation in the corresponding DNA region between two or more organisms

22
Q

What are the types of Single-nucleotide polymorphism (SNP)?

A
  1. Synonymous SNPs: No change in the amino acid sequence
  2. Non-synonymous SNPs: change in the amino acid sequence of protein
    a. missense: results in a codon that codes for a
    different amino acid
    b. nonsense: results in a premature stop codon
23
Q

What is TRAMS program function?

A

TRAMS is a program for functional annotation of genomic SNPs as synonymous, nonsynonymous or nonsense.
It separates nonsynonymous SNPs in start and stop codons as non-start and non-stop SNPs
Multiple nucleotide polymorphisms (MNPs)
within a codon are combined before annotation.