Quality of NGS Reads, Reference mapping and SNP analyses Flashcards

1
Q

What are Quality control (QC) sequencing reads?

A

Short DNA sequences provided by the sequencing instruments. The length may vary between 50 and 40,000
bases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What formats do QC sequencing reads come in?

A

Binary standard flowgram format (SFF format): Roche 454

FASTQ format: a text-based format; example: Illumina reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What format are Illumina reads in?

A

FASTQ format: a text-based format including base ATCG

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is a quality score?

A

given by software from info received from instrument and reflects confidence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does a quality score/Phred score of 10 compared to 60 mean?

A

QC=10 means 1/10 chance the base has been incorrectly placed
QC=60 means 1/10x10^6 or 1/million = 99.9999% certainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why do we accept all bases with a QC higher than 20?

A

Phred score 20 gives a 99% accuracy which is better than a Phred score of 10 - 90% accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What could cause poor distribution when looking at per base sequence quality (QC reads)? How could this be avoided? ( QC page with red, yellow green)

A

Degradation of quality over the duration of long runs.

Quality trimming could be helpful in such cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does the Per tile sequence quality function?

A

Record average Phred score for each tile in a flow cell.
Each Flow cell contains 8 lanes - 2 columns each lane - 50/60 tiles per collumn
Each tile is imaged 4 times, once per base and a colour is given blue to red
RED = Poor quality
Blue = at or above average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what does Per Sequence GC Content tell you in QC? What does differentiation suggest?

A

GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content. Differentiation/ jagged not smooth lines may suggest specific contaminant for example: adaptor dimer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is adaptor sequence removal and why do we carry it out in QC?

A

The primary reads from the sequencer also carry the adaptor sequences that were used for the sequencing reaction, and these need to be removed before processing. They interfere with downstream analyses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the Per Base Sequence Content tell you, and what are normal levels?

A

The proportion of each base position and GC content

This should be equal or similar

Overrepresented sequences such as adapter dimers or rRNA in a sample may cause bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does low/ high sequence duplication tell you?

A

A low level of duplication may indicate a very high level of coverage of the target sequence.

A high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the process of De novo assembly?

A
  1. Reads are taken from genomic information Short-reads” typically range in size 35 – 1,000 bp “long-reads” typically range in size 1,000 – 500,000 bp.
  2. Contig creation - Contigs are a set of overlapping oriented reads. A single contig is constructed from two or more overlapping and oriented reads. Reads must overlap by a minimum of base pairs or kmers before joining.
  3. Scaffold formation - Scaffolds are a set of joined-oriented contigs. A single scaffold is constructed from two or more joined and oriented contigs.
  4. Assembled genome/ chromosome - A single chromosome is constructed from two or more joined and oriented scaffolds.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are is 3 K-mer for ATCGATCAC?

A

3-mer #0: ATC
3-mer #1: TCG
3-mer #2: CGA
3-mer #3: GAT
3-mer #4: ATC
3-mer #5: TCA
3-mer #6: CAC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Reference Mapping and what is it useful for ?

A

Reference Mapping is alligning sequencing reads to similar strains to determine Single Nucleotide Polymorphisms (SNPs), insertions and deletions (indels)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which programs are useful for reference mapping?

A

BWA
Bowtie

17
Q

What is a Single-nucleotide polymorphism (SNP)?

A

SNP is a single nucleotide variation in the corresponding DNA region between two or more organisms

18
Q

What are the types of Single-nucleotide polymorphism (SNP)?

A

Synonymous SNPs: No change in the amino acid sequence
Non-synonymous SNPs: change in the amino acid sequence of protein
a. missense: results in a codon that codes for a
different amino acid
b. nonsense: results in a premature stop codon

19
Q

What is TRAMS program function?

A

TRAMS is a program for functional annotation of genomic SNPs as synonymous, non-synonymous or nonsense.
It separates non-synonymous SNPs in start and stop codons as non-start and non-stop SNPs
Multiple nucleotide polymorphisms (MNPs)
within a codon are combined before annotation.