Genomics - NGS reads, reference mapping and SNP analysis Flashcards

Question 1

Q

Name 2 examples of read formats you may get for QC analysis

Answer

A

Binary standard flowgram format (SFF) - Roche 454
FASTQ Format - text based format - Illumina

Question 2

Q

What do Illumina Reads show?

Answer

A

Label
Sequence
Q-Score

Question 3

Q

What does a Phred score of 10, 20, 30, 40, 50 and 60 mean?

Answer

A

10 - Error 1/10 - Accuracy - 90%
20 - 1/100 - 99%
30 - 1/1000 - 99.9%
40 - 1/10000 - 99.99%
50 - 1/100000 - 99.999%
60 - 1/1000000 - 99.9999%

Question 4

Q

What Phred score is the cut of for QC analysis?

Question 5

Q

Name 2 QC softwares

Answer

A

CLC Genomic workbench
Fast QC

Question 6

Q

Describe Per Base Quality Distribution

Answer

A

Shows the range of all quality values across all bases at each position
Poor distribution could be due to general degredation of quality over the duration of long runs and quality trimming could be useful in such cases

Question 7

Q

Describe per tile sequence quality

Answer

A

Average Phred score for each tile on the flow cell
If everything is perfect - all tiles will be blue
Only for illumina - reads retain original sequence identifiers for the flow cell tile from which each read originated
The colour scale is a blue - red scale
blue foe positions with quality at or above the average and red for poorer qualities

Question 8

Q

Define Quality scores and what are their benefits

Answer

A

A small fraction of reads have overall poor quality values due to poor imaging or other technical problems during the run
The per sequence quality score helps identify the subset of sequence with low quality values
Low quality reads must be removed from downstream processing and analyses due to problems, such as wrongly called bases, which could introduce bias in downstream analysis.
Generally filtered by their quality scores using in-built software

Question 9

Q

Describe per sequence and per base GC content

Answer

A

GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content
Sharp peaks on a smooth distribution normally results from specific contaminant
For example adapter dimers, which could be picked up by the overrepresented sequences module
You should have a uniform GC content across the read
A normal library contains a diverse set of sequences and overrepresentation of a single sequence might indicate the library is contaminated

Question 10

Q

Describe Adapter sequence removal

Answer

A

The primary reads from the sequencer also carry the adaptor sequences that were used for the sequencing reaction which need to be removed before processing
This feature is generally available in the programs
There should be NO adapters left

Question 11

Q

Describe per base sequence content

Answer

A

The proportion of each base position and GC contents in file and are expected to be a little or no difference between the bases of a sequence run
However - overrepresented sequences such as adaptor dimers or rRNA in a sample may cause bias
If there are fluctuations they must be trimmed

Question 12

Q

Describe sequence duplication

Answer

A

In a diverse library most sequences will occur only once in the final set
A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level is more likely to indicate some kind of enrichment bias (PCR overamplification)

Question 13

Q

Describe per base N count

Answer

A

The bases with poor quality score are normally left ambiguous and are substituted with an N by the program
- There should be no Ns

Question 14

Q

What are features of a bad Illumina run

Answer

A

Low Q score per base
Fluctuating per base sequence content
Sharp peaks per sequence GC content
Peaks in sequence duplication levels

Question 15

Q

Describe denovo assembly

Answer

A

When genome DNA is fragmented for sequencing into genomic reads de novo assembly is when you attempt to put the reads back together
This can leave gaps
Paired end reads can be helpful filling some of these gaps

Question 16

Q

Describe repetitive sequences

Answer

Study These Flashcards

A

Most common source of assembly errors
Longer reads than the size of repeat for minimal impact
R1 and R2 are a repetitive region, software will be unlikely to tell that R 1 and R2 are separate sequences so will just form one of them. This leaves orphans in the sequence.

Question 17

Q

Describe reference mapping

Answer

Study These Flashcards

A

Next generation sequencing reads are mapped onto a well assembled reference genome to identify SNPs, indels, between similar strains
Reference mapping is really useful for resequencing projects for organisms where genomic rearrangements are limited
Programs = BWA and Bowtie

Question 18

Q

What are Mapping output?

Answer

Study These Flashcards

A

Reference guided assemblies by mapping short reads to a reference genome
A list of potential indels
A list of genome wide SNPs
Some genes are facing the forward and reverse directions
The order of genes cant be predicted in the mapped sequence

Question 19

Q

Describe the different types of SNPs

Answer

Study These Flashcards

A

Synonymous SNPs - no change in AA sequence
Non synonymous - missense - results in a codon that encodes for a different AA
Non synonymous - nonsense - results in a premature stop codon

Question 20

Q

Whats is TRAMS - Tool for Rapid Annotation of Microbial SNPs

Answer

Study These Flashcards

A

TRAMS is a program for functional annotation of genomic SPs as synonymous, nonsynonymous or nonsense. It separates nonsynymous SNPs in start and stop codons as non-start and non-stop SNPs, respectively. MNPs within a codon are combines before annotation

Genomics - NGS reads, reference mapping and SNP analysis Flashcards

(20 cards)