Genomics - NGS reads, reference mapping and SNP analysis Flashcards

1
Q

Name 2 examples of read formats you may get for QC analysis

A
  1. Binary standard flowgram format (SFF) - Roche 454
  2. FASTQ Format - text based format - Illumina
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What do Illumina Reads show?

A

Label
Sequence
Q-Score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does a Phred score of 10, 20, 30, 40, 50 and 60 mean?

A

10 - Error 1/10 - Accuracy - 90%
20 - 1/100 - 99%
30 - 1/1000 - 99.9%
40 - 1/10000 - 99.99%
50 - 1/100000 - 99.999%
60 - 1/1000000 - 99.9999%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What Phred score is the cut of for QC analysis?

A

20

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name 2 QC softwares

A

CLC Genomic workbench
Fast QC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe Per Base Quality Distribution

A
  • Shows the range of all quality values across all bases at each position
  • Poor distribution could be due to general degredation of quality over the duration of long runs and quality trimming could be useful in such cases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe per tile sequence quality

A
  • Average Phred score for each tile on the flow cell
  • If everything is perfect - all tiles will be blue
  • Only for illumina - reads retain original sequence identifiers for the flow cell tile from which each read originated
  • The colour scale is a blue - red scale
  • blue foe positions with quality at or above the average and red for poorer qualities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define Quality scores and what are their benefits

A
  • A small fraction of reads have overall poor quality values due to poor imaging or other technical problems during the run
  • The per sequence quality score helps identify the subset of sequence with low quality values
  • Low quality reads must be removed from downstream processing and analyses due to problems, such as wrongly called bases, which could introduce bias in downstream analysis.
  • Generally filtered by their quality scores using in-built software
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe per sequence and per base GC content

A
  • GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content
  • Sharp peaks on a smooth distribution normally results from specific contaminant
    For example adapter dimers, which could be picked up by the overrepresented sequences module
  • You should have a uniform GC content across the read
  • A normal library contains a diverse set of sequences and overrepresentation of a single sequence might indicate the library is contaminated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe Adapter sequence removal

A
  • The primary reads from the sequencer also carry the adaptor sequences that were used for the sequencing reaction which need to be removed before processing
  • This feature is generally available in the programs
  • There should be NO adapters left
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe per base sequence content

A
  • The proportion of each base position and GC contents in file and are expected to be a little or no difference between the bases of a sequence run
  • However - overrepresented sequences such as adaptor dimers or rRNA in a sample may cause bias
  • If there are fluctuations they must be trimmed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe sequence duplication

A
  • In a diverse library most sequences will occur only once in the final set
  • A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level is more likely to indicate some kind of enrichment bias (PCR overamplification)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe per base N count

A

The bases with poor quality score are normally left ambiguous and are substituted with an N by the program
- There should be no Ns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are features of a bad Illumina run

A
  • Low Q score per base
  • Fluctuating per base sequence content
  • Sharp peaks per sequence GC content
  • Peaks in sequence duplication levels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe denovo assembly

A
  • When genome DNA is fragmented for sequencing into genomic reads de novo assembly is when you attempt to put the reads back together
  • This can leave gaps
  • Paired end reads can be helpful filling some of these gaps
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe repetitive sequences

A
  • Most common source of assembly errors
  • Longer reads than the size of repeat for minimal impact
  • R1 and R2 are a repetitive region, software will be unlikely to tell that R 1 and R2 are separate sequences so will just form one of them. This leaves orphans in the sequence.
17
Q

Describe reference mapping

A
  • Next generation sequencing reads are mapped onto a well assembled reference genome to identify SNPs, indels, between similar strains
  • Reference mapping is really useful for resequencing projects for organisms where genomic rearrangements are limited
  • Programs = BWA and Bowtie
18
Q

What are Mapping output?

A
  • Reference guided assemblies by mapping short reads to a reference genome
  • A list of potential indels
  • A list of genome wide SNPs
  • Some genes are facing the forward and reverse directions
  • The order of genes cant be predicted in the mapped sequence
19
Q

Describe the different types of SNPs

A
  • Synonymous SNPs - no change in AA sequence
  • Non synonymous - missense - results in a codon that encodes for a different AA
  • Non synonymous - nonsense - results in a premature stop codon
20
Q

Whats is TRAMS - Tool for Rapid Annotation of Microbial SNPs

A
  • TRAMS is a program for functional annotation of genomic SPs as synonymous, nonsynonymous or nonsense. It separates nonsynymous SNPs in start and stop codons as non-start and non-stop SNPs, respectively. MNPs within a codon are combines before annotation