Genomics - NGS reads, reference mapping and SNP analysis Flashcards
Name 2 examples of read formats you may get for QC analysis
- Binary standard flowgram format (SFF) - Roche 454
- FASTQ Format - text based format - Illumina
What do Illumina Reads show?
Label
Sequence
Q-Score
What does a Phred score of 10, 20, 30, 40, 50 and 60 mean?
10 - Error 1/10 - Accuracy - 90%
20 - 1/100 - 99%
30 - 1/1000 - 99.9%
40 - 1/10000 - 99.99%
50 - 1/100000 - 99.999%
60 - 1/1000000 - 99.9999%
What Phred score is the cut of for QC analysis?
20
Name 2 QC softwares
CLC Genomic workbench
Fast QC
Describe Per Base Quality Distribution
- Shows the range of all quality values across all bases at each position
- Poor distribution could be due to general degredation of quality over the duration of long runs and quality trimming could be useful in such cases
Describe per tile sequence quality
- Average Phred score for each tile on the flow cell
- If everything is perfect - all tiles will be blue
- Only for illumina - reads retain original sequence identifiers for the flow cell tile from which each read originated
- The colour scale is a blue - red scale
- blue foe positions with quality at or above the average and red for poorer qualities
Define Quality scores and what are their benefits
- A small fraction of reads have overall poor quality values due to poor imaging or other technical problems during the run
- The per sequence quality score helps identify the subset of sequence with low quality values
- Low quality reads must be removed from downstream processing and analyses due to problems, such as wrongly called bases, which could introduce bias in downstream analysis.
- Generally filtered by their quality scores using in-built software
Describe per sequence and per base GC content
- GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content
- Sharp peaks on a smooth distribution normally results from specific contaminant
For example adapter dimers, which could be picked up by the overrepresented sequences module - You should have a uniform GC content across the read
- A normal library contains a diverse set of sequences and overrepresentation of a single sequence might indicate the library is contaminated
Describe Adapter sequence removal
- The primary reads from the sequencer also carry the adaptor sequences that were used for the sequencing reaction which need to be removed before processing
- This feature is generally available in the programs
- There should be NO adapters left
Describe per base sequence content
- The proportion of each base position and GC contents in file and are expected to be a little or no difference between the bases of a sequence run
- However - overrepresented sequences such as adaptor dimers or rRNA in a sample may cause bias
- If there are fluctuations they must be trimmed
Describe sequence duplication
- In a diverse library most sequences will occur only once in the final set
- A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level is more likely to indicate some kind of enrichment bias (PCR overamplification)
Describe per base N count
The bases with poor quality score are normally left ambiguous and are substituted with an N by the program
- There should be no Ns
What are features of a bad Illumina run
- Low Q score per base
- Fluctuating per base sequence content
- Sharp peaks per sequence GC content
- Peaks in sequence duplication levels
Describe denovo assembly
- When genome DNA is fragmented for sequencing into genomic reads de novo assembly is when you attempt to put the reads back together
- This can leave gaps
- Paired end reads can be helpful filling some of these gaps
Describe repetitive sequences
- Most common source of assembly errors
- Longer reads than the size of repeat for minimal impact
- R1 and R2 are a repetitive region, software will be unlikely to tell that R 1 and R2 are separate sequences so will just form one of them. This leaves orphans in the sequence.
Describe reference mapping
- Next generation sequencing reads are mapped onto a well assembled reference genome to identify SNPs, indels, between similar strains
- Reference mapping is really useful for resequencing projects for organisms where genomic rearrangements are limited
- Programs = BWA and Bowtie
What are Mapping output?
- Reference guided assemblies by mapping short reads to a reference genome
- A list of potential indels
- A list of genome wide SNPs
- Some genes are facing the forward and reverse directions
- The order of genes cant be predicted in the mapped sequence
Describe the different types of SNPs
- Synonymous SNPs - no change in AA sequence
- Non synonymous - missense - results in a codon that encodes for a different AA
- Non synonymous - nonsense - results in a premature stop codon
Whats is TRAMS - Tool for Rapid Annotation of Microbial SNPs
- TRAMS is a program for functional annotation of genomic SPs as synonymous, nonsynonymous or nonsense. It separates nonsynymous SNPs in start and stop codons as non-start and non-stop SNPs, respectively. MNPs within a codon are combines before annotation