NGS File Formats Flashcards
What is FASTQ?
FASTQ is a text based format for storing both DNA sequence and its quality scores.
There is one FASTQ file per read. So in illumina sequencing you get two FASTQ files per samples (R1 & R2, Forward and Reverse)
What is the file format sequencing is stored in after demultiplexing?
FASTQ
In a FASTQ file what does the quality value Q indicate?
Q is an integer mapping of p (the probability that the corresponding base call is correct)
What number is added to the Q score to make it human readable so it can be translated into a SAM file?
33
What is a SAM file?
A SAM (Sequence, Alignment, Map) file is a generic data format for storing large nucleotide sequence alignments.
Why is a SAM used to store the alignment?
- It is flexible enough to store alignments generated by various alignment tools
- It is simple enough to be easily generated by alignment programmes or be converted
- It is compact in file size
- Allows most operations to be streamed without loading thereby saving memory
- Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus (i.e. trim etc)
What is a CIGAR string?
A CIGAR string is found in a SAM file and is a sequence of base lengths and the associated operation. They are used to indicate things like where bases align (either match/mismatch) with the reference, are deleted from the reference and are insertions that are not in the reference.
What is a BAM file?
A BAM file is a binary compressed version of a SAM file. BAM files use a modified version of gzip compression called BGZF which can be applied to any file format to provide compression with efficient random access.
What is a VCF file?
A VCF (Variant call file) is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line and then data lines each containing information about a position in the genome.
What is an INFO field in a VCF?
An INFO field is a ‘key’ to an abbreviation used in the variant calls
What is a FILTER field in a VCF?
A FILTER field describes flags which may be added to calls to indicate they should be filtered out. ie. LowQual
What is a FASTA file and what is stored in it?
A FASTA file is a text based format representing nucleotide sequences.
Typically the reference genome used for alignment is stored in a FASTA format as most aligners need it in this format.
What are BCL files?
BCL files are base calls per cycle for Illumina sequencing. The contain the base call and quality for each cycle.
Illumina software bcl2fastq converts BCL to FASTQ
Different sequencers have different file types
How are the VCF quality scores calculated?
Using the PHRED scale
What do the four lines of a FASTQ detail?
Line 1 - Identifier Line
Line 2- Sequence
Line 3- + (Quality Score Identifier Line)
Line 4- Quality Values in ASCII (+33 for Illumina)