NGS file formats Flashcards
What information is contained in the BCL format?
Binary base call- Contains all base calls and quality scores for each tile in each cycle.
What is bcl2fastq?
Illumina proprietary software to convert BCL files to FASTQ files and demultiplex samples (using sample sheets generated by the user)
What is the FASTQ format used for?
Text based file for storing biological sequences and Phred qualities in a single file
What is the Illumina file naming convention?
SampleName_S1_L001_R1_001.fastq.gz
Contains:
- Sample name
- Sample number (order on sample sheet)
- Lane number
- Reverse or forward (paired end)
- Last segment always 001
What is contained in the 4 lines of the FASTQ format?
- Sequence identifier- instrument, run number, flowcell ID, lane, tile, X-pos, Y-pos ….
- Sequence
- Quality score identifier line (just +)
- Quality score (probability of a base being called incorrectly)
What is the SAM/BAM format used for?
TAB delimited text format consisting of a header section (optional) and an alignment section with 11 mandatory fields for essential alignment information.
What is included in the header of SAM format?
- Header line- all tags start with @ e.g.
a. @HD- header line… VN format version
b. @SQ reference sequence line… SN reference name
c. @RG- read group line… ID read group identifier
What are the 11 mandatory fields of the alignment section?
- QNAME- read name
- FLAG- Bitwise flag contains info about the alignment
- RNAME- Reference sequence name of the alignment
- POS- Start position of the read/
- MAPQ- Mapping Q score of the read alignment
- CIGAR- Describes the exact alignment to the reference e.g. 100M
- RNEXT- Reference sequence name of mate in pair
- PNEXT- Position of mate in pair
- TLEN- Observed length of the read mapped to the reference sequence
- SEQ- the read sequence
- QUAL- ASCII code of base quality of each base in read
What is the BAM format?
Binary representation of a SAM file. BGZF compressed.
Why would you index a BAM?
Aims to achieve fast retrieval of alignment information. BAM must be sorted before being indexed.
What is the VCF format?
Variant call format. A text file used in bioinformatics for storing gene sequence variations
How many sections are in a VCF?
3
- Meta information
- Header lines
- Data lines
What is included within the Meta info line (VCF)?
Key value list detailing information contained in data lines. ID, number type and description required. INFO (describes content of info field), FILTER (filters applied by variant caller) and FORMAT (genotype info) are recommended
What are the 8 mandatory columns of the header line (VCF)?
- CHROM- Chromosome
- POS- Position in reference
- ID- dbSNP identifier
- REF- Base in reference
- ALT- base identified- can be multiple
- QUAL- Phred scale score for assertion made in ALT
- INFO- additional info about the variant e.g. AF, DP-depth across variant
- FORMAT- Genotype field
if ‘/’ separated == unphased (don’t know which chromosome variant is on)
if ‘|’ phased
0 | 0 - Homozygous (REF)
0 | 1 - Heterozygous
1 | 1 - Homo variant
1 | 2 - het ALT1 or ALT2
- | 1 = hemizygous
What information is stored in a BED file?
Browser Extensible Data. Tab delimited file to define a feature track (chromosomal regions). Used for defining panel regions, ROI