NGS File Formats Flashcards

1
Q

What is FASTQ?

A

FASTQ is a text based format for storing both DNA sequence and its quality scores.

There is one FASTQ file per read. So in illumina sequencing you get two FASTQ files per samples (R1 & R2, Forward and Reverse)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the file format sequencing is stored in after demultiplexing?

A

FASTQ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In a FASTQ file what does the quality value Q indicate?

A

Q is an integer mapping of p (the probability that the corresponding base call is correct)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What number is added to the Q score to make it human readable so it can be translated into a SAM file?

A

33

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a SAM file?

A

A SAM (Sequence, Alignment, Map) file is a generic data format for storing large nucleotide sequence alignments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is a SAM used to store the alignment?

A
  • It is flexible enough to store alignments generated by various alignment tools
  • It is simple enough to be easily generated by alignment programmes or be converted
  • It is compact in file size
  • Allows most operations to be streamed without loading thereby saving memory
  • Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus (i.e. trim etc)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a CIGAR string?

A

A CIGAR string is found in a SAM file and is a sequence of base lengths and the associated operation. They are used to indicate things like where bases align (either match/mismatch) with the reference, are deleted from the reference and are insertions that are not in the reference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a BAM file?

A

A BAM file is a binary compressed version of a SAM file. BAM files use a modified version of gzip compression called BGZF which can be applied to any file format to provide compression with efficient random access.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a VCF file?

A

A VCF (Variant call file) is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line and then data lines each containing information about a position in the genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an INFO field in a VCF?

A

An INFO field is a ‘key’ to an abbreviation used in the variant calls

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a FILTER field in a VCF?

A

A FILTER field describes flags which may be added to calls to indicate they should be filtered out. ie. LowQual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a FASTA file and what is stored in it?

A

A FASTA file is a text based format representing nucleotide sequences.

Typically the reference genome used for alignment is stored in a FASTA format as most aligners need it in this format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are BCL files?

A

BCL files are base calls per cycle for Illumina sequencing. The contain the base call and quality for each cycle.

Illumina software bcl2fastq converts BCL to FASTQ

Different sequencers have different file types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How are the VCF quality scores calculated?

A

Using the PHRED scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What do the four lines of a FASTQ detail?

A

Line 1 - Identifier Line
Line 2- Sequence
Line 3- + (Quality Score Identifier Line)
Line 4- Quality Values in ASCII (+33 for Illumina)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Where do you find the Mapping Quality Score and the QUAL score in a SAM file?

A

Column 5 - MAP. −10 log10 Pr(position is wrong), rounded to the nearest integer. A value of 255 indicates that the mapping quality is not available.

Column 11 – QUAL base quality scores as for FASTQ. * is stored when not available. (ASCII of Phred-scaled base QUALity+33)

17
Q

What column in a SAM file is the CIGAR string found?

A

Column 6

18
Q

What is the FLAG in a SAM file and in which column is it found?

A

Column 2 - bitwise FLAG, gives information on the mapping of the read, for example a flag can indicate that the read does not pass filter or it is a PCR duplicate. Each flag has a bit value which is combined to give an overall bit value for the read