NGS file formats Flashcards

1
Q

What information is contained in the BCL format?

A

Binary base call- Contains all base calls and quality scores for each tile in each cycle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is bcl2fastq?

A

Illumina proprietary software to convert BCL files to FASTQ files and demultiplex samples (using sample sheets generated by the user)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the FASTQ format used for?

A

Text based file for storing biological sequences and Phred qualities in a single file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Illumina file naming convention?

A

SampleName_S1_L001_R1_001.fastq.gz

Contains:

  1. Sample name
  2. Sample number (order on sample sheet)
  3. Lane number
  4. Reverse or forward (paired end)
  5. Last segment always 001
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is contained in the 4 lines of the FASTQ format?

A
  1. Sequence identifier- instrument, run number, flowcell ID, lane, tile, X-pos, Y-pos ….
  2. Sequence
  3. Quality score identifier line (just +)
  4. Quality score (probability of a base being called incorrectly)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the SAM/BAM format used for?

A

TAB delimited text format consisting of a header section (optional) and an alignment section with 11 mandatory fields for essential alignment information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is included in the header of SAM format?

A
  1. Header line- all tags start with @ e.g.
    a. @HD- header line… VN format version
    b. @SQ reference sequence line… SN reference name
    c. @RG- read group line… ID read group identifier
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 11 mandatory fields of the alignment section?

A
  1. QNAME- read name
  2. FLAG- Bitwise flag contains info about the alignment
  3. RNAME- Reference sequence name of the alignment
  4. POS- Start position of the read/
  5. MAPQ- Mapping Q score of the read alignment
  6. CIGAR- Describes the exact alignment to the reference e.g. 100M
  7. RNEXT- Reference sequence name of mate in pair
  8. PNEXT- Position of mate in pair
  9. TLEN- Observed length of the read mapped to the reference sequence
  10. SEQ- the read sequence
  11. QUAL- ASCII code of base quality of each base in read
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the BAM format?

A

Binary representation of a SAM file. BGZF compressed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why would you index a BAM?

A

Aims to achieve fast retrieval of alignment information. BAM must be sorted before being indexed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the VCF format?

A

Variant call format. A text file used in bioinformatics for storing gene sequence variations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How many sections are in a VCF?

A

3

  1. Meta information
  2. Header lines
  3. Data lines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is included within the Meta info line (VCF)?

A

Key value list detailing information contained in data lines. ID, number type and description required. INFO (describes content of info field), FILTER (filters applied by variant caller) and FORMAT (genotype info) are recommended

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 8 mandatory columns of the header line (VCF)?

A
  1. CHROM- Chromosome
  2. POS- Position in reference
  3. ID- dbSNP identifier
  4. REF- Base in reference
  5. ALT- base identified- can be multiple
  6. QUAL- Phred scale score for assertion made in ALT
  7. INFO- additional info about the variant e.g. AF, DP-depth across variant
  8. FORMAT- Genotype field
    if ‘/’ separated == unphased (don’t know which chromosome variant is on)
    if ‘|’ phased
    0 | 0 - Homozygous (REF)
    0 | 1 - Heterozygous
    1 | 1 - Homo variant
    1 | 2 - het ALT1 or ALT2
    - | 1 = hemizygous
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What information is stored in a BED file?

A

Browser Extensible Data. Tab delimited file to define a feature track (chromosomal regions). Used for defining panel regions, ROI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Name 3 required and 2 optional fields of the BED format?

A

Required= Chrom, ChromStart, ChromEnd

Optional= name of line, strand

17
Q

What is the definition of a template?

A

A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences

18
Q

What is the definition of a read?

A

A raw sequence that comes off a sequencing machine. A read may consist of multiple segments. For sequencing data, reads are indexed by the order in which they are sequenced.

19
Q

What is the definition of insert?

A

DNA between two adapter sequences

20
Q

What is the definition of a fragment?

A

Insert + adapters

21
Q

What is read through?

A

Read sequence that has gone through an insert and into the adapter sequence

22
Q

What is a chimeric alignment?

A

When one sequencing read aligns to two distinct portions of the genome with no overlap

Chimeric reads are indicative of structural variation

23
Q

What is a read group?

A

Set of reads generated from a single run (sample) of a sequencing instrument.

When multiplexing is involved each subset of reads originating from a separate library run on that late will constitute a separate read group