File formats Flashcards

Question 1

Q

BWA

Answer

A

Aligner program that takes as input:
- a FASTQ file (coming from the patient, sample)
- A FASTA or GTF file (representing the reference genome)
Outputs a SAM file, containing the alignment information for each read (where it is mapped).

Question 2

Q

SAM

Answer

A

!!!IMPORTANT!!!
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
|                                |
| 1-based indexing   |
|\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_|

SAM file, containing the alignment information for each read (where it is mapped).
The SAM file can be compressed: making it BAM (can reduce size up to ~ 1/10).
It is produced by aligners.
May contain header lines that start with ‘@’.
One line for each alignment, tab delimited columns.

Important columns:

1st: name of the read (present also in the .fa/.fq of the sample)
2nd: alignment information in a compressed way.
3rd: chromosome number.
4th: position
5th: quality score of the alignment

Question 3

Q

FASTA (.fa)

Answer

A

The FASTA files is used to represent reads from the sample (patient).

Question 4

Q

FASTA (.fa)

Answer

A

The FASTA files is used to represent reads from the sample (e.g. patient).
Example:
>read_id_0 GGTATGCTTCTGGGGCGGCAGTCGATAGGGCTAGACTCAGGTCCCGTGGC
>read_id_1 CACTGTGGCCCTCTTGGGGGGTGTCCACACGCCGCCCGTCGGCCCCCTCC
>read_id_2 GTTCTGTGGGTACCTCGCGGTTATGGTGTCGGGGGTATCCAAGGCACCCC

Question 5

Q

FASTQ (.fq)

Answer

A

FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores.
@ERX288614.1 HWI-ST1362:33:D1J0JACXX:6:1101:1687:2354 length=101
TTTTTCTAGACGGCAGGTCAGGTCCACCACTGACACGTTGGCAGTGGGGACACGGAAGGCCATGCCAGTGAGCTTCCCGTTCAGCTCAGGGATGACCTTGC
+ERX288614.1 HWI-ST1362:33:D1J0JACXX:6:1101:1687:2354 length=101
BBBFFFBFFBFFFFF0BFF0BFFBFBFFBFB77BBBBBBBFFBBBBBFIIB0BB

Question 6

Q

GTF files

Answer

A

Annotation file related to the corresponding .fa file corresponding to the reference genome.

Each line can contain information about:

genes
transcripts
exons
CDS (coding sequence): directly and automatically translated into a protein.

Important columns:
1st: chromosom number
3rd: feature
4th: start position of the feature
5th: end position of the feature
the line also contains: gene id, gene name, gene_biotype information

Example:
1 havana gene 11869 14409 . + . gene_id “ENSG00000223972”; gene_version “5”; gene_name “DDX11L1”; gene_source “havana”; gene_biotype “transcribed_unprocessed_pseudogene”;
1 havana transcript 11869 14409 . + . gene_id “ENSG00000223972”; gene_version “5”; transcript_id “ENST00000456328”; transcript_version “2”; gene_name “DDX11L1”; gene_source “havana”; gene_biotype “transcribed_unprocessed_pseudogene”; transcript_name “DDX11L1-202”; transcript_source “havana”; transcript_biotype “processed_transcript”; tag “basic”; transcript_support_level “1”;
1 havana exon 11869 12227 . + . gene_id “ENSG00000223972”; gene_version “5”; transcript_id “ENST00000456328”; transcript_version “2”; exon_number “1”; gene_name “DDX11L1”; gene_source “havana”; gene_biotype “transcribed_unprocessed_pseudogene”; transcript_name “DDX11L1-202”; transcript_source “havana”; transcript_biotype “processed_transcript”; exon_id “ENSE00002234944”; exon_version “1”; tag “basic”; transcript_support_level “1”;

Question 7

Q

BWA aligner scheme

Answer

A

.fq/.fa ----|
               |----> |          |
                       | BWA | -> SAM ---> VCF
ref. ---------->   |          |    (BAM)
(GTF or .fa)

Question 8

Q

Representation of the reference genome

Answer

A

Can be represented using .fa and with a .GTF

Question 9

Q

Gene, transcripts, coding sequences

Answer

A

Given a reference genome, and given a gene inside of it:

It can contain multiple exons: each exon can contain one or more coding sequences.
It can contain multiple transcripts: they contains one or more exons.
A coding sequence is a portion of an exon.

Example:
Gene1:
- transcript1:
  - Exon1 -> CDS1
  - Exon2 -> CDS2
- transcript2
  - Exon1 -> CDS1
  - Exon3 -> CDS3, CDS4

Question 10

Q

VCF file

Answer

A

____________
| 1-BASED |
———————-
Strictly related to the SAM file (. Contains all variations we have in the samples wrt the reference.

Important columns:

1st: chromosome #
2nd: position of the specific alteration
4th: bases in the reference
5th: bases in the sample

4th and 5th column simply contain information about the diff.

File formats Flashcards

(10 cards)