File formats Flashcards
BWA
Aligner program that takes as input:
- a FASTQ file (coming from the patient, sample)
- A FASTA or GTF file (representing the reference genome)
Outputs a SAM file, containing the alignment information for each read (where it is mapped).
SAM
!!!IMPORTANT!!! \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ | | | 1-based indexing | |\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_|
SAM file, containing the alignment information for each read (where it is mapped).
The SAM file can be compressed: making it BAM (can reduce size up to ~ 1/10).
It is produced by aligners.
May contain header lines that start with ‘@’.
One line for each alignment, tab delimited columns.
Important columns:
1st: name of the read (present also in the .fa/.fq of the sample)
2nd: alignment information in a compressed way.
3rd: chromosome number.
4th: position
5th: quality score of the alignment
FASTA (.fa)
The FASTA files is used to represent reads from the sample (patient).
FASTA (.fa)
The FASTA files is used to represent reads from the sample (e.g. patient).
Example:
>read_id_0 GGTATGCTTCTGGGGCGGCAGTCGATAGGGCTAGACTCAGGTCCCGTGGC
>read_id_1 CACTGTGGCCCTCTTGGGGGGTGTCCACACGCCGCCCGTCGGCCCCCTCC
>read_id_2 GTTCTGTGGGTACCTCGCGGTTATGGTGTCGGGGGTATCCAAGGCACCCC
FASTQ (.fq)
FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores.
@ERX288614.1 HWI-ST1362:33:D1J0JACXX:6:1101:1687:2354 length=101
TTTTTCTAGACGGCAGGTCAGGTCCACCACTGACACGTTGGCAGTGGGGACACGGAAGGCCATGCCAGTGAGCTTCCCGTTCAGCTCAGGGATGACCTTGC
+ERX288614.1 HWI-ST1362:33:D1J0JACXX:6:1101:1687:2354 length=101
BBBFFFBFFBFFFFF0BFF0BFFBFBFFBFB77BBBBBBBFFBBBBBFIIB0BB
GTF files
Annotation file related to the corresponding .fa file corresponding to the reference genome.
Each line can contain information about:
- genes
- transcripts
- exons
- CDS (coding sequence): directly and automatically translated into a protein.
Important columns: 1st: chromosom number 3rd: feature 4th: start position of the feature 5th: end position of the feature the line also contains: gene id, gene name, gene_biotype information
Example:
1 havana gene 11869 14409 . + . gene_id “ENSG00000223972”; gene_version “5”; gene_name “DDX11L1”; gene_source “havana”; gene_biotype “transcribed_unprocessed_pseudogene”;
1 havana transcript 11869 14409 . + . gene_id “ENSG00000223972”; gene_version “5”; transcript_id “ENST00000456328”; transcript_version “2”; gene_name “DDX11L1”; gene_source “havana”; gene_biotype “transcribed_unprocessed_pseudogene”; transcript_name “DDX11L1-202”; transcript_source “havana”; transcript_biotype “processed_transcript”; tag “basic”; transcript_support_level “1”;
1 havana exon 11869 12227 . + . gene_id “ENSG00000223972”; gene_version “5”; transcript_id “ENST00000456328”; transcript_version “2”; exon_number “1”; gene_name “DDX11L1”; gene_source “havana”; gene_biotype “transcribed_unprocessed_pseudogene”; transcript_name “DDX11L1-202”; transcript_source “havana”; transcript_biotype “processed_transcript”; exon_id “ENSE00002234944”; exon_version “1”; tag “basic”; transcript_support_level “1”;
BWA aligner scheme
.fq/.fa ----| |----> | | | BWA | -> SAM ---> VCF ref. ----------> | | (BAM) (GTF or .fa)
Representation of the reference genome
Can be represented using .fa and with a .GTF
Gene, transcripts, coding sequences
Given a reference genome, and given a gene inside of it:
- It can contain multiple exons: each exon can contain one or more coding sequences.
- It can contain multiple transcripts: they contains one or more exons.
- A coding sequence is a portion of an exon.
Example: Gene1: - transcript1: - Exon1 -> CDS1 - Exon2 -> CDS2 - transcript2 - Exon1 -> CDS1 - Exon3 -> CDS3, CDS4
VCF file
____________
| 1-BASED |
———————-
Strictly related to the SAM file (. Contains all variations we have in the samples wrt the reference.
Important columns:
1st: chromosome #
2nd: position of the specific alteration
4th: bases in the reference
5th: bases in the sample
4th and 5th column simply contain information about the diff.