File formats Flashcards

1
Q

BWA

A

Aligner program that takes as input:
- a FASTQ file (coming from the patient, sample)
- A FASTA or GTF file (representing the reference genome)
Outputs a SAM file, containing the alignment information for each read (where it is mapped).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

SAM

A
!!!IMPORTANT!!!
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
|                                |
| 1-based indexing   |
|\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_|

SAM file, containing the alignment information for each read (where it is mapped).
The SAM file can be compressed: making it BAM (can reduce size up to ~ 1/10).
It is produced by aligners.
May contain header lines that start with ‘@’.
One line for each alignment, tab delimited columns.

Important columns:

1st: name of the read (present also in the .fa/.fq of the sample)
2nd: alignment information in a compressed way.
3rd: chromosome number.
4th: position
5th: quality score of the alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

FASTA (.fa)

A

The FASTA files is used to represent reads from the sample (patient).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

FASTA (.fa)

A

The FASTA files is used to represent reads from the sample (e.g. patient).
Example:
>read_id_0 GGTATGCTTCTGGGGCGGCAGTCGATAGGGCTAGACTCAGGTCCCGTGGC
>read_id_1 CACTGTGGCCCTCTTGGGGGGTGTCCACACGCCGCCCGTCGGCCCCCTCC
>read_id_2 GTTCTGTGGGTACCTCGCGGTTATGGTGTCGGGGGTATCCAAGGCACCCC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

FASTQ (.fq)

A

FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores.
@ERX288614.1 HWI-ST1362:33:D1J0JACXX:6:1101:1687:2354 length=101
TTTTTCTAGACGGCAGGTCAGGTCCACCACTGACACGTTGGCAGTGGGGACACGGAAGGCCATGCCAGTGAGCTTCCCGTTCAGCTCAGGGATGACCTTGC
+ERX288614.1 HWI-ST1362:33:D1J0JACXX:6:1101:1687:2354 length=101
BBBFFFBFFBFFFFF0BFF0BFFBFBFFBFB77BBBBBBBFFBBBBBFIIB0BB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

GTF files

A

Annotation file related to the corresponding .fa file corresponding to the reference genome.

Each line can contain information about:

  • genes
  • transcripts
  • exons
  • CDS (coding sequence): directly and automatically translated into a protein.
Important columns:
1st: chromosom number
3rd: feature
4th: start position of the feature
5th: end position of the feature
the line also contains: gene id, gene name, gene_biotype information 

Example:
1 havana gene 11869 14409 . + . gene_id “ENSG00000223972”; gene_version “5”; gene_name “DDX11L1”; gene_source “havana”; gene_biotype “transcribed_unprocessed_pseudogene”;
1 havana transcript 11869 14409 . + . gene_id “ENSG00000223972”; gene_version “5”; transcript_id “ENST00000456328”; transcript_version “2”; gene_name “DDX11L1”; gene_source “havana”; gene_biotype “transcribed_unprocessed_pseudogene”; transcript_name “DDX11L1-202”; transcript_source “havana”; transcript_biotype “processed_transcript”; tag “basic”; transcript_support_level “1”;
1 havana exon 11869 12227 . + . gene_id “ENSG00000223972”; gene_version “5”; transcript_id “ENST00000456328”; transcript_version “2”; exon_number “1”; gene_name “DDX11L1”; gene_source “havana”; gene_biotype “transcribed_unprocessed_pseudogene”; transcript_name “DDX11L1-202”; transcript_source “havana”; transcript_biotype “processed_transcript”; exon_id “ENSE00002234944”; exon_version “1”; tag “basic”; transcript_support_level “1”;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

BWA aligner scheme

A
.fq/.fa ----|
               |----> |          |
                       | BWA | -> SAM ---> VCF
ref. ---------->   |          |    (BAM)
(GTF or .fa)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Representation of the reference genome

A

Can be represented using .fa and with a .GTF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Gene, transcripts, coding sequences

A

Given a reference genome, and given a gene inside of it:

  • It can contain multiple exons: each exon can contain one or more coding sequences.
  • It can contain multiple transcripts: they contains one or more exons.
  • A coding sequence is a portion of an exon.
Example:
Gene1:
- transcript1:
  - Exon1 -> CDS1
  - Exon2 -> CDS2
- transcript2
  - Exon1 -> CDS1
  - Exon3 -> CDS3, CDS4
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

VCF file

A

____________
| 1-BASED |
———————-
Strictly related to the SAM file (. Contains all variations we have in the samples wrt the reference.

Important columns:

1st: chromosome #
2nd: position of the specific alteration
4th: bases in the reference
5th: bases in the sample

4th and 5th column simply contain information about the diff.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly