HC 4.2 Omics and Gene Expression: Transcript Level Analysis Flashcards
hoorcollege 4
Transcription involves which omes?
genome and transcriptome
Types of genes
-Protein coding
-Non coding
Principal step of RNA seq is selecting RNA molecules. Which selections are possible?
-Size selection
-Type selection with Ribodepletion and poly(A)selection
The details of the RNA seq analysis depend on …
the experimental context and RNA molecules measured
Data analysis workflow for gene expression
-Selection
-Fragmentation and reverse transcription
-sequence and mapping
-quantitate
Goals mRNAseq
Taking complexity and analyse isoforms of genes (transcripts) or working with non-model organism with poorly characterized genome
4 options for mRNAseq analysis via transcriptome
- De novo assembly of transcriptome
- Well characterized genome: reference-based transcriptome assembly
- Combined reference based and de novo assembly
- model organism: download transcriptome from ENSEMBL or NCBI and use those for mapping
Principle of assembly
Reconstructing long sequences from overlapping sequence fragments.
Big challenge with de novo assembly
How to find the overlaps with millions of reads generated
How are De Bruijn graphs made?
-Sequence reads to k-mers of length k (nucleotide sequence from the reads)
-Order k-mers based on the overlaps > graph with arrows (de Bruijn)
Problem with de novo assembly and isoforms
Due to multiple k-mers with enough overlap for connecting to the previous one, multiple isoforms of assembled transcriptome are made, and the actual transcriptome is therefore not completely constructed.
How are long assembled sequences called?
Contigs
Purpose De Bruijn graph
Method to construct long sequences from short sequences
What is the result of an assembly?
A contig
Question with complexity of isoforms in gene expression quantitation
Is it a isoform or an assembly (next different piece
Which regions should be searched for in contigs for more biological relevance?
ORFs
How to identify ORFs (open reading frames) from contigs
-Identify ORF by searching start codons (methionine) and stop codons in each frame
-Longest potential ORFs could be candidates for a protein > more biological relevance
Gene prediction models
-Abinitio: based on gene signals like intron splice site, TF binding site and codon structure
-Homology: significant matches query with known genes
-Probabilitistic: Markov models: translate AA sequence to probable location/function
Methods for transcriptome functional annotation
- searching for homologs based on sequence similarities and identifying assembled sequences
- domain and other sequence feature identification (sequence feature annotation)
- assigning standardized descriptions for sequence biological properties (GO terms)
mRNAseq analysis 2. Reference-based transcriptome assembly
-Reads are first splice-aware mapped against reference genome
> connectivity or splice graph is constructed to represent all possible splicing events at a locus
> alternative paths through the graph are followed to join compatible reads together to isoforms
> biological reference
mRNAseq analysis 3. Combined reference-based and de novo assembly
First: de novo assembly, then alignment
> are the contigs found on the reference genome
> scaffold contigs are mapped
> unassembled reads are mapped and the scaffold contigs which were mapped are extended
Or: First alignment and then assembly
> de novo assembly of unmapped reads
> Reference-based assemby of aligned reads
mRNAseq analysis 4. Model organisms download transcriptome using ENSEMBL/NCBi for mapping
-Download fasta files online
-Alignment of reads to the transcripts to calculate expression levels
After alignment of reads to downloaded transcriptome: the gene expression =
The isoform expression
Disadvantage download of transcriptome
You cannot discover new transcripts
> no characterization, which is already done by the community
Why is splice-aware mapping not needed when download based mapping?
The transcriptome is downloaded, which does not contain introns
> introns are removed from the isoforms
Issues download based mapping
-Isoforms are often very similar, so many reads do not align uniquely
> try to calculate the right gene expression levels when there are reads which multi-map (mapping on two transcripts) > the problem should be taken into account
-Numbers of reads depend on transcript lengths
> longer transcripts generate more reads: bias
> Correction is needed
Approaches for multi-mapping reads
-Ignore the reads: remove them from quantification
-Count once per alignment: count those reads for both alignments
-Split them equally: divide multi-mapping reads and split among the transcripts
-Rescue based on uniquely mapped reads
-and more; the essence; take multi-mapping reads into account
Differential gene expression analysis
-Differences between control cells and treated cells
-Combine various approaches
How can mRNAseq reads be used to characterize and quantify transcriptome?
Characterize 4 options:
-De novo assembly
-Reference-based assembly
-Combined reference-based and de novo assembly
-Model organism download
Quantification by mapping reads on the transcripts