RNA seq Flashcards

Question

What are pseudo aligners?

Answer 1

- FM indexing illustrates one of the challenges with mapping reads to the entire genome - Pseudo aligners have been developed to provide faster and more efficient read mapping - Use k-mers rather than aligning full reads - Examples include Salmon and Kallisto - Kallisto **“**can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build”

Answer 2

- For each read determines which transcripts it is from rather than where it aligns - Therefore, not necessary to do a full alignment of the reads to the genome - Raw sequence reads are directly compared to transcript sequences and then used to quantify transcript abundance - The comparison of the sequencing reads to the transcripts is done using a transcriptome de Bruijn graph (T-DBG) - T-DBG constructed from the k-mers present in an input transcriptome as opposed to reads which is done normally for genome/transcriptome assembly. - Transcripts converted into a T-DBG - Each node/vertex is a k-mer in the T-DBG and associated with transcript(s) - a k- compatibility class - Left most node has k-compatibility class of all 3 transcripts - Once T-DBG built kallisto stores a hash table mapping each k-mer to the linear stretches (e.g. the first 3 nodes) it is contained in as well as the position - Called the “kallisto index” - Reads are also split int k-mers and matched to transcripts using the hash table - The black nodes represent the k-mers of the read, where they match transcript k-mers - To identify the transcript(s) a read is from identify all associated k-compatibility classes - The k-compatibility classes of all black nodes for this read are the blue and pink transcripts - Can be extended to paired-end reads using all the k-compatibility classes along both reads

Answer 3

only works with a good transcriptome

Answer 4

- Kallisto improves efficiency by utilising redundancy - the 3 left most nodes have the same k-compatibility class - the same equivalence class - When a read k-mer is hashed the k-compatibility class of the node is identified and jumps to the node after the last one in the same equivalence class - Once the left most k-mer of the read is hashed kallisto ignores the next 2 nodes as they are redundant and hash only the 4th k-mer of the read - For most reads kallisto only performs a hash lookup for only two k-mers

Answer 5

- Pseudo-aligners provide a highly efficient method for mapping reads to transcripts - Benefits include speed and computational resources required - Disadvantage is that reads can only be mapped to known transcripts - Unable to identify and quantify unknown or novel transcripts - Quantification is at the transcript and not gene level - slight disadvantage - we uaually work with genomesnot transcripts - R library available to convert to gene level quantification for input to DESeq2 etc - tximport - Kallisto has a companion R library for transcript differential expression analysis - sleuth

Answer 6

Normalisation adjusts the read count to compensate for within sample analysis

Answer 7

- Several units normalize counts by feature length to allow comparison of features WITHIN a sample: RPKM, FPKM, TPM - They also normalize by total read count but this is generally NOT sufficient for comparison between samples (see later slide)

Answer 8

Very similar to RPKM and FPKM but difference is the order of operations - FPKM and TPM (transcripts per million( are both measures of the **relative** abundance of a transcript in your pool of transcripts. - TPM is now generally preferred over FPKM (as the proportionality constant for FPKM is experiment specific).

Answer 9

- FPKM & TPM both normalize for total read count, but this is generally not sufficient to make comparisons between samples. - They can only tell us the relative proportion of transcripts in a sample. - If we make further assumptions, we can develop suitable methods to compare gene abundances between samples. - Ballgown further processes e.g. allowing for total read count per sample. - YOU CAN’T COMPARE SAMPLES WITH THIS METHOOD, ONLY WITHIN SAMPLE OPTIMISATION Assume 4 genes in genome, each equally expressed in condition 1. Genes C and D down regulated in condition 2 but other 2 have same expression. Both have 8000 reads mapped - As the number of reads are the same in both conditions the RPKM for genes A and B would be higher in 2 than 1 suggesting they are up regulated - This is the RNA composition effect - Tools that use this method, such as ballgown, also maintain extra information about samples (e.g. the total number of reads) - This allows for the proper comparison of these normalized measures across samples

Answer 10

- Highly and differentially expressed (DE) genes can distort gene abundance measures, so total read count is not an accurate normalization factor. - Instead, we want to find a ‘control’ set of transcripts that are not DE and use these to estimate ‘size factors’ that enable meaningful comparisons between samples. - Several methods exist (all assume most genes are not DE), including: TMM and Median of Ratios

Answer 11

- Uses a trimmed weighted mean. Excludes genes with large log-fold ratios between samples, and those with extreme abundance values before calculating a weighted mean of log-fold changes. - anything extreme is exluded - Used by edgeR

Answer 12

- Uses the median of ratios of observed counts. For each gene, calculate the (geometric) mean of its expression across all samples and treat this as a pseudo- reference. For each sample, calculate the ratios of observed gene counts to these pseudo-references, and take the median value as the size factor. - Used by DESeq2

Answer 13

- As well as normalization, we may need to correct for batch effects. - These are confounding factors that cause unwanted variation in gene abundances between samples, due to technical factors that differ across batches (samples that are processed in parallel). - For example, differences in reagents, equipment, or date of library preparation or sequencing may cause batch effects. - Various tools (e.g. COMBAT, SVAseq) enable batch correction, assuming suitable experimental design so technical and experimental factors are not confounded.

Answer 14

- More replicates increases our power to detect DE genes. - Minimum of 3 biological replicates recommended. - 1. RNA-seq experiments often have few replicates, so specialized statistical methods require

Answer 15

- Testing for a statistically significant change in expression - Many individual statistical hypothesis tests are performed (for each of 100s- 1000s of genes) so p-values need to be corrected for multiple testing - These methods generally require **unnormalized data**, as they perform an integrated normalization step - use raw or estimated read counts

Answer 16

- Calculated p-values need to be adjusted when repeating multiple independent statistical tests to reduce the false discovery rate (FDR) - This applies to differential expression calculations where multiple genes are being compared - DESeq2 implements Benjamini Hochberg multiple test correction and reports a q-value along with the p-value - The q-value is the adjusted p-value and the significance value that should be used

Answer 17

- you can use RNA seq to narrow down the pool of genes and then do qPCR on the subset of genes - RNA seq is a starting point - then you can take the genes to further analysis if you see anything interesting

Answer 18

- RNA-seq has applications beyond quantifying gene expression. - One gene may give rise to several different mRNAs (and protein isoforms) due to alternative splicing - RNA-seq allows us to study changes in isoform expression. - indentification of mutations - Mutations may affect RNA cis- regulatory elements, spliceosomal components, or trans-acting regulatory factors.

Answer 19

- Chromosomal rearrangements can lead to fused transcripts. - RNA-seq allows us to detect these fusion events. - Gene fusions are commonly reported in many types of cancer, and may be used for diagnosis and prognosis.- A fusion junction is a section of transcribed RNA that maps to an exon from one gene followed by an exon from another gene - It might occur as result of a translocation, deletion or chromosomal inversion - Example - PML-RAR protein associated with Acute Promyelocytic Leukaemia - These types of structural rearrangements can also be identified by direct sequencing using paired end reads

RNA seq Flashcards

(43 cards)