5.2 RNA seq Flashcards

1
Q

Describe how genes in prokaryotic are transcribed

A

Multiple mRNAs are expressed in an operon that may have multiple genes, These mRNAs get translated into individual proteins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe transcription in eukaryotes

A
  1. A gene is comprised of coding & non-coding regions (exons & introns) with 5’ UTR at the TSS, and 3’ UTR
  2. pre-MRNA gets spliced, capped, polyadenylated into mature mRNA
  3. mature mRNA exported into the cytoplasm for translation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define the transcriptome

A

All RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA produced in one or a population of cells

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is there a molecular or transcriptional technique that can measure the entire transcriptome of a cell? Why or why not?

A

No there is not.
Each RNA species requires its own customized experimental workflow and analytical pipeline
ex: miRNA are small, single-stranded, require ligation to adaptors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name 5 challenges of RNA sequencing compared to DNA sequencing

A
  1. Sample purity/quality/qty: RNA inherently labile; hard to get good samples
  2. RNA consist of small exons that may be separated by large introns: hard to map with seed & extension strategy since exon-exon junctions don’t exist in genome
  3. Relative abundance of RNAs vary wildly (amount of genomic DNA consistent, but RNA expression not across cell types)
  4. RNAs come in wide range of sizes
  5. RNA easily degradable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is RNA quality assessed?

A
  1. Using a bioanalyzer, a small microchip gel is injected with a polymer and RNA (w dye and markers)
  2. Fluorescence unit diagram & RNA Integrity number (RIN) value outputted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a RIN value?

A

The ratio of expected RNA species (in 18S and 20S) over smaller RNA fragments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a good and bad RIN score?

A

Good RIN:10
Threshold to proceed in RNA experiment: 7
Bad RIN: 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some challenges in using poor quality RNA?

A
  1. Will get RNA that degrades in a non-predictable way, creating smaller species than expected
  2. Smaller species get captured during library construction in non-stochastic way which affect gene counts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What form should RNA be in when creating an RNA-seq library?

A

RNA should be fully in tact

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In mammalian RNA-seq what is (typically) the first step that is done?

A

mRNA is separate from other RNA types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In mammalian cells, what is the most abundant type of RNA and why is it not desirable to analyze for gene expression?

A

~80% total RNA is tRNA

tRNA is abundance, highly repetitive and not useful for understanding differential gene expression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can eukaryotic mRNA be separated from other RNA types? For prokaryotic mRNA?

A
  1. Poly-A tail selection (Note not all mRNAs have this) –> PolydT beads bind to Poly A tails
  2. Ribodepletion: depletion of rRNA
    Proks.: mRNAs have no polyA tail, can only use ribodepletion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe the 7 Steps used to construct an mRNA RNA-seq library

A
  1. target mRNA is enriched using PolyTbeads or ribodepletion
  2. RNA fragmented and primed
  3. First strand of cDNA generated
  4. Second strand of cRNA generated
  5. 3’ end adenylated & 5’ ends repaired
  6. Adaptors containing barcodes added to both ends
  7. ligated fragments PCR amplified
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How are different adaptors added to the ends of inserts?

A

Forked adaptors:

  • first 14 nts are complementary
  • Afterwards, no longer complementary and diverge
  • not an issue in PCR

Prevents loss of ~ 50% product as in Ion Torrent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is it important to know which strand a transcript originated from?

A

to be able to distinguish the expression levels of 2 different genes/exons on different stands that may overlap

Strand information can be retained when shearing so the orientation of the insert is known when sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe the un-stranded protocol?

A
  • synthesis of randomly primed ds-cDNA + addition of adaptors for sequencing
  • info on which strand the original mRNA template came from is lost
  • can’t determine gene expression of overlapping genes that are transcribed from different strands
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe the stranded protocol

A
  • dUTPs added in synthesis of 2nd cDNA strand instead of dTTPs (can be 1st strand, usually 2nd)
  • before PCR, strand with Uracils degraded using uracil-N-glycosylase. The remaining strand corresponds to the original mRNA transcript

Uracil acts as a molecular tag on the 2nd strand for removal prior to sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the common analysis goals of RNA-seq? (6)

A
  • gene expression and differential expression
  • transcript discovery and annotation
  • allele-specific expression (& in relation to SNPs or mutations)
  • Mutation discovery
  • fusion detection
  • RNA editing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Should PRC duplicates be removed?

A

Depends:
RNA-seq: typically include duplicates
Chip: best practice to remove
Whole genome analysis: always remove bc not representative of true biological replicates

To decide asses library complexity
If removed assess duplicates at paired-end reads level and not single ends reads level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are some concerns about PCR duplicates in RNA-seq?

A
  • may be due to biased PCR amplification of certain fragments (same w Chip-seq)
  • duplicates w no PCR bias expected in highly-expressed short genes (over-representation actually reflective of biology)
  • removing duplicates for short or highly expressed genes compresses the top end of their expression (reduces dynamic range of experiment)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is an appropriate depth for RNA-sequencing?

A

Depends on:

  1. research question
  2. Tissue type, RNA preparation, quality of input RNA, library construction method
  3. sequencing type: read length, paired vs unpaired
  4. computational approach and resources
  5. similar publications
  6. create pilot experiments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the standard read depth for reference mapping?

A

200 million reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe the RNA-seq workflow when a reference genome is available

A
  • Reads are mapped to the reference genome using a gapped aligner that can deal with exon/intron boundaries
  • Novel transcript discovery and quantification can proceed without without an annotation file (annotation file gives coordinates of genes with respect to reference genome and should usually be used)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Describe the RNA-seq workflow when a reference transcriptome is available

A
  • Reads aligned to reference transcriptome using an ungapped aligner since transcriptome already has introns spliced out
  • transcript identification and quantification can occur simultaneously since read counts are now directly associated with the reads
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Describe the RNA-seq workflow when no reference genome is available

A
  • reads need to be assembled into contigs or transcripts (software can stitch reads back together)
  • assembled contigs are contained in FASTA file
  • for quantification reads are mapped/aligned to assembled contigs using ungapped aligner
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Describe the basic workflow in RNA-seq

A
  1. RNA library generated
  2. library mapped to reference genome (STAR)
  3. Transcript read count table generated (Htseq)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Why are exon-exon junctions challenging in RNA-seq?

A

Reads span regions where introns exist in the reference genome, creating a computation challenge to calculate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the two approaches to RNAseq exon-exon alignment?

A
  1. exon-first approach

2. seed-extend approach

30
Q

Describe how exon-first approach works

A

reads first aligned to exons and then any reads that didn’t align would be aligned to a reference of exon-exon junctions

31
Q

Describe how the seed-extend approach works

A

Read is split into multiple seeds, which allows the aligner to choose multiple seeds along the read.

Now we can generate reads that span exon-exon junctions

32
Q

What is a pro and con of STAR?

A

high accuracy and speed compared to other aligners but very memory intensive (requires a lot of RAM)

33
Q

What does STAR stand for?

A

Spliced Transcript Alignment to a Reference

34
Q

What are the 2 steps in STAR alignment?

A
  1. seed searching: start with 5’ seed, extend, clip, repeat
    5’ seed is extended looking for an exact match, until there is no longer a match, and clips the read
  2. Clustering, stitching, and scoring
35
Q

Describe how seed searching works in STAR

A

For each read, STAR searches for the longest sequence that exactly matches one or more locations in the reference genome. One it no longer matches the read it clipped and a new seed is started

36
Q

what are MMPs?

A

Maximum mappable prefixes

The longest matching sequence in a read to the reference genome

37
Q

What happens if an exact match is not found for a seed due to indel/mutation/variant? (STAR)

A

It gets clipped sooner. Another MMP is generated and are continually generated to deal with any mismtaches

38
Q

Describe how clustering, stitching and scoring works in STAR

A

Transcripts are stitched together to make a complete read

  1. seeds are clustered together based on how close they are to a set of anchor seeds or uniquely mapped seeds (MMPs stitched together based on location next to each other in the genome)
  2. Seeds are stiched together based on best alignment for the stiched read (using Phred scores); depends on parameters set for insert length acceptable in the gaps
39
Q

What happens if the seed extension exceeds the alignment mismatch threshold

A

the mismatched sequence ( such as if the read is poor quality or an adapter sequence) will be soft clipped.
The read is still saved in the BM file, but the CIGAR string will indicate that the read has been clipped

40
Q

What are the 2 steps to align reads using STAR?

A
  1. create a genome index (includes annotation info used in seed stitching step)
  2. align fastq files to indexed genome
41
Q

What are the 2 files required to generate a STAR genome index file?

A
  1. Reference Genome fasta file

2. Gene annotation file in GTF or GFF format

42
Q

What are gene annotations?

A

the plotting of genes onto genome assemblies, and indexing their genomic coordinates

43
Q

What are coordinates?

A

All the start and end positions for all the exons/transcripts in the reference build

44
Q

Which reference build is most suitable for RNA-seq alignment?

A

The primary assembly

45
Q

What are repeat masks?

A

Regions of the genome that are repetitive have been masked (stripped out of the sequences and replaced with Ns

46
Q

What is the purpose of repeat masks and soft masks?

A

For heuristics. Masking helps improve computational speed and makes it easier to map.

When you have a region of interest in a transcript, masking makes sure it doesn’t align to other pseudo gene regions where the repeat exists. Thus can more accurately quantify it

47
Q

What is soft masking?

A

Sequence is changed from upper case to lower case

48
Q

What does the top level reference genome selection contain?

A
  • contains all sequence regions flagged as top-level in an Ensemble schema
  • includes chromosomes, regions not assembled into chromosomes and N packed haplotypes/patch regions
49
Q

What are haplotypes?

A

regions where there is divergence from the reference

50
Q

What does the primary assembly reference genome selection contain?

A
  • all the top level sequence regions except haplotypes, and patches
51
Q

What does it mean if the primary assembly file is not present?

A

there are no haplotypes/patch regions in the reference and the primary and top level are equivalent

52
Q

What is the primary assembly best used for?

A

performing sequence similarity searches where patch and haplotype sequences could confused analysis

53
Q

Name 3 gene annotation sets

A

ENSEMBL
Consensus CDS (CCDS)
ResSeqs

54
Q

what format are annotation files ins?

A

GTF (General Transfer format) or GFF (General Feature format)

55
Q

What do GTF and GFF files contain?

A

consist of one line per feature, each containing 9 columns of data plus option track definition lines

contains genomic coordinates and description of the gene

56
Q

What does fully closed mean?

A

Both the start and end positions are included

57
Q

What happens if you run genome indexing without the GTF file?

A

Clustering and stitching of reads will not be informed based on positions of exons

58
Q

What are the outputs of STAR?

A
  1. aligned.sortedByCoord.out.bam (aligned reads in standard BAM format, sorted by coordinates
  2. Log. out (main log file with detailed info used for troubleshooting)
  3. Log.final.out (summary mapping statistics for quality control)
  4. Log.process.out (job progress per minute)
  5. SJ.out.tab (highly confident collapsed splice junctions)
59
Q

Where can the log.final.out file be visualized & how many uniquely mapped reads are expected for mouse/human?

A
  • visualized in multiQC

- ~60-90%

60
Q

Where can the STAR .BAM file be visualized in?

A

IGV;

  • need GTF or GFF
  • can see split read alignments, coverage values of the expression of each exon
61
Q

What are common features used to assess the quality of read alignment?

A
  • 3’ and 5’ bias (just using PolyT to enrich introduced a 3’ bias)
  • nucleotide content
  • base/read quality
  • sequencing depth
  • base distribution
  • insert size distribution
62
Q

How can post-alignment QC be performed?

A

Quality of RNA-seq toolset (QoRTs) or other packages

63
Q

What does HTseq do?

A

takes the coordinates from the sorted.bam file (STAR output) and generates read counts for downstream processing

64
Q

How does HTseq work?

A
  • Takes a BAM file and list of gene locations (GTF file) & counts how many reads map to each gene (gene = union of all its exons)
  • multimapping reads & ambiguous reads removed
  • 3 modes to handle reads that overlap several genes (union, intersection-strict, intersection non-empty)
65
Q

For HTseq why do you need to know if library prep was done w stranded or unstranded protocol?

A

HTseq takes this information as a parameter, and if this input is wrong, the wrong output is given

66
Q

What are ambiguous reads?

A

where a read aligns to a regions where 2 genes on different strands overlap. Because the read is unstranded we don’t know which gene it came from

67
Q

What are multimapping reads?

A

reads that align to 2 different genes with the same quality (MQ=0)
Doesn’t have anything to do with strandedness

68
Q

How do HTseq count modes affect the reported gene?

A
  1. if the read is all within gene A, all modes report gene A

2. Overhang: union & interesction_nonempty with report gene A. intersection_strict: no_features

69
Q

What is contained in an HTseq output?

A

A list of ENGSs (gene) ids and a sum of all the isoforms that are expressed for that gene

70
Q

What is the relationship between ENSGs and ENSTs

A

ENSGs are the sum of all ENSTs (the sum of all isoforms for that particular gene)

71
Q

What are 3 common normalization strategies

A
  1. RPKM: reads per kilobase mapped per million sequence reads (for single-end RNA-seq)
  2. FPKM: fragments per kilobase mapped per million sequences reads (~RPKM/2 for paired end reads)
  3. TPM: transcript per million