5.2 RNA seq Flashcards

Question

Describe the RNA-seq workflow when a reference transcriptome is available

Answer 1

- Reads aligned to reference transcriptome using an ungapped aligner since transcriptome already has introns spliced out - transcript identification and quantification can occur simultaneously since read counts are now directly associated with the reads

Answer 2

- reads need to be assembled into contigs or transcripts (software can stitch reads back together) - assembled contigs are contained in FASTA file - for quantification reads are mapped/aligned to assembled contigs using ungapped aligner

Answer 3

1. RNA library generated 2. library mapped to reference genome (STAR) 3. Transcript read count table generated (Htseq)

Answer 4

Reads span regions where introns exist in the reference genome, creating a computation challenge to calculate

Answer 5

1. exon-first approach | 2. seed-extend approach

Answer 6

reads first aligned to exons and then any reads that didn't align would be aligned to a reference of exon-exon junctions

Answer 7

Read is split into multiple seeds, which allows the aligner to choose multiple seeds along the read. Now we can generate reads that span exon-exon junctions

Answer 8

high accuracy and speed compared to other aligners but very memory intensive (requires a lot of RAM)

Answer 9

Spliced Transcript Alignment to a Reference

Answer 10

1. seed searching: start with 5' seed, extend, clip, repeat 5' seed is extended looking for an exact match, until there is no longer a match, and clips the read 2. Clustering, stitching, and scoring

Answer 11

For each read, STAR searches for the longest sequence that exactly matches one or more locations in the reference genome. One it no longer matches the read it clipped and a new seed is started

Answer 12

Maximum mappable prefixes The longest matching sequence in a read to the reference genome

Answer 13

It gets clipped sooner. Another MMP is generated and are continually generated to deal with any mismtaches

Answer 14

Transcripts are stitched together to make a complete read 1. seeds are clustered together based on how close they are to a set of anchor seeds or uniquely mapped seeds (MMPs stitched together based on location next to each other in the genome) 2. Seeds are stiched together based on best alignment for the stiched read (using Phred scores); depends on parameters set for insert length acceptable in the gaps

Answer 15

the mismatched sequence ( such as if the read is poor quality or an adapter sequence) will be soft clipped. The read is still saved in the BM file, but the CIGAR string will indicate that the read has been clipped

Answer 16

1. create a genome index (includes annotation info used in seed stitching step) 2. align fastq files to indexed genome

Answer 17

1. Reference Genome fasta file | 2. Gene annotation file in GTF or GFF format

Answer 18

the plotting of genes onto genome assemblies, and indexing their genomic coordinates

Answer 19

All the start and end positions for all the exons/transcripts in the reference build

Answer 20

The primary assembly

Answer 21

Regions of the genome that are repetitive have been masked (stripped out of the sequences and replaced with Ns

Answer 22

For heuristics. Masking helps improve computational speed and makes it easier to map. When you have a region of interest in a transcript, masking makes sure it doesn't align to other pseudo gene regions where the repeat exists. Thus can more accurately quantify it

Answer 23

Sequence is changed from upper case to lower case

Answer 24

- contains all sequence regions flagged as top-level in an Ensemble schema - includes chromosomes, regions not assembled into chromosomes and N packed haplotypes/patch regions

Answer 25

regions where there is divergence from the reference

Answer 26

- all the top level sequence regions except haplotypes, and patches

Answer 27

there are no haplotypes/patch regions in the reference and the primary and top level are equivalent

Answer 28

performing sequence similarity searches where patch and haplotype sequences could confused analysis

Answer 29

ENSEMBL Consensus CDS (CCDS) ResSeqs

Answer 30

GTF (General Transfer format) or GFF (General Feature format)

Answer 31

consist of one line per feature, each containing 9 columns of data plus option track definition lines contains genomic coordinates and description of the gene

Answer 32

Both the start and end positions are included

Answer 33

Clustering and stitching of reads will not be informed based on positions of exons

Answer 34

1. aligned.sortedByCoord.out.bam (aligned reads in standard BAM format, sorted by coordinates 2. Log. out (main log file with detailed info used for troubleshooting) 3. Log.final.out (summary mapping statistics for quality control) 4. Log.process.out (job progress per minute) 5. SJ.out.tab (highly confident collapsed splice junctions)

Answer 35

- visualized in multiQC | - ~60-90%

Answer 36

IGV; - need GTF or GFF - can see split read alignments, coverage values of the expression of each exon

Answer 37

- 3' and 5' bias (just using PolyT to enrich introduced a 3' bias) - nucleotide content - base/read quality - sequencing depth - base distribution - insert size distribution

Answer 38

Quality of RNA-seq toolset (QoRTs) or other packages

Answer 39

takes the coordinates from the sorted.bam file (STAR output) and generates read counts for downstream processing

Answer 40

- Takes a BAM file and list of gene locations (GTF file) & counts how many reads map to each gene (gene = union of all its exons) - multimapping reads & ambiguous reads removed - 3 modes to handle reads that overlap several genes (union, intersection-strict, intersection non-empty)

Answer 41

HTseq takes this information as a parameter, and if this input is wrong, the wrong output is given

Answer 42

where a read aligns to a regions where 2 genes on different strands overlap. Because the read is unstranded we don't know which gene it came from

Answer 43

reads that align to 2 different genes with the same quality (MQ=0) Doesn't have anything to do with strandedness

Answer 44

1. if the read is all within gene A, all modes report gene A | 2. Overhang: union & interesction_nonempty with report gene A. intersection_strict: no_features

Answer 45

A list of ENGSs (gene) ids and a sum of all the isoforms that are expressed for that gene

Answer 46

ENSGs are the sum of all ENSTs (the sum of all isoforms for that particular gene)

Answer 47

1. RPKM: reads per kilobase mapped per million sequence reads (for single-end RNA-seq) 2. FPKM: fragments per kilobase mapped per million sequences reads (~RPKM/2 for paired end reads) 3. TPM: transcript per million

5.2 RNA seq Flashcards

(71 cards)