Bioinformatics Exam Flashcards

Question

ambiguous assembly

Answer 1

connecting contigs in an unknown order - accounts for differences in scaffolds - assemble using reference genome

Answer 2

multiple overlapping contigs with estimated gaps put together in a known order

Answer 3

1. sort contigs from longest to shortest 2. Find point when you have ~50% of genome 3. then annotate our genome with exons and introns

Answer 4

- number of contigs whose combined length is at least 50% *** Lower is better for L50 value - longer contigs = more confidence that genome is right (higher quality assembly)

Answer 5

- sequence length of the shortest contig at 50% of the total genome length *** Higher is better for N50 value [median contig size = reliability factor] - N50 is the length of shortest contig in L50 assembly

Answer 6

- improves assembly - Garbage in = garbage out

Answer 7

- store sequences - One line starts with a “>” and a sequence ID code --- It is optionally followed by a description of the sequence - One or more lines containing the sequence itself. However…base calling is NOT perfect

Answer 8

- by failure to remove blocking fluorophore - synthesis is behind by 1 because block fluorophore was not removed

Answer 9

- by addition of dNTP instead of ddNTP - synthesis is ahead by 1 nucleotide because 2 were added at one

Answer 10

- degrades the quality of assembly - clean = clear - noisy - blurry - ML models and algorithms compute the probability of error → i.e. quality --- not confident that what it is seeing is purely blue, green, etc.

Answer 11

- store sequence and quality - quality scores measure the probability that a base is called incorrectly

Answer 12

- allow for storing many floats per nucleotide - ASCII characters require ~¼ the memory and we already have to store nucleotides - Hexadecimal characters have an associated integer (phred quality (Q))

Answer 13

the integer associated with the ASCII symbol - indicates the probability that an error has occurred -- smallest value 33 = lower hexadecimal cannot be rendered on screen -- ! = probability of error =1 (very bad quality) *** Lower you go down the chart = higher quality of read; less likely for there to be errors within the text file

Answer 14

P(#) = 10 ^ - (#-#)/10

Answer 15

- NIH databases - GeneBank for genomic sequences - Sequence read archive (SRA) for sequencing data - RefSeq for reference genomes - BioProject for curated resources for a specific project

Answer 16

- errors are introduced to to the technical limitations of sequencing platforms - adapters may be present if reads are longer than the fragments sequenced --- trimming adapters may improve the number of reads mapped **** quality control is an essential first step in any analysis

Answer 17

- box and whisker plot of base-call accuracy - green = excellent - yellow = good - poor = red

Answer 18

- strong deviation from normal distribution could indicate contamination - shows curves of GC count per read - and theoretical distribution curve **Compared similiarity of 2 to indicate purity/quality of sample

Answer 19

- trimming of problematic bases at the ends of reads must be done in order to reduce bias in future analyzes Trimming: 1. low quality score regions 2. beginning/end of sequence 3. remove adapters Filtering: 1. with low mean quality score 2. too short 3. too many ambiguous (N) bases Ex/ CutAdapt, Trimmomatic, FastP

Answer 20

- adapters are unique to DNA prep protocol and technology - note which specified adapter sequences are used for trimming

Answer 21

processing data many times consumes many resources SO combine tool features into runs Instead of trimming adapters in one run and quality in another, we can simultaneously remove base calls with low accuracy. --- Phred

Answer 22

align reads to reference genome and identify variants

Answer 23

construct genome sequence from overlaps between reads *** done 99% of the time * repeats/high coverage are the main challenges

Answer 24

- overlap maximization - reduces redundancy - maximizes confidence with highest overlaps - repeat resolution resolves repeats by favoring collapsed arrangements - evolutionary pressure allows for most genomes have selective pressure to be efficient

Answer 25

- merge strings by highest overlap Procedure → 1. Merge strings one at time keeping consistent with 5’ and 3 2. Always merge the largest overlap (greedy), not necessarily the size of fragment 3. Repeat *** Being greedy makes genome assembly tractable *** not used in practice but helps us to understand the problem

Answer 26

- Chose randomly (first encountered, first merged) - Chose highest quality base call (use sequence with highest quality) - Chose highest coverage (whichever results in more coverage) - Look ahead (do both and evaluate consequence) - Exclude (don't merge at all)

Answer 27

- missing strings can result from the greedy assembly process - get the correct string back by increasing our K

Answer 28

- graphs is a data structure for drawing relationships between items - node = single entity [k-1] - edge = represents a connection between entities (can have direction) [k]

Answer 29

- genome assembly uses direct edges to specify overlap and concatenation

Answer 30

1. each unique k-mer is a node. (k-mer = substring of length k) 2. Add directed edges for each overlap and concatenation

Answer 31

indegree equals outdegree

Answer 32

- Circular genomes are not Eulerian - Contains an extra edge

Answer 33

- more than two semi-balanced nodes - cannot walk along each edge once -- if there was no overlap, then we would have some unconnected graphs

Answer 34

- errors dramatically increase the number of edges and unconnected graphs - errors affect k-mer counts * Error correction should remove most tips and islands; rest can be removed here, leveraging graph structure

Answer 35

1. Select a start node 2. Walk along the graph until a dead end or previously visited node is reached 3. Backtrack and explore alternative paths 4. Repeat for remaining unvisited nodes *** walking along the graph produces strings

Answer 36

using hubs with in and out degrees

Answer 37

suggests that the node is likely a true sequence rather than an error -- confidence in that overlap is good and that node is a good starting point

Answer 38

1. Start a chosen vertex (node). 2. Mark the current vertex as visited. 3. Explore an adjacent unvisited vertex. 4. If no unvisited adjacent vertices exist, backtrack to the last vertex with unvisited adjacent vertices.

Answer 39

- Long paths are desired but not always reliable due to potential repeats - High, consistent read coverage - Unique, non-branching paths

Answer 40

prokaryotic genome assembler -- based on DeBruijn graphs with numerous improvements

Answer 41

1. Build hamming graphs for k-mers --- Undirected edges for Hamming distances of n nucleotide differences 2. Identify strong k-mers baked on clustering (i.e. high similarity) --- Estimate read error based on base qualities

Answer 42

- building multisized graphs with different k's - using multiple graphs with different sizes of K's allows for handling of variable coverages

Answer 43

- leads to fragmented graphs - good for high-coverage

Answer 44

- leads to collasped, tangled graphs - great for low-coverage regions (not too picky)

Answer 45

small, alternative path in the graph that diverges and then merges back into the main path -- due to sequencing errors, repetitive sequences, or small variations indels ** must remove bulge, but bulge will quickly deteriorate the graph and lose read info -- must project the info/coverage into Q -- P's edges are removed in the process

Answer 46

a short, dead-end path in the graph that does not connect back to the main sequence or structure -- result of sequencing errors, such as incomplete reads, low coverage, or random noise, which generate k-mers that don’t correctly align with the rest of the sequence -- Removes P (shortest) and projects information onto Q

Answer 47

- If our insert (i.e. DNA sample) is longer than reads, then we don’t sequence the inner distance. - We want to maximize this inner distance. - A gap between paired reads gives us insight into repeated regions.

Answer 48

- ...estimates gap length between 2 reads via deBruijn gaps and graphs - doesnt not always have to be a repeating sequence; better for gap than unique sequences

Answer 49

- assembler provide contigs and scaffolds - island contains 1 or more contigs - solid lines are called nodes and represent a contig - each connection suggests how these contigs connect to form a scaffold ex/ bandage

Answer 50

- identifying the genetic elements and function in our contigs - results in sequences that likely encode for proteins - 2 types: structural and functional Ex/ Prokka (several outputs)

Answer 51

identifies critical genetic elements such as genes, promoters, and regulatory elements

Answer 52

predicts the function of genetic elements - normally based on protein database search

Answer 53

- Eukaryote annotation is significantly more challenging than prokaryotes - Introns and alternative splicing complicate eukaryote annotation P: probabilistic models to identify open reading frames E: accuracy demands supporting evidence like mRNA sequencing

Answer 54

1. Seek the standard start codons: ATG, GTG, or TTG 2. Seek the stop codons based on the translation table --- TAA, TAG, TGA for bacteria, archaea, and plant plastids ***then score the potential ORFs

Answer 55

- RBS score computed from dataset fitting -- Search for RBS motif after start codon; choose whichever has the lowest bin number -- take the training data from different annotated genomes to get computed frequency of RBS motif bin in the entire sequence (baseline) and RBS frequency - start codon score given by similar RBS framework

Answer 56

- Upstream score based on base analysis -- By analyzing base frequency in specific upstream region, their annotation results improved **essentially looking for promoters

Answer 57

- computed based on gene enrichment parameters - computed frequency of nucleotide hexamers called in words -- probability of observing word within single genes [G(w)] -- probability of observing word within the whole genome/entire DNA sequence [B(w)]

Answer 58

- Biological sequences reveal evolutionary relationships - Sequences play a large role in the central dogma of DNA

Answer 59

- highly conserved gene - Play a crucial role in embryonic development, particularly in determining the body plan and specifying the anterior-posterior axis ***So how do we know that it is highly conserved? -- By aligning sequences!! -- infrequent changes (high similarity) indicate evolutionarily conserved sequences

Answer 60

-- reveals relationships between biological sequences -- Multiple Sequence Alignment (MSA) extends pairwise

Answer 61

- the process of aligning 3 or more biological sequences simultaneously -- Identifies conserved regions across multiple species -- Reveals patterns not visible in pairwise comparisons

Answer 62

- Functional annotation (google search through data bases) - RNA and protein structure (ex/ alphafold) - Disease-associated mutations - Vaccine design

Answer 63

Alignment scores guide the selection of meaningful alignments 1. objectivity 2. optimization 3. significance

Answer 64

provides a quantitative measure for comparison

Answer 65

allows algorithms to find the best alignment

Answer 66

helps distinguish real homology from random similarity

Answer 67

- identical characters in aligned positions - Represents conserved regions or no change

Answer 68

evolutionary events in sequences - matches, mismatches, gap

Answer 69

- different characters in aligned positions - Indicates substitutions or mutations

Answer 70

- dash(-) inserted to improve alignment - Represents insertions and deletions (indels)

Answer 71

- fixed cost for each gap Ex/ -2 for each gap, regardless of length

Answer 72

- different costs for opening and extending gaps Ex/ gap open= -4, gap extend= -1

Answer 73

** reflect biological assumptions and impact alignment outcomes

Answer 74

1.) Linear penalties: -- Simpler to implement -- May over-penalize long gaps 2.) Affine penalties: -- Better handling of long indels -- More biologically realistic 3.) Biological rationale: -- Single mutation event often causes multi-base indel -- Affine penalties better model this biological reality

Answer 75

***Advanced scoring methods enhance alignment accuracy 1.) Position- specific gap penalties: -- Reduce penalties in variable regions -- Increase penalties in conserved regions 2.) Residue-specific gap penalties: -- Adjust penalties based on amino acid properties 3.) Terminal gap penalities: -- Often reduced to allow end gaps in local alignments

Answer 76

***Simple match/mismatch scoring is insufficient bc: 1. Some amino acid substitutions are more likely than others 2. Chemically similar amino acids often substitute without affecting function 3. Evolutionary relationships between amino acids are complex

Answer 77

*quantify amino acid replacement probabilities - probability that a.acid i mutates into a.acid j for all pairs of a.acids - Constructed by assembling a large and diverse sample of verified amino acid alignments - Reflect the true probabilities of mutations occurring through a period of evolution

Answer 78

- compares sequences in their entirety aka from START to END ***Needleman-Wunsch

Answer 79

- Attempts to align every residue in both sequences - Introduces gaps as necessary to maintain end-to-end alignment - Optimizes the overall alignment score for the entire sequences

Answer 80

- guarantees optimal global sequence alignment - Final number = final alignment score - traceback to find the best alignment ***Look at every possible move u can make to get into that cell -- Diagonal = mismatch/match -- Side/up/down = gap -- MATCH ⇒ diagonal *** There can be multiple optimal alignments

Answer 81

- Provides a complete picture of sequence similarity - Ideal for detecting overall conservation patterns - Useful for phylogentic analysis of related sequences

Answer 82

- May force alignment of unrelated regions in divergent sequence - Less effective for sequences of very different lengths - Can be computationally intensive for long sequences

Answer 83

- identifies best matching subsequences * focuses on finding regions of high similarity within sequences - Does not require aligning entire sequences end-to-end - Allows for identification of conserved regions or domains

Answer 84

- Aligns subsections of sequences - Ignores poorly matching regions - Can find multiple areas of similarity in a single comparison

Answer 85

- 0 zero is the lowest score - Start alignment at the highest cell - Stop aligning when you encounter a zero

Answer 86

1.) Matrix initialization: NW: the first row and column are filled with gap penalties SW: first row and column filled with zeros 2.) Scoring system: NW: allows negative scores SW: negative scores are set to zero 3.) Traceback: NW: starts from the bottom-right cell SW: starts from the highest scoring cell in the matrix

Answer 87

- exemplifies local alignment utility - can identify functional regions: -- protein domains -- active sites -- binding motifs -- signal sequences -- post-translational modification sites

Answer 88

- Compares three or more sequences simultaneously Definition of MSA: arranges 3 or more biological sequences (DNA, RNA, or protein) to identify regions of similarities - Aims to infer structural, functional, or evolutionary relationships among the sequences Ex/ Clustal Omega, MAFFT, and MUSCLE

Answer 89

- Aligns multiple sequences in a single analysis - Introduces gaps to maximize alignment of similar characters - Preserves the order of characters in each sequence

Answer 90

- A real-time microscope - Allows us to see exactly what genes are active at a given moment - Can see gene expression changes over time

Answer 91

- switch the complete set of RNA transcripts - including mRNA, rRNA, tRNA, non-coding RNA

Answer 92

instructions for protein synthesis

Answer 93

forms part of the ribosome structure

Answer 94

helps translate the genetic code into proteins

Answer 95

- play regulatory roles in the cell

Answer 96

G: relatively static T: constantly changing and captures the cell’s response to its environment and internet signals *** The dynamic nature of the transcriptome reflects the functional state of the cell

Answer 97

- cell type - developmental stage - environmental conditions *** allows us to see which annotated games are actually being used

Answer 98

a neuron will have a different gene expression profile than a liver cell

Answer 99

The genes active in an embryo differ from those in an adult

Answer 100

cells respond to stress, nutrients, or pathogens by changing gene expression

Answer 101

- developmental biology - disease research - drug discovery - ecology

Answer 102

- understanding cell differentiation Which genes are expressed in a specific cell type or condition?

Answer 103

- identifying pathological gene expression patterns What are the differences in gene expression between healthy and diseased states?

Answer 104

- revealing mechanisms of action and side effects How does gene expression change over time or in response to stimuli?

Answer 105

- studying organism-environment interactions How do environmental factors influence gene expression?

Answer 106

A single gene can produce multiple mRNA transcripts (isoforms)

Answer 107

- reveals alternative splicing and isoforms - One of the main ways organism can increase protein diversity without increasing the number of genes

Answer 108

- revolutionizes resolution - Captures gene expression in individual cell (overall purpose) - Reveals cellular heterogeneity within tissues - While powerful, data is sparse and noisy -- Not very reproducible bc there is very little RNA in cell -- Often paired with bulk RNA analysis *** most beneficial for rare cell with complex tissue types

Answer 109

- maps gene expression to location - Preserves spatial information of transcripts within tissue sections - Reveals how cellular neighborhoods influence gene expression

Answer 110

- Identifies potential functional elements - Predicts disease risk

Answer 111

- Reveals which elements are active - Shows diseases state

Answer 112

- Requires one-time sampling - Reveals evolutionary history

Answer 113

- captures real-time cellular responses

Answer 114

- Assess RNA integrity -- rRNA makes up a large (~85%) of our RNA ** Based on the ratio of 28S and 18S rRNA vs. all RNA 18S is partially degrade 28S RNA 28S is largest peak (furthest to the right of graph)

Answer 115

- focuses sequencing on protein-coding transcripts Enrichment method affects: -- Gene expression measurements -- Detection of non-coding RNAs -- Identification of immature transcripts

Answer 116

- Poly A tail primer will allow for amplification of only mRNA - Poly (A) selection captures mature mRNAS

Answer 117

- RNA is converted to cDNA using reverse transcriptase - Random or oligo(dT) primers influence transcript representation - Second-strand synthesis method can preserve strand information

Answer 118

*detect gene expression - cell sample is cultured - mRNA is isolated - reverse transcription to cDNA - hybridize cDNA probes to oligo sequences on microarray - no longer in practice *** require previous knowledge/info input to reference

Answer 119

- Limited to known sequences: can only detect pre-defined sequences - Cross-hybridization: similar sequences may cause false positives - Limited dynamic range: may miss very low or high abundance transcripts - Normalization challenges: complex process, potential for bias

Answer 120

Does not require prior knowledge/information

Answer 121

- Now we just use the cDNA - RNA-seq doesn’t require prior knowledge of sequences -- Enables discovery of novel transcripts and isoforms

Answer 122

1. Read Alignment: Mapping Transcripts to the Genome 2. Quantification: Measuring Gene Expression Levels 3. Differential Expression Analysis: Identifying Key Genes 4. Dimensionality Reduction: Visualizing Complex Data

Answer 123

- Consideration of splice junctions and gene isoforms - Needs to account for known and novel splice sites - Requires specialized alignment algorithms (e.g. STAR, HISAT2)

Answer 124

- Counting aligned reads with HTSeq or featureCounts - Transcript-level quantification with Salmon or Kallisto - Normalization methods: ex/ TPM (transcripts per million) - Distinguishing between different isoforms of the same gene

Answer 125

- Compares gene expression levels - Statistical testing with DESeq2 or edgeR - Visualization of results (volcano plots) - Clustering of differentially expressed genes - Results in list of up- and down-regulated genes

Answer 126

- Reduces high-dimensional data to 2D or 3D for visualization - Reveals patterns and clustering in the dta - Techniques include PCA, t-SNE, and UMAP - This practice is widely used, but extreme caution needs to be used and is not generally recommended ** generally not in practice because the data is not accurate; never analyze from reduced dimensions

Answer 127

- dealing with enormous data sets - millions of base pairs - hundreds of GB (most computers hold 8-12 GB)

Answer 128

- Hash tables - Suffix arrays/trees - Burrows-Wheeler transforms

Answer 129

- Hash tables link a key to a value -- Keys represent a “label” we can use to get information - A “hash function” determines where to find their number - convert labels to table indices *** connects information to data in memory via hash function ex/ like a phone book

Answer 130

- hashing our reference genome seeds our hash table with k-mer locations - provides quick lookups of our reference genome - query a k-mer read to get indices of our possible reference genome locations

Answer 131

- determine k-mer strings - use hash table for rapid lookup of potential matches quickly -***multiple seeds increase chance of finding correct location - extend by starting from seed match and grow in both directions with reference genome ** check to see if we can align to reference *Always go forward, but have to check backward if hit is not at the start of the sequence

Answer 132

A “DNA dictionary” with a quick lookup and direct access to potential matches

Answer 133

Pros: - Easily parallelizable - Flexible for allowing mismatches - Conceptually simply Cons: - Large memory footprint for index - Can be slower for very large genomes

Answer 134

- represent all suffixes of a given string - used to find starting index of suffix

Answer 135

- memory efficient alternatives to trees - requires less memory but is also less powerful 1. create all suffixes 2. sort lexicographically 3.

Answer 136

Compression reduces the amount of data we have to store - sorts string without losing the original data when sorting lexicographically forces repeats that loses data

Answer 137

1. Append a unique end-of-string (EOS) marker to the input string. 2. Generate all rotations of the string. 3. Sort these rotations lexicographically 4. Extract the last column of the sorted matrix as the BWT output. ** First column is more compressible but we lose context and reversibility

Answer 138

1. Write BWT output vertically 2. Sort output lexographically. 3. Append the BWT output to front of sorted string. 4. Repeat sort and append 5. Repeat into length of rows equals length of output 6. The string that ends with EOS marker is the original string.

Answer 139

- efficiently finds occurrences of a pattern in a text using the LF-mapping - number the F (first) and L (last) columns - find F rows that have last letter of search string - note which rows have the next letter in the L-column - repeat until first letter

Answer 140

use transcriptomics

Answer 141

- must normalize before making comparisons between transcriptomes (many sizes of such) - make ratio of normal to cancerous cell expression of transcriptomes

Answer 142

- Transcripts and ratios are substantially smaller - Small floats require high precision and thus memory - This can make computations and communications challenging, so we often scale everything to a million to use unsigned integers

Answer 143

- corrects experimental biases where longer transcripts will have more reads - corrects through normalization of gene length (more exons) RPK = (read counts for gene) / (gene length in kilobases)

Answer 144

RPKM = 10^9 * (reads mapped to transcript / total reads*transcript length)

Answer 145

- assigns a read to single transcript using read mapping algorithms - once aligned we can count the number of mapped reads to each transcript

Answer 146

Uses BWT to map and quantify reads

Answer 147

- Maximum Mappable Prefix (MMP) approach for fast, accurate spliced alignments - finds the prefix that perfectly matches reference then repeats for unmatched regions - automatically detects junctions instead of relying on databases

Answer 148

Alignment-based methods need to determine the read’s exact position in the transcript

Answer 149

- finds which transcript, but not where -- Identifies which transcripts are compatible with the read, skipping the precise location step ** skips the full alignment process - Instead of mapping each read to a specific position, pseudoalignment identifies which transcripts are compatible with a given read

Answer 150

Alignment → specifies where exactly in the transcript this read came from (at position ___) Pseudoalignment → specifies that it came somewhere from this transcript (compatible)

Answer 151

Pros: faster and less resource-intensive than alignment-based methods Cons: It may lack certain details, such as the position and orientation of reads, which are useful for correcting technical biases

Answer 152

- statistical model that explains how the observed data are generated from the underlying system - Defines a computational framework that produces sequencing reads from a population of transcripts

Answer 153

- mathematically defines a transcriptome by its individual transcripts and their counts - take nucleotide fractions by taking into account the effective length of each transcript - tells us how much of the total RNA pool comes from each transcript - tries to identify distributions of reads amongst the transcripts - Matrix computationally assigns fragments to transcripts - Salmon looks for parameters with the lowest errors (generative model)

Answer 154

- The transcript fraction tells us the proportion of total RNA molecules in the sample that come from transcript i *** normalizes the nucleotide fraction by the effective length (Ti) - adjusts for the longer transcripts generating more reads

Answer 155

proportion of total RNA molecules in the sample that come from a certain transcript (i)

Answer 156

- Z is binary matrix where all values are 0 or 1 - M transcripts (rows) - N fragments (columns) ** shows if fragment is assigned to transcript

Answer 157

P(a|b) - what is the probability of a occurring if b is true ** we want to optimize values to get the highest probability

Answer 158

Conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical biases **SALMON quasi-mapping: probability is approximated based on transcript compatibility rather than exact positions

Answer 159

- Fragments that include transcript ends might be too short - Fragments from central regions are more likely to be of optimal length for sequencing reads - A transcript’s effective length adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled

Answer 160

- Undersample GC regions - Make good stop codons - Oversample AT rich regions

Answer 161

1. Online phase: makes fast, initial estimates of transcript abundances 2. Offline phase: refines these initial estimates using more complex optimization techniques This 2-phase approach balances speed(in the online phase) with accuracy (offline phase)

Answer 162

Quasi-mapping is a fast, lightweight technique used to associate RNA-seq fragments with possible transcripts - early stopping of read mapping - alignment is expensive, so Quasi-mapping stops after identifying seeds nt = (# of fragments mapping to t / total # of fragments)

Answer 163

- based on mini batches - Offline phase fine tunes transcript abundance - After the online phase, Salmon refines the estimates using a more complex optimization method, typically based on the Expectation-Maximization (EM) algorithm ** ensures the accuracy of abundance estimates, incorporating the bias corrects learned during the online phase

Answer 164

ensures the accuracy of abundance estimates, incorporating the bias corrects learned during the online phase

Answer 165

- central to the interference process in Salmon - probability of observing the entire set of Fragments, given the transcriptome and nucleotide fractions - optimize the estimates of alpha, a vector of the estimated number of reads originating from each transcript *** goal is to maximize this likelihood to infer the most likely values of n, which correspond to the relative abundances of transcripts

Answer 166

The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)

Answer 167

EM algorithm breaks down a difficult problem into 2 simpler problems: -- E-step: estimate the missing information(the assignment of fragments to transcripts) using the current transcript abundance estimates -- M-step: use the estimated assignments to update the transcript abundances, improving the likelihood

Answer 168

For each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimate until it reaches a maximum

Answer 169

the process of identifying and quantifying changes in gene expression levels between different sample groups or conditions

Answer 170

1. Sample collection: gather samples from different conditions (e.g. healthy or diseased) 2. RNA sequencing (RNA-seq): Quantify gene expression level using high-throughput sequencing technologies 3. Read Mapping and Quantification: align RNA-seq reads to a reference genome and quantify expression (e.g, using Salmon) 4. Statistical Analysis: Identify genes with significant expression differences between conditions.

Answer 171

Objective: -- Identify genes differentially expressed between triple-negative breast cancer(TNBC) and hormone receptor-positive breast cancer Findings: -- TNBC shows upregulation of genes involved in cell proliferation and metastasis Implication: -- Targets for specific therapies Improved classification and prognosis of breast cancer subtypes ***DGE provides statistical tools to identify changes between samples

Answer 172

A mathematical tool that describes how data is generated ***help us to make sense of complex data by identifying patterns and determining whether differences are meaningful or just due to chance

Answer 173

It helps us answer: -- Is there an apparent difference in gene expression between 2 conditions? -- If so, is it real, or could it have happened by random chance or experimental flaws?

Answer 174

- perform hypothesis testing to see if the difference in expression between conditions is statistically significant.

Answer 175

null (Ho) alternative (H1)

Answer 176

There is no difference in gene expression between the 2 conditions.

Answer 177

There is a significant difference in gene expression between the conditions.

Answer 178

We reject the null hypothesis when our statistical test shows that the observed difference, if any, is unlikely to have happened by random chance.

Answer 179

- the probability of the null hypothesis being true What is the probability that any difference is either (1) nonexistent or (2) due to random chance (i.e. “getting lucky”)

Answer 180

1. The higher the p-value, the more our model supports the null hypothesis 2. The lower the p-value, the more our model supports the alternative hypothesis

Answer 181

Ensures that we are not biasing our data or our interpretation

Answer 182

RNA-seq generates count data: the number of RNA fragments that map to each gene

Answer 183

- data that can only take specific values (ex/ only whole numbers) ** In RNA-seq, we measure the number of reads mapped to a gene, so the data are count-based - requires special statistical tools - cannot use normal distribution bc it requires continuous data

Answer 184

models the number of successes in a fixed number of independent trials, where each trial has the same probability of success ** simple model for discrete counts

Answer 185

1. MAIN limitation → assumes that the probability of success is constant between samples 2. Smaller limitation 1 → The number of possible trials can be very large, especially when sequencing at a high depth. 3. Smaller limitation 2 → The probability of expression is very small for many genes because they are either lowly expressed or not at all.

Answer 186

a statistical tool used to model the number of events(or counts) that happen in a fixed period of time or space, where: -- The events are independent of each other -- Each event has a constant average rate - A baseline for modeling discrete counts *** simplifies computation and allows for varying probabilities *provides an accurate distribution of counts if your mean and variance are approximately equal

Answer 187

- show deviations with Poisson distributions Mean = variance line ***Higher counts typically have larger variance

Answer 188

when the variance in the data is larger than what is predicted by simpler models (e.g. Poisson distribution) ** may reflect biological variability between samples not captured by the experimental conditions

Answer 189

Differences in RNA quality Sequencing depth Biological factors like different cell types within the same tissue

Answer 190

- equals the mean: Variance = u *** Variance is often larger than the mean for RNA-Seq: Variance > u

Answer 191

If alpha = 0, the Negative Binomial distribution reduces the Poisson distribution. - negative binomial distribution models overdispersed count data (variance exceeds mean) - if this event rate is low, it is simplified into a poisson distribution. (that plots number of successes)

Answer 192

- RNA-seq data frequently contains zero counts for some genes because not all genes are expressed under all conditions. - Most statistical models account for variance, but not that 0’s can dominate counts Ex/ high expected mean with Poisson distribution, we can still have zeros or very low counts. (zero = gene turned off) In these circumstances, we have to use zero-inflated models.

Answer 193

RNA-seq data = messy: counts vary, lots of zeros, and data has no simple patterns We need models to account for this complexity and figure out which genes are differentially expressed in a meaningful way

Answer 194

- used to estimate the parameters ų (mean) and ɑ (dispersion) for each gene ** MLE tries to find the model parameters that make the observed counts most likely Adjusts the model until the predicted counts match the actual counts as closely as possible (i.e. minimize the error)

Answer 195

statistical test that helps us to determine whether estimated log fold change between 2 conditions is significantly different from zero.

Answer 196

- means that the gene is expressed at the same level in both conditions. - null = log fold change between conditions is 0. no difference in expression - alternative = log fold change between conditions is not zero (there is a difference in expression).

Answer 197

- gives us an estimated log fold change (β1) for each gene - Also gives us standard error (SE) for this estimate, which tells us how uncertain we are about the estimate of log fold change [ SE(β1) ]

Answer 198

tells us how many standard deviations the estimated log fold change is away from zero (no difference = 0)

Answer 199

Idea is to compare the likelihood of data under: -- The null model (same expression in both conditions) -- The alternative model (different expression levels in each condition

Answer 200

displays the relationship between each gene’s statistical significance (p-value) and the magnitude of change (fold change)

Answer 201

Top corners: genes with high significance and large fold changes (both upregulated and downregulated) Center: genes with little to no change or low significance

Answer 202

visualizes the relationship between the average expression (A) and the log fold change (M) for each gene Usage: identifying trends or biases in expression data, such as mean-dependent variance

Answer 203

Center Line (M=0): No change in expression. Spread: indicates variability in fold changes across different expression levels

Answer 204

- displays the expression levels of multiple genes across different samples using color gradients Rows: Genes Columns: Samples Color Intensity: represents expression level (e.g. red for upregulation, blue for downregulation)

Answer 205

Identifying clusters of co-expressed genes and sample groupings based on expression profiles

Answer 206

- PCA transforms high-dimensional gene expression data into principal components that capture the most variance Axes: Principal components representing the most significant sources of variation Usage: assessing batch effects, overall data structure, and sample quality

Answer 207

Sample clustering: samples from similar conditions cluster together Outliers: samples that do not group with others may indicate technical or biological variability

Answer 208

It has a high cost and low throughput.

Answer 209

to confirm sequencing results and fill gaps

Answer 210

Sequencing by synthesis using reversible terminator nucleotides.

Answer 211

Handling repetitive DNA sequences.

Answer 212

The overlap between k-mer sequences

Answer 213

The similarity of k-mers within the reads

Answer 214

Long reads can span repetitive regions, while short reads improve coverage.

Answer 215

Smith-Waterman Blast

Answer 216

Global alignment matches sequences in full; local alignment focuses on best-matching parts Local alignment identifies conserved regions; global for full sequence comparison

Answer 217

The numerical value in the bottom-right corner of the Needleman-Wunsch alignment matrix represents the optimal score of the global alignment between two sequences.

Answer 218

Greedy algorithms for sequence assembly can be efficient for simple genomes but struggle with complex ones due to issues with repetitive sequences, limited global perspective, and handling of genetic variation. These limitations can lead to misassemblies, especially in genomes with high complexity, such as those with repetitive regions or structural variations.

Answer 219

The gene expression varies significantly within the sample groups

Answer 220

- Batch effects or variations in sequencing depth between the samples. - Exclusive reliance on fold-change values. - Attributing all changes in gene expression to transcriptional regulation.

Answer 221

Employing statistical models that account for sequencing error rates and genomic variability

Answer 222

Use a hybrid approach that combines alignment to a reference genome with de novo assembly of reads

Answer 223

Initially aligns reads to a simplified model of the genome and gradually integrate more complex regions

Answer 224

The rearrangement of characters to bring similar characters together.

Answer 225

It allows for efficient backward search, reducing the time complexity of finding patterns.

Answer 226

It is for reconstructing the original sequence.

Answer 227

BWT is the first step in FM-index construction

Answer 228

The complete set of RNA molecules, including all isoforms present in the sample.

Answer 229

The adjustment for the empirical distribution of fragment lengths obtained during sequencing.

Answer 230

To maximize the probability of the observed RNA sequencing data

Answer 231

- primarily attributed to the decreasing population of longer DNA fragments. 1. probability of ddNTP incorporation 2. concentration ratio of dNTPs to ddNTPs 3. mass and mobility differences 4. signal-to-noise ratio NOT DUE TO QUALITY OF READS - mixture contains an excess of both dNTPs and ddNTPs. - concentrations of these nucleotides are not depleted during sequencing. - concentrations of dNTPs and ddNTPs remain constant throughout

Answer 232

Adapters are short oligonucleotide sequences added to DNA fragments during Illumina sequencing library preparation. Purpose: Adapters contain sequences complementary to oligonucleotides on the Illumina flow cell surface. This allows DNA fragments to bind to the flow cell and form clusters.

Answer 233

- short fragments - loss of long reads - reduced overall signal

Answer 234

- long fragments - loss of short reads - weak signal for short fragments

Answer 235

- critical for generating a balanced distribution of fragment lengths -- fragment distribution -- read lengths

Answer 236

- more likely to find overlaps between reads because they require fewer matching bases. This increases sensitivity, helping to connect reads in regions with low coverage or sequencing errors.

Answer 237

Larger - k-mers are more specific, reducing the chance of erroneous overlaps but requiring higher-quality data. - more likely to span unique regions - reduced overlap detection -more memory used f