Bioinformatics Exam Flashcards
AlphaFold 3
Predict the joint structure of complexes including proteins, NA, small molecules, ions, etc.
HOMER2
Show that the effect of transcription factor binding on transcription initiation is position dependent
How to we acquire our DNA sample for DNA sequencing?
- Start with bacterial culture to produce the product of interest
— Biotechnology frequently uses massive E. coli cultures to produce. - Separate cells from media
– Centrifuge and separate cells and media
– Keep the component of interest (DNA)
– Break open the cells by lysing them (chemical lysis destabilizes the lipid bilayer and denatures proteins) - Isolate and purify our DNA
– phenol-chloroform extraction (liquid-liquid separation)
– Aq DNA/RNA on top
– Lipids/large molecules on the bottom
Surfactants VS Phospholipids
- both contain a hydrophilic head and hydrophobic tail
– surfactant have only hydrophobic tail which allows them to further penetrate molecular structure as compared to phospholipids with 2 tails
– break phospholipid barrier more and destabilize proteins (used for chemical lysis)
260 nm DNA sample absorbance
- absorbance at 260 nm is correlated to the DNA concentration of the sample
— looks for impurities in the sample solution
— can assume we have purified DNA sample after this step
— based on the the absorbance of UV irradiation (Bier Lambert’s Law)
Main purpose of Sanger Sequencing
— determine the precise ordering of nucleotides
DNA elongation
- occurs rapidly and continuously
- use DNA polymerase and excess nucleotides to make copies of DNA
- requires 3’ OH to add another nucleotide to the chain
Di-deoxynucleotides (ddNTP)
- ddNTPs stop replication
- do not have a 3’ OH for continued elongation
- usually a 1:100 ratio
*** left with DNA strands of variable length
Sanger sequencing process
- sort DNA fragments by length to see what the last nucleotide is
– the less ddNTP results in a longer strand
– higher concentration of ddNTP results in shorter strands
*** by sorting fragments by length, we can see what the last nucleotide was (line up 5’ nucleotide)
— get the template strand
Original Sanger Sequencing SetUp
- split DNA sample into 4 beakers
- Add a ddNTP into each beaker (A,T,C,G)
- Add some radioactive ddNTP into a single beaker
- Add Taq and run PCR
** separate by length in gel electrophoresis
(larger fragments do not travel as far)
– order from farthest traveled (shortest) to least traveled (longest)
***** need SEPARATE beakers bc you cannot differentiate between radioactive nucleotides
Sanger Sequencing Now
- now use fluorescent tags to distinguish ddNTPs
- only need one beaker for PCR
- also automate fragment separation
capillary gel electrophoresis
- can accelerate fragment length sorting and detection
- separates molecules by sized based on their charge-to-mass ratio
- Smaller molecules move more freely/faster through the gel than larger molecules
- molecules must be charged through tagging with a charged molecule
- DNA and RNA are charged bc each nucleotide has a charge
SanSeq Chromatogram
- unique fluorescence signal per ddNTP produces a chromatogram
ideal SanSeq Chromatogram
- variation in peak height is less than 3-fold
- peaks are evenly distributed
- peaks contain only 1 color
- absent baseline noise
- interpreted nucleotide sequence is 5’ to 3’
Nonideal SanSeq Chromatogram
- significant noise up to ~20 bps is unreliable transport
- dye blobs from unused ddNTPs
- fewer longer fragments so signal is weaker
SanSeq VS Illumina Sequencing
- Sanger sequencing is very accurate but slow compared to Illumina
Illumina Sequencing
- sequencing by synthesis
- used polymerase/ligase enzyme to incorporate nucleotides with fluorescent tag (fluorescently labeled reversible terminator)
- tags are then identified to determine the DNA sequence
Illumina Sequencing Process
- Adapter ligations attach P5 and P7 oligos to facilitate binding to flow cell
- fragments become bound somewhere in the flow cell
- locally amplify bound DNA fragments to get clusters of the same sequence
– bridge amplification creates double-stranded bridges
– double-stranded clonal bridges are denatured with cleaved reverse strands
***clusters will give off a stronger signal compared to a single fragment
We repeatedly →
1. Add nucleotide
2. Capture signal
3. Cleave fluorophore
5 step iIlumina sequencing process
- Add labeled dNTPs into flow cells
- Incorporate a complementary nucleotide
- Remove unincorporated fluorescent nucleotides
4, Capture fluorescent signal & image clusters - Remove the fluorophores and the protecting group
pair-ended sequencing
- enables both ends of the DNA fragment to be sequenced
– Because the distance between each paired read is known, alignment algorithms can use this information to map the reads over repetitive regions more precisely.
***Results in much better alignment of the reads, especially across difficult-to-sequence, repetitive regions of the genome
Nanopore Sequencing Technology
- nanopore and polymer membrane respond to electrical perturbations
*** gives us much longer reads which is important for assembling reads into a genome
** type of third-generation sequencing (TGS)
- can give long reads with no amplification
- Direct detection of epigenetic modifications on native DNA.
- sequencing through regions of the genome inaccessible or difficult to analyze by short-read platforms.
- Uniform coverage of the genome; not as sensitive to GC content as short-read platforms.
genome assembly
- process of combining the short, overlapping sequencing reads into continuous DNA sequence
– having multiple fragments that contain the same portion of the sequence improves our coverage
reads
raw sequences coming from experimentatation
contigs
continuous stretches of DNA sequence from overlapping sequencing reads
ambiguous assembly
connecting contigs in an unknown order
- accounts for differences in scaffolds
- assemble using reference genome
scaffolds
multiple overlapping contigs with estimated gaps put together in a known order
Assembly quality metrics
- sort contigs from longest to shortest
- Find point when you have ~50% of genome
- then annotate our genome with exons and introns
L50
- number of contigs whose combined length is at least 50%
*** Lower is better for L50 value
- longer contigs = more confidence that genome is right (higher quality assembly)
N50
- sequence length of the shortest contig at 50% of the total genome length
*** Higher is better for N50 value [median contig size = reliability factor]
- N50 is the length of shortest contig in L50 assembly
why clean sequencing reads?
- improves assembly
- Garbage in = garbage out
FASTA files
- store sequences
- One line starts with a “>” and a sequence ID code
— It is optionally followed by a description of the sequence - One or more lines containing the sequence itself.
However…base calling is NOT perfect
lagging synthesis
- by failure to remove blocking fluorophore
- synthesis is behind by 1 because block fluorophore was not removed
leading synthesis
- by addition of dNTP instead of ddNTP
- synthesis is ahead by 1 nucleotide because 2 were added at one
signal cross talk
- degrades the quality of assembly
- clean = clear
- noisy - blurry
- ML models and algorithms compute the probability of error → i.e. quality
— not confident that what it is seeing is purely blue, green, etc.
FASTQ files
- store sequence and quality
- quality scores measure the probability that a base is called incorrectly
ASCII-encoded probabilities
- allow for storing many floats per nucleotide
- ASCII characters require ~¼ the memory and we already have to store nucleotides
- Hexadecimal characters have an associated integer
(phred quality (Q))
Phred quality P(Q)
the integer associated with the ASCII symbol
- indicates the probability that an error has occurred
– smallest value 33 = lower hexadecimal cannot be rendered on screen
– ! = probability of error =1 (very bad quality)
*** Lower you go down the chart = higher quality of read; less likely for there to be errors within the text file
calculating phred quality
P(#) = 10 ^ - (#-#)/10
Where do FASTQ file entries go?
- NIH databases
- GeneBank for genomic sequences
- Sequence read archive (SRA) for sequencing data
- RefSeq for reference genomes
- BioProject for curated resources for a specific project
quality issues in sequencing data
- errors are introduced to to the technical limitations of sequencing platforms
- adapters may be present if reads are longer than the fragments sequenced
— trimming adapters may improve the number of reads mapped
** quality control is an essential first step in any analysis
Per base sequence quality
- box and whisker plot of base-call accuracy
- green = excellent
- yellow = good
- poor = red
Per sequence GC content
- strong deviation from normal distribution could indicate contamination
- shows curves of GC count per read
- and theoretical distribution curve
**Compared similiarity of 2 to indicate purity/quality of sample
trimming and filtering
- trimming of problematic bases at the ends of reads must be done in order to reduce bias in future analyzes
Trimming:
1. low quality score regions
2. beginning/end of sequence
3. remove adapters
Filtering:
1. with low mean quality score
2. too short
3. too many ambiguous (N) bases
Ex/ CutAdapt, Trimmomatic, FastP
adapters for trimming
- adapters are unique to DNA prep protocol and technology
- note which specified adapter sequences are used for trimming
automatically cleaning data for quality
processing data many times consumes many resources SO combine tool features into runs
Instead of trimming adapters in one run and quality in another, we can simultaneously remove base calls with low accuracy.
— Phred </= 20 → poor
— Length required = 20
- automatically removes low accuracy and short reads needed to assemble reads into quality contigs/scaffolds
resequencing
align reads to reference genome and identify variants
deNovo Assembly
construct genome sequence from overlaps between reads
*** done 99% of the time
- repeats/high coverage are the main challenges
Why do we want the shortest superstring?
- overlap maximization
- reduces redundancy
- maximizes confidence with highest overlaps
- repeat resolution resolves repeats by favoring collapsed arrangements
- evolutionary pressure allows for most genomes have selective pressure to be efficient
greedy algorithm
- merge strings by highest overlap
Procedure →
1. Merge strings one at time keeping consistent with 5’ and 3
2. Always merge the largest overlap (greedy), not necessarily the size of fragment
3. Repeat
*** Being greedy makes genome assembly tractable
*** not used in practice but helps us to understand the problem
What happens if we have a tie for the greedy algorithm?
- Chose randomly (first encountered, first merged)
- Chose highest quality base call (use sequence with highest quality)
- Chose highest coverage (whichever results in more coverage)
- Look ahead (do both and evaluate consequence)
- Exclude (don’t merge at all)
repeats ruined our assembly
- missing strings can result from the greedy assembly process
- get the correct string back by increasing our K
de Bruijn graph
- graphs is a data structure for drawing relationships between items
- node = single entity [k-1]
- edge = represents a connection between entities (can have direction) [k]
directed multigraphs for genome assembly
- genome assembly uses direct edges to specify overlap and concatenation
Building a directed multigraph
- each unique k-mer is a node. (k-mer = substring of length k)
- Add directed edges for each overlap and concatenation
node is balanced if
indegree equals outdegree
cyclical sequence
- Circular genomes are not Eulerian
- Contains an extra edge
Why is this not Eulerian?
- more than two semi-balanced nodes
- cannot walk along each edge once
– if there was no overlap, then we would have some unconnected graphs
de Bruijn graphs and errors
- errors dramatically increase the number of edges and unconnected graphs
- errors affect k-mer counts
- Error correction should remove most tips and islands; rest can be removed here, leveraging graph structure
Graph traversal algorithms are used to extract contigs (procedure)
- Select a start node
- Walk along the graph until a dead end or previously visited node is reached
- Backtrack and explore alternative paths
- Repeat for remaining unvisited nodes
*** walking along the graph produces strings
how do we select a starting node?
using hubs with in and out degrees
high coverage
suggests that the node is likely a true sequence rather than an error
– confidence in that overlap is good and that node is a good starting point
How do you choose a walk
- Start a chosen vertex (node).
- Mark the current vertex as visited.
- Explore an adjacent unvisited vertex.
- If no unvisited adjacent vertices exist, backtrack to the last vertex with unvisited adjacent vertices.
How do we choose the “best” path for our contig?
- Long paths are desired but not always reliable due to potential repeats
- High, consistent read coverage
- Unique, non-branching paths
SPAdes
prokaryotic genome assembler
– based on DeBruijn graphs with numerous improvements
Error correction with BayesHamming
- Build hamming graphs for k-mers
— Undirected edges for Hamming distances of n nucleotide differences - Identify strong k-mers baked on clustering (i.e. high similarity)
— Estimate read error based on base qualities
multisized graphs and SPAdes
- building multisized graphs with different k’s
- using multiple graphs with different sizes of K’s allows for handling of variable coverages
large K SPAdes graphs
- leads to fragmented graphs
- good for high-coverage
small K SPAdes graphs
- leads to collasped, tangled graphs
- great for low-coverage regions (not too picky)
potential bulge in SPAde graph
small, alternative path in the graph that diverges and then merges back into the main path
– due to sequencing errors, repetitive sequences, or small variations indels
** must remove bulge, but bulge will quickly deteriorate the graph and lose read info
– must project the info/coverage into Q
– P’s edges are removed in the process
potential tip in SPAde graph
a short, dead-end path in the graph that does not connect back to the main sequence or structure
– result of sequencing errors, such as incomplete reads, low coverage, or random noise, which generate k-mers that don’t correctly align with the rest of the sequence
– Removes P (shortest) and projects information onto Q
Paired-ended reads do not always cover our whole insert
- If our insert (i.e. DNA sample) is longer than reads, then we don’t sequence the inner distance.
- We want to maximize this inner distance.
- A gap between paired reads gives us insight into repeated regions.
SPAdes estimates…
- …estimates gap length between 2 reads via deBruijn gaps and graphs
- doesnt not always have to be a repeating sequence; better for gap than unique sequences
assembler graphs
- assembler provide contigs and scaffolds
- island contains 1 or more contigs
- solid lines are called nodes and represent a contig
- each connection suggests how these contigs connect to form a scaffold
ex/ bandage
gene annotation
- identifying the genetic elements and function in our contigs
- results in sequences that likely encode for proteins
- 2 types: structural and functional
Ex/ Prokka (several outputs)
structural annotation
identifies critical genetic elements such as genes, promoters, and regulatory elements
functional annotation
predicts the function of genetic elements
- normally based on protein database search
eukaryotic VS prokaryotic annotation
- Eukaryote annotation is significantly more challenging than prokaryotes
- Introns and alternative splicing complicate eukaryote annotation
P: probabilistic models to identify open reading frames
E: accuracy demands supporting evidence like mRNA sequencing
Identifying open reading frames (ORFs)
- Seek the standard start codons: ATG, GTG, or TTG
- Seek the stop codons based on the translation table
— TAA, TAG, TGA for bacteria, archaea, and plant plastids
***then score the potential ORFs
ribosomal binding site motif score
- RBS score computed from dataset fitting
– Search for RBS motif after start codon; choose whichever has the lowest bin number
– take the training data from different annotated genomes to get computed frequency of RBS motif bin in the entire sequence (baseline) and RBS frequency - start codon score given by similar RBS framework
upstream score
- Upstream score based on base analysis
– By analyzing base frequency in specific upstream region, their annotation results improved
**essentially looking for promoters
coding score
- computed based on gene enrichment parameters
- computed frequency of nucleotide hexamers called in words
– probability of observing word within single genes [G(w)]
– probability of observing word within the whole genome/entire DNA sequence [B(w)]
why is sequence alignment important for bioinformatics?
- Biological sequences reveal evolutionary relationships
- Sequences play a large role in the central dogma of DNA
hox genes
- highly conserved gene
- Play a crucial role in embryonic development, particularly in determining the body plan and specifying the anterior-posterior axis
***So how do we know that it is highly conserved?
– By aligning sequences!!
– infrequent changes (high similarity) indicate evolutionarily conserved sequences
pairwise alignment
– reveals relationships between biological sequences
– Multiple Sequence Alignment (MSA) extends pairwise
Multiple Sequence Alignment (MSA)
- the process of aligning 3 or more biological sequences simultaneously
– Identifies conserved regions across multiple species
– Reveals patterns not visible in pairwise comparisons
Aligning sequences can provide more insight than just conservation
- Functional annotation (google search through data bases)
- RNA and protein structure (ex/ alphafold)
- Disease-associated mutations
- Vaccine design
Importance of scoring in alignment selection
Alignment scores guide the selection of meaningful alignments
- objectivity
- optimization
- significance
objectivity importance for alignment selection
provides a quantitative measure for comparison
optimization importance for alignment selection
allows algorithms to find the best alignment
significance importance for alignment selection
helps distinguish real homology from random similarity
match
- identical characters in aligned positions
- Represents conserved regions or no change
alignment elements reflect…
evolutionary events in sequences
- matches, mismatches, gap
mismatch
- different characters in aligned positions
- Indicates substitutions or mutations
gap
- dash(-) inserted to improve alignment
- Represents insertions and deletions (indels)
linear gap penality
- fixed cost for each gap
Ex/ -2 for each gap, regardless of length
affine gap penalty
- different costs for opening and extending gaps
Ex/ gap open= -4, gap extend= -1
gap penalties
** reflect biological assumptions and impact alignment outcomes
Implications of Gap Penalty Types
1.) Linear penalties:
– Simpler to implement
– May over-penalize long gaps
2.) Affine penalties:
– Better handling of long indels
– More biologically realistic
3.) Biological rationale:
– Single mutation event often causes multi-base indel
– Affine penalties better model this biological reality
Sophisticated scoring approaches (gap penalties)
***Advanced scoring methods enhance alignment accuracy
1.) Position- specific gap penalties:
– Reduce penalties in variable regions
– Increase penalties in conserved regions
2.) Residue-specific gap penalties:
– Adjust penalties based on amino acid properties
3.) Terminal gap penalities:
– Often reduced to allow end gaps in local alignments
Protein alignments that require sophisticated scoring systems
***Simple match/mismatch scoring is insufficient bc:
- Some amino acid substitutions are more likely than others
- Chemically similar amino acids often substitute without affecting function
- Evolutionary relationships between amino acids are complex
substitution matrices
*quantify amino acid replacement probabilities
- probability that a.acid i mutates into a.acid j for all pairs of a.acids
- Constructed by assembling a large and diverse sample of verified amino acid alignments
- Reflect the true probabilities of mutations occurring through a period of evolution
global alignment
- compares sequences in their entirety aka from START to END
***Needleman-Wunsch
Key Characteristics of Global Alignment
- Attempts to align every residue in both sequences
- Introduces gaps as necessary to maintain end-to-end alignment
- Optimizes the overall alignment score for the entire sequences
Needleman-Wunsch
- guarantees optimal global sequence alignment
- Final number = final alignment score
- traceback to find the best alignment
***Look at every possible move u can make to get into that cell
– Diagonal = mismatch/match
– Side/up/down = gap
– MATCH ⇒ diagonal
*** There can be multiple optimal alignments
Advantages of global alignment (needleman-wunsch)
- Provides a complete picture of sequence similarity
- Ideal for detecting overall conservation patterns
- Useful for phylogentic analysis of related sequences
Limitations of Global Alignment
- May force alignment of unrelated regions in divergent sequence
- Less effective for sequences of very different lengths
- Can be computationally intensive for long sequences
local alignment
- identifies best matching subsequences
- focuses on finding regions of high similarity within sequences
- Does not require aligning entire sequences end-to-end
- Allows for identification of conserved regions or domains
Key Characteristics of Local Alignment (Smith-Waterman)
- Aligns subsections of sequences
- Ignores poorly matching regions
- Can find multiple areas of similarity in a single comparison
Smith-Waterman
- 0 zero is the lowest score
- Start alignment at the highest cell
- Stop aligning when you encounter a zero
Needleman-Wunsch VS Smith-Waterman
1.) Matrix initialization:
NW: the first row and column are filled with gap penalties
SW: first row and column filled with zeros
2.) Scoring system:
NW: allows negative scores
SW: negative scores are set to zero
3.) Traceback:
NW: starts from the bottom-right cell
SW: starts from the highest scoring cell in the matrix
Protein motif identification
- exemplifies local alignment utility
- can identify functional regions:
– protein domains
– active sites
– binding motifs
– signal sequences
– post-translational modification sites
Multiple Sequence Alignment
- Compares three or more sequences simultaneously
Definition of MSA: arranges 3 or more biological sequences (DNA, RNA, or protein) to identify regions of similarities
- Aims to infer structural, functional, or evolutionary relationships among the sequences
Ex/ Clustal Omega, MAFFT, and MUSCLE
Key Characteristics of MSA
- Aligns multiple sequences in a single analysis
- Introduces gaps to maximize alignment of similar characters
- Preserves the order of characters in each sequence
Transcriptomics
- A real-time microscope
- Allows us to see exactly what genes are active at a given moment
- Can see gene expression changes over time
transcriptomics process
- switch the complete set of RNA transcripts
- including mRNA, rRNA, tRNA, non-coding RNA
mRNA
instructions for protein synthesis
rRNA
forms part of the ribosome structure
tRNA
helps translate the genetic code into proteins
non-coding RNA
- play regulatory roles in the cell
genome VS transcriptome
G: relatively static
T: constantly changing and captures the cell’s response to its environment and internet signals
*** The dynamic nature of the transcriptome reflects the functional state of the cell
transcriptome can reflect:
- cell type
- developmental stage
- environmental conditions
*** allows us to see which annotated games are actually being used
cell type
a neuron will have a different gene expression profile than a liver cell
developmental stage
The genes active in an embryo differ from those in an adult
environmental conditions
cells respond to stress, nutrients, or pathogens by changing gene expression
transcriptomics is versatile
- developmental biology
- disease research
- drug discovery
- ecology
developmental biology
- understanding cell differentiation
Which genes are expressed in a specific cell type or condition?
disease research
- identifying pathological gene expression patterns
What are the differences in gene expression between healthy and diseased states?
Drug discovery
- revealing mechanisms of action and side effects
How does gene expression change over time or in response to stimuli?
ecology
- studying organism-environment interactions
How do environmental factors influence gene expression?
isoforms
A single gene can produce multiple mRNA transcripts (isoforms)
transcriptomics reveals…
- reveals alternative splicing and isoforms
- One of the main ways organism can increase protein diversity without increasing the number of genes
single-cell transcriptomics
- revolutionizes resolution
- Captures gene expression in individual cell (overall purpose)
- Reveals cellular heterogeneity within tissues
- While powerful, data is sparse and noisy
– Not very reproducible bc there is very little RNA in cell
– Often paired with bulk RNA analysis
*** most beneficial for rare cell with complex tissue types
spatial transcriptomics
- maps gene expression to location
- Preserves spatial information of transcripts within tissue sections
- Reveals how cellular neighborhoods influence gene expression
functional insights from genomics
- Identifies potential functional elements
- Predicts disease risk
functional insights from transcriptomics
- Reveals which elements are active
- Shows diseases state
temporal insights from genomics
- Requires one-time sampling
- Reveals evolutionary history
temporal insights from transcriptomics
- captures real-time cellular responses
RNA integrity number
- Assess RNA integrity
– rRNA makes up a large (~85%) of our RNA
** Based on the ratio of 28S and 18S rRNA vs. all RNA
18S is partially degrade 28S RNA
28S is largest peak (furthest to the right of graph)
mRNA enrichment focus
- focuses sequencing on protein-coding transcripts
Enrichment method affects:
– Gene expression measurements
– Detection of non-coding RNAs
– Identification of immature transcripts
How could we filter our sample for only mRNA?
- Poly A tail primer will allow for amplification of only mRNA
- Poly (A) selection captures mature mRNAS
Reverse transcription introduces unique challenge
- RNA is converted to cDNA using reverse transcriptase
- Random or oligo(dT) primers influence transcript representation
- Second-strand synthesis method can preserve strand information
microarrays
*detect gene expression
- cell sample is cultured
- mRNA is isolated
- reverse transcription to cDNA
- hybridize cDNA probes to oligo sequences on microarray
- no longer in practice
*** require previous knowledge/info input to reference
Caveats of Microarrays
- Limited to known sequences: can only detect pre-defined sequences
- Cross-hybridization: similar sequences may cause false positives
- Limited dynamic range: may miss very low or high abundance transcripts
- Normalization challenges: complex process, potential for bias
The primary advantage of RNA-sequencing over microarray technology?
Does not require prior knowledge/information
RNA-seq
- Now we just use the cDNA
- RNA-seq doesn’t require prior knowledge of sequences
– Enables discovery of novel transcripts and isoforms
Computational Pipeline for RNA-seq data analysis Outline
- Read Alignment: Mapping Transcripts to the Genome
- Quantification: Measuring Gene Expression Levels
- Differential Expression Analysis: Identifying Key Genes
- Dimensionality Reduction: Visualizing Complex Data
Read Alignment: Mapping Transcripts to the Genome
- Consideration of splice junctions and gene isoforms
- Needs to account for known and novel splice sites
- Requires specialized alignment algorithms (e.g. STAR, HISAT2)
Quantification: Measuring Gene Expression Levels
- Counting aligned reads with HTSeq or featureCounts
- Transcript-level quantification with Salmon or Kallisto
- Normalization methods: ex/ TPM (transcripts per million)
- Distinguishing between different isoforms of the same gene
Differential Expression Analysis: Identifying Key Genes
- Compares gene expression levels
- Statistical testing with DESeq2 or edgeR
- Visualization of results (volcano plots)
- Clustering of differentially expressed genes
- Results in list of up- and down-regulated genes
Dimensionality Reduction: Visualizing Complex Data
- Reduces high-dimensional data to 2D or 3D for visualization
- Reveals patterns and clustering in the dta
- Techniques include PCA, t-SNE, and UMAP
- This practice is widely used, but extreme caution needs to be used and is not generally recommended
** generally not in practice because the data is not accurate; never analyze from reduced dimensions
challenge of aligning short reads to large reference genomes
- dealing with enormous data sets
- millions of base pairs
- hundreds of GB (most computers hold 8-12 GB)
3 different alignment algorithm strategies
- Hash tables
- Suffix arrays/trees
- Burrows-Wheeler transforms
Hash table
- Hash tables link a key to a value
– Keys represent a “label” we can use to get information - A “hash function” determines where to find their number
- convert labels to table indices
*** connects information to data in memory via hash function
ex/ like a phone book
hash table for k-mer location
- hashing our reference genome seeds our hash table with k-mer locations
- provides quick lookups of our reference genome
- query a k-mer read to get indices of our possible reference genome locations
Seed-and-extend in hash-based alignment
- determine k-mer strings
- use hash table for rapid lookup of potential matches quickly
-***multiple seeds increase chance of finding correct location - extend by starting from seed match and grow in both directions with reference genome
** check to see if we can align to reference
*Always go forward, but have to check backward if hit is not at the start of the sequence
Hash-Based Alignment: Divide and Conquer
A “DNA dictionary” with a quick lookup and direct access to potential matches
pros/cons of hash-based alignment
Pros:
- Easily parallelizable
- Flexible for allowing mismatches
- Conceptually simply
Cons:
- Large memory footprint for index
- Can be slower for very large genomes
suffix trees
- represent all suffixes of a given string
- used to find starting index of suffix
suffix arrays
- memory efficient alternatives to trees
- requires less memory but is also less powerful
- create all suffixes
- sort lexicographically
3.
Burrows-Wheeler Transform (BWT) purpose
Compression reduces the amount of data we have to store
- sorts string without losing the original data when sorting lexicographically forces repeats that loses data
BWT workflow
- Append a unique end-of-string (EOS) marker to the input string.
- Generate all rotations of the string.
- Sort these rotations lexicographically
- Extract the last column of the sorted matrix as the BWT output.
** First column is more compressible but we lose context and reversibility
BWT reverse
- Write BWT output vertically
- Sort output lexographically.
- Append the BWT output to front of sorted string.
- Repeat sort and append
- Repeat into length of rows equals length of output
- The string that ends with EOS marker is the original string.
Backward Search Algorithm for BWT
- efficiently finds occurrences of a pattern in a text using the LF-mapping
- number the F (first) and L (last) columns
- find F rows that have last letter of search string
- note which rows have the next letter in the L-column
- repeat until first letter
suppose we have isolated a normal and cancerous cell. We want to identify possible drug targets based on overexpressed genes
use transcriptomics
normalizing the transcriptome
- must normalize before making comparisons between transcriptomes (many sizes of such)
- make ratio of normal to cancerous cell expression of transcriptomes
Scaling data to “parts per million”
- Transcripts and ratios are substantially smaller
- Small floats require high precision and thus memory
- This can make computations and communications challenging, so we often scale everything to a million to use unsigned integers
read per kilobase (RPK)
- corrects experimental biases where longer transcripts will have more reads
- corrects through normalization of gene length (more exons)
RPK = (read counts for gene) / (gene length in kilobases)
Reads per kilobase of transcript per million reads mapped (RPKM)
RPKM = 10^9 * (reads mapped to transcript / total reads*transcript length)
traditional quantification and read mapping
- assigns a read to single transcript using read mapping algorithms
- once aligned we can count the number of mapped reads to each transcript
BOWTIE 2
Uses BWT to map and quantify reads
Spliced Transcripts Alignment to a Reference (STAR)
- Maximum Mappable Prefix (MMP) approach for fast, accurate spliced alignments
- finds the prefix that perfectly matches reference then repeats for unmatched regions
- automatically detects junctions instead of relying on databases
Alignment-based methods are computationally expensive
SO
Alignment-based methods need to determine the read’s exact position in the transcript
Pseudoalignment
- finds which transcript, but not where
– Identifies which transcripts are compatible with the read, skipping the precise location step
** skips the full alignment process
- Instead of mapping each read to a specific position, pseudoalignment identifies which transcripts are compatible with a given read
alignment VS pseudoalignment
Alignment → specifies where exactly in the transcript this read came from (at position ___)
Pseudoalignment → specifies that it came somewhere from this transcript (compatible)
Pros and cons of pseudoalignment
Pros: faster and less resource-intensive than alignment-based methods
Cons: It may lack certain details, such as the position and orientation of reads, which are useful for correcting technical biases
generative model
- statistical model that explains how the observed data are generated from the underlying system
- Defines a computational framework that produces sequencing reads from a population of transcripts
Salmon
- mathematically defines a transcriptome by its individual transcripts and their counts
- take nucleotide fractions by taking into account the effective length of each transcript
- tells us how much of the total RNA pool comes from each transcript
- tries to identify distributions of reads amongst the transcripts
- Matrix computationally assigns fragments to transcripts
- Salmon looks for parameters with the lowest errors (generative model)
Converting salmon to relative abundances
- The transcript fraction tells us the proportion of total RNA molecules in the sample that come from transcript i
*** normalizes the nucleotide fraction by the effective length (Ti)
- adjusts for the longer transcripts generating more reads
transcript fraction
proportion of total RNA molecules in the sample that come from a certain transcript (i)
Transcript-Fragment Assignment Matrix
- Z is binary matrix where all values are 0 or 1
- M transcripts (rows)
- N fragments (columns)
** shows if fragment is assigned to transcript
conditional probability notation
P(a|b)
- what is the probability of a occurring if b is true
** we want to optimize values to get the highest probability
Fragment Probabilities P(fj|ti)
Conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical biases
**SALMON quasi-mapping: probability is approximated based on transcript compatibility rather than exact positions
positional bias in Salmon
- Fragments that include transcript ends might be too short
- Fragments from central regions are more likely to be of optimal length for sequencing reads
- A transcript’s effective length adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled
Overcoming GC content bias in Salmon
- Undersample GC regions
- Make good stop codons
- Oversample AT rich regions
2 Phase Interference in Salmon
- Online phase: makes fast, initial estimates of transcript abundances
- Offline phase: refines these initial estimates using more complex optimization techniques
This 2-phase approach balances speed(in the online phase) with accuracy (offline phase)
quasi-mapping
Quasi-mapping is a fast, lightweight technique used to associate RNA-seq fragments with possible transcripts
- early stopping of read mapping
- alignment is expensive, so Quasi-mapping stops after identifying seeds
nt = (# of fragments mapping to t / total # of fragments)
Iteratively update parameters based on mini batches
- based on mini batches
- Offline phase fine tunes transcript abundance
- After the online phase, Salmon refines the estimates using a more complex optimization method, typically based on the Expectation-Maximization (EM) algorithm
** ensures the accuracy of abundance estimates, incorporating the bias corrects learned during the online phase
Expectation-Maximization (EM) algorithm
ensures the accuracy of abundance estimates, incorporating the bias corrects learned during the online phase
likelihood of data for Salmon
- central to the interference process in Salmon
- probability of observing the entire set of Fragments, given the transcriptome and nucleotide fractions
- optimize the estimates of alpha, a vector of the estimated number of reads originating from each transcript
*** goal is to maximize this likelihood to infer the most likely values of n, which correspond to the relative abundances of transcripts
Maximum Likelihood Estimation (MLE)
The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)
Why the EM Algorithm Maximizes the Likelihood:
EM algorithm breaks down a difficult problem into 2 simpler problems:
– E-step: estimate the missing information(the assignment of fragments to transcripts) using the current transcript abundance estimates
– M-step: use the estimated assignments to update the transcript abundances, improving the likelihood
EM algorithm and likelihood
For each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimate until it reaches a maximum
What is Differential Gene Expression?
the process of identifying and quantifying changes in gene expression levels between different sample groups or conditions
DGE workflow
- Sample collection: gather samples from different conditions (e.g. healthy or diseased)
- RNA sequencing (RNA-seq): Quantify gene expression level using high-throughput sequencing technologies
- Read Mapping and Quantification: align RNA-seq reads to a reference genome and quantify expression (e.g, using Salmon)
- Statistical Analysis: Identify genes with significant expression differences between conditions.
Case Study: Breast Cancer
(DGE)
Objective:
– Identify genes differentially expressed between triple-negative breast cancer(TNBC) and hormone receptor-positive breast cancer
Findings:
– TNBC shows upregulation of genes involved in cell proliferation and metastasis
Implication:
– Targets for specific therapies
Improved classification and prognosis of breast cancer subtypes
***DGE provides statistical tools to identify changes between samples
statistical model
A mathematical tool that describes how data is generated
***help us to make sense of complex data by identifying patterns and determining whether differences are meaningful or just due to chance
what do statistical models help to answer?
It helps us answer:
– Is there an apparent difference in gene expression between 2 conditions?
– If so, is it real, or could it have happened by random chance or experimental flaws?
hypothesis testing
- perform hypothesis testing to see if the difference in expression between conditions is statistically significant.
2 types of hypothesis
null (Ho)
alternative (H1)
Null Hypothesis(Ho)
There is no difference in gene expression between the 2 conditions.
Alternative Hypothesis(H1)
There is a significant difference in gene expression between the conditions.
when do you reject the null hypothesis?
We reject the null hypothesis when our statistical test shows that the observed difference, if any, is unlikely to have happened by random chance.
p-value
- the probability of the null hypothesis being true
What is the probability that any difference is either (1) nonexistent or (2) due to random chance (i.e. “getting lucky”)
higher VS lower p-value
- The higher the p-value, the more our model supports the null hypothesis
- The lower the p-value, the more our model supports the alternative hypothesis
Differential gene expression uses statistical models for hypothesis testing
Ensures that we are not biasing our data or our interpretation
count data
RNA-seq generates count data: the number of RNA fragments that map to each gene
discrete data
- data that can only take specific values (ex/ only whole numbers)
** In RNA-seq, we measure the number of reads mapped to a gene, so the data are count-based
- requires special statistical tools
- cannot use normal distribution bc it requires continuous data
Binomial distribution
models the number of successes in a fixed number of independent trials, where each trial has the same probability of success
** simple model for discrete counts
Limitations of Binomial distributions for RNA-seq
- MAIN limitation → assumes that the probability of success is constant between samples
- Smaller limitation 1 → The number of possible trials can be very large, especially when sequencing at a high depth.
- Smaller limitation 2 → The probability of expression is very small for many genes because they are either lowly expressed or not at all.
Poisson distribution
a statistical tool used to model the number of events(or counts) that happen in a fixed period of time or space, where:
– The events are independent of each other
– Each event has a constant average rate
- A baseline for modeling discrete counts
*** simplifies computation and allows for varying probabilities
*provides an accurate distribution of counts if your mean and variance are approximately equal
parity plots with mean and variance
- show deviations with Poisson distributions
Mean = variance line
***Higher counts typically have larger variance
Overdispersion in RNA-Seq
when the variance in the data is larger than what is predicted by simpler models (e.g. Poisson distribution)
** may reflect biological variability between samples not captured by the experimental conditions
biological variability between samples not captured by the experimental conditions
Differences in RNA quality
Sequencing depth
Biological factors like different cell types within the same tissue
Expected variance for Poisson-distributed data
- equals the mean: Variance = u
*** Variance is often larger than the mean for RNA-Seq: Variance > u
Negative Binomial distribution accounts for high dispersion
If alpha = 0, the Negative Binomial distribution reduces the Poisson distribution.
- negative binomial distribution models overdispersed count data (variance exceeds mean)
- if this event rate is low, it is simplified into a poisson distribution. (that plots number of successes)
The Challenge of zeros in RNA-seq data
- RNA-seq data frequently contains zero counts for some genes because not all genes are expressed under all conditions.
- Most statistical models account for variance, but not that 0’s can dominate counts
Ex/ high expected mean with Poisson distribution, we can still have zeros or very low counts. (zero = gene turned off)
In these circumstances, we have to use zero-inflated models.
Why are statistical models important in RNA-seq?
RNA-seq data = messy: counts vary, lots of zeros, and data has no simple patterns
We need models to account for this complexity and figure out which genes are differentially expressed in a meaningful way
maximum likelihood estimation (MLE) for optimization algorithms
- used to estimate the parameters ų (mean) and ɑ (dispersion) for each gene
** MLE tries to find the model parameters that make the observed counts most likely
Adjusts the model until the predicted counts match the actual counts as closely as possible (i.e. minimize the error)
wald’s test
statistical test that helps us to determine whether estimated log fold change between 2 conditions is significantly different from zero.
Log Fold Change (β1) = 0
- means that the gene is expressed at the same level in both conditions.
- null = log fold change between conditions is 0. no difference in expression
- alternative = log fold change between conditions is not zero (there is a difference in expression).
Estimate Parameters from the Negative Binomial Model
- gives us an estimated log fold change (β1) for each gene
- Also gives us standard error (SE) for this estimate, which tells us how uncertain we are about the estimate of log fold change [ SE(β1) ]
Wald Statistic
tells us how many standard deviations the estimated log fold change is away from zero (no difference = 0)
likelihood ratio test (LRT)
Idea is to compare the likelihood of data under:
– The null model (same expression in both conditions)
– The alternative model (different expression levels in each condition
volcano plot
displays the relationship between each gene’s statistical significance (p-value) and the magnitude of change (fold change)
volcano plot interpretation
Top corners: genes with high significance and large fold changes (both upregulated and downregulated)
Center: genes with little to no change or low significance
MA Plots
visualizes the relationship between the average expression (A) and the log fold change (M) for each gene
Usage: identifying trends or biases in expression data, such as mean-dependent variance
MA Plots interpretation
Center Line (M=0): No change in expression.
Spread: indicates variability in fold changes across different expression levels
heat map
- displays the expression levels of multiple genes across different samples using color gradients
Rows: Genes
Columns: Samples
Color Intensity: represents expression level (e.g. red for upregulation, blue for downregulation)
heat map interpretation
Identifying clusters of co-expressed genes and sample groupings based on expression profiles
Principal Component Analysis (PCA) Plots
- PCA transforms high-dimensional gene expression data into principal components that capture the most variance
Axes: Principal components representing the most significant sources of variation
Usage: assessing batch effects, overall data structure, and sample quality
PCA interpretation
Sample clustering: samples from similar conditions cluster together
Outliers: samples that do not group with others may indicate technical or biological variability
What is the main limitation of Sanger sequencing?
It has a high cost and low throughput.
How can Sanger sequencing be used to help with next-generation sequencing (NGS) technologies?
to confirm sequencing results and fill gaps
Which principle does Illumina sequencing rely on?
Sequencing by synthesis using reversible terminator nucleotides.
What is a significant challenge particular to de novo genome assembly?
Handling repetitive DNA sequences.
What does a directed edge represent in de Bruijn graphs?
The overlap between k-mer sequences
What computational complexity is characteristic of de novo assembly?
The similarity of k-mers within the reads
What is the primary benefit of using both short and long reads in de novo genome assembly?
Long reads can span repetitive regions, while short reads improve coverage.
Which algorithm is commonly used for local pairwise sequence alignment?
Smith-Waterman
Blast
What is the main difference between global and local sequence alignment?
Global alignment matches sequences in full; local alignment focuses on best-matching parts
Local alignment identifies conserved regions; global for full sequence comparison
What is the significance of the numerical value in the bottom-right of the Needleman-Wunsch alignment matrix?
The numerical value in the bottom-right corner of the Needleman-Wunsch alignment matrix represents the optimal score of the global alignment between two sequences.
What are the inherent limitations of greedy algorithms for genome assembly? In particular, how might these limitations affect the assembly outcome when dealing with complex eukaryotic genomes?
Greedy algorithms for sequence assembly can be efficient for simple genomes but struggle with complex ones due to issues with repetitive sequences, limited global perspective, and handling of genetic
variation. These limitations can lead to misassemblies, especially in genomes with high complexity,
such as those with repetitive regions or structural variations.
You are analyzing bulk RNA-seq data from a study comparing gene expression in diseased versus
healthy tissue samples. You notice that a specific gene has a high fold change (i.e., it is up-regulated),
but the large p-value indicates that it is insignificant. What could be the most likely reason for this
observation?
The gene expression varies significantly within the sample groups
When comparing RNA-seq data from two different developmental stages of an organism, you find
many genes with altered expression. Which factor should be considered before attributing these
changes to developmental processes?
- Batch effects or variations in sequencing depth between the samples.
- Exclusive reliance on fold-change values.
- Attributing all changes in gene expression to transcriptional regulation.
Given the challenge of genomic variability and sequencing errors in read mapping, which approach
is most effective in distinguishing true splice junctions from artifacts?
Employing statistical models that account for sequencing error rates and genomic variability
The computational inference of splice junctions from RNA-Seq data involves aligning short reads that
may span exon-exon junctions. This process is complicated by the vast diversity of potential splicing events and the need for high accuracy in distinguishing actual splice junctions from sequencing
errors or genomic variations. Given these challenges, which strategy is most effective for improving
the accuracy of splice junction identification?
Use a hybrid approach that combines alignment to a reference genome with de novo assembly
of reads
Aligning short reads to a reference genome presents significant challenges, especially in the context
of repetitive sequences or highly variable regions. Which approach offers the best potential to enhance read mapping accuracy in these complex genomic landscapes?
Initially aligns reads to a simplified model of the genome and gradually integrate more complex
regions
What property of the Burrows-Wheeler transform is most crucial for improving the efficiency of pattern matching in biological sequences?
The rearrangement of characters to bring similar characters together.
Why is the Burrows-Wheeler transform significant for bioinformatics applications in the context of
FM-indexes?
It allows for efficient backward search, reducing the time complexity of finding patterns.
When applying the Burrows-Wheeler transform to a sequence, what is the importance of the last column?
It is for reconstructing the original sequence.
Which of the following best describes the role of the Burrows-Wheeler transform (BWT) in the FM
index?
BWT is the first step in FM-index construction
The true transcriptome of a sample is defined as:
The complete set of RNA molecules, including all isoforms present in the sample.
The concept of effective length in RNA-seq data analysis accounts for:
The adjustment for the empirical distribution of fragment lengths obtained during sequencing.
What is the primary goal of using maximum likelihood estimation in Salmon for RNA-Seq data analysis?
To maximize the probability of the observed RNA sequencing data
Why is the quality of sequencing data typically lower at the end of a read in Sanger sequencing?
- primarily attributed to the decreasing population of longer DNA fragments.
- probability of ddNTP incorporation
- concentration ratio of dNTPs to ddNTPs
- mass and mobility differences
- signal-to-noise ratio
NOT DUE TO QUALITY OF READS
- mixture contains an excess of both dNTPs and ddNTPs.
- concentrations of these nucleotides are not depleted during sequencing.
- concentrations of dNTPs and ddNTPs remain constant throughout
What is the purpose of adding adapters to DNA fragments in Illumina sequencing?
Adapters are short oligonucleotide sequences added to DNA fragments during Illumina sequencing library preparation.
Purpose: Adapters contain sequences complementary to oligonucleotides on the Illumina flow cell surface. This allows DNA fragments to bind to the flow cell and form clusters.
If the concentration of ddNTPs is too high:
- short fragments
- loss of long reads
- reduced overall signal
If the concentration of ddNTPs is too low:
- long fragments
- loss of short reads
- weak signal for short fragments
ratio of ddNTPs to dNTPs in Sanger sequencing
- critical for generating a balanced distribution of fragment lengths
– fragment distribution
– read lengths
smaller k-mer sizes
- more likely to find overlaps between reads because they require fewer matching bases. This increases sensitivity, helping to connect reads in regions with low coverage or sequencing errors.
larger k-mer sizes
Larger
- k-mers are more specific, reducing the chance of erroneous overlaps but requiring higher-quality data.
- more likely to span unique regions
- reduced overlap detection
-more memory used
f