Bioinformatics Exam Flashcards
AlphaFold 3
Predict the joint structure of complexes including proteins, NA, small molecules, ions, etc.
HOMER2
Show that the effect of transcription factor binding on transcription initiation is position dependent
How to we acquire our DNA sample for DNA sequencing?
- Start with bacterial culture to produce the product of interest
— Biotechnology frequently uses massive E. coli cultures to produce. - Separate cells from media
– Centrifuge and separate cells and media
– Keep the component of interest (DNA)
– Break open the cells by lysing them (chemical lysis destabilizes the lipid bilayer and denatures proteins) - Isolate and purify our DNA
– phenol-chloroform extraction (liquid-liquid separation)
– Aq DNA/RNA on top
– Lipids/large molecules on the bottom
Surfactants VS Phospholipids
- both contain a hydrophilic head and hydrophobic tail
– surfactant have only hydrophobic tail which allows them to further penetrate molecular structure as compared to phospholipids with 2 tails
– break phospholipid barrier more and destabilize proteins (used for chemical lysis)
260 nm DNA sample absorbance
- absorbance at 260 nm is correlated to the DNA concentration of the sample
— looks for impurities in the sample solution
— can assume we have purified DNA sample after this step
— based on the the absorbance of UV irradiation (Bier Lambert’s Law)
Main purpose of Sanger Sequencing
— determine the precise ordering of nucleotides
DNA elongation
- occurs rapidly and continuously
- use DNA polymerase and excess nucleotides to make copies of DNA
- requires 3’ OH to add another nucleotide to the chain
Di-deoxynucleotides (ddNTP)
- ddNTPs stop replication
- do not have a 3’ OH for continued elongation
- usually a 1:100 ratio
*** left with DNA strands of variable length
Sanger sequencing process
- sort DNA fragments by length to see what the last nucleotide is
– the less ddNTP results in a longer strand
– higher concentration of ddNTP results in shorter strands
*** by sorting fragments by length, we can see what the last nucleotide was (line up 5’ nucleotide)
— get the template strand
Original Sanger Sequencing SetUp
- split DNA sample into 4 beakers
- Add a ddNTP into each beaker (A,T,C,G)
- Add some radioactive ddNTP into a single beaker
- Add Taq and run PCR
** separate by length in gel electrophoresis
(larger fragments do not travel as far)
– order from farthest traveled (shortest) to least traveled (longest)
***** need SEPARATE beakers bc you cannot differentiate between radioactive nucleotides
Sanger Sequencing Now
- now use fluorescent tags to distinguish ddNTPs
- only need one beaker for PCR
- also automate fragment separation
capillary gel electrophoresis
- can accelerate fragment length sorting and detection
- separates molecules by sized based on their charge-to-mass ratio
- Smaller molecules move more freely/faster through the gel than larger molecules
- molecules must be charged through tagging with a charged molecule
- DNA and RNA are charged bc each nucleotide has a charge
SanSeq Chromatogram
- unique fluorescence signal per ddNTP produces a chromatogram
ideal SanSeq Chromatogram
- variation in peak height is less than 3-fold
- peaks are evenly distributed
- peaks contain only 1 color
- absent baseline noise
- interpreted nucleotide sequence is 5’ to 3’
Nonideal SanSeq Chromatogram
- significant noise up to ~20 bps is unreliable transport
- dye blobs from unused ddNTPs
- fewer longer fragments so signal is weaker
SanSeq VS Illumina Sequencing
- Sanger sequencing is very accurate but slow compared to Illumina
Illumina Sequencing
- sequencing by synthesis
- used polymerase/ligase enzyme to incorporate nucleotides with fluorescent tag (fluorescently labeled reversible terminator)
- tags are then identified to determine the DNA sequence
Illumina Sequencing Process
- Adapter ligations attach P5 and P7 oligos to facilitate binding to flow cell
- fragments become bound somewhere in the flow cell
- locally amplify bound DNA fragments to get clusters of the same sequence
– bridge amplification creates double-stranded bridges
– double-stranded clonal bridges are denatured with cleaved reverse strands
***clusters will give off a stronger signal compared to a single fragment
We repeatedly →
1. Add nucleotide
2. Capture signal
3. Cleave fluorophore
5 step iIlumina sequencing process
- Add labeled dNTPs into flow cells
- Incorporate a complementary nucleotide
- Remove unincorporated fluorescent nucleotides
4, Capture fluorescent signal & image clusters - Remove the fluorophores and the protecting group
pair-ended sequencing
- enables both ends of the DNA fragment to be sequenced
– Because the distance between each paired read is known, alignment algorithms can use this information to map the reads over repetitive regions more precisely.
***Results in much better alignment of the reads, especially across difficult-to-sequence, repetitive regions of the genome
Nanopore Sequencing Technology
- nanopore and polymer membrane respond to electrical perturbations
*** gives us much longer reads which is important for assembling reads into a genome
** type of third-generation sequencing (TGS)
- can give long reads with no amplification
- Direct detection of epigenetic modifications on native DNA.
- sequencing through regions of the genome inaccessible or difficult to analyze by short-read platforms.
- Uniform coverage of the genome; not as sensitive to GC content as short-read platforms.
genome assembly
- process of combining the short, overlapping sequencing reads into continuous DNA sequence
– having multiple fragments that contain the same portion of the sequence improves our coverage
reads
raw sequences coming from experimentatation
contigs
continuous stretches of DNA sequence from overlapping sequencing reads
ambiguous assembly
connecting contigs in an unknown order
- accounts for differences in scaffolds
- assemble using reference genome
scaffolds
multiple overlapping contigs with estimated gaps put together in a known order
Assembly quality metrics
- sort contigs from longest to shortest
- Find point when you have ~50% of genome
- then annotate our genome with exons and introns
L50
- number of contigs whose combined length is at least 50%
*** Lower is better for L50 value
- longer contigs = more confidence that genome is right (higher quality assembly)
N50
- sequence length of the shortest contig at 50% of the total genome length
*** Higher is better for N50 value [median contig size = reliability factor]
- N50 is the length of shortest contig in L50 assembly
why clean sequencing reads?
- improves assembly
- Garbage in = garbage out
FASTA files
- store sequences
- One line starts with a “>” and a sequence ID code
— It is optionally followed by a description of the sequence - One or more lines containing the sequence itself.
However…base calling is NOT perfect
lagging synthesis
- by failure to remove blocking fluorophore
- synthesis is behind by 1 because block fluorophore was not removed
leading synthesis
- by addition of dNTP instead of ddNTP
- synthesis is ahead by 1 nucleotide because 2 were added at one
signal cross talk
- degrades the quality of assembly
- clean = clear
- noisy - blurry
- ML models and algorithms compute the probability of error → i.e. quality
— not confident that what it is seeing is purely blue, green, etc.
FASTQ files
- store sequence and quality
- quality scores measure the probability that a base is called incorrectly
ASCII-encoded probabilities
- allow for storing many floats per nucleotide
- ASCII characters require ~¼ the memory and we already have to store nucleotides
- Hexadecimal characters have an associated integer
(phred quality (Q))
Phred quality P(Q)
the integer associated with the ASCII symbol
- indicates the probability that an error has occurred
– smallest value 33 = lower hexadecimal cannot be rendered on screen
– ! = probability of error =1 (very bad quality)
*** Lower you go down the chart = higher quality of read; less likely for there to be errors within the text file
calculating phred quality
P(#) = 10 ^ - (#-#)/10
Where do FASTQ file entries go?
- NIH databases
- GeneBank for genomic sequences
- Sequence read archive (SRA) for sequencing data
- RefSeq for reference genomes
- BioProject for curated resources for a specific project
quality issues in sequencing data
- errors are introduced to to the technical limitations of sequencing platforms
- adapters may be present if reads are longer than the fragments sequenced
— trimming adapters may improve the number of reads mapped
** quality control is an essential first step in any analysis
Per base sequence quality
- box and whisker plot of base-call accuracy
- green = excellent
- yellow = good
- poor = red
Per sequence GC content
- strong deviation from normal distribution could indicate contamination
- shows curves of GC count per read
- and theoretical distribution curve
**Compared similiarity of 2 to indicate purity/quality of sample
trimming and filtering
- trimming of problematic bases at the ends of reads must be done in order to reduce bias in future analyzes
Trimming:
1. low quality score regions
2. beginning/end of sequence
3. remove adapters
Filtering:
1. with low mean quality score
2. too short
3. too many ambiguous (N) bases
Ex/ CutAdapt, Trimmomatic, FastP
adapters for trimming
- adapters are unique to DNA prep protocol and technology
- note which specified adapter sequences are used for trimming
automatically cleaning data for quality
processing data many times consumes many resources SO combine tool features into runs
Instead of trimming adapters in one run and quality in another, we can simultaneously remove base calls with low accuracy.
— Phred </= 20 → poor
— Length required = 20
- automatically removes low accuracy and short reads needed to assemble reads into quality contigs/scaffolds
resequencing
align reads to reference genome and identify variants
deNovo Assembly
construct genome sequence from overlaps between reads
*** done 99% of the time
- repeats/high coverage are the main challenges
Why do we want the shortest superstring?
- overlap maximization
- reduces redundancy
- maximizes confidence with highest overlaps
- repeat resolution resolves repeats by favoring collapsed arrangements
- evolutionary pressure allows for most genomes have selective pressure to be efficient
greedy algorithm
- merge strings by highest overlap
Procedure →
1. Merge strings one at time keeping consistent with 5’ and 3
2. Always merge the largest overlap (greedy), not necessarily the size of fragment
3. Repeat
*** Being greedy makes genome assembly tractable
*** not used in practice but helps us to understand the problem
What happens if we have a tie for the greedy algorithm?
- Chose randomly (first encountered, first merged)
- Chose highest quality base call (use sequence with highest quality)
- Chose highest coverage (whichever results in more coverage)
- Look ahead (do both and evaluate consequence)
- Exclude (don’t merge at all)
repeats ruined our assembly
- missing strings can result from the greedy assembly process
- get the correct string back by increasing our K
de Bruijn graph
- graphs is a data structure for drawing relationships between items
- node = single entity [k-1]
- edge = represents a connection between entities (can have direction) [k]
directed multigraphs for genome assembly
- genome assembly uses direct edges to specify overlap and concatenation
Building a directed multigraph
- each unique k-mer is a node. (k-mer = substring of length k)
- Add directed edges for each overlap and concatenation
node is balanced if
indegree equals outdegree
cyclical sequence
- Circular genomes are not Eulerian
- Contains an extra edge
Why is this not Eulerian?
- more than two semi-balanced nodes
- cannot walk along each edge once
– if there was no overlap, then we would have some unconnected graphs
de Bruijn graphs and errors
- errors dramatically increase the number of edges and unconnected graphs
- errors affect k-mer counts
- Error correction should remove most tips and islands; rest can be removed here, leveraging graph structure
Graph traversal algorithms are used to extract contigs (procedure)
- Select a start node
- Walk along the graph until a dead end or previously visited node is reached
- Backtrack and explore alternative paths
- Repeat for remaining unvisited nodes
*** walking along the graph produces strings
how do we select a starting node?
using hubs with in and out degrees
high coverage
suggests that the node is likely a true sequence rather than an error
– confidence in that overlap is good and that node is a good starting point
How do you choose a walk
- Start a chosen vertex (node).
- Mark the current vertex as visited.
- Explore an adjacent unvisited vertex.
- If no unvisited adjacent vertices exist, backtrack to the last vertex with unvisited adjacent vertices.
How do we choose the “best” path for our contig?
- Long paths are desired but not always reliable due to potential repeats
- High, consistent read coverage
- Unique, non-branching paths
SPAdes
prokaryotic genome assembler
– based on DeBruijn graphs with numerous improvements
Error correction with BayesHamming
- Build hamming graphs for k-mers
— Undirected edges for Hamming distances of n nucleotide differences - Identify strong k-mers baked on clustering (i.e. high similarity)
— Estimate read error based on base qualities
multisized graphs and SPAdes
- building multisized graphs with different k’s
- using multiple graphs with different sizes of K’s allows for handling of variable coverages
large K SPAdes graphs
- leads to fragmented graphs
- good for high-coverage
small K SPAdes graphs
- leads to collasped, tangled graphs
- great for low-coverage regions (not too picky)
potential bulge in SPAde graph
small, alternative path in the graph that diverges and then merges back into the main path
– due to sequencing errors, repetitive sequences, or small variations indels
** must remove bulge, but bulge will quickly deteriorate the graph and lose read info
– must project the info/coverage into Q
– P’s edges are removed in the process
potential tip in SPAde graph
a short, dead-end path in the graph that does not connect back to the main sequence or structure
– result of sequencing errors, such as incomplete reads, low coverage, or random noise, which generate k-mers that don’t correctly align with the rest of the sequence
– Removes P (shortest) and projects information onto Q
Paired-ended reads do not always cover our whole insert
- If our insert (i.e. DNA sample) is longer than reads, then we don’t sequence the inner distance.
- We want to maximize this inner distance.
- A gap between paired reads gives us insight into repeated regions.
SPAdes estimates…
- …estimates gap length between 2 reads via deBruijn gaps and graphs
- doesnt not always have to be a repeating sequence; better for gap than unique sequences
assembler graphs
- assembler provide contigs and scaffolds
- island contains 1 or more contigs
- solid lines are called nodes and represent a contig
- each connection suggests how these contigs connect to form a scaffold
ex/ bandage
gene annotation
- identifying the genetic elements and function in our contigs
- results in sequences that likely encode for proteins
- 2 types: structural and functional
Ex/ Prokka (several outputs)
structural annotation
identifies critical genetic elements such as genes, promoters, and regulatory elements
functional annotation
predicts the function of genetic elements
- normally based on protein database search
eukaryotic VS prokaryotic annotation
- Eukaryote annotation is significantly more challenging than prokaryotes
- Introns and alternative splicing complicate eukaryote annotation
P: probabilistic models to identify open reading frames
E: accuracy demands supporting evidence like mRNA sequencing
Identifying open reading frames (ORFs)
- Seek the standard start codons: ATG, GTG, or TTG
- Seek the stop codons based on the translation table
— TAA, TAG, TGA for bacteria, archaea, and plant plastids
***then score the potential ORFs
ribosomal binding site motif score
- RBS score computed from dataset fitting
– Search for RBS motif after start codon; choose whichever has the lowest bin number
– take the training data from different annotated genomes to get computed frequency of RBS motif bin in the entire sequence (baseline) and RBS frequency - start codon score given by similar RBS framework
upstream score
- Upstream score based on base analysis
– By analyzing base frequency in specific upstream region, their annotation results improved
**essentially looking for promoters
coding score
- computed based on gene enrichment parameters
- computed frequency of nucleotide hexamers called in words
– probability of observing word within single genes [G(w)]
– probability of observing word within the whole genome/entire DNA sequence [B(w)]
why is sequence alignment important for bioinformatics?
- Biological sequences reveal evolutionary relationships
- Sequences play a large role in the central dogma of DNA
hox genes
- highly conserved gene
- Play a crucial role in embryonic development, particularly in determining the body plan and specifying the anterior-posterior axis
***So how do we know that it is highly conserved?
– By aligning sequences!!
– infrequent changes (high similarity) indicate evolutionarily conserved sequences
pairwise alignment
– reveals relationships between biological sequences
– Multiple Sequence Alignment (MSA) extends pairwise
Multiple Sequence Alignment (MSA)
- the process of aligning 3 or more biological sequences simultaneously
– Identifies conserved regions across multiple species
– Reveals patterns not visible in pairwise comparisons
Aligning sequences can provide more insight than just conservation
- Functional annotation (google search through data bases)
- RNA and protein structure (ex/ alphafold)
- Disease-associated mutations
- Vaccine design
Importance of scoring in alignment selection
Alignment scores guide the selection of meaningful alignments
- objectivity
- optimization
- significance
objectivity importance for alignment selection
provides a quantitative measure for comparison
optimization importance for alignment selection
allows algorithms to find the best alignment
significance importance for alignment selection
helps distinguish real homology from random similarity
match
- identical characters in aligned positions
- Represents conserved regions or no change
alignment elements reflect…
evolutionary events in sequences
- matches, mismatches, gap
mismatch
- different characters in aligned positions
- Indicates substitutions or mutations
gap
- dash(-) inserted to improve alignment
- Represents insertions and deletions (indels)
linear gap penality
- fixed cost for each gap
Ex/ -2 for each gap, regardless of length
affine gap penalty
- different costs for opening and extending gaps
Ex/ gap open= -4, gap extend= -1
gap penalties
** reflect biological assumptions and impact alignment outcomes
Implications of Gap Penalty Types
1.) Linear penalties:
– Simpler to implement
– May over-penalize long gaps
2.) Affine penalties:
– Better handling of long indels
– More biologically realistic
3.) Biological rationale:
– Single mutation event often causes multi-base indel
– Affine penalties better model this biological reality
Sophisticated scoring approaches (gap penalties)
***Advanced scoring methods enhance alignment accuracy
1.) Position- specific gap penalties:
– Reduce penalties in variable regions
– Increase penalties in conserved regions
2.) Residue-specific gap penalties:
– Adjust penalties based on amino acid properties
3.) Terminal gap penalities:
– Often reduced to allow end gaps in local alignments
Protein alignments that require sophisticated scoring systems
***Simple match/mismatch scoring is insufficient bc:
- Some amino acid substitutions are more likely than others
- Chemically similar amino acids often substitute without affecting function
- Evolutionary relationships between amino acids are complex
substitution matrices
*quantify amino acid replacement probabilities
- probability that a.acid i mutates into a.acid j for all pairs of a.acids
- Constructed by assembling a large and diverse sample of verified amino acid alignments
- Reflect the true probabilities of mutations occurring through a period of evolution
global alignment
- compares sequences in their entirety aka from START to END
***Needleman-Wunsch
Key Characteristics of Global Alignment
- Attempts to align every residue in both sequences
- Introduces gaps as necessary to maintain end-to-end alignment
- Optimizes the overall alignment score for the entire sequences
Needleman-Wunsch
- guarantees optimal global sequence alignment
- Final number = final alignment score
- traceback to find the best alignment
***Look at every possible move u can make to get into that cell
– Diagonal = mismatch/match
– Side/up/down = gap
– MATCH ⇒ diagonal
*** There can be multiple optimal alignments