Final Exam Flashcards
DNA sequencing set-up
- Start with bacterial culture for product of interest
- Separate cells from media via centrifuge
- Keep DNA by breaking open cells via lysing
- Isolate and purify DNA using liquid-liquid extraction (aq layer has DNA)
chemical lysis
destabilizes the lipid bilayer and denatures proteins
surfactants
one hydrophobic tail, which allows them to further penetrate molecular structures as compared to phospholipids with 2 tails
Similar to phospholipids, but break through barrier and destabilize proteins better
Main problem of determining the order of nucleotides
DNA elongation happens rapidly and continually
Uses DNA polymerase and excess of nucleotides to make copies of DNA
3’ OH is required for DNA elongation
Di-deoxynucleotides stop replication bc it lacks 3’ OH so polymerase cannot add another nucleotide to it
sanger sequencing
- accurate, long reads, but resource consuming
- use one beaker and fluorescence to distinguish between the ddNTPs
– Fragment separation can be automated via capillary gel electrophoresis
– Separates molecules by size based on their charge-to-mass ratio
Smaller molecules move more freely through the gel and migrate faster than larger molecules
molecules must be charged through tagging
– Unique signal per ddNTP products chromatogram
Building strand from fragments
Sort DNA fragments by length to see what the last nucleotide was
Line up the last 5’ nucleotide; gradually builds the 3’ end up to get strand
Original Set up →
- Split sample into 4 beakers
- Add all 4 ddNTPS into each beaker & radioactive ddNTP
Need separate beakers bc cannot differentiate between them - Add Taq polymerase
- Separate by length using gel electro.
Shortest lengths travel the farthest; associate them with a beaker
Good vs Bad chromatogram
Good:
- Variation in peak high is less than 3-fold.
- Peaks are evenly distributed and one color
- Baseline noise is absent
Interpreted nucleotide sequence is 5’ → 3”
Bad:
- Significant noise up to ~20 bps in (unreliable transport properties)
- Dye blobs occur from unused ddNTPs
- Fewer longer fragments so signal is weaker
Illumina
short reads, but high throughput
- Adapter ligations attach P5 and P7 oligos to facilitate binding to flow cell
- Primers are not complementary, so they do not base pair
- Fragments become bound somewhere in the flow cell
- locally amplify bound DNA fragments to get clusters of the same sequence
- Bridge amplification creates double-stranded bridges
- Double-stranded clonal bridges are denatured with cleaved reverse strands
- uses pair-end sequencing
***clusters will give off a stronger signal compared to a single fragment
Illumina stepwise
- Add labeled dNTPs into flow cells
- Incorporate a complementary nucleotide
- Remove unincorporated fluorescent nucleotides
- Capture fluorescent signal & image clusters
- Cleave the fluorophores and the protecting group
Pair-end sequencing
generated from both ends of a DNA fragment with known insert size
enables both ends of the DNA fragment to be sequenced
Distance between each paired read is known, alignment algorithms can use this info to map the reads over repetitive regions more precisely.
Results in much better alignment of the reads, especially across difficult-to-sequence, repetitive regions of the genome
** more expensive but ideal for genome assembly
Nanopore
Longer reads, more accurate for assembling reads into genome
Very expensive, low throughput
single-end reads
- generated from only one end of a DNA fragment
- Simpler, fast, more cost-effective
- Limited context for structural variations or duplications
- Used for small genomes and RNA seq where contiguity is less critical
Genome assembly
- process of combining our sequencing reads into a continuous DNA sequence
(Sequencing provides short, overlapping reads of DNA)
Having multiple fragments that contain the same portion of the sequence improves our coverage
reads
raw sequences coming the experiments
Contigs
continuous stretches of DNA seq from overlapping seq reads
Ambiguous assembly
contigs put together in an unknown order
Accounts for differences in scaffolds; Assemble using reference genome
Scaffold
contigs put together overlapping with estimated gaps in a known order
main challenges for deNovo genome reconstruction
Repeats: create ambiguity and can cause assemblies; inflate genome size
High Coverage: sequencing the genome multiple times, resulting in a greater number of reads that overlap any given region of the genome
greedy overlap
deNovo genome reconstruction
Goal is to assemble the strings (reads) into a continuous, single string (contig)
Want the shortest possible superstring
- Overlap maximization
– Reduces redundancy, maximizes confidence with highest overlap - Repeat resolution
– Resolves repeats by favoring collapsed arrangements - Evolutionary pressure
– Most genomes have selective pressure to be efficient
how to do a greedy assembly?
merge by highest overlap!!
Repeats ruin assembly ⇒ can cause missing reads
Increase K to overcome repeats
de Brujin graphs
- help for to visualize relationships/overlaps between the strings
- Node = single entity [k-1]
- Edge = represents a connection between entities (can have direction) [k]
- uses direct edges to specify overlap and concatenation
- Each unique k-mer is a node. (K-mer = substring of length k)
- A node is balanced if indegree equals outdegree
multiple reads for DB graphs
not Eulerian bc cannot walk along each edge once; 2 semi-balanced nodes
edges on walk extend the contig in multiple directions
errors in assembly effect on DB graphs
Errors affect:
1) k-mer counts, 2) increase # of edges and unconnected graphs
- No overlap would lead to unconnected graphs; weights can be added to arrows (#)
Error correction should remove most tips, islands, bulges (splits and reconnects)
high coverage for deBrujin
High coverage suggests that a node is likely a true sequence rather than an error
How do we choose the “best” path for our contig?
Long paths are desired but not always reliable due to potential repeats
High, consistent read coverage
Unique, non-branching paths
SPAdes
- prokaryotic genome assembler based on DB graphs
- Estimates gaps between reads using DB graphs
- Builds multisized graphs with different k’s.
- Using multiple graphs allows for a better handling of variable coverage.
- Assemblers provide contigs and scaffolds (connections how contigs form scaffolds)
Large VS Small K
Large K ⇒ fragmented graphs; helps reduce repeat collapsing
Small K ⇒ collapsed/tangled graph good for low-coverage regions
L50
NUMBER of contigs whose combined length is at least 50%
Lower is better for L50 value
**longer contigs = more confidence that genome is right
N50
LENGTH of the shortest contig at 50% of the total genome length
Higher is better for N50 value [median contig size = reliability factor]
***N50 is the length of shortest contig in L50 assembly
leading synthesis
- by addition of dNTP instead of ddNTP
- synthesis is ahead by 1 nucleotide bc 2 were added at once
lagging synthesis
by failure to remove blocking fluorophore
signal cross talk
degrades quality of assemblies
phred quality score
assess the accuracy of nucleotide base calls in DNA sequencing (prob that base call is incorrect)
ASCII encoded probability store phred quality scores in FASTQ file
Per sequence GC content
Deviation from normal distribution indicates contamination (reads)
Trimming/Filtering
reduces bias of bad base calls normally at the ends of reads
Trimming/cutting/masking sequences
– From low quality score regions
– Beginning and end of sequence
– Remove adapters
Filtering of sequences
– With low mean quality score
– Too short
– With too many ambiguous (N) bases
structural annotation
identifies critical genetic elements such as genes, promoters, and regulatory elements
Functional annotation
- predicts the function of genetic elements
Reading ORFS →
- Seek the standard start codons: ATG, GTG, or TTG
- Seek the stop codons based on the translation table
TAA, TAG, TGA for bacteria, archaea, and plant plastids
Typical elements of a gene that are annotated
Promoter, start site, 5’ UTR, exons, introns, start codon, CDS, stop codon, 3’ UTR
MSA
the process of aligning three or more biological sequences simultaneously
Identifies conserved regions across multiple species
Reveals patterns not visible in pairwise comparisons (evol. relationships)
Key characteristics:
- Aligns multiple sequences in a single analysis
- Introduces gaps to maximize alignment of similar characters
- Preserves the order of characters in each sequence
Important elements of scoring in alignment selection
Objectivity: provides a quantitative measure for comparison
Optimization: allows algorithms to find the best alignment
Significance: helps distinguish real homology from random similarity
Alignment elements reflect …
evolutionary events in sequences
(match, gap, mismatch)
match
identical characters in aligned positions
Represents conserved regions or no change
mismatch
different characters in aligned positions
Indicates substitutions or mutations
gap
dash(-) inserted to improve alignment
Represents insertions and deletions (indels)
global alignment
compares sequences in their entirety (start to end)
Key characteristics:
– Attempts to align every residue in both sequences
– Introduces gaps as necessary to maintain end-to-end alignment
– Optimizes the overall alignment score for the entire sequences
Needleman-Wunsch: guarantees optimal global alignment
Advantages of global alignment
Provides a complete picture of sequence similarity
Ideal for detecting overall conservation patterns
Useful for phylogentic analysis of related sequences
limitations of global alignment
May force alignment of unrelated regions in divergent sequence
Less effective for sequences of very different lengths
Can be computationally intensive for long sequences
local alignment
identifies best matching subsequences; focus on regions of high similarity
Key characteristics:
– Does not require aligning entire sequences end-to-end
– Allows for identification of conserved regions or domains
– Ignores poorly matching regions
– Can find multiple areas of similarity in a single comparison
– Aligns subsections of sequences
– Protein motif identification exemplifies local alignment utility (identifies functional regions)
Smith-Waterman
Needleman Wunsch
- start with 0 in top corner
- add gap penalty down the first row
- move across to get the highest possible score while including penalities
- score is in bottom row
Smith Waterman
- 0 zero is the lowest score
- if negative, make it 0
- enter 0’s in starting rows
- Start alignment at the highest cell
- Stop aligning when you encounter a zero
Smith-Waterman differs from Needleman-Wunsch in key aspects →
Matrix initialization:
NW: the first row and column are filled with gap penalties
SW: first row and column filled with zeros
Scoring system:
NW: allows negative scores
SW: negative scores are set to zero
Traceback:
NW: starts from the bottom-right cell
SW: starts from the highest scoring cell in the matrix
linear gap penalty
fixed cost for each gap
Similar to implement, over-penalizes long gaps
affine gap penalty
different costs for opening and extending gaps
Better for long indels, more biologically realistic (Single mutation event often causes multi-base indel)
position-specific gap penalties
Reduced in variable regions; increase in conserved regions
residue-specific gap penalties
Adjust penalties based on amino acid properties
terminal gap penalties
Often reduced to allow end gaps in local alignments
transcriptomics
allows us to see exactly what genes are active within a given moment
Allows us to see changes in gene expression overtime (picture of gene exp.)
Works with a complete set of RNA transcripts (mRNA, rRNA, tRNA, non-coding RNA)
Captures the dynamic nature of the transcriptome to reflect the functional state of the cell; captures cell’s response to environment and signals
*** what annotated genes are actually being used
isoforms
a single gene can produce multiple mRNA transcripts
Way for org. to increase protein diversity without increasing the number of genes
reveals alternative splicing and reforms (cell type, envt, developmental state)
genomics VS transcriptomics
Functional insights →
- Identifies potential functional elements
- Reveals which elements are active
- Predicts disease risk
- Shows diseases state
Temporal insights →
- Requires one-time sampling
- Captures real-time cellular responses
- Reveals evolutionary history
single-cell transcriptomics
- revolutionizes resolution
- best for rare cells with complex tissue types
- captures gene expression in an individual cell
- reveals cellular heterogeneity within the tissues
***very powerful data but can be very sparse and noisy
***Not very reproducible bc there is so little RNA in a cell; typically paired with bulk RNA analysis
spatial transcriptomics
- maps gene expression to location
- Preserves spatial information of transcripts within tissue sections
- Reveals how cellular neighborhoods influence gene expression
RNA integrity number
- rRNA makes up a large percentage of our RNA
- lower numbers are degreaded sample (28S is degraded to 18S rRNA)
filter for mRNA only
poly A tail primer allows for amplification of only mRNA
microarrays
- convert mRNA to cDNA
- no longer in practice
- LIMITED to known sequences
- similar sequences may cause false positives
- limited dynamic range
- normalization challenges
- potential for bias
RNA-seq
- doesn’t require prior knowledge of sequences; allows for discovery of novel trancripts/isoforms (primary advantages over microarray technology)
computational pipeline for RNA-seq data analysis
- Read alignment: mapping transcripts to the genome
- Quantification: measuring gene expression levels
- Differential expression analysis: identifying key genes
- Dimensionality reduction: visualizing complex data (not in practice)
Hash table
- link a key to a value
- keys represent a label we can use to get info
- hash function used to determine where to find their number
- DNA dictionary with quick lookup and direct access to potential matches
(large memory and slow for large genomes)
** way for reads to be mapped to reference genomes
suffix arrays/trees
- represent all suffixes of a given string
- used to find the starting index of a suffix
- arrays are a memory-efficient alternative to trees
— require less memory, but are less powerful
*** create all suffixes; fix with end-of-string identifier; then sort lexicographically
we LOSE the original data
Burrows-Wheeler transforms (BWT)
- compresses the amount of data that we have to store without losing the original data
— allows for reversibility of data
Basic concept of BWT:
– Append a unique end-of-string (EOS) marker to the input string.
– Generate all rotations of the string.
– Sort these rotations lexicographically
– Extract the last column of the sorted matrix as the BWT output.
– 1st column is more compressible but lose context/reversibility
backwards search for BWT
- backwards search efficiently finds occurrences of a pattern in the text using L-F mapping
- reversibility of BWT is better than suffix arrays bc we do not lose data
alignment
- specifies where exactly in the transcript this read came from
– (at position ___) - specifies where exactly in the transcript this read came from
*** need to determine the read’s exact position in the transcript but they are SOOO EXPENSIVE $$$$
pseudoalignment
- specifies that it came somewhere from this transcript (compatible)
- Finds which transcript, but not where
- Identifies which transcripts are compatible with the read, skipping the precise location step
- Faster and less resource intensive than alignment based methods
- Lacks certain details (position and orientation of reads) which are useful for correcting technical biases
quantifying gene expression levels
- Must scale data for higher precision, less memory
– Read per kilobase(RPK): corrects this experimental bias through normalization by gene length
– Parts per million (ppm)
generative model
- a statistical model that explains how the observed data are generated from the underlying system
- Defines a computational framework that produces sequencing reads from a population of transcripts
- get reads from the transcript though we don’t know how much transcript is there bc it is bias
— go backwards to calculate transcript abundance from the read distribution
transcript fraction
- tells us the proportion of total RNA molecules in the sample that come from a certain transcript
- adjusts for the fact that longer transcripts generate more reads
- normalizes length VS nucleotide to transcript proportions
fragment probabilities
conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical bias
- SALMON approximates
Positional bias:
- Fragments that include transcript ends might be too short
- Fragments from central regions are more likely to be of optimal length for sequencing reads
GC content:
- Undersample GC regions
- Make good stop codons
- Oversample AT rich regions
expectation-maximization algorithm
E) estimate missing info (assignment of fragments to transcripts) using the current transcript abundance estimates
M) use the estimated assignments to update the transcript abundances (improves likelihood)
For each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimate until it reaches a maximum
– ensures the accuracy of abundance estimates by correcting bias learned during the estimation (online) phasee
transcript effective length
adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled
maximum likelihood estimation (MLE) goal
The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)
2 Phase interference in Salmon
- online phase: fast, initial estimates of transcript abundances
- offline phase: refines initial estimates using more complex optimization techniques
** balances speed (online) with accuracy (offline)
quasi-mapping
- a fast, lightweight technique used to associate RNA-seq fragments with possible transcripts
*** often used for the initial estimates of the online phase in SALMON
Expensive so stops after identifying seeds !!!
SALMON transcript-fragment assignment matrix
- uses matrix to identify distributions of reads amongst the transcripts
— computationally assigns fragments to transcripts
*** maps RNA-seq reads (fragments) to transcripts, enabling accurate quantification of transcript levels
— decides how many fragments are assigned to a specific transcript (higher expression = more fragment abundance in a transcript)
statistical model
mathematical tool that describes how data is generated
***help us to make sense of complex data by identifying patterns and determining whether differences are meaningful or just due to chance
hypothesis testing
- see if the difference in expression between conditions statistically significant
Null (Ho) = There is no difference in gene expression between the 2 conditions.
Alternative (H1) = There is a significant difference in gene expression between the conditions.
Reject the null hypothesis when there is a difference that could not have happened by random chance
p-value
- the probability of the null hypothesis being true
– The higher the p-value, the more our model supports the null hypothesis
– The lower the p-value, the more our model supports the alternative hypothesis
differential gene expression
the process of identifying and quantifying changes in gene expression levels between different sample groups or conditions
RNA-seq pipeline
- read alignment: mapping transcripts to the genome
- quantification: measuring gene expression levels
- differential expression analysis: identifying key genes and comparing gene expression levels
- dimensionality reduction: visualizing complex data
*** quantifying gene expression levels using high-throughput sequencing technologies
generates count data
count data
the number of RNA fragments that map to each gene
** generated for RNA-seq
binomial distribution
models the number of successes in a fixed number of independent trials, where each trial has the same probability of success
binomial distribution limitations
MAIN limitation → assumes that the probability of success is constant between samples
Smaller limitation 1 → The number of possible trials can be very large, especially when sequencing at a high depth.
Smaller limitation 2 → The probability of expression is very small for many genes because they are either lowly expressed or not at all.
Poisson distribution
- a baseline for modeling discrete counts
- statistical tool used to model the number of events(or counts) that happen in a fixed period of time or space, where:
– The events are independent of each other
– Each event has a constant average rate
***gives an accurate distribution of counts if mean and variance are equal
(AKA probability of observing the sequenced fragments)
parity plots
show deviations in Poisson distribution (mean = variance line)
Higher counts typically have larger variance
overdispersion
when the variance in the data is larger than what is predicted by simpler models (e.g. Poisson distribution)
– Expected variance for Poisson-distributed data equals the mean
Reflect biological variability between samples not captured by the experimental conditions
– Differences in RNA quality
– Sequencing depth
– Biological factors like different cell types within the same tissue
Negative Binomial distribution accounts for high dispersion
zeros in RNA-seq data
- RNA-seq data contains zero counts for some genes because not all genes are expressed under all conditions.
- Most statistical models account for variance, but not that 0’s can dominate counts
— zero-inflated models
RNA-seq data is SO MESSY
Need models to account for this complexity and figure out which genes are differentially expressed in a meaningful way
negative binomial
- gives an estimation of a log fold change
- also gives standard error of how uncertain
structural biology
SB determines the 3D shapes of biological macromolecules and how these shapes relate to function
Primary Goal: to understand how molecular machines in cells work by deciphering their atomic arrangements
structural biology challenges
- technical limitations
- biological complexity
- resource constraints
covalent bonds
- the framework of biomolecules
- Formed when atoms share pairs of electrons that hold molecules together
Relevant characteristics of covalent bonds →
Strength and stability: covalent bonds provide the necessary stability for complex biological structures
Directionality: covalent bonds limit the specific angles and orientations leading to the 3D shapes of biomolecules
Non-covalent
- dynamic glue
- weaker than the covalent bonds and involve electostatics
single vs double/triple bonds
– Single bonds: allow rotation, contributing to molecular flexibility
– Double/Triples bonds: restrict rotation, affecting rigidity/molecule function
Noncovalent interactions drive most of biology →
Molecular recognition →
– Enzyme-substrate binding
– Antigen-Antibody interactions
Macromolecular structure →
– Membrane formation
– Protein-protein interactions
– Base pairing in DNA and RNA
– Protein folding
primary structure
linear sequence of amino acids, held together by peptide bonds
– dictates how the protein will fold into higher-order structures
– does not reveal the protein’s functional form or activity alone
– may also depend on cellular factors (e.g. chaperones)
Secondary structure →
local conformations of the polypeptide chain
*** stabilized primarily by hydrogen bonds
– Structural motifs are critical for certain functions. (alpha helixes/B-pleated sheets)
– undergo local fluctuations adding to functional flexibility (unwind/twist)
Tertiary structure →
Refers to the complete 3D shape of a single polypeptide chain
– Predicting how a sequence folds into its tertiary structure is complex
– Reveal active sites/binding pocket
X-ray Crystallography →
- electrons mix into different molecular orbitals at characteristic energy levels
- products an e- density distribution that is unique to that structure
- Probe: photon (carrier of electromagnetic radiation)
- Basic Principle: photons scatter when they interact with atoms
— The scattered X-rays form a diffraction pattern unique to the crystal
— Constructive interference is needed to amplify signal for detectors
diffraction patterns
- The spots on the detector represent the reflections of the scattered X-rays
- Intensity of the spots reflects the electron density in the crystal
- Position and angle of the spots correspond to the geometry
*** The diffraction pattern does not directly show the atomic positions but provides the data needed to infer the electron density
Building the electron density map →
- Reveals the distribution of e- in the crystal, indicating where atoms are located
- The electron density map is interpreted by fitting atomic models (e.g. amino acids for proteins) into density
- Low-resolution data make it difficult to assign atomic positions precisely, leading to uncertainty in the model
why crystals?
- Crystals have the same repeating unit cell, which amplifies our signals
- If in solution, particles would be:
- Too sparse to diffract
- Moving and diffraction pattern would constantly change
Challenges in X-ray Crystallography →
- Flexible or disordered regions do not pack into crystals well, often leading to failure in obtaining high-quality crystals
- flexible/disordered regions do not show up clearly in the electron density map
- Crystals capture a single conformation of the molecule, often ignoring the flexibility or dynamic range
Cryo-Electron Microscopy
a beam of high-energy electrons is used instead of photons
– Electrons have a shorter wavelength than photons
– Scatter light more effectively than x-rays
No crystals: The sample is sample is rapidly frozen in vitreous ice to preserve its native structure
– imaged in their native hydrated state.
– can capture multiple conformations
uses SPA
Single particle analysis (SPA)
- main Cryo-EM technique used to determine the 3d structures of individual macromolecules
- Millions of image of individual particles are collected from a thin layer
- Particles are computationally aligned and classified into different orientations
—- 2D imaging, particle alignment and averaging, compete 3D structure from 2D projections
Cryo-EM advantages and challenges
its ability to capture multiple conformational states of a molecule, providing insights into flexibility and structural heterogeneity
highly flexible or disordered molecules may appear as fuzzy or low-resolution regions in the final structure
IDPs
lack a stable 3D structure under physiological conditions but are still functional, often gaining structure upon binding to partners
- structural techniques often require ordered/stable configurations
- May appear fuzzy or have low-resolution in these regions
Fit force fields to experimental data of structured proteins BUT there is not a lot of data of IDPs
Levinthal’s Paradox
- Proteins can adopt a large number of possible conformations.
Levinthal’s Paradox: a protein can’t sample all conformations in a biologically reasonable time, yet it folds quickly.
challenges of protein structure prediction
- large conformational space
- complex energy landscapes
- flexibility and dynamics
- environmental effects
- PTMs
- data driven methods
potential energy surface (PES)
- represents the energy of a system as a function of the positions of its atoms.
- Understands how the system’s E changes upon reactions or movements
- Proteins fold to lowest free-E state, but this landscape is highly rugged.
- Energy calculations are computationally $$$$ /depend on accurate force fields.
– Lots of potential E minima (so many conformations needs to be tired)
– Multiple minima may be similar but can be far apart in a conformational space
homology modeling
- predicts protein structures based on evolutionary relationships
- the main principle is that proteins with similar sequences tend to fold into similar structures
*** most accurate when sequence identity to other proteins is high (>30%) – across full protein length
threading
- In cases where sequence similarity to known structures is low (<30%), homology modeling becomes unreliable.
- Matches sequences to known structural folds based on structural rather than sequence similarity
- When remote homologs exist but their evolutionary relationship cannot be detected by sequence
comparison alone.
Hidden Markov Models (HMMs)
- statistical models representing sequences using probabilities for matches/indels
- HMMs model protein sequences as a series of probabilistic states
Hidden states: the underlying biological events that are not directly observable
Match states: conserved positions in the sequence
Insertion states: positions where extra residues are added
Deletion states: positions where residues are missing
contact map
- 2D representation of which residues are in close proximity (residue interactions in proteins)
- represent spatial proximity, not sequence order
coevolutionary analysis
- Coevolving residues mutate in a correlated manner.
- Mutations in one residue often result in compensatory mutations in its interacting partner
- observed across species through analysis of homologous protein sequences
- Correlated mutations indicate functionally significant residue pairs
*** helps to predict which residues are close in the 3D structure (useful when there is no experimental structure available)
how is coevolution detected?
- Coevolution is detected using large MSAs from homologous proteins.
- The more diverse the sequences in the MSA, the better the resolution of coevolving residues.
- Evolutionary info from MSAs guides predictions for residue-residue contacts.
co-evolution signals
- Co-evolution signals can be noisy →
- Noise from data can come from random mutations or insufficient evolutionary diversity.
- Large and diverse sequence data sets are needed for reliable coevolution predictions.
How does alpha fold work
- ML predicts 3D structures only from sequenced data
- AlphaFold predicts protein structures with atomic accuracy by using deep learning models trained on large structural datasets
— Trains neutral networks on large amounts of protein sequence and structural data
— Neural networks analyzes pattern and learn to recognize them from coevol data
Machine learning leverages coevolution for high-accuracy predictions.
— incorporate evolutionary info along with structural features
*** Struggle with disordered proteins
Molecular dynamics (MD) simulations
- Protein structure determination and prediction provide fixed snapshots
- Understanding the motions of the proteins is important
- MD simulations →
Provide time-resolved insights into protein behavior
1) 3D coordinates of atoms
2) Atoms exert forces
3) Use Newtons to predict movement
time steps
Smaller time steps ⇒ more accurate, but more calculations
The time step must be smaller than the shortest vibrational period to accurately capture atomic
motions.
Simulation of Atomic Movement
- computes trajectories of atoms over time scales of femtoseconds to microseconds.
- capture both small-scale vibrations and large-scale conformational changes.
- Provides detailed information on atomic interactions and energy changes.
- Enables the study of mechanisms at an atomic level
***CLASSICAL MECHANICS
why are MD simulations beneficial?
- More realistic analysis of proteins (dynamic vs static)
- Refines predicted structures
- Minimizes E
- Accounts for environmental effects for improved accuracy
- Studies IDPs
- Captures the flexible nature of disordered regions
- Aids in understanding functions that rely on disorder
- Identifies folding intermediates and misfolding mechanisms-
classical mechanics
- Describes the motion of macroscopic objects
- Assumes particles have well-defined positions and velocities
- Governed by Newton’s Laws of Motion
quantum mechanics
- Necessary for describing behavior at atomic and subatomic scales
- Accounts for wave-particle duality, uncertainty principle, proton tunneling
- E- exhibit quantum behavior that can’t be captured classically
electrons in MD simulations
- neglect quantum effects
- Electrons are not simulated in MD
- Effect is included implicitly through potential E functions (force fields)
suitable systems for MD simulations
- Biological macromolecules (protein, nucleic acids, lipids)
- Materials where electronic excitations are not critical.
- Processes where bond breaking/forming does not occur.
limitations of MD simulations
- cannot accurately simulate chemical reactions w/ electronic transitions.
- Quantum stuff like tunneling and zero-point energy are not captured.
why classical particles?
- It is very expensive to simulate large systems of atoms using quantum mechanics
- Detailed electronic structure is not as important (faster, cheaper calculations)
*** As a result, no electronic interactions/movements can be captured with MD
newton’s 2nd law
the acceleration of an object is directly proportional to the net force acting on it and inversely proportional to its mass.
forces
- Forces are negative gradients of potential energy
- Thus energy gradients can be used to determine the acceleration and therefore motion of the atoms (velocity)
predicting the spatial position of each atom
- Time evolution of a system: computed by integrating equations of motion
- Determine force, move forward in time, repeat
-Provides cont. Equations of motion using the time steps
molecular dynamics are a combination of:
Molecule dynamics are a combination of: bond lengths, angles, and dihedral angles
– Bonds behave like springs (resists changes in distances between 2 atoms)
– Angles behave like harmonic oscillators (hinge with central atom)
– Dihedral angle (angle between 2 planes formed by 4 bonded atoms; describes the rotations of the bond between 2 atoms) - not like springs
Approximate bond vibrations and angles as harmonic oscillators
Spring constants are determined by bond order (s, d, t)
Bonds and Angles
govern local geometry (bond lengths and bond angles) using quadratic (harmonic) potentials that favor specific distances and angles
Dihedrals
- govern torsional or rotational flexibility around bonds, typically using periodic and multi-well potentials to allow for multiple stable conformations.
- Dihedral potentials must capture arbitrary functions with rotational symmetry.
Fourier series
- Modeling dihedral potentials → Fourier series
– Approximate functions as sum of sine and cosine waves
— More sine/cosine terms improves approximations
– Can approximate (any) symmetrical rotational energy function.
*** equation to go from structure to energy (instead of using QM)
force fields
- compute energies and atomic forces
- Setting the force field depends on system type, accuracy/speed, and compatibility
- Bonded and non-bonded interactions make up the complete force field
topology files
- tell the program which force field parameters to use and where)
- define the molecular structure and interactions in the simulations
- contains info on atom types, bonds, angles, dihedrals, and non-bonded interactions based on the chosen force field
force field parameterization stepwise
Quantum Mechanical Calculations: obtain high-accuracy data for smell molecules and representative fragments
Empirical Data Integration: incorporate experimental measurements to validate and refine parameters
Parameter Optimization: adjust force field parameters through iterative simulations and comparisons
Advanced Techniques: utilize machine learning, multi-scale modeling, and automated pipelines to enhance parameters accuracy and efficiency
scenarios in which quantum mechanical effects cannot be ignored
Electronic interactions involving electrons (smaller molecules better)
– Wave-particles, particle tunneling
– Electron transfer involving reactions
importance of noncovalent interactions in molecular recognition
- Crucial for simulating multiple molecules in MD
- Facilitate organization of molecules into structures
- Determine macroscopic props (MP, BP, solubility)
- Govern biological functions (enzyme binding, protein folding)
*** covalent interactions define primary structure while noncovalent interactions dictate how molecules interact
Dual nature of non-covalent interactions →
- Dispersion forces: weak, attractive forces arising from instantaneous dipoles
Stabilize molecular assemblies by promoting close packing - Repulsion forces: strong, short-range forces due to overlapping e- clouds
Prevents atoms from collapsing into each other; maintains atom integrity
criteria for selecting high-quality experimental structures for simulations.
Need a good structure before starting any molecular simulation (equilibrium)
Avoid low quality, high-energy conformations with missing/wrong co-factors
- Resolution: how well the atomic positions are determined
Resolution below 2.0 A is generally preferred for high-quality simulations - Completeness: Flexible loops/disordered regions are often missing from the structure
No missing residudes - Functional State: Proteins can exist in different functional conformations: active vs inactive state, bound to ligands or unboard
- B-factors: Higher B-factors suggest more uncertainty in atom positions, which might make that part of the structure less reliable
strategies for adding missing residues or atoms in protein models prior to conducting MD simulations.
- Missing atoms or residues can be added using modeling software like Modelleer
- Homology modeling, structure prediction software, MD simulations
Unwanted components like ligands or non-essential ions should be removed
ligands, ions, or crystallization agents that are not physiologically relevant
Distorts protein’s behavior in a simulated biological environment if not removed
Describe the steps involved in protein preparation for simulations.
- Add missing residues
- Remove unwanted ligands, non-essential ions that distort protein behavior
- Correct protonation state (pH-sensitive residues)
- Energy minimization for sterics (adjust structure to remove unfavorable atom position and steric clashes that cause instability)
- Assign force field parameters
- Stabilize temp and pressure for MD
Evaluating the suitability of a protein structure for simulations
Completeness: Check for missing residues, loops, or side chains; incomplete regions may need modeling to avoid simulation artifacts.
Functional State: Ensure the protein’s conformational state (e.g., active, inactive) aligns with the simulation goals; incorrect states could yield irrelevant results.
Clash Scores: Assess clash scores (using tools like MolProbity) to identify steric issues; high clash scores indicate steric conflicts that should be resolved with energy minimization or correction.
Resolution and R-Factors: Higher resolution (<2.0 Å) and low R-factors indicate greater structural accuracy, while poor values suggest potential inaccuracies.
periodic boundary conditions (PBC) in molecular simulations
*** Realistic systems do not have walls
- For simulations, need to apply force to keep molecules in the box
– H20 and proteins would bounce off thse walls in an unphysical manner (edge effects)
PBC simulates infinite systems from a finite box
— Virtually place exact copies of system in all directions
— Atoms that cross the box edge reappear on the other side (no edge effects) → like pacman
- uses Minimum image convention (MIC)
Minimum image convention (MIC)
ensures that an atom in the primary box only interacts with the closest image of another atom
Images atoms in adjacent boxes are used to calculate interactions across the boundaries (ensures correct interactions)
Macrostate
specifies the temp, pressure, volume, and number of particles of molecular systems
*** infinite number of macrostates
- Large-scale system that defines properties of molecular system
(temp, pressure, vol) → changing these changes macrostate - Encompasses all microstates that share the same properties
ensemble
- the collection of all possible microstates of a single macrostate
- Perfect/accurate ensemble averages require sampling every possible configuration
- Macrostate observables are ensemble averages
Ensemble examples
Microcanonical Ensemble (NVE) →
- Fixed number of particles (N)
- Volume (V)
- Energy (E)
Canonical Ensemble (NVT) →
- Fixed number of particles (N)
- Volume (V)
- Temperature (T)
Isothermal-Isobaric Ensemble (NPT) → most common
- Fixed number of particles (N)
- Pressure (P)
- Temperature (T)
energy in ensembles
- The instantaneous temperature of microstates will fluctuate, but the ensemble average should be constant
There should be no net flow of energy!!!
microstate
s a unique configuration defined by the positions and velocities of all particles
specific, detailed configuration of a system at the molecular level
- Indicates exact position and energy of a particle
- Multiple microstates can have the same distance (use mean of them)
importance of adequately sampling microstates in MD simulations
- needed to compute reliable ensemble averages
- Longer simulation provide better sampling of microstates and their probabilities
1) Statistical accuracy, 2) Covering all conformations shifts, 3) Thermodynamic quantities
thermostats
adjust the velocities of particles to increase or decrease the system’s kinetic energy → thereby controlling the temperature
Berendsen thermostat
adjusts the velocities of all particles uniformly based on the current temperature and target temperature
- Scales current velocity based on temp deviation
- Prevents abrupt changes that could destabilize the simulation
- Simple velocity scaling does not generate a true canonical (NVT) ensemble; it cannot reproduce realistic temperature fluctuations
^^^inaccurately models thermal energy transfer via particle collisions
Nose-Hoover thermostat
connect particle momenta to fictitious heat bath
- Momenta scaling provides realistic kinetic energy and thus temperature control
- Heat bath allows thermal energy to flow in and out of our simulation
Q ⇒ a “mass” coupling parameter controls thermostat responsiveness
barostats
maintain desired pressure during simulations
- Adjusts volume of simulation box to achieve and maintain target pressure
- Pressure is proportional to density and temperature
- Scales box volume based on pressure difference to target
*** all help to keep a consistent macrostate!
Ensemble averages improve …
improve with more simulation time by sampling more microstates
– Many short simulations is better than 1 long one!!!
Random initial velocities provide better change of sampling different microstates
– Initial velocities are sent in a direction on the potential E surface simulation; there is a change it never samples a certain minima (multiple simulations with random velocities reduces chance)
*** discard initial relaxation as it is not our desired microstate
Equilibration (Relaxation) Phase
Purpose: To allow the system to relax and reach a stable, equilibrated state after initial setup.
Process: Temperature, pressure, and density are gradually stabilized, with constraints often applied to avoid abrupt movements.
Goal: Achieve a realistic starting conformation that reflects the desired ensemble
Production (Data Collection) Phase
Purpose: To collect data on the system’s behavior for analysis of properties like energy, structure, and dynamics.
Process: Constraints are usually removed, and the system is allowed to evolve naturally.
Goal: Gather accurate ensemble averages and insights into properties for the equilibrated system over time.
Root Mean Square Deviation (RMSD)
measures deviation in structure over time (what conditions allow conformations to change faster)
- Overall change in structure during simulation, tracks deviations from starting conformations (global conformational changes)
***Low is good: close to reference structure
***High indicates significant deviation and large structure changes over time
Root Mean Square Fluctuation (RMSF)
How much does amino acid position change; Identifies regions of flexibility in the protein by calculating fluctuation of each atom (tracks local flexibility) about a mean position
– How does it change around its mean, not relative to ref structure
***Low: atom is fixed in place (well-ordered)
***High: fluctuates a lot (flexibility)
relationship between energy and probability in molecular simulations
Minima ⇒ concave part in energy diagram
- Must overcome energy barrier to make the conformations (TSs can be higher)
– Most preferred would be the lowest energy state
***The probability curve will be exactly OPPOSITE of the energy curve
- How much energy is needed between conformation or energy of binding .
*** without minimization, high-energy configurations may lead to bad results in MD simulations (removes unfavorable atom positions / sterics)
stages of the drug discovery pipeline
Discovery and Preclinical Research (***Computation is most helpful with this stage)
– Potential drugs are identified and tested in non-human studies
Clinical Trials
– Testing in human subjects to assess safety and efficacy
Regulatory Approval
– Evaluation by agencies like the FDA before the drug can be marketed
Post-Marketing Surveillance
– Ongoing monitoring after the drug is available to the public
key factors in selecting protein targets for drug design
Disease Relevance: the protein plays a critical role in the disease mechanism
Druggability: target has a structure that allows it to bind with drug-like molecules
Specificity: Targeting the protein minimizes effects on healthy cells, reducing side effects
virtual screening
narrows down potential compounds from large chemical libraries during drug discovery.
- tests (HTS) compounds against the target protein
- Experimental assays are still expensive, and limited to commercially available compounds
– use computational methods to predict which compounds we should experimental validate
*** SO MUCH FASTER TO NARROW DOWN SEARCH SPACE
gibbs free energy
Energy is released/uptaken during binding → spontaneity of binding
– Combines enthapy and entropy
***Simulations capture free energy directions instead of treating enthalpy and entropy separately
BINDING STRENGTH AND STABILITY AND AFFINITY
enthalpy
energetic interactions (sum of contributions provide ensemble avg)
- Accounts for non covalent interactions (electrostatics, H-bonds, dipoles, pi-pi)
- Ensemble differences in non covalent interactions provide binding enthalpy
Electrostatic forces role in binding
(strongest force)
Long range interaction, anchor points (~5 to 20 kcal/mol per inter)
Charged molecules have a net imbalance between
net electrostatic attractions or repulsions between different atoms or molecules
h bond role in binding
Specificity/orientation, stabilization, dynamic, strongest when hydrogen, donor, and acceptor atoms are collinear
Attraction between a (donor) hydrogen atom covalently bonded to an electronegative atom and another (acceptor) electronegative atom with a lone pair
Uneven e- distribution role in binding
Directional binding for proper ligand alignment, flexibility (weak)
creates partial charges and dipoles
Electronegativity differences lead to unequal distribution of electron density
Consistent electron density spatial variation results in permanent dipoles
VDW role in binding
Maximizes surface contact (complementary fit); flexibility
Dispersion: Electrons in molecules are constantly moving, leading to temporary uneven distributions that induce dipoles in neighboring molecules
Induction: The electric field of a polar molecule distorts the electron cloud of a nonpolar molecule, creating a temporary dipole
Pi-pi interactions role in binding
Involve stacking of aromatic rings
Orientation of aromatics, selectivity
Noncovalent interactions between aromatic rings due to overlap of pi-electron clouds
entropy
how much conformational flexibility changes
(Energy dispersion)
*** Higher entropy ⇒ greater microstate diversity for a given macrostate
- Accounts for microstate diversity of a single macrostate
Can increase/decrease/remain the same depending on ligand concentration
purpose of alchemical free energy simulations
Slowly disappear interactions to see the lowest energy conformations to get insight into binding affinities
Gets energy difference between conformations
how alchemical simulations work
Compute the binding free energy somewhere in solution and then bind to protein
Slowly disappear the ligand (1 = normal interaction, 0 = no interactions)
– More relevant conformational sampling
– Run independent simulations in parallel
– Focuses on taking difference w smaller numbers
integrate over these small free energy changes (turn interaction on and off)
– Use this to relatively calculate the free energy difference between bound and unbound states
why alchemical simulations not ideal
VERY EXPENSIVE; use docking first to screen molecule, then computes energies
- captures all atomistic forces
- wide range of conformation sampling
- specific parameters
docking
Avoid sampling all microstates and determine one “optimal” protein-ligand structure (rigid structure)
– using this bound structure, predict a “score” that is correlated to binding affinity
– Simplifies binding free energy prediction to enhance speed
–Ligand is not guaranteed to fit perfectly
the importance of choosing an appropriate protein conformation
Protein-ligand interactions are highly-dependent on the protein’s 3D structure
Using an inappropriate protein conformation can lead to inaccurate docking results
**ONLY PICKING ONE STRUCTURE
– Challenge bc proteins are dynamic
challenge for docking
- Conformational Flexibility (proteins are not rigid structure and experience movement side to side)
- Binding sites can change
- Limited experimental structures (not all relevant states may be covered)
Experimental Structure Selection Criteria → Docking
Resolution and Quality
Ligand-Bound vs. Unbound Structures
Relevance to Target Ligand
water molecules in docking
Role in binding: structured water molecules can mediates interactions between the protein and ligand
Inclusion Criteria: retain water molecules that are conserved across multiple crystal structures
Handling water in docking →
Some docking programs allow explicit water molecules in the binding site
Alternatively, consider their effect implicitly in scoring functions
convex vs concave regions
Convex Regions: Typically inaccessible to ligands.
Concave Regions (Cavities): Potential binding sites.
grid-based binding pocket detection
Grid-based: puts a protein on a grid, looks for protein vs no protein within grid
If there is no protein in a space, there is likely a pocket there (concave space)
Macrostate remains constant
alpha shape theory
alpha spheres touch certain about of atoms (3 atoms only); cannot put any spheres on the outside in protein land
– Shows pockets based on how many spheres it is touches (group spheres placed in open spaces and indicate it as a pocket)
– uses Delaunay triangulation and alpha complexes to define cavities
3 binding pocket classifications
- Orthosteric: primary active site where ligands bind
- Allosteric: secondary sites that modulate protein function upon ligand binding
- Cryptic: binding pockets not apparent in the unboard protein structure but form upon ligand binding or conformational change
cryptic pockets
binding pockets not apparent in the unboard protein structure but form upon ligand binding or conformational change
– hard for MD simulations to detect
– must use enhanced MD methods, and apply pocket detection to multiple conformations
process of pose optimization in docking to optimize ligand positions within the binding site for accurate binding affinity prediction
*** Accurate docking depends on optimized ligand poses (binding affinity)
- Initial ligand placement
- Scoring function evaluation of binding affinity
- Pose adjustment/optimization (move around ligand and protein residues to achieve best fit) + energy minimization
- Rescore and rank in terms of best pose
- Final pose choice
Search strategies for docking →
systematic and stochastic
systematic search
numerically iterate over all possible conformations
– Only possible for very small molecules, not used very often
– ID important degrees of freedom
– Remove structures with high strain
*****ALMOST NEVER DO
stochastic searches
random sampling (Monte Carlo)
– Provide better balance of sampling and cost
– See if energy change of new conformation is less than random
Steps →
- Generate conformation
- Compete energy change
- If energy change less than a random sample: make move
- Repeat
***Allows us to sample efficiently!
Scoring function
are parameterized models to estimate binding affinity after docking
– Physics-based methods using force-field like methods
– Machine learning (graphing neural networks) have been gaining traction recently
Structure-Based Drug Design VS Ligand-Based Drug Design
Structure-Based Drug Design:
- Requires 3D structure of the target protein.
- Uses the binding site structure to model potential interactions.
- Often employs docking and molecular simulations.
Ligand-Based Drug Design:
- Requires no structural information of the target.
- Uses the chemical structure and activity of known ligands as guides.
- Relies on molecular similarity rather than direct binding predictions.
Molecular weight
indicates the overall size of the molecule
– Impacts drug distribution and elimination rates in the body
LogP
measures lipophilicity (chemical compound’s ability to dissolve in lipids, fats, oils, and non-polar solvents)
– Influences a molecule’s ability to cross cell membranes and affects absorption and bioavailability
molar refractivity
relates to polarizability and electron cloud distribution
– Affecting intermolecular interactions and binding affinity
TPSA
estimates the molecule’s ability to form hydrogen bonds
–impacting solubility and permeability across biological membranes
number of rotatable bonds
reflects molecular flexibility
– influences binding affinity and oral bioavailability
Extended connectivity fingerprints (ECFPs)
encode structural features into numerical representations
– Hash functions are used to encode chemical information
– Can be encoded into a bit array
Tanimoto similarity
compares ECFPs between 2 molecules based on …
Molecular similarity: the concept that similar molecules often show similar biological effects.
Formula measures the ratio of the shared features to the total number of unique features between 2 molecules
how molecular fingerprints are generated by hashing atom-specific properties
- Numerical representation of the properties of the molecules
- Chose a function, not a number → function turns into number
- Generate a number consistently based on whatever input given for keeping track of molecules in the system
hash function iterations
used to encode chemical info
For each iteration:
- incorporate hashes of atoms that are n bonds away
- Then encode atom IDs that exactly one bond away
- Repeat while hashing n-1 IDs
Each iteration encodes local chemical info into each atom’s ID
Similar features will share atom IDs until our iterations starts incorporating new features (encodes multiple levels of info)
Bit arrays
fixed-length collections of ones and zeros
Allow for efficient operations
Encoded into bit array to store a collection of atom IDs
QSAR models
link chemical structure with biological activity
Predict the biological activity of molecules based on their structure to reduce the need for experimental screen (quick and cost-effective)
challenges of efficiently exploring chemical space to find active compounds similar to known bioactive molecules
So many molecules to find and finite amount of time
– Chemical space is vast
– Diverse properties
– Exhausts computational resources
– Reliability of predictive models