Computational Structural Biology Exam Flashcards
GFP
- green fluorescent protein
- keeps the chromophore planar and facilitates an excited-state proton transfer for the fluorescent coloring
2 types of atomistic interactions
covalent (the framework of biomolecules)
non-covalent (dynamic glue)
covalent
- the framework of biomolecules
- forms when. atoms share pairs of electrons that hold molecules together
ex/ peptide, phosphodiester, glycosidic bonds
peptide bonds
covalently link amino acids into polypeptide chains
phosphodiester bonds
form the sugar-phosphate backbone of DNA and RNA
covalent
glycosidic bonds
join monosaccharides to form complex sugars
covalent
characteristics of covalent bonds
- strength/stability for complex structures
- directionality: covalent bonds limit the specific angles and orientations leading to the 3D shapes of biomolecules
– single bonds allow rotation
– double/triple bonds restrict rotation
directionality of covalent bonds
covalent bonds limit the specific angles and orientations leading to the 3D shapes of biomolecules
– Single bonds: allow rotation, contributing to molecular flexibility
– Double/Triples bonds: restrict rotation, affecting the rigidity and function of molecules
non-covalent bonds
- the dynamic glue
- weaker than the covalent bonds and involve electrostatics (charge dipoles, van der waals)
- drive most of biology
— molecular recognition
— macromolecular structure
types of non-covalent electrostatic interactions
- charge-charge
- charge-dipole
- dipole-dipole
- charge-induced dipole
- dipole-induced dipole
- dispersion (van der Waals)
molecular recognition
Enzyme-substrate binding
Antigen-Antibody interactions
macromolecular structure
Membrane formation
Protein-protein interactions
Base pairing in DNA and RNA
Protein folding
structural biology
- determines the 3D shapes of biological macromolecules and how these shapes relate to functions
why study structural biology?
- Proteins and nucleic acids adopt specific shapes crucial for their biological roles
- Primary Goal: to understand how molecular machines in cells work by deciphering their atomic arrangements
primary structure of a protein
- The linear sequence of amino acids, held together by covalent peptide bonds
- dictates how the protein will fold into higher-order structures
- does not reveal protein’s functional form/activity
- its folding process may depend on cellular factors/chaperones
secondary structure of a protein
- local conformations of the polypeptide chain, stabilized primarily by hydrogen bonds
- structural motif are critical for certain functions
— pleated sheet, alpha helix, 310 helices - undergo local fluctuations – alpha helices can unwind, and beta-sheets can twist – adding to functional flexibility
tertiary structure of a protein
- complete 3D shape of a single polypeptide chain
- reveal active sites or binding pockets were catalysis or molecular interactions occur
- predicting how a sequence folds into its tertiary structure is complex even with knowledge of 2ndary structures
particle behaviors
- determined by quantum numbers (principle, orbital, magnetic)
— based on electrons specific energy levels and characteristics - electrons mix into molecular orbitals based on their specific energy level
*** molecular orbitals are what determine behavior as particles interact with orbitals
*** changing positions changes orbitals
RESULTS in e- density distribution unique to that structure
what causes different e- density distributions?
particles interacting with molecular orbitals and energy levels differently based on positions of e- within structure
3 types of experimental techniques based on probes interacting with molecule’s e- density
- x-ray crystallography
- NMR spectroscopy
- cryo-electron microscopy
x-ray crystallography
- uses how a crystal of molecules diffracts X-rays
Basic Principle: photons scatter when they interact with atoms
Probe: photon (carrier of electromagnetic radiation)
The scattered X-rays form a diffraction pattern unique to the crystal (elastic scattering by e-)
elastic scattering for x-ray crystallography
- Incident photon induces an oscillating dipole by distorting the electron density (Rayleigh)
- An oscillating dipole acts as an electromagnetic source and re-emits photons at the same wavelength in all directions
constructive interference
- needed to amplify the signals of the e- for the detectors of the diffraction pattern
- wavelengths are similar and in phase –> constructively interfere
- waves are out of phase –> destructively interfere
diffraction pattern
- spots on the detector represent the reflections of the scattered X-rays
– Intensity of the spots reflects the electron density in the crystal
– Position and angle of the spots corresponds to the geometry
*** does NOT directly show the atomic positions but provides the data needed to infer e- density
building an e- density map
- reveals the distribution of electrons in the crystal, indicating where atoms are located
- interpreted by fitting atomic models (e.g. amino acids for proteins) into density
- Low-resolution data make it difficult to assign atomic positions precisely, leading to uncertainty in the model
Why do we need crystals?
- Crystals have the same repeating unit cell, which amplifies our signals
If in solution, particles would be:
– Too sparse to diffract
– Moving and diffraction pattern would constantly change
NMR spectroscopy
How atomic nuclei interact with magnetic fields and radiofrequency pulses
Cryo-Electron Microscopy
- how molecules scatter electron beams
- beam of high-energy electrons used instead of photons
- no crystals used: The sample is sample is rapidly frozen in vitreous ice to preserve its native structure
— By freezing sample, the biological molecules are imaged in their native hydrated state.
UniProt
- protein information database
- Comprehensive database to access curated data about protein structures, functions, sequences, and annotation
- Reviewed (Swiss-Prot): experts manually curated and verified these entries, ensuring high accuracy
- Unreviewed (TrEMBL): these entities are automatically generated and have no been manually reviewed
- entry ID’s are unique identifiers for the proteins
- Protein Data Bank contains structures (PDB)
why are electrons used for Cryo-EM?
- Have much shorter wavelength (~ 0.02 Å at 300 keV) than photons
- Light elements which scatter electrons more effectively than X-rays
Single Particle Analysis (SPA)
- main Cryo-EM technique used to determine the 3d structures of individual macromolecules
- Millions of image of individual particles are collected from a thin layer
- Particles are computationally aligned and classified into different orientations
5 Challenges of disorder in molecules
- flexibility and disorder
- x-ray crystallography
- Cryo-EM and conformational flexibility
- Intrinsically Disordered Proteins (IDPs)
- Conformational Heterogeneity and Biological Function
Challenge of flexibility and disorder in biomolecules
- Molecules are not static
- Proteins often exhibit flexibility, disordered regions, and multiple confrontations
Why it matters: structural techniques often require ordered/stable configurations
Challenges in X-ray Crystallography
- Flexible or disordered regions do not pack into crystals well, often leading to failure in obtaining high-quality crystals
- In cases where crystallization is successful, flexible or disordered regions do not show up clearly in e- density map
- Crystals capture a single conformation of the molecule, often ignoring the flexibility or dynamic range
Challenges in Cryo-Em and Conformational Flexibility
- strength of Cryo-Em is its ability to capture multiple conformational states of a molecule, providing insights into flexibility and structural heterogeneity
- Challenge: that highly flexible or disordered molecules may appear as fuzzy or low-resolution regions in the final structure
- Advanced computational techniques are required to sort out different conformations present in Cryo-EM data
Intrinsically Disordered Proteins (IDPs)
lack a stable 3D structure under physiological conditions but are still functional, often gaining structure upon binding to partners
Challenge of Conformational Heterogeneity and Biological Function
Many proteins function by switching between different conformations, which is essential for their activity (e.g. enzymes, transporters, and receptors)
ex/ G-protein coupled receptors that adopt different conformations when bound to different ligands, triggering different cellular responses.
G-protein coupled receptors (GPCRs)
adopt different conformations when bound to different ligands, triggering different cellular responses.
Challenges in Experimental Structural Biology
Technical Limitations:
– Difficulty in capturing dynamic and flexible regions.
Incomplete structures due to unresolved disordered regions.
Biological Complexity:
– Dynamic conformational ensembles not represented in static snapshots
Resource Constraints:
– Time-consuming and costly experiments
Why predict protein structure?
- Protein structure dictates intersections, signaling, and biochemical roles.
- Experimental methods (x-ray, Cryo-EM) provide high-resolution structures but are resource-intensive and time-consuming
Structural insights can accelerate…
- Drug discovery: designing small-molecule inhibitors or antibodies that target specific protein conformations/
- Biotechnology: engineering proteins for industrial to therapeutic applications
- Disease research: mutations causing structural defects linked to diseases like Alzheimer’s and cystic fibrosis.
why is prediction is critical for the future of biology?
- Advances in predictive accuracy are opening new frontiers in biology
- integrating predictive models with experimental data is the way forward
- Structure prediction complements genomics/transcriptomics to create a holistic understanding of biological function
6 things make structure prediction hard
- conformational space
- complex energy landscapes
- flexibility and dynamics
- environmental effects
- post-translational modifications (PTMs)
- methods are data-driven
Conformational space
- Proteins can adopt a large number of possible conformations.
- Levinthal’s Paradox: a protein can’t sample all conformations in a biologically reasonable time, yet it folds quickly.
– Ex/ A protein with 100 amino acids, each capable of adopting about 3 torsion angles, results in ~3 ^100 possible conformations.
complex energy landscape
- A potential energy surface (PES) represents the energy of a system as a function of the positions of its atoms.
– Understands how the system’s energy changes upon reactions or movements
– Proteins fold to the lowest free-energy state, but this landscape is highly rugged. - Energy calculations are computationally intensive and depend on accurate force fields.
flexibility and dynamics
- Proteins are not static; they adopt multiple conformations (flexibility) based on their environment and interactions with other molecules
- Some proteins/regions do not adopt a fixed 3D structure but remain disordered or flexible under physiological conditions.
environmental effects
- Proteins fold differently in different environments
- Predictions need to capture interactions with solvent molecules, ions, and cofactors
Post-translational modifications (PTMs)
PTMs such as phosphorylation, glycosylation, and methylation can alter protein folding and function
Ex/
– elF4E is a eukaryotic translation initiation factor involved in directing ribosomes to the cap structure of mRNAs
– Ser209 is phosphorylated by MNK1
– AlphaFold3 accurately predicts changes when they’re already known.
methods are data driven
Our predictions rely on similarity to known structures, but novel sequences or folds (for which no homologous structures exist) are difficult to predict accurately.
– Ex/ AlphaFold has made strides, but prediction de novo structures remain challenging, especially for proteins with no templates.
homology modeling
- predicts protein structures based on evolutionary relationships
*** The main principle is that proteins with similar sequences tend to fold into similar structures.
Common tools for homology modeling: MODELLER, SWISS-MODEL, Phyre2
– most accurate when sequence identity to other proteins is high (>30%)
Hidden Markov Models (HMMs)
HMMs: statistical models representing sequences using probabilities for matches/indels (probabilistic states)
- capture evolutionary patterns in proteins
- predicts outcomes based on transitional probabilities
- captures more robust alignments
- include info on hidden states
HMM stepwise
- start with a multiple sequence alignment
- indels can be modeled
- occupancy and amino acid frequency at each position in the alignment are encoded
- profile created
HMMs model protein sequences as a series of probabilistic states (4)
- hidden states
- match states
- insertion states
- deletion states
hidden states
represent the underlying biological events that are not directly observable
match states
conserved positions in the sequence
insertion and deletion states
- Insertion states: positions where extra residues are added
- Deletion states: positions where residues are missing
HMMER
a tool that uses HMMs to search databases for sequence that match a given profile HMM (homology)
– Used to find homologous sequences, identifying evolutionary relationships across protein families
SWISS-MODEL
automated protein structure homology-modelling platform for generating 3D models of a protein using a comparative approach.
*** novel proteins are very challenging
when to use threading instead of homology modeling
- In cases where sequence similarity to known structures is low (<30%), homology modeling becomes unreliable.
- Threading matches sequences to known structural folds based on structural rather than sequence similarity
*** Phyre2, RaptorX, MUSTER, and I-TASSER are commonly used for threading and takes much longer than homology modeling.
identifying the right fold stepwise
- sequences
- LOMETS threading
— template - template fragments for structure assembly
- clustering
— cluster centroid - structure re-assembly
- lowest E structure
— final model - TM align search
- PDB library
- structural analogy
— function prediction
contact maps
- A contact map is a 2D representation of which residues are in close proximity
- allow for visualization of residue interactions in proteins
contact maps and spatial proximity
- determined by spatial proximity, not sequence order, typically within a certain distance threshold
- Residues on the diagonal are adjacent in sequence (and spatially)
- residues far apart in the sequence can still be close in the 3D structure, reflected in contact map
The Rise of Machine Learning in Structural Biology
- Traditional methods like homology modeling and threading rely on templates and known structures
- ML predicts 3D structures only from sequenced data
- AlphaFold (DeepMind) and RosettaFold (Baker Lab) lead the charge in this area.
AlphaFold
- Developed by DeepMind
*** predicts protein structures with atomic accuracy by using deep learning models trained on large structural datasets
Breakthroughs:
- AlphaFold 2 achieved near-experimental level accuracy in the 2020 CASP14 competition (critical assessment of protein structure prediction)
- AlphaFold 3 (2024) predicts proteins, DNA, RNA, ligands, and post-translational modifications.
Coevolving residues mutate in a correlated manner
- Mutations in one residue often result in compensatory mutations in its interacting partner
- This is observed across species through analysis of homologous protein sequences
- Correlated mutations indicate functionally significant residue pairs
coevolution analysis
- helps predict which residues are close in the 3D structure
- Residues showing correlated mutations are likely to be spatially close in the folded protein
- This is particularly useful when no experimental structure is available.
coevolution detection
- using large multiple sequence alignments (MSAs) from homologous proteins.
- The more diverse the sequences in the MSA, the better the resolution of coevolving residues.
- Evolutionary info from MSAs guides predictions for residue-residue contacts.
Coevolution example: DHFR
- Residues with a high score (i.e. coevolve) are near each other in the protein’s structure (i.e. small distance)
Coevolutionary signals can be noisy.
- Not all correlated mutations are due to direct physical interactions; some may be indirect.
- Noise from data can come from random mutations or insufficient evolutionary diversity.
- Large and diverse sequence data sets are needed for reliable coevolution predictions.
Machine learning leverages coevolution for high-accuracy predictions.
- AlphaFold and RosettaFold utilize coevolutionary data from MSAs to predict residue interactions.
- incorporate evolutionary info along with structural features, leading to highly accurate predictions.
alphafold pipeline (evoformer)
input sequence and MSA –> ML models ==> prediction of atomistic structure
- Using MSAs and contact maps, DeepMind trained a model to predict protein structures
– Contact maps are converted into dihedral angles
What is new in AlphaFold 3?
Biggest change is the use of a diffusion model
Diffusion models essentially learn to unscramble atoms into a structure.
- supercharged for any biomolecule
** breakthrough but not a final solution
– caveat is that proteins are dynamic
alphafold and disordered proteins
- At least 40% of proteins have disordered regions
- AlphaFold (and all other methods) struggle with disordered regions.
LARP1
protein movements
- proteins undergo movements like folding, unfolding, and domain motions.
– essential for binding, catalysis, and signal transduction.
– Understanding dynamics is crucial for drug design, protein design, biotech, etc.
Protein structure determination and prediction provide fixed snapshots
***DO NOT capture the full range of functional conformations
molecular dynamics (MD)
- provide time-resolved insights into protein behavior
- more realistic analysis of proteins
- atoms are treated as classical particles (atoms treated as hard spheres)
– involves:
1. simulation of atomic movement
2. visualization and analysis
Simulation of Atomic Movement
- MD computes trajectories of atoms over time scales of femtoseconds to microseconds.
- It can capture both small-scale vibrations and large-scale conformational changes.
Visualization and Analysis
- Provides detailed information on atomic interactions and energy changes.
- Enables the study of mechanisms at an atomic level
MD simulations provide more realistic analysis of proteins through..
- refinement of predicted structures
- Studying Intrinsically Disordered Proteins
- Folding and Misfolding Pathways
Refinement of Predicted Structures (MD)
- MD helps minimize energy and relax structures obtained from modeling.
- Improves accuracy by accounting for environmental effects
Studying Intrinsically Disordered Proteins (MD)
- MD captures the flexible nature of disorder regions.
- Aids in understanding functions that depend on disorder
Folding and Misfolding Pathways (MD)
- Simulates the folding process to identify intermediates.
- Investigates misfolding mechanisms relevant to diseases.
classical mechanics
- Describes the motion of macroscopic objects
- Assumes particles have well-defined positions and velocities
- Governed by Newton’s Laws of Motion
** atoms are treated as hard spheres
Quantum Mechanics
- Necessary for describing behavior at atomic and subatomic scales
- Accounts for wave-particle duality, uncertainty principle, proton tunneling
- Electrons exhibit quantum behavior that cannot be captured classically
Classical approximation impacts…
Nuclei →
- Nuclei (protons and neutrons) are much heavier than electrons.
- Their de Broglie wavelengths are very small, making quantum effects less significant
- At RT, thermal energies dominate over quantum zero-point energies.
Electrons →
- not explicitly simulated in classical MD.
- Their effect are included implicitly through potential energy functions (force fields).
- The electronic structure is assumed to remain in the ground state during simulation.
suitable systems vs limitations of classical approximations
Suitable Systems:
- Biological macromolecules (protein, nucleic acids, lipids)
- Materials where electronic excitations are not critical.
- Processes where bond breaking/forming does not occur.
Limitations:
- cannot accurately simulate chemical reactions involving electronic transitions.
- Quantum phenomena like tunneling and zero-point energy are not captured.
Newton’s Second Law
The acceleration of an object is directly proportional to the net force acting on it and inversely proportional to its mass
– given atomic forces, we can calculate atomic movements
(F = ma)
forces (NSLoM)
- the negative gradients of potential energy
– potential energy is dependent on positions of all atoms
– determines accelerations and thus motion of atoms
Time evolution of the system
- computed by integrating equations of motion
- Continuous motion approximated using discrete time steps
– Determine forces
– Move a small amount forward in time
– Repeat - Time step length determines how “smooth” the animation/trajectory
stepwise of molecular simulations computing an atomistic trajectory
- 3d coordinates of atoms in the system
- atoms exert forces on each other
- using Newton’s equation of motion, we can predict their movement
integration algorithms
- Numerical Solution:
- Approximate the continuous equations of motion using discrete time steps - Update Position and Velocities:
- Calculate the new positions and velocities of particles based on current forces.
Challenges Addressed by Integration Algorithms
- Stability: prevent numerical errors from accumulating over many time steps
- Accuracy: ensure that the trajectories closely follow the true physical behavior.
- Efficiency: balance computational speed with the precision of the simulation.
Common Integration Algorithms
- Verlet: uses current and previous positions to calculate the next position.
- Velocity Verlet: an extension of the Verlet algorithm that explicitly calculates velocities.
time step length
- determines how smooth the trajectory
- smaller time steps lead to more calculations to simulate same amount of time
force fields
- used to compute energies and atomic forces
- sets of equations that describe the potential energy of a molecule based on atomic positions
- based on dynamics of bond lengths, bond angles, and dihedral angles
chemical bonds
- behave like springs
- Two spheres (atoms) connected by a single spring
- The spring resists changes in the distance between the two atoms
- bond vibrations are seen as harmonic oscillators
Spring constants
- are determined by bond order and atom types
- energy increases (k) in kcal/mol as bond length decreases
— single > double > tripe
Bond angles behave like…harmonic oscillators
- Three balls connected by 2 springs forming an angle, with a “hinge” at the central atom.
— We also have separate spring constants for bond angles.
dihedral angle
- the angle between two planes formed by four sequentially bonded atoms (A-B-C-D)
- the angle between these two planes.
- describes the rotation around the bond between atoms B and C.
*** do not behave like springs
Dihedrals VS Bonds and Angles
Bonds and Angles:
- govern local geometry (bond lengths/angles) using quadratic (harmonic) potentials that favor specific distances and angles
Dihedrals:
- govern torsional or rotational flexibility around bonds, typically using periodic and multi-well potentials to allow for multiple stable conformations.
dihedral potentials
- capture arbitrary functions with rotational symmetry.
ex/ periodic energy functions with varying minima - can be modeled using custom fourier series
fourier series
- approximate functions as a sum of sine and cosine waves
- approximate (any) symmetrical rotational energy function.
adding more sine and cosine terms for fourier series
improves the approximation
- allows the series to closely match the original complex function
Noncovalent Interactions Role in Molecular Assembly
- Facilitate the organizations of molecules into complex structures
- Determine the macroscopic properties of materials (e.g. solubility, melting points)
Noncovalent Interactions Importance in Biological Systems
- Govern essential processes like enzyme-substrate binding, protein folding, and membrane formation
- Critical for understanding biochemical pathways and drug design
Why are Noncovalent Interactions crucial for MD?
While covalent bonds define the primary structure of molecules
— noncovalent interactions are pivotal for dictating how molecules interact.
Dispersion Forces
Nature:
- weak, attractive forces arising from instantaneous dipoles in molecules
Role:
- stabilize molecular assemblies by promoting close packing
C6 = dispersion coefficient
Repulsion Forces
Nature:
- Strong, short-range forces due to overlapping electron clouds.
Role:
- Prevent atoms from collapsing into each other, maintaining molecular integrity
C12 = repulsion coefficient
Combined van der Waals Potential
Van der Waals forces are modeled using the Lennard-Jones potential
— captures both the attractive and repulsive aspects of noncovalent interactions.
Electrostatic forces decay
- decay as 1/r, making them significant over longer distances compared to van de Waals forces
*** Electrostatic Interactions Drive Charged and Polar Molecule Behavior
what makes up the complete force field?
bonded and non-bonded interactions
parameterizing force fields starts with
Begins with Quantum Mechanical Data for Smalls Molecules
- QM calculations
- data utilization
- small molecule focus for simplicity and accuracy
Role of Quantum Mechanics in Parameterizing Force Fields
QM Calculations:
- provides high-accuracy data on molecular geometries, energetics, and electronic distributions
Data Utilization:
- QM data inform the selection and tuning of force field parameters to ensure they reflect true molecular behavior.
Small Molecule Focus for Parameterizing Force Fields
Simplicity:
- Smaller molecules have fewer atoms and simpler interactions, making QM calculations more manageable.
Accuracy:
- QM methods (e.g. Density Functional Theory, Hartree-Fock) yield precise information essential for initial parameterization.
Complexity of Proteins in Force Field Parameterization
Size & Structure:
- protein consists of hundreds to thousands of atoms with intricate 3D structures.
Diverse Interactions:
- include a variety of noncovalent interactions, such as hydrogen bonds, ionic bonds, hydrophobic interactions, and van der Waals forces.
Limitations of QM for Large Systems for Force Field Parameterization
Computational Cost:
- QM calculations become computationally prohibitive for large biomolecules like proteins.
Alternative Strategies:
- Utilize QM data from representative small segments or use empirical and semiempirical methods.
Types of Experimental Data –>(Experimental Data is crucial for Refining Force Field Parameters)
- Spectroscopic Data:
- Infrared (IR), Nuclear Magnetic Resonance (NMR), and Raman spectroscopy provide insights into bond vibrations and molecular geometries. - Crystallography:
- X-ray crystallography offers precise information on atomic positions and molecular conformations. - Thermodynamic Measurements:
- Data on melting points, boiling points, and solvation energies inform interaction strengths.
Parameters Optimization
Fitting Process:
- adjusts force field parameters to minimize discrepancies between simulations results and experimental observations.
Validation Metrics:
- use root-mean-square deviations (RMSD), binding affinities, and structural stability as benchmarks.
Fitting Force Field Parameters to Experimental Data…
Ensures Realistic Simulations
– uses parameter adjustment
Parameter Adjustment for Fitting Force Field Parameters
Process:
- fine-tune force field parameters to minimize discrepancies between simulations outcomes and experimental observations.
Techniques:
- use of optimizations algorithms and statistical methods to achieve best-fit parameters
Challenges in Parameterizing Force Fields for Proteins
- High Dimensionality
- Diverse Chemical Environments
- Dynamic Conformational Changes
- Long-Range Electrostatic Interactions
high dimensionality challenge
Issue:
- proteins possess numerous degrees of freedom, making comprehensive parameterization computationally intensive.
Solution:
- utilize advanced optimization techniques and high-performance computing resources.
Diverse Chemical Environments Challenge
Issue:
- Different regions of a protein (e.g. active sites, hydrophobic cores, experience varied chemical environments)
Solution:
- Develop region-specific parameters or use adaptive force fields that can account for environmental variations.
Dynamic Conformational Changes challenge
Issue:
- proteins frequently undergo conformational shifts that must be accurately captured by the force field.
Solution:
- Incorporate flexible dihedral terms and ensure that parameters support a wide range of conformational states.
Long-Range Electrostatic Interactions Challenge
Issue:
- Accurate modeling of electrostatics in large, charge systems is computationally demanding
Solution:
- Implement efficient algorithms like Particle Mesh Ewald (PME) and use approximations where appropriate.
Summary of Force Field Parameterization Process →
Step-by-Step Process
- Quantum Mechanical Calculations:
- obtain high-accuracy data for smell molecules and representative fragments - Empirical Data Integration:
- Incorporate experimental measurements to validate and refine parameters - Parameter Optimization:
- adjust force field parameters through iterative simulations and comparisons - Advanced Techniques:
- utilize machine learning, multi-scale modeling, and automated pipelines to enhance parameters accuracy and efficiency.
Common Force Fields
*** Different force fields are tailored for specific types of molecules and applications
AMBER, CHARMM, OPLS
AMBER
optimized for proteins and nucleic acids
– optimized for biomolecular interactions
CHARMM
- versatile, used for a wide range of biomolecules
- known for its extensive parameter set, suitable for complex systems including proteins, lipids, and membranes
OPLS
- focuses on liquids and organic molecules
- optimized for small molecules, organic compounds, and polymers, with emphasis on accurate non-bonded interactions
Selection Criteria for Force Fields
- Compatibility with the system being studied
- Availability of parameters for the molecules of interest
5,6,7,8-tetrahydrofolate (THF)
- crucial for cell growth
- Producing red blood cells
- Synthesizing purines
- Interconverting amino acids
- Methylating tRNA
- Generating and using formate
Disrupting THF production
- has a cascading effect on essential cellular processes, primarily affecting DNA and RNA synthesis and amino acid metabolism
***This is a useful process for drug design.
DHFR
- Dihydrofolate reductase (DHFR) is a crucial enzyme that produces THF from dihydrofolate (DHF)
DHF + NADPH → THF + NADP(+)
DHFR uses
studied as an antibiotic (e.g. trimethoprim) and cancer (e.g. methotrexate) target
DHFR conservation
- complicates drug design
- patient with a bacterial infection is prescribed a drug loosely targeting DHFR
—- deleterious side effects
***Both proteins have high structural similarity, even around the active site
- Bacteria and humans have similar structures, but their dynamics are different
— must ensure drugs only bind to bacterial proteins by exploiting dynamics insights
Simulating DHFR
- provides insight into druggable conformations
- explore various low-energy conformations that are, hopefully, similar to reality
- Knowing conformations unique to bacteria allow us to design a small molecule that competitively inhibits DHFR
before starting any molecular simulation…
- need a starting structure
- If our starting structure is very far away from our desired equilibrium, our simulations will take longer
- NO static structure for experiment
Things that could go wrong with using static structure from experiment
- Low-quality experimental structures
- Inaccurate computational predictions
- High-energy conformations
- Missing or incorrect cofactors
** wait for the protein to fold to study its dynamics
experimental structures
- Experimental structures offer the best option for their accuracy
- PDB contains experimentally determined structures for thousands of proteins (not all equally suitable for simulations)
— Generally resolution preference: X-ray, Cryo-EM, NMR
factors for choosing the best experimental structures
- resolution
- completeness
- functional state
- B-factors
Resolution
- refers to how well the atomic positions are determined
– Resolution below 2.0 A is generally preferred for high-quality simulation
– r-factors that are high indicate less structural accuracy
Completeness
Flexible loops or disordered regions are often missing from the structure
Functional State
Proteins can exist in different functional conformations: active vs inactive state, bound to ligands or unboard
B-factors
Higher B-factors suggest more uncertainty in atom positions, which might make that part of the structure less reliable
Simulations cannot have missing…
…residues (specific amino acid in protein)
- It’s essential to fix chain breaks and missing loops before simulation
— dashed lines indicate unknown and missing info
how to add missing residues
Missing atoms or residues can be added using modeling software like Modelleer
(protein model prediction programs)
removing some components from PDB structures
- components like ligands or non-essential ions should be removed
- ligands, ions, or crystallization agents that are not physiologically relevant
***Distorts protein’s behavior in a simulated biological environment if not removed
Correct protonation states
- are essential for accurate simulations
- Experimental structures often cannot resolve hydrogens, so we need to add them ourselves
pH-sensitive residues
Protonation states of amino acids affect the charge distribution, which influences electrostatic interactions during the simulation
Histidine (His, H)
- pKa ~6.0
- Protonation switching around pH 6-7
Cysteine (Cys, C)
- pKa ~8.3
- Could form disulfide bonds in oxidizing environments
Aspartic Acid (Asp, D)
- pKa ~3.9
- Affects interactions like salt bridges and hydrogen bonds
Lysine (Lys, K)
- pKa ~10.5
- Can form ionic bonds with negatively charged residues
Glutamic Acid (Glu, E)
- pKa ~4.2
- Glu’s protonation state affects electrostatic interactions
Tyrosine (Tyr, Y)
- pKa ~10.1
- Hydrogen bonding and in enzyme active sites
DHFR is localized in the cytoplasm, which contains a multitude of chemical species
ions, molecules, proteins, organelles, cytoskeleton, membranes
how to balance computational feasibility with biological realism
- Protein of interest (already prepared)
- Water molecular at the appropriate temperature (310 K) and pressure (1 atm)
- Cations (Na+ and K+) and anions (Cl-) at an ionic strength of 150 millimolar
- Any cofactors (e.g. NADPH and folate for DHFR)
realistic systems do not have…
walls
- solved with periodic boundary conditions (PBC)
why do we have PBCs for MD simulations?
- a protein in vivo will have lots of room to move around
— could make box very large, but that is very costly - for this simulation, we have to apply force to keep molecules in the box
- water molecules and proteins would bounce off these walls in an unphysical manner (edge effects)
- PBC simulate infinite systems from a finite box
periodic boundary conditions (PBC)
- PBC simulate infinite systems from a finite box
— We (virtually) place exact copies of our system all directions
Atoms that cross the box edge reappear on the other side; thus, do not have edge effects
— think PacMan game
why are force fields parameterized?
to reproduce quantum chemical and experimental data
minimum image convention (MIC)
- ensures that an atom in the primary box only interacts with the closest image of another atom
- Image atoms in adjacent boxes are used to calculate interactions across the boundaries
(ensures correct interactions)
Force field parameterization steps overall
- Generate structures and use quantum chemistry to compute energy and forces
- Optimize force field parameters until they reproduce the quantum chemistry data set
- Run MD simulations and predict experimental data (e.g. NMR, Raman spec, solvation energies, etc)
- Continue to optimize force field parameters to minimizing quantum chemistry and simulation prediction errors
Force fields are dependent on…
- Force fields are dependent on fitting data and simulation set up
– Force fields are not inherently compatible with each other (causes simulations to be unreliable)
- Ex/ protein force fields and DNA force fields are set to different things (proteins and DNA/RNA types)
** therefore are compatible by design, or validated against experimental data
Key factors for selecting a force field
- System type: different force fields are optimized for specific systems
- Accuracy VS speed: high accuracy force fields may require more computational resources
- Compatibility: choose a force fields based on compatibility with available topology generators and the type of molecules in your simulations
Topology files
- define the molecular structure and interactions in the simulations
- contains info on atom types, bonds, angles, dihedrals, and non-bonded interactions based on the chosen force field
*** essentially tells the program which force field parameters to use and where
when is additional parameterization required?
- Complex molecules and ligands requires parameterization and careful integration
- Non-standard residues or ligands are not always included in standard fold field parameter sets
—- require additional parameterization to ensure proper interactions in the simulation
Energy minimization
- necessary before running molecular dynamics simulations
- adjust the initial structure to remove unfavorable atom positions and steric clashes that could cause instability during simulations
** Without minimization, high-energy configurations may lead to unrealistic results or early failures in the molecular dynamics simulations
energy minimization and steric clashes
- removes steric clashes and optimizes the initial geometry
— Steric clashes occur when atoms are too close, resulting in excessively high energy
— Energy minimization gently adjust the structures to lower the system’s energy
Physics statistical at the molecular level involves 3 concepts
- Number of particles:
- biological systems contain billions of atoms interacting simultaneously - Thermal motion:
- atoms and molecules are in constant motion due to thermal energy - Uncertainty and variability:
- exact positions and velocities of particles are inherently uncertain
Observable properties
averages of atomistic behaviors on macroscopic and microscopic levels
Microscopic VS Macroscopic levels
Microscopic level:
- individual atoms and molecules
Macroscopic level:
- bulk properties from collective behavior
Atomistic system
stochastic (randomly determined), measurable properties are computed as averages.
Statistical mechanics
uses statistical methods to relate microscopic proerties to macroscopic observables
Macrostate
- specifies the temp, pressure, volume, and number of particles of molecular systems
— Large scale system that defines properties of molecular system - changing values of temp, pressure, volume, etc changes the macrostate
*** essentially infinite number of macrostates
ensemble
the collection of all possible microstates of a single macrostate
microstate
a unique configuration defined by the positions and velocities of all particles
— a specific configuration of a system by knowing positions and velocities of all particles
Accurate ensemble averages require…
- require sampling every possible configuration
- Longer simulation provide better sampling of microstates and their probabilities
More accurate hydrogen bond distance estimate!
multiple microstates (i.e. configurations) can have the same…
distance
– measure the weighted mean of the microstates
— used to compute expected value of ensemble
Microcanonical Ensemble (NVE) →
- Fixed number of particles (N)
- Volume (V)
- Energy (E)
Canonical Ensemble (NVT) →
- Fixed number of particles (N)
- Volume (V)
- Temperature (T)
Isothermal-Isobaric Ensemble (NPT) →
- Fixed number of particles (N)
- Pressure (P)
- Temperature (T)
*** most common
What does constant temperature mean?
Remember: macrostate observables are ensembles averages
— The instantaneous temperature of microstates will fluctuate, but the ensemble average should be constant
*** There should be no net flow of energy!!!
*** Kinetic energy determines temperature
Kinetic energy
- determines temperature
- Particle velocities determine kinetic energy
— every particle does not have same velocity; they generally follow the Maxwell-Boltzmann distribution
Most Probable Velocity
the velocity at which the peak of the distribution occurs
Average Velocity
the mean velocity of all particles
Temperature Dependence
higher temperatures shift the distribution toward higher velocities
Thermostats
adjust the velocities of particles to increase or decrease the system’s kinetic energy → thereby controlling the temperature
Berendsen thermostat
adjusts the velocities of all particles uniformly based on the current temperature and target temperature
– indicated by velocity scaling factor
—– Velocity scaling factor is computed by slowly/carefully scaling the current velocity based on the temperature deviation
velocity scaling factor
- computed by slowly/carefully scaling the current velocity based on the temperature deviation
- prevents abrupt changes that could destabilize the simulation
— Simple velocity scaling does not generate a true canonical (NVT) ensemble; it cannot reproduce realistic temperature fluctuations
particle collisions are…
mass dependent
Berendsen thermostats VS Nose-Hoover Thermostat
- Berendsen thermostats inaccurately models thermal energy transfer via particle collisions
- Nose-Hoover thermostat uses momenta scaling provides realistic kinetic energy and thus temperature control
Nose-Hoover thermostat
- connect particle momenta to fictitious heat bath
- Heat bath allows thermal energy to flow in and out of our simulation
- Momenta scaling provides realistic kinetic energy and thus temperature control
- dependent on Q ⇒ a “mass” coupling parameter that controls thermostat responsiveness
Barostats and pressure
- Barostats maintain desired pressure during simulations
- Adjusts the volume of the simulation box to achieve and maintain target pressure
pressure
directly proportional to density and temp
NkBT
- represents thermal energy of ideal gas
- assumes non-interacting particles and elastic collisions
{W}
- virial corrections to real gas
- corrects for intermolecular forces in pressure equation
Berendsen Barostat
- Gentle Pressure Stabilization
- Same concept as Berendsen thermostat: Scale box volume based on pressure difference to target
- atomic positions get scaled with box size
- velocities do not get affected
- using barostats, we can keep a consistent macrostate!!!
WIth thermostats and barostats, we can…
keep a consistent macrostate!!!
Initial configurations
- are not in true thermodynamic equilibrium
- starting structures often come from experiments not relevant for our simulations
- After minimization, we run a short simulation to let the system adjust to the desired macrostate
Why discard the initial relaxation/configurations?
- We discard the initial relaxation as it is not our desired macrostate
- Once macrostate variable(s) reach steady state, we are now sampling valid microstates
Production simulation sampling
- sample microstates from our desired macrostate
- Ensemble averages improve with more simulation time by sampling more microstates
*** “Replicates” do not exists as it does experimental biology and chemistry
Production simulation sampling timeline
NVT
- short simulation to relax to temperature of interest
NPT
- short simulation to relax to density of interest
NPT
- long simulation process
Multiple shorter simulations or one long one?
multiple short simulations provides better sampling of microstates
Random initial velocities
*** provide better change of sampling different microstates
- Simulation starts here on my potential energy surface (PES)
- Initial velocities send it in this direction
- There is a change that it never samples this minima
- Multiple simulations with random velocities reduces this chance
Root Mean Square Deviation (RMSD)
- measures the overall change in the structure during a simulation, tracking deviations from the starting conformation
— monitors global conformational changes
- The difference between the coordinates represents the displacement of atom i from its reference position at time t
Low VS High RMSD
- Low RMSD → the structure is very similar to the reference structure (e.g., stable conformation)
- High RMSD → indicates significant deviation, suggesting large structural changes or flexibility over time
Root Mean Square Fluctuation (RMSF)
- identifies regions of flexibility in the protein by calculating the fluctuation of each atom or residue
– Tracking Local Flexibility
- This measures how much the atom is fluctuating around its mean, not relative to a reference structure
High VS Low RMSF
High RMSF → value for an atom means that it fluctuates a lot, indicating flexibility (often seen in loops or solvent-exposed regions)
Low RMSF → atom remains relatively fixed in place, suggesting rigidity (common in well-ordered regions like helices or beta-sheets)
Potential of Mean Force (PMF)
- effective potential that governs the behavior of a system along a collective variable
- A collective variable defines the progress of an interaction or molecular reaction
— common collective variables include distances between atoms, bond angles, or dihedral angles.
1D potential energy surface
- This shows you the average energy with respect to h
- Bond length is a particular angstroms apart
- Important: This is not a covalent bond, so it will not look like our spring model
*** Nature prefers to spend time in low-energy conformations
Probability and energy
- Probability and energy are intricately linked [ W(x) vs P(x) ]
— display as opposite curve plots
drug development
a complex, multi-stage process requiring significant time and resources
Many years and millions of dollars
drug discovery pipeline
- Discovery and Preclinical Research
– Potential drugs are identified and tested in non-human studies
***Computation is most helpful with the drug discovery stage - Clinical Trials
– Testing in human subjects to assess safety and efficacy - Regulatory Approval
– Evaluation by agencies like the FDA before the drug can be marketed - Post-Marketing Surveillance
– Ongoing monitoring after the drug is available to the public
why identifying the right protein target is crucial for drug development?
- crucial for developing effective and safe drugs
- Proteins regulate nearly all cellular processes and drugs and inhibit or activate proteins to correct disease states
*** Target identification is accelerated with bioinformatics
Criteria for selecting a protein target:
- Disease Relevance: the protein plays a critical role in the disease mechanism
- Druggability: target has a structure that allows it to bind with drug-like molecules
- Specificity: Targeting the protein minimizes effects on healthy cells, reducing side effects
importance of chemical space in drug discovery
- Chemical space contains an astronomical number of possible compounds to explore
- Effective drugs must bind to the target protein with sufficient affinity and specificity
***Estimated to be between 10^60 to 10^200 possible small organic molecules
We need methods to navigate chemical space and identify promising leads accurately and efficiently
High-throughput screening (HTS)
allows testing of thousands of compounds against the target protein
High-throughput screening (HTS) stepwise
- Library Preparation:
- Collection of diverse compounds - Assay Development:
- Design of biological assays to measure compound activity against the target - Screening:
- Compounds are tested in miniaturized assays - Data Analysis:
- Identification of “hits” that show desired activity
Virtual screening
- evaluates vast libraries to identify potential leads efficiently
- Experimental assays are still expensive, and limited to commercially available compounds
*** Instead, we can use computational methods to predict which compounds we should experimental validate
— virtual screening allows for screening of millions/billions of compounds allowing for expansion of the search space
selective binding
- binding to a protein is governed by thermodynamics (and kinetics)
- Binding occurs when a compound/ligand interacts specifically with a protein
** reversible
binding affinity and energy
- determined by the Gibbs free energy change
- the change in free energy when a ligand binds to a protein determines the binding process spontaneity
gibbs free energy
- Gibbs free energy combines enthalpy and entropy
Enthalpy (delta H) ⇒ accounts for energetic interactions
Entropy (delta S) ⇒ how much conformational flexibility changes
***Simulations capture free energy directions instead of treating enthalpy and entropy separately
enthalpy
Enthalpy accounts for non covalent interactions
—- electrostatics, h-bonds, dipoles, pi-pi stacking
- Ensemble differences in non covalent interactions provide binding enthalpy
chemical interactions and e- densities
- Molecular interactions are governed by their electron densities (Hohenberg-Kohn theorem)
** For a quantum system, if you know electron densities, then you know everything about that system
This is rather difficult, so we often use conceptual frameworks to explain trends (e.g., hybridization and resonance)
Every noncovalent interaction can be described with this framework → (4)
- Coulomb’s law describes the interactions between charges
- Molecular geometry uniquely specifies an e- density
- Regions of increased electron density are associated with higher partial negative charges
- Electron are mobile and can be perturbed by external interactions/other electrons
electrostatic forces
- govern interactions between charged and polar regions
- Charged molecules have a net imbalance between
(+) charges in nuclei & (-) charges from electrons
*** leads to net electrostatic attractions or repulsions between different atoms
electrostatic forces role in binding
- Long-range interaction: can attract ligands to the binding site from a distance
- Anchor points:
— often serves as a key anchoring interactions in the binding site
~5 to 20 kcal/mol per interaction
hydrogen bonds
Attraction between a (donor) hydrogen atom covalently bonded to an electronegative atom and another (acceptor) electronegative atom with a lone pair
h-bonding role in binding
- Specificity: Precise orientation of the ligand
- Stabilization: Moderately strong interactions
- Dynamic: Allows for adaptability of ligands
*** strongest when the hydrogen, donor, and acceptor atoms are collinear
~2 to 7 kcal/mol per hydrogen bond
Uneven electron distribution
- creates partial charges and dipoles
- lead to unequal distribution of electron density
- results in regions or partial positive or partial negative charges
- Consistent electron density spatial variation results in permanent dipoles
uneven electron distribution role in binding
- Directional binding: Highly directional, ensuring that the ligand aligns correctly
- Flexibility: Can accommodate slight conformational changes
~0.01-1 kcal/mol per interaction
Van der Waals forces
- weak, non-directional interactions
- Dispersion: Electrons in molecules are constantly moving, leading to temporary uneven distributions that induce dipoles in neighboring molecules
- Induction: The electric field of a polar molecule distorts the electron cloud of a nonpolar molecule, creating a temporary dipole
Van der Waals forces role in binding
- Complementary fit: Maximizes surface contact
- Flexibility: Allows small conformational changes
~0.4 - 4 kcal/mol per interaction
pi-pi interactions
- involve stacking of aromatic rings
- Noncovalent interactions between aromatic rings due to overlap of pi-electron clouds
pi-pi interactions role in binding
Orientation: proper positioning of aromatics
Selectivity: recognition of ligands
~1 to 15 kcal/mol per interaction
summing up all enthalpic contributions during a simulation…
provides our ensemble average
Entropy
- accounts for microstate diversity of a single macrostate
- defined as S=kBlnΩ
– where Ω = total # of microstates available to the system without changing the system state
***Entropy is “energy dispersion”
– Higher entropy implies greater microstate diversity for a given macrostate
system state
can be arbitrarily defined and compared as
– Unbound ligand vs. bound ligand
– Unfolded protein vs. folded protein
– Liquid water at 300 K vs. 500 K
Grid-based protein-ligand binding →
- My macrostate (number of particles, temp, and pressure) remain constant
— rearranges the ligands without binding to the receptor - N choose L grid sites
- Number of ways to choose L grid sites out of N is the binomial coefficient
*** Smaller grid (with same size site) is decreased entropy
How does entropy change?
- Depends (increase, no change, decrease) on ligand concentration!!!
- How to interpret this: Pick a number of ligands and move to the right (L - 1), does entropy go up or down?
for protein ligand binding, we must account for…
*** For protein-ligand binding, we need to account for how the number of accessible microstates/configurations for protein and ligand
- after that point, can run molecular simulations of different states
partition function (Z)
- Partition functions of protein, ligand, and complex are vastly different
- Z is related to the number of all accessible microstates
*** many practical limitations to sampling all microstates
What if we slowly disappear the ligand? (for sampling all microstates)
- This has several advantages:
– More relevant conformational sampling
– Can run independent simulations in parallel
– Focuses on taking differences with smaller numbers
***This technique is generally called alchemical simulations
alchemical parameter
- controls our protein-ligand interactions
- 1 = interactions are normal
- 0 = no intermolecular interactions are on
– Intramolecular interactions are left alone
Alchemical simulations limitation
- VERY expensive
*** Use “docking” to more efficiently screen molecule before (if ever) doing alchemical simulations
Alchemical simulations precision
- Compute energy changes by gradually transforming one molecule into another
– highly precise, offering detailed insights into binding affinities for drug design
Why are alchemical simulations computational expensive?
- Atomistic forces:
— computes forces for all atoms in proteins, ligands, cofactors, ions, solvents for millions of structures - Detailed sampling:
— captures a wide range of conformations, which adds more dimensions to the calculation
Alchemical parameters:
— simulations must be performed at various alchemical parameters
*** ~ 10,000 CPU hour
(417 days on 1 core)
docking
- Avoid sampling all microstates and determine one “optimal” protein-ligand structure ⇒ using this bound structure, predict a “score” that is correlated to binding affinity
- simplifies the binding free energy prediction problem to enhance speed
- efficient by avoiding sampling all microstates and determining one “optimal” protein-ligand structure
Significance of Protein Conformation in Docking
- Protein-ligand interactions are highly-dependent on the protein’s 3D structure
- Using an inappropriate protein conformation can lead to inaccurate docking results
challenges of docking
- Conformational Flexibility:
- Proteins are not rigid structures; they exhibit movements ranging from side-chain rotations to large domain motions - Impact on Binding Sites:
- The shape and properties of the binding site can change, affecting ligand binding affinity and specificity. - Limited Experimental Structures:
- Crystallography and NMR provide snapshots of protein conformations but may not capture all relevant states.
Sources of Protein Conformational Data
Experimental Methods:
- X-ray Crystallography:
Provides high-resolution structures but may miss dynamic conformations.
- NMR Spectroscopy: Captures ensembles of conformations but is limited to smaller proteins.
Computational Techniques:
- Molecular Dynamics (MD) Simulations: Explore the conformational space over time.
- Normal Mode Analysis (NMA): Identifies collective motions in proteins.
- Ensemble Generation Methods: Generate multiple protein conformations for docking.
Experimental Structure Selection Criteria →
Resolution and Quality
– Prefer structures with higher resolution (e.g., <2.5 Å).
– Assess reliability using R-factors and validation reports.
Ligand-Bound vs. Apo Structures
– Ligand-Bound (Holo) Structures: Provide direct insight into binding site conformation.
– Apo Structures: May reveal binding site flexibility in the absence of ligands.
Relevance to Target Ligand
– Choose structures co-crystallized with ligands similar to those of interest.
Molecular Dynamic Simulations for Conformational Sampling
- Extract representative structures using clustering algorithms
- Identify conformations with open or closed binding sites
Importance of Water Molecules
- Role in binding: structured water molecules can mediates interactions between the protein and ligand
- Inclusion Criteria: retain water molecules that are conserved across multiple crystal structures
handling water in docking
- Some docking programs allow explicit water molecules in the binding site
- Alternatively, consider their effect implicitly in scoring functions
binding pocket detection for docking
- The binding pocket is the specific region where a ligand interacts with a protein
** Accurate identification of binding pockets is essential for successful docking and virtual screening.
Binding pocket
a cavity that can accommodate a ligand
Protein Surface Characteristics →
- Convex Regions: Typically inaccessible to ligands.
- Concave Regions (Cavities): Potential binding sites.
Classification of Binding Pockets →
- Orthosteric Sites
- Allosteric Sites
- Cryptic Sites
Orthosteric Sites
The primary active site where endogenous ligands bind.
Allosteric Sites
Secondary sites that modulate protein function upon ligand binding.
Cryptic Sites
Binding pockets not apparent in the unbound protein structure but form upon ligand binding or conformational change.
Geometry-Based Pocket Detection Technique
alpha shape theory
– uses Delaunay triangulation and alpha complexes to define cavities
alpha shape theory
alpha spheres touch certain about of atoms (3 atoms only); cannot put any spheres on the outside in protein land
Shows pockets based on how many spheres it is touches (group spheres placed in open spaces and indicate it as a pocket)
Grid-Based Pocket Detection
Methodology
1. Overlay a 3D grid on the protein structure.
2. Classify grid points as inside, outside, or on the surface.
Pocket Identification
– Clusters of surface grid points forming concave regions indicate potential pockets.
Detecting Cryptic Binding Sites
- Cryptic sites are hidden in the unbound structure and require conformational changes to become apparent
Strategies →
– Used enhanced sampling MD methods like metadynamics
– Apply pocket detection to multiple conformations
ligand poses importance
- Precise ligand poses are crucial for reliable predictions of binding affinity and activity.
- Incorrect poses can lead to false negatives or positives, misguiding drug development efforts.
*** aka accurate docking
ligand pose
The specific orientation and conformation of a ligand within the binding site of a target protein.
Ligand Pose Optimization
Optimization Goal →
– Identify the energetically most favorable pose that closely represents the true binding mode.
Key Components →
1. Orientation: Position and alignment within the binding pocket.
2. Conformation: Internal geometry, including bond angles, lengths, and torsions.
4 types of search strategies for docking
Systematic, stochastic, empirical, machine learning
systematic searches
- numerically iterate over all possible conformations
– Identify important degrees of freedom
– Scan along each angle with a step size of N degrees
– Remove structures with high strain
*** only possible for very small molecules = not used often!
Stochastic searches
- random sampling (Monte Carlo)
- provide better balance of sampling and cost
- can utilize conformer libraries (pre-generated)
Steps:
1. Generate conformation
2. Compute energy change
3. If energy change less than a random sample: make move
4. Repeat
***Allows us to sample efficiently!
scoring functions
- parameterized models to estimate binding affinity after docking
- Physics-based methods using force-field like methods
- Machine learning (graphing neural networks) have been gaining traction recently
Phenotypic drug screening
- involves testing compounds on an organism level to identify potential leads
ex/ drug screening on an antibiotic-resistant bacterial strain to identify potential new leads
Ligand-based drug design (LBDD)
- relies on the properties of known bioactive compounds to guide drug discovery
- Does not require the structure of the target protein, making it useful when this is unknown
motivations and assumptions of LBDD
- Motivation: If we find compounds with little bioactivity, we can use LBDD to find compounds with similar chemical features to improve specific outcomes
- Assumption: Similar structures can lead to similar—hopefully improved—biological effects
structure-based VS ligand-based drug design
Structure-Based Drug Design:
1. Requires 3D structure of the target protein.
2. Uses the binding site structure to model potential interactions.
3. Often employs docking and molecular simulations.
Ligand-Based Drug Design:
1. Requires no structural information of the target.
2. Uses the chemical structure and activity of known ligands as guides.
3. Relies on molecular similarity rather than direct binding predictions.
molecular descriptors
- used to numerically encode chemical properties
- molecular weight
- LogP
- molar refractivity
- TPSA
- # of rotatable bonds
molecular weight
- indicates the overall size of the molecule
- Impacts drug distribution and elimination rates in the body
LogP
- measures lipophilicity (chemical compound’s ability to dissolve in lipids, fats, oils, and non-polar solvents)
- Influences a molecule’s ability to cross cell membranes and affects absorption and bioavailability
Molar Refractivity
- relates to polarizability and electron cloud distribution
*Affecting intermolecular interactions and binding affinity
TPSA
- estimates the molecule’s ability to form hydrogen bonds
*impacting solubility and permeability across biological membranes
Number of rotatable bonds
- reflects molecular flexibility
*influences binding affinity and oral bioavailability
Phenylephrine
a synthetic compound that acts as a vasoconstrictor by stimulating alpha-adrenergic receptors
**Molecules can have similar properties, with slight structural differences causing widely different functions
Dopamine
a naturally occurring neurotransmitter in the brain and interacts with dopamine receptors
**Molecules can have similar properties, with slight structural differences causing widely different functions
Extended connectivity fingerprints (ECFPs)
- encode structural features into numerical representations
- utilize hash functions to encode chemical information (transform info into a numerical format for computers)
hashing for molecular fingerprints stepwise
- Hash functions are used to encode chemical information
- For each additional iteration of n, incorporate the hashes of connected atoms that are n bonds away.
- Then encode the atom IDs that are exactly one bond away
- Repeat for all atoms while hashing n-1 IDs
- Each iteration encodes local chemical info into each atom’s ID
— repeat the process for large n, which captures more chemical info at a (small) computational cost
atom ids for molecular fingerprints
- We keep track of atom IDs at each iteration to encode multiple “levels” of chemical information
*** Similar structural features will share atom IDs until our iteration starts incorporating different structural features
bit arrays
- fixed-length collections of ones and zeros
** allow for efficient operations
- Atoms are encoded into a bit array to store a collection of atom IDs
Converting atom IDs to bit arrays →
- Decide on length of bit array, for example, 1024 and fill with zeros
- Divide each atom ID by the length of the array and determine the remainder
- Set the value of the bit array at that index to 1
Tanimoto similarity
- compares the ECFPs between two molecules
- formula measures the ratio of the shared features to the total number of unique features between two molecules.
TS = c / a + b - c
(bits set to vectors a,b,c)
Molecular similarity
The concept that similar molecules often show similar biological effects.
(Tanimoto)
QSAR models
- link chemical structure with biological activity
Purpose: To predict the biological activity of molecules based on their structure.
Motivation:
- Reduces the need for experimental screening.
- Helps identify potential drugs quickly and cost-effectively.
2 types:
- linear and nonlinear
Types of QSAR Models
- Linear Models: Simple, interpretable, e.g., linear regression.
- Nonlinear Models: Capture complex relationships, e.g., neural networks.
QSAR model systematic steps
- Data Collection: Gather biological activity and molecular data.
- Descriptor Calculation: Calculate numerical descriptors for each molecule.
- Model Selection and Training: Use machine learning to correlate descriptors with activity.
- Model Validation: Test model accuracy with independent datasets.
- Interpretation and Application: Use the model for predicting new molecules.
Linear regression models
- Linear regression models are simple but effective for QSAR analysis
– Fits a linear relationship between descriptors and output
pros/cons of linear regression models
Advantages: Easy to interpret.
Limitations: Limited to linear relationships; struggles with complex datasets
Nonlinear models
- capture complex relationships in QSAR data
Examples =
1. Neural Networks: Capture complex, nonlinear patterns in large datasets.
2. Random Forests: Effective for high-dimensional data, robust against overfitting.
pharmacophore
- the 3D arrangement of molecular features required for biological activity
- defines the essential molecular features needed for biological activity
– Looks at H-bond acceptors/donors, cationic, anionic, hydrophobic, aromatic
Building a pharmacophore model
- requires multiple active compounds
Step 1:
- Align active molecules
- Identify common structural features
- Determine spatial relationships
- Consider multiple conformations
Step 2:
- Define feature locations
- Mark shared pharmacophoric points
- Establish distance constraints
- Set tolerance spheres