Final Exam Flashcards

Question

high coverage for deBrujin

Answer 1

High coverage suggests that a node is likely a true sequence rather than an error

Answer 2

Long paths are desired but not always reliable due to potential repeats High, consistent read coverage Unique, non-branching paths

Answer 3

- prokaryotic genome assembler based on DB graphs - Estimates gaps between reads using DB graphs - Builds multisized graphs with different k’s. - Using multiple graphs allows for a better handling of variable coverage. - Assemblers provide contigs and scaffolds (connections how contigs form scaffolds)

Answer 4

Large K ⇒ fragmented graphs; helps reduce repeat collapsing Small K ⇒ collapsed/tangled graph good for low-coverage regions

Answer 5

NUMBER of contigs whose combined length is at least 50% Lower is better for L50 value ****longer contigs = more confidence that genome is right

Answer 6

LENGTH of the shortest contig at 50% of the total genome length Higher is better for N50 value [median contig size = reliability factor] ***N50 is the length of shortest contig in L50 assembly

Answer 7

- by addition of dNTP instead of ddNTP - synthesis is ahead by 1 nucleotide bc 2 were added at once

Answer 8

by failure to remove blocking fluorophore

Answer 9

degrades quality of assemblies

Answer 10

assess the accuracy of nucleotide base calls in DNA sequencing (prob that base call is incorrect) ASCII encoded probability store phred quality scores in FASTQ file

Answer 11

Deviation from normal distribution indicates contamination (reads)

Answer 12

reduces bias of bad base calls normally at the ends of reads Trimming/cutting/masking sequences -- From low quality score regions -- Beginning and end of sequence -- Remove adapters Filtering of sequences -- With low mean quality score -- Too short -- With too many ambiguous (N) bases

Answer 13

identifies critical genetic elements such as genes, promoters, and regulatory elements

Answer 14

- predicts the function of genetic elements

Answer 15

1. Seek the standard start codons: ATG, GTG, or TTG 2. Seek the stop codons based on the translation table TAA, TAG, TGA for bacteria, archaea, and plant plastids

Answer 16

Promoter, start site, 5’ UTR, exons, introns, start codon, CDS, stop codon, 3’ UTR

Answer 17

the process of aligning three or more biological sequences simultaneously Identifies conserved regions across multiple species Reveals patterns not visible in pairwise comparisons (evol. relationships) Key characteristics: - Aligns multiple sequences in a single analysis - Introduces gaps to maximize alignment of similar characters - Preserves the order of characters in each sequence

Answer 18

Objectivity: provides a quantitative measure for comparison Optimization: allows algorithms to find the best alignment Significance: helps distinguish real homology from random similarity

Answer 19

evolutionary events in sequences (match, gap, mismatch)

Answer 20

identical characters in aligned positions Represents conserved regions or no change

Answer 21

different characters in aligned positions Indicates substitutions or mutations

Answer 22

dash(-) inserted to improve alignment Represents insertions and deletions (indels)

Answer 23

compares sequences in their entirety (start to end) Key characteristics: -- Attempts to align every residue in both sequences -- Introduces gaps as necessary to maintain end-to-end alignment -- Optimizes the overall alignment score for the entire sequences Needleman-Wunsch: guarantees optimal global alignment

Answer 24

Provides a complete picture of sequence similarity Ideal for detecting overall conservation patterns Useful for phylogentic analysis of related sequences

Answer 25

May force alignment of unrelated regions in divergent sequence Less effective for sequences of very different lengths Can be computationally intensive for long sequences

Answer 26

identifies best matching subsequences; focus on regions of high similarity Key characteristics: -- Does not require aligning entire sequences end-to-end -- Allows for identification of conserved regions or domains -- Ignores poorly matching regions -- Can find multiple areas of similarity in a single comparison -- Aligns subsections of sequences -- Protein motif identification exemplifies local alignment utility (identifies functional regions) Smith-Waterman

Answer 27

- start with 0 in top corner - add gap penalty down the first row - move across to get the highest possible score while including penalities - score is in bottom row

Answer 28

- 0 zero is the lowest score - if negative, make it 0 - enter 0's in starting rows - Start alignment at the highest cell - Stop aligning when you encounter a zero

Answer 29

Matrix initialization: NW: the first row and column are filled with gap penalties SW: first row and column filled with zeros Scoring system: NW: allows negative scores SW: negative scores are set to zero Traceback: NW: starts from the bottom-right cell SW: starts from the highest scoring cell in the matrix

Answer 30

fixed cost for each gap Similar to implement, over-penalizes long gaps

Answer 31

different costs for opening and extending gaps Better for long indels, more biologically realistic (Single mutation event often causes multi-base indel)

Answer 32

Reduced in variable regions; increase in conserved regions

Answer 33

Adjust penalties based on amino acid properties

Answer 34

Often reduced to allow end gaps in local alignments

Answer 35

allows us to see exactly what genes are active within a given moment Allows us to see changes in gene expression overtime (picture of gene exp.) Works with a complete set of RNA transcripts (mRNA, rRNA, tRNA, non-coding RNA) Captures the dynamic nature of the transcriptome to reflect the functional state of the cell; captures cell’s response to environment and signals *** what annotated genes are actually being used

Answer 36

a single gene can produce multiple mRNA transcripts Way for org. to increase protein diversity without increasing the number of genes reveals alternative splicing and reforms (cell type, envt, developmental state)

Answer 37

Functional insights → - Identifies potential functional elements - Reveals which elements are active - Predicts disease risk - Shows diseases state Temporal insights → - Requires one-time sampling - Captures real-time cellular responses - Reveals evolutionary history

Answer 38

- revolutionizes resolution - best for rare cells with complex tissue types - captures gene expression in an individual cell - reveals cellular heterogeneity within the tissues ***very powerful data but can be very sparse and noisy ***Not very reproducible bc there is so little RNA in a cell; typically paired with bulk RNA analysis

Answer 39

- maps gene expression to location - Preserves spatial information of transcripts within tissue sections - Reveals how cellular neighborhoods influence gene expression

Answer 40

- rRNA makes up a large percentage of our RNA - lower numbers are degreaded sample (28S is degraded to 18S rRNA)

Answer 41

poly A tail primer allows for amplification of only mRNA

Answer 42

- convert mRNA to cDNA - no longer in practice - LIMITED to known sequences - similar sequences may cause false positives - limited dynamic range - normalization challenges - potential for bias

Answer 43

- doesn't require prior knowledge of sequences; allows for discovery of novel trancripts/isoforms (primary advantages over microarray technology)

Answer 44

1. Read alignment: mapping transcripts to the genome 2. Quantification: measuring gene expression levels 3. Differential expression analysis: identifying key genes 4. Dimensionality reduction: visualizing complex data (not in practice)

Answer 45

- link a key to a value - keys represent a label we can use to get info - hash function used to determine where to find their number - DNA dictionary with quick lookup and direct access to potential matches (large memory and slow for large genomes) ** way for reads to be mapped to reference genomes

Answer 46

- represent all suffixes of a given string - used to find the starting index of a suffix - arrays are a memory-efficient alternative to trees --- require less memory, but are less powerful *** create all suffixes; fix with end-of-string identifier; then sort lexicographically we LOSE the original data

Answer 47

- compresses the amount of data that we have to store without losing the original data --- allows for reversibility of data Basic concept of BWT: -- Append a unique end-of-string (EOS) marker to the input string. -- Generate all rotations of the string. -- Sort these rotations lexicographically -- Extract the last column of the sorted matrix as the BWT output. -- 1st column is more compressible but lose context/reversibility

Answer 48

- backwards search efficiently finds occurrences of a pattern in the text using L-F mapping - reversibility of BWT is better than suffix arrays bc we do not lose data

Answer 49

- specifies where exactly in the transcript this read came from -- (at position ___) - specifies where exactly in the transcript this read came from *** need to determine the read's exact position in the transcript but they are SOOO EXPENSIVE $$$$

Answer 50

- specifies that it came somewhere from this transcript (compatible) - Finds which transcript, but not where - Identifies which transcripts are compatible with the read, skipping the precise location step - Faster and less resource intensive than alignment based methods - Lacks certain details (position and orientation of reads) which are useful for correcting technical biases

Answer 51

- Must scale data for higher precision, less memory -- Read per kilobase(RPK): corrects this experimental bias through normalization by gene length -- Parts per million (ppm)

Answer 52

- a statistical model that explains how the observed data are generated from the underlying system - Defines a computational framework that produces sequencing reads from a population of transcripts - get reads from the transcript though we don't know how much transcript is there bc it is bias --- go backwards to calculate transcript abundance from the read distribution

Answer 53

- tells us the proportion of total RNA molecules in the sample that come from a certain transcript - adjusts for the fact that longer transcripts generate more reads - normalizes length VS nucleotide to transcript proportions

Answer 54

conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical bias - SALMON approximates Positional bias: - Fragments that include transcript ends might be too short - Fragments from central regions are more likely to be of optimal length for sequencing reads GC content: - Undersample GC regions - Make good stop codons - Oversample AT rich regions

Answer 55

E) estimate missing info (assignment of fragments to transcripts) using the current transcript abundance estimates M) use the estimated assignments to update the transcript abundances (improves likelihood) For each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimate until it reaches a maximum -- ensures the accuracy of abundance estimates by correcting bias learned during the estimation (online) phasee

Answer 56

adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled

Answer 57

The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)

Answer 58

1. online phase: fast, initial estimates of transcript abundances 2. offline phase: refines initial estimates using more complex optimization techniques ** balances speed (online) with accuracy (offline)

Answer 59

- a fast, lightweight technique used to associate RNA-seq fragments with possible transcripts *** often used for the initial estimates of the online phase in SALMON Expensive so stops after identifying seeds !!!

Answer 60

- uses matrix to identify distributions of reads amongst the transcripts --- computationally assigns fragments to transcripts *** maps RNA-seq reads (fragments) to transcripts, enabling accurate quantification of transcript levels --- decides how many fragments are assigned to a specific transcript (higher expression = more fragment abundance in a transcript)

Answer 61

mathematical tool that describes how data is generated ***help us to make sense of complex data by identifying patterns and determining whether differences are meaningful or just due to chance

Answer 62

- see if the difference in expression between conditions statistically significant Null (Ho) = There is no difference in gene expression between the 2 conditions. Alternative (H1) = There is a significant difference in gene expression between the conditions. Reject the null hypothesis when there is a difference that could not have happened by random chance

Answer 63

- the probability of the null hypothesis being true -- The higher the p-value, the more our model supports the null hypothesis -- The lower the p-value, the more our model supports the alternative hypothesis

Answer 64

the process of identifying and quantifying changes in gene expression levels between different sample groups or conditions

Answer 65

1. read alignment: mapping transcripts to the genome 2. quantification: measuring gene expression levels 3. differential expression analysis: identifying key genes and comparing gene expression levels 4. dimensionality reduction: visualizing complex data *** quantifying gene expression levels using high-throughput sequencing technologies generates count data

Answer 66

the number of RNA fragments that map to each gene ** generated for RNA-seq

Answer 67

models the number of successes in a fixed number of independent trials, where each trial has the same probability of success

Answer 68

MAIN limitation → assumes that the probability of success is constant between samples Smaller limitation 1 → The number of possible trials can be very large, especially when sequencing at a high depth. Smaller limitation 2 → The probability of expression is very small for many genes because they are either lowly expressed or not at all.

Answer 69

- a baseline for modeling discrete counts - statistical tool used to model the number of events(or counts) that happen in a fixed period of time or space, where: -- The events are independent of each other -- Each event has a constant average rate ***gives an accurate distribution of counts if mean and variance are equal (AKA probability of observing the sequenced fragments)

Answer 70

show deviations in Poisson distribution (mean = variance line) Higher counts typically have larger variance

Answer 71

when the variance in the data is larger than what is predicted by simpler models (e.g. Poisson distribution) -- Expected variance for Poisson-distributed data equals the mean Reflect biological variability between samples not captured by the experimental conditions -- Differences in RNA quality -- Sequencing depth -- Biological factors like different cell types within the same tissue ***Negative Binomial distribution accounts for high dispersion***

Answer 72

- RNA-seq data contains zero counts for some genes because not all genes are expressed under all conditions. - Most statistical models account for variance, but not that 0’s can dominate counts --- zero-inflated models

Answer 73

Need models to account for this complexity and figure out which genes are differentially expressed in a meaningful way

Answer 74

- gives an estimation of a log fold change - also gives standard error of how uncertain

Answer 75

SB determines the 3D shapes of biological macromolecules and how these shapes relate to function Primary Goal: to understand how molecular machines in cells work by deciphering their atomic arrangements

Answer 76

- technical limitations - biological complexity - resource constraints

Answer 77

- the framework of biomolecules - Formed when atoms share pairs of electrons that hold molecules together

Answer 78

Strength and stability: covalent bonds provide the necessary stability for complex biological structures Directionality: covalent bonds limit the specific angles and orientations leading to the 3D shapes of biomolecules

Answer 79

- dynamic glue - weaker than the covalent bonds and involve electostatics

Answer 80

-- Single bonds: allow rotation, contributing to molecular flexibility -- Double/Triples bonds: restrict rotation, affecting rigidity/molecule function

Answer 81

Molecular recognition → -- Enzyme-substrate binding -- Antigen-Antibody interactions Macromolecular structure → -- Membrane formation -- Protein-protein interactions -- Base pairing in DNA and RNA -- Protein folding

Answer 82

linear sequence of amino acids, held together by peptide bonds -- dictates how the protein will fold into higher-order structures -- does not reveal the protein’s functional form or activity alone -- may also depend on cellular factors (e.g. chaperones)

Answer 83

local conformations of the polypeptide chain *** stabilized primarily by hydrogen bonds -- Structural motifs are critical for certain functions. (alpha helixes/B-pleated sheets) -- undergo local fluctuations adding to functional flexibility (unwind/twist)

Answer 84

Refers to the complete 3D shape of a single polypeptide chain -- Predicting how a sequence folds into its tertiary structure is complex -- Reveal active sites/binding pocket

Answer 85

- electrons mix into different molecular orbitals at characteristic energy levels - products an e- density distribution that is unique to that structure - Probe: photon (carrier of electromagnetic radiation) - Basic Principle: photons scatter when they interact with atoms --- The scattered X-rays form a diffraction pattern unique to the crystal --- Constructive interference is needed to amplify signal for detectors

Answer 86

- The spots on the detector represent the reflections of the scattered X-rays - Intensity of the spots reflects the electron density in the crystal - Position and angle of the spots correspond to the geometry *** The diffraction pattern does not directly show the atomic positions but provides the data needed to infer the electron density

Answer 87

- Reveals the distribution of e- in the crystal, indicating where atoms are located - The electron density map is interpreted by fitting atomic models (e.g. amino acids for proteins) into density * Low-resolution data make it difficult to assign atomic positions precisely, leading to uncertainty in the model

Answer 88

- Crystals have the same repeating unit cell, which amplifies our signals - If in solution, particles would be: - Too sparse to diffract - Moving and diffraction pattern would constantly change

Answer 89

- Flexible or disordered regions do not pack into crystals well, often leading to failure in obtaining high-quality crystals - flexible/disordered regions do not show up clearly in the electron density map - Crystals capture a single conformation of the molecule, often ignoring the flexibility or dynamic range

Answer 90

a beam of high-energy electrons is used instead of photons -- Electrons have a shorter wavelength than photons -- Scatter light more effectively than x-rays No crystals: The sample is sample is rapidly frozen in vitreous ice to preserve its native structure -- imaged in their native hydrated state. -- can capture multiple conformations uses SPA

Answer 91

- main Cryo-EM technique used to determine the 3d structures of individual macromolecules - Millions of image of individual particles are collected from a thin layer - Particles are computationally aligned and classified into different orientations ---- 2D imaging, particle alignment and averaging, compete 3D structure from 2D projections

Answer 92

its ability to capture multiple conformational states of a molecule, providing insights into flexibility and structural heterogeneity highly flexible or disordered molecules may appear as fuzzy or low-resolution regions in the final structure

Answer 93

lack a stable 3D structure under physiological conditions but are still functional, often gaining structure upon binding to partners - structural techniques often require ordered/stable configurations - May appear fuzzy or have low-resolution in these regions Fit force fields to experimental data of structured proteins BUT there is not a lot of data of IDPs

Answer 94

- Proteins can adopt a large number of possible conformations. Levinthal’s Paradox: a protein can’t sample all conformations in a biologically reasonable time, yet it folds quickly.

Answer 95

1. large conformational space 2. complex energy landscapes 3. flexibility and dynamics 4. environmental effects 5. PTMs 6. data driven methods

Answer 96

- represents the energy of a system as a function of the positions of its atoms. - Understands how the system’s E changes upon reactions or movements - Proteins fold to lowest free-E state, but this landscape is highly rugged. - Energy calculations are computationally $$$$ /depend on accurate force fields. -- Lots of potential E minima (so many conformations needs to be tired) -- Multiple minima may be similar but can be far apart in a conformational space

Answer 97

- predicts protein structures based on evolutionary relationships - the main principle is that proteins with similar sequences tend to fold into similar structures *** most accurate when sequence identity to other proteins is high (>30%) -- across full protein length

Answer 98

- In cases where sequence similarity to known structures is low (<30%), homology modeling becomes unreliable. - Matches sequences to known structural folds based on structural rather than sequence similarity - When remote homologs exist but their evolutionary relationship cannot be detected by sequence comparison alone.

Answer 99

- statistical models representing sequences using probabilities for matches/indels - HMMs model protein sequences as a series of probabilistic states Hidden states: the underlying biological events that are not directly observable Match states: conserved positions in the sequence Insertion states: positions where extra residues are added Deletion states: positions where residues are missing

Answer 100

- 2D representation of which residues are in close proximity (residue interactions in proteins) - represent spatial proximity, not sequence order

Answer 101

- Coevolving residues mutate in a correlated manner. - Mutations in one residue often result in compensatory mutations in its interacting partner - observed across species through analysis of homologous protein sequences - Correlated mutations indicate functionally significant residue pairs *** helps to predict which residues are close in the 3D structure (useful when there is no experimental structure available)

Answer 102

- Coevolution is detected using large MSAs from homologous proteins. - The more diverse the sequences in the MSA, the better the resolution of coevolving residues. - Evolutionary info from MSAs guides predictions for residue-residue contacts.

Answer 103

- Co-evolution signals can be noisy → - Noise from data can come from random mutations or insufficient evolutionary diversity. - Large and diverse sequence data sets are needed for reliable coevolution predictions.

Answer 104

- ML predicts 3D structures only from sequenced data - AlphaFold predicts protein structures with atomic accuracy by using deep learning models trained on large structural datasets --- Trains neutral networks on large amounts of protein sequence and structural data --- Neural networks analyzes pattern and learn to recognize them from coevol data Machine learning leverages coevolution for high-accuracy predictions. --- incorporate evolutionary info along with structural features *** Struggle with disordered proteins

Answer 105

- Protein structure determination and prediction provide fixed snapshots - Understanding the motions of the proteins is important - MD simulations → Provide time-resolved insights into protein behavior 1) 3D coordinates of atoms 2) Atoms exert forces 3) Use Newtons to predict movement

Answer 106

Smaller time steps ⇒ more accurate, but more calculations The time step must be smaller than the shortest vibrational period to accurately capture atomic motions.

Answer 107

- computes trajectories of atoms over time scales of femtoseconds to microseconds. - capture both small-scale vibrations and large-scale conformational changes. - Provides detailed information on atomic interactions and energy changes. - Enables the study of mechanisms at an atomic level ***CLASSICAL MECHANICS

Answer 108

- More realistic analysis of proteins (dynamic vs static) - Refines predicted structures - Minimizes E - Accounts for environmental effects for improved accuracy - Studies IDPs - Captures the flexible nature of disordered regions - Aids in understanding functions that rely on disorder - Identifies folding intermediates and misfolding mechanisms-

Answer 109

- Describes the motion of macroscopic objects - Assumes particles have well-defined positions and velocities - Governed by Newton’s Laws of Motion

Answer 110

- Necessary for describing behavior at atomic and subatomic scales - Accounts for wave-particle duality, uncertainty principle, proton tunneling - E- exhibit quantum behavior that can’t be captured classically

Answer 111

- neglect quantum effects - Electrons are not simulated in MD - Effect is included implicitly through potential E functions (force fields)

Answer 112

- Biological macromolecules (protein, nucleic acids, lipids) - Materials where electronic excitations are not critical. - Processes where bond breaking/forming does not occur.

Answer 113

- cannot accurately simulate chemical reactions w/ electronic transitions. - Quantum stuff like tunneling and zero-point energy are not captured.

Answer 114

- It is very expensive to simulate large systems of atoms using quantum mechanics - Detailed electronic structure is not as important (faster, cheaper calculations) *** As a result, no electronic interactions/movements can be captured with MD

Answer 115

the acceleration of an object is directly proportional to the net force acting on it and inversely proportional to its mass.

Answer 116

- Forces are negative gradients of potential energy - Thus energy gradients can be used to determine the acceleration and therefore motion of the atoms (velocity)

Answer 117

- Time evolution of a system: computed by integrating equations of motion - Determine force, move forward in time, repeat -Provides cont. Equations of motion using the time steps

Answer 118

Molecule dynamics are a combination of: bond lengths, angles, and dihedral angles -- Bonds behave like springs (resists changes in distances between 2 atoms) -- Angles behave like harmonic oscillators (hinge with central atom) -- Dihedral angle (angle between 2 planes formed by 4 bonded atoms; describes the rotations of the bond between 2 atoms) - not like springs Approximate bond vibrations and angles as harmonic oscillators Spring constants are determined by bond order (s, d, t)

Answer 119

govern local geometry (bond lengths and bond angles) using quadratic (harmonic) potentials that favor specific distances and angles

Answer 120

- govern torsional or rotational flexibility around bonds, typically using periodic and multi-well potentials to allow for multiple stable conformations. - Dihedral potentials must capture arbitrary functions with rotational symmetry.

Answer 121

- Modeling dihedral potentials → Fourier series -- Approximate functions as sum of sine and cosine waves --- More sine/cosine terms improves approximations -- Can approximate (any) symmetrical rotational energy function. *** equation to go from structure to energy (instead of using QM)

Answer 122

- compute energies and atomic forces - Setting the force field depends on system type, accuracy/speed, and compatibility - Bonded and non-bonded interactions make up the complete force field

Answer 123

- tell the program which force field parameters to use and where) - define the molecular structure and interactions in the simulations - contains info on atom types, bonds, angles, dihedrals, and non-bonded interactions based on the chosen force field

Answer 124

Quantum Mechanical Calculations: obtain high-accuracy data for smell molecules and representative fragments Empirical Data Integration: incorporate experimental measurements to validate and refine parameters Parameter Optimization: adjust force field parameters through iterative simulations and comparisons Advanced Techniques: utilize machine learning, multi-scale modeling, and automated pipelines to enhance parameters accuracy and efficiency

Answer 125

Electronic interactions involving electrons (smaller molecules better) -- Wave-particles, particle tunneling -- Electron transfer involving reactions

Answer 126

- Crucial for simulating multiple molecules in MD - Facilitate organization of molecules into structures - Determine macroscopic props (MP, BP, solubility) - Govern biological functions (enzyme binding, protein folding) *** covalent interactions define primary structure while noncovalent interactions dictate how molecules interact

Answer 127

1. Dispersion forces: weak, attractive forces arising from instantaneous dipoles Stabilize molecular assemblies by promoting close packing 2. Repulsion forces: strong, short-range forces due to overlapping e- clouds Prevents atoms from collapsing into each other; maintains atom integrity

Answer 128

Need a good structure before starting any molecular simulation (equilibrium) Avoid low quality, high-energy conformations with missing/wrong co-factors 1. Resolution: how well the atomic positions are determined Resolution below 2.0 A is generally preferred for high-quality simulations 2. Completeness: Flexible loops/disordered regions are often missing from the structure No missing residudes 3. Functional State: Proteins can exist in different functional conformations: active vs inactive state, bound to ligands or unboard 4. B-factors: Higher B-factors suggest more uncertainty in atom positions, which might make that part of the structure less reliable

Answer 129

- Missing atoms or residues can be added using modeling software like Modelleer - Homology modeling, structure prediction software, MD simulations Unwanted components like ligands or non-essential ions should be removed ligands, ions, or crystallization agents that are not physiologically relevant Distorts protein’s behavior in a simulated biological environment if not removed

Answer 130

1. Add missing residues 2. Remove unwanted ligands, non-essential ions that distort protein behavior 3. Correct protonation state (pH-sensitive residues) 4. Energy minimization for sterics (adjust structure to remove unfavorable atom position and steric clashes that cause instability) 5. Assign force field parameters 6. Stabilize temp and pressure for MD

Answer 131

Completeness: Check for missing residues, loops, or side chains; incomplete regions may need modeling to avoid simulation artifacts. Functional State: Ensure the protein’s conformational state (e.g., active, inactive) aligns with the simulation goals; incorrect states could yield irrelevant results. Clash Scores: Assess clash scores (using tools like MolProbity) to identify steric issues; high clash scores indicate steric conflicts that should be resolved with energy minimization or correction. Resolution and R-Factors: Higher resolution (<2.0 Å) and low R-factors indicate greater structural accuracy, while poor values suggest potential inaccuracies.

Answer 132

*** Realistic systems do not have walls - For simulations, need to apply force to keep molecules in the box -- H20 and proteins would bounce off thse walls in an unphysical manner (edge effects) PBC simulates infinite systems from a finite box --- Virtually place exact copies of system in all directions --- Atoms that cross the box edge reappear on the other side (no edge effects) → like pacman - uses Minimum image convention (MIC)

Answer 133

ensures that an atom in the primary box only interacts with the closest image of another atom Images atoms in adjacent boxes are used to calculate interactions across the boundaries (ensures correct interactions)

Answer 134

specifies the temp, pressure, volume, and number of particles of molecular systems *** infinite number of macrostates - Large-scale system that defines properties of molecular system (temp, pressure, vol) → changing these changes macrostate - Encompasses all microstates that share the same properties

Answer 135

- the collection of all possible microstates of a single macrostate - Perfect/accurate ensemble averages require sampling every possible configuration - Macrostate observables are ensemble averages

Answer 136

Microcanonical Ensemble (NVE) → - Fixed number of particles (N) - Volume (V) - Energy (E) Canonical Ensemble (NVT) → - Fixed number of particles (N) - Volume (V) - Temperature (T) Isothermal-Isobaric Ensemble (NPT) → most common - Fixed number of particles (N) - Pressure (P) - Temperature (T)

Answer 137

- The instantaneous temperature of microstates will fluctuate, but the ensemble average should be constant There should be no net flow of energy!!!

Answer 138

s a unique configuration defined by the positions and velocities of all particles specific, detailed configuration of a system at the molecular level - Indicates exact position and energy of a particle - Multiple microstates can have the same distance (use mean of them)

Answer 139

- needed to compute reliable ensemble averages - Longer simulation provide better sampling of microstates and their probabilities 1) Statistical accuracy, 2) Covering all conformations shifts, 3) Thermodynamic quantities

Answer 140

adjust the velocities of particles to increase or decrease the system's kinetic energy → thereby controlling the temperature

Answer 141

adjusts the velocities of all particles uniformly based on the current temperature and target temperature - Scales current velocity based on temp deviation - Prevents abrupt changes that could destabilize the simulation - Simple velocity scaling does not generate a true canonical (NVT) ensemble; it cannot reproduce realistic temperature fluctuations ^^^inaccurately models thermal energy transfer via particle collisions

Answer 142

connect particle momenta to fictitious heat bath - Momenta scaling provides realistic kinetic energy and thus temperature control - Heat bath allows thermal energy to flow in and out of our simulation Q ⇒ a “mass” coupling parameter controls thermostat responsiveness

Answer 143

maintain desired pressure during simulations - Adjusts volume of simulation box to achieve and maintain target pressure - Pressure is proportional to density and temperature - Scales box volume based on pressure difference to target *** all help to keep a consistent macrostate!

Answer 144

improve with more simulation time by sampling more microstates -- Many short simulations is better than 1 long one!!! Random initial velocities provide better change of sampling different microstates -- Initial velocities are sent in a direction on the potential E surface simulation; there is a change it never samples a certain minima (multiple simulations with random velocities reduces chance) *** discard initial relaxation as it is not our desired microstate

Answer 145

Purpose: To allow the system to relax and reach a stable, equilibrated state after initial setup. Process: Temperature, pressure, and density are gradually stabilized, with constraints often applied to avoid abrupt movements. Goal: Achieve a realistic starting conformation that reflects the desired ensemble

Answer 146

Purpose: To collect data on the system’s behavior for analysis of properties like energy, structure, and dynamics. Process: Constraints are usually removed, and the system is allowed to evolve naturally. Goal: Gather accurate ensemble averages and insights into properties for the equilibrated system over time.

Answer 147

measures deviation in structure over time (what conditions allow conformations to change faster) - Overall change in structure during simulation, tracks deviations from starting conformations (global conformational changes) ***Low is good: close to reference structure ***High indicates significant deviation and large structure changes over time

Answer 148

How much does amino acid position change; Identifies regions of flexibility in the protein by calculating fluctuation of each atom (tracks local flexibility) about a mean position -- How does it change around its mean, not relative to ref structure ***Low: atom is fixed in place (well-ordered) ***High: fluctuates a lot (flexibility)

Answer 149

Minima ⇒ concave part in energy diagram - Must overcome energy barrier to make the conformations (TSs can be higher) -- Most preferred would be the lowest energy state ***The probability curve will be exactly OPPOSITE of the energy curve - How much energy is needed between conformation or energy of binding . *** without minimization, high-energy configurations may lead to bad results in MD simulations (removes unfavorable atom positions / sterics)

Answer 150

Discovery and Preclinical Research (***Computation is most helpful with this stage) -- Potential drugs are identified and tested in non-human studies Clinical Trials -- Testing in human subjects to assess safety and efficacy Regulatory Approval -- Evaluation by agencies like the FDA before the drug can be marketed Post-Marketing Surveillance -- Ongoing monitoring after the drug is available to the public

Answer 151

Disease Relevance: the protein plays a critical role in the disease mechanism Druggability: target has a structure that allows it to bind with drug-like molecules Specificity: Targeting the protein minimizes effects on healthy cells, reducing side effects

Answer 152

narrows down potential compounds from large chemical libraries during drug discovery. - tests (HTS) compounds against the target protein - Experimental assays are still expensive, and limited to commercially available compounds -- use computational methods to predict which compounds we should experimental validate *** SO MUCH FASTER TO NARROW DOWN SEARCH SPACE

Answer 153

Energy is released/uptaken during binding → spontaneity of binding -- Combines enthapy and entropy ***Simulations capture free energy directions instead of treating enthalpy and entropy separately BINDING STRENGTH AND STABILITY AND AFFINITY

Answer 154

energetic interactions (sum of contributions provide ensemble avg) - Accounts for non covalent interactions (electrostatics, H-bonds, dipoles, pi-pi) - Ensemble differences in non covalent interactions provide binding enthalpy

Answer 155

(strongest force) Long range interaction, anchor points (~5 to 20 kcal/mol per inter) Charged molecules have a net imbalance between net electrostatic attractions or repulsions between different atoms or molecules

Answer 156

Specificity/orientation, stabilization, dynamic, strongest when hydrogen, donor, and acceptor atoms are collinear Attraction between a (donor) hydrogen atom covalently bonded to an electronegative atom and another (acceptor) electronegative atom with a lone pair

Answer 157

Directional binding for proper ligand alignment, flexibility (weak) creates partial charges and dipoles Electronegativity differences lead to unequal distribution of electron density Consistent electron density spatial variation results in permanent dipoles

Answer 158

Maximizes surface contact (complementary fit); flexibility Dispersion: Electrons in molecules are constantly moving, leading to temporary uneven distributions that induce dipoles in neighboring molecules Induction: The electric field of a polar molecule distorts the electron cloud of a nonpolar molecule, creating a temporary dipole

Answer 159

Involve stacking of aromatic rings Orientation of aromatics, selectivity Noncovalent interactions between aromatic rings due to overlap of pi-electron clouds

Answer 160

how much conformational flexibility changes (Energy dispersion) *** Higher entropy ⇒ greater microstate diversity for a given macrostate - Accounts for microstate diversity of a single macrostate Can increase/decrease/remain the same depending on ligand concentration

Answer 161

Slowly disappear interactions to see the lowest energy conformations to get insight into binding affinities Gets energy difference between conformations

Answer 162

Compute the binding free energy somewhere in solution and then bind to protein Slowly disappear the ligand (1 = normal interaction, 0 = no interactions) -- More relevant conformational sampling -- Run independent simulations in parallel -- Focuses on taking difference w smaller numbers integrate over these small free energy changes (turn interaction on and off) -- Use this to relatively calculate the free energy difference between bound and unbound states

Answer 163

VERY EXPENSIVE; use docking first to screen molecule, then computes energies - captures all atomistic forces - wide range of conformation sampling - specific parameters

Answer 164

Avoid sampling all microstates and determine one “optimal” protein-ligand structure (rigid structure) -- using this bound structure, predict a “score” that is correlated to binding affinity -- Simplifies binding free energy prediction to enhance speed --Ligand is not guaranteed to fit perfectly

Answer 165

Protein-ligand interactions are highly-dependent on the protein’s 3D structure Using an inappropriate protein conformation can lead to inaccurate docking results ****ONLY PICKING ONE STRUCTURE -- Challenge bc proteins are dynamic

Answer 166

1. Conformational Flexibility (proteins are not rigid structure and experience movement side to side) 2. Binding sites can change 3. Limited experimental structures (not all relevant states may be covered)

Answer 167

Resolution and Quality Ligand-Bound vs. Unbound Structures Relevance to Target Ligand

Answer 168

Role in binding: structured water molecules can mediates interactions between the protein and ligand Inclusion Criteria: retain water molecules that are conserved across multiple crystal structures Handling water in docking → Some docking programs allow explicit water molecules in the binding site Alternatively, consider their effect implicitly in scoring functions

Answer 169

Convex Regions: Typically inaccessible to ligands. Concave Regions (Cavities): Potential binding sites.

Answer 170

Grid-based: puts a protein on a grid, looks for protein vs no protein within grid If there is no protein in a space, there is likely a pocket there (concave space) Macrostate remains constant

Answer 171

alpha spheres touch certain about of atoms (3 atoms only); cannot put any spheres on the outside in protein land -- Shows pockets based on how many spheres it is touches (group spheres placed in open spaces and indicate it as a pocket) -- uses Delaunay triangulation and alpha complexes to define cavities

Answer 172

1. Orthosteric: primary active site where ligands bind 2. Allosteric: secondary sites that modulate protein function upon ligand binding 3. Cryptic: binding pockets not apparent in the unboard protein structure but form upon ligand binding or conformational change

Answer 173

binding pockets not apparent in the unboard protein structure but form upon ligand binding or conformational change -- hard for MD simulations to detect -- must use enhanced MD methods, and apply pocket detection to multiple conformations

Answer 174

*** Accurate docking depends on optimized ligand poses (binding affinity) 1. Initial ligand placement 2. Scoring function evaluation of binding affinity 3. Pose adjustment/optimization (move around ligand and protein residues to achieve best fit) + energy minimization 4. Rescore and rank in terms of best pose 5. Final pose choice

Answer 175

systematic and stochastic

Answer 176

numerically iterate over all possible conformations -- Only possible for very small molecules, not used very often -- ID important degrees of freedom -- Remove structures with high strain *****ALMOST NEVER DO

Answer 177

random sampling (Monte Carlo) -- Provide better balance of sampling and cost -- See if energy change of new conformation is less than random Steps → - Generate conformation - Compete energy change - If energy change less than a random sample: make move - Repeat ***Allows us to sample efficiently!

Answer 178

are parameterized models to estimate binding affinity after docking -- Physics-based methods using force-field like methods -- Machine learning (graphing neural networks) have been gaining traction recently

Answer 179

Structure-Based Drug Design: - Requires 3D structure of the target protein. - Uses the binding site structure to model potential interactions. - Often employs docking and molecular simulations. Ligand-Based Drug Design: - Requires no structural information of the target. - Uses the chemical structure and activity of known ligands as guides. - Relies on molecular similarity rather than direct binding predictions.

Answer 180

indicates the overall size of the molecule -- Impacts drug distribution and elimination rates in the body

Answer 181

measures lipophilicity (chemical compound's ability to dissolve in lipids, fats, oils, and non-polar solvents) -- Influences a molecule's ability to cross cell membranes and affects absorption and bioavailability

Answer 182

relates to polarizability and electron cloud distribution -- Affecting intermolecular interactions and binding affinity

Answer 183

estimates the molecule’s ability to form hydrogen bonds --impacting solubility and permeability across biological membranes

Answer 184

reflects molecular flexibility -- influences binding affinity and oral bioavailability

Answer 185

encode structural features into numerical representations -- Hash functions are used to encode chemical information -- Can be encoded into a bit array

Answer 186

compares ECFPs between 2 molecules based on ... Molecular similarity: the concept that similar molecules often show similar biological effects. Formula measures the ratio of the shared features to the total number of unique features between 2 molecules

Answer 187

- Numerical representation of the properties of the molecules - Chose a function, not a number → function turns into number - Generate a number consistently based on whatever input given for keeping track of molecules in the system

Answer 188

used to encode chemical info For each iteration: - incorporate hashes of atoms that are n bonds away - Then encode atom IDs that exactly one bond away - Repeat while hashing n-1 IDs Each iteration encodes local chemical info into each atom’s ID Similar features will share atom IDs until our iterations starts incorporating new features (encodes multiple levels of info)

Answer 189

fixed-length collections of ones and zeros Allow for efficient operations Encoded into bit array to store a collection of atom IDs

Answer 190

link chemical structure with biological activity Predict the biological activity of molecules based on their structure to reduce the need for experimental screen (quick and cost-effective)

Answer 191

So many molecules to find and finite amount of time -- Chemical space is vast -- Diverse properties -- Exhausts computational resources -- Reliability of predictive models

Final Exam Flashcards

(216 cards)