Final Exam Flashcards

1
Q

DNA sequencing set-up

A
  1. Start with bacterial culture for product of interest
  2. Separate cells from media via centrifuge
  3. Keep DNA by breaking open cells via lysing
  4. Isolate and purify DNA using liquid-liquid extraction (aq layer has DNA)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

chemical lysis

A

destabilizes the lipid bilayer and denatures proteins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

surfactants

A

one hydrophobic tail, which allows them to further penetrate molecular structures as compared to phospholipids with 2 tails
Similar to phospholipids, but break through barrier and destabilize proteins better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Main problem of determining the order of nucleotides

A

DNA elongation happens rapidly and continually
Uses DNA polymerase and excess of nucleotides to make copies of DNA
3’ OH is required for DNA elongation
Di-deoxynucleotides stop replication bc it lacks 3’ OH so polymerase cannot add another nucleotide to it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

sanger sequencing

A
  • accurate, long reads, but resource consuming
  • use one beaker and fluorescence to distinguish between the ddNTPs
    – Fragment separation can be automated via capillary gel electrophoresis
    – Separates molecules by size based on their charge-to-mass ratio

Smaller molecules move more freely through the gel and migrate faster than larger molecules
molecules must be charged through tagging
– Unique signal per ddNTP products chromatogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Building strand from fragments

A

Sort DNA fragments by length to see what the last nucleotide was

Line up the last 5’ nucleotide; gradually builds the 3’ end up to get strand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Original Set up →

A
  1. Split sample into 4 beakers
  2. Add all 4 ddNTPS into each beaker & radioactive ddNTP
    Need separate beakers bc cannot differentiate between them
  3. Add Taq polymerase
  4. Separate by length using gel electro.
    Shortest lengths travel the farthest; associate them with a beaker
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Good vs Bad chromatogram

A

Good:
- Variation in peak high is less than 3-fold.
- Peaks are evenly distributed and one color
- Baseline noise is absent
Interpreted nucleotide sequence is 5’ → 3”
Bad:
- Significant noise up to ~20 bps in (unreliable transport properties)
- Dye blobs occur from unused ddNTPs
- Fewer longer fragments so signal is weaker

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Illumina

A

short reads, but high throughput

  • Adapter ligations attach P5 and P7 oligos to facilitate binding to flow cell
  • Primers are not complementary, so they do not base pair
  • Fragments become bound somewhere in the flow cell
  • locally amplify bound DNA fragments to get clusters of the same sequence
  • Bridge amplification creates double-stranded bridges
  • Double-stranded clonal bridges are denatured with cleaved reverse strands
  • uses pair-end sequencing

***clusters will give off a stronger signal compared to a single fragment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Illumina stepwise

A
  1. Add labeled dNTPs into flow cells
  2. Incorporate a complementary nucleotide
  3. Remove unincorporated fluorescent nucleotides
  4. Capture fluorescent signal & image clusters
  5. Cleave the fluorophores and the protecting group
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pair-end sequencing

A

generated from both ends of a DNA fragment with known insert size

enables both ends of the DNA fragment to be sequenced

Distance between each paired read is known, alignment algorithms can use this info to map the reads over repetitive regions more precisely.

Results in much better alignment of the reads, especially across difficult-to-sequence, repetitive regions of the genome

** more expensive but ideal for genome assembly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Nanopore

A

Longer reads, more accurate for assembling reads into genome

Very expensive, low throughput

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

single-end reads

A
  • generated from only one end of a DNA fragment
  • Simpler, fast, more cost-effective
  • Limited context for structural variations or duplications
  • Used for small genomes and RNA seq where contiguity is less critical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Genome assembly

A
  • process of combining our sequencing reads into a continuous DNA sequence

(Sequencing provides short, overlapping reads of DNA)

Having multiple fragments that contain the same portion of the sequence improves our coverage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

reads

A

raw sequences coming the experiments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Contigs

A

continuous stretches of DNA seq from overlapping seq reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Ambiguous assembly

A

contigs put together in an unknown order

Accounts for differences in scaffolds; Assemble using reference genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Scaffold

A

contigs put together overlapping with estimated gaps in a known order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

main challenges for deNovo genome reconstruction

A

Repeats: create ambiguity and can cause assemblies; inflate genome size

High Coverage: sequencing the genome multiple times, resulting in a greater number of reads that overlap any given region of the genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

greedy overlap

A

deNovo genome reconstruction

Goal is to assemble the strings (reads) into a continuous, single string (contig)

Want the shortest possible superstring

  1. Overlap maximization
    – Reduces redundancy, maximizes confidence with highest overlap
  2. Repeat resolution
    – Resolves repeats by favoring collapsed arrangements
  3. Evolutionary pressure
    – Most genomes have selective pressure to be efficient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

how to do a greedy assembly?

A

merge by highest overlap!!

Repeats ruin assembly ⇒ can cause missing reads

Increase K to overcome repeats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

de Brujin graphs

A
  • help for to visualize relationships/overlaps between the strings
  • Node = single entity [k-1]
  • Edge = represents a connection between entities (can have direction) [k]
  • uses direct edges to specify overlap and concatenation
  • Each unique k-mer is a node. (K-mer = substring of length k)
  • A node is balanced if indegree equals outdegree
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

multiple reads for DB graphs

A

not Eulerian bc cannot walk along each edge once; 2 semi-balanced nodes

edges on walk extend the contig in multiple directions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

errors in assembly effect on DB graphs

A

Errors affect:
1) k-mer counts, 2) increase # of edges and unconnected graphs

  • No overlap would lead to unconnected graphs; weights can be added to arrows (#)

Error correction should remove most tips, islands, bulges (splits and reconnects)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
high coverage for deBrujin
High coverage suggests that a node is likely a true sequence rather than an error
26
How do we choose the “best” path for our contig?
Long paths are desired but not always reliable due to potential repeats High, consistent read coverage Unique, non-branching paths
27
SPAdes
- prokaryotic genome assembler based on DB graphs - Estimates gaps between reads using DB graphs - Builds multisized graphs with different k’s. - Using multiple graphs allows for a better handling of variable coverage. - Assemblers provide contigs and scaffolds (connections how contigs form scaffolds)
28
Large VS Small K
Large K ⇒ fragmented graphs; helps reduce repeat collapsing Small K ⇒ collapsed/tangled graph good for low-coverage regions
29
L50
NUMBER of contigs whose combined length is at least 50% Lower is better for L50 value ****longer contigs = more confidence that genome is right
30
N50
LENGTH of the shortest contig at 50% of the total genome length Higher is better for N50 value [median contig size = reliability factor] ***N50 is the length of shortest contig in L50 assembly
31
leading synthesis
- by addition of dNTP instead of ddNTP - synthesis is ahead by 1 nucleotide bc 2 were added at once
32
33
lagging synthesis
by failure to remove blocking fluorophore
34
signal cross talk
degrades quality of assemblies
35
phred quality score
assess the accuracy of nucleotide base calls in DNA sequencing (prob that base call is incorrect) ASCII encoded probability store phred quality scores in FASTQ file
36
Per sequence GC content
Deviation from normal distribution indicates contamination (reads)
37
Trimming/Filtering
reduces bias of bad base calls normally at the ends of reads Trimming/cutting/masking sequences -- From low quality score regions -- Beginning and end of sequence -- Remove adapters Filtering of sequences -- With low mean quality score -- Too short -- With too many ambiguous (N) bases
38
structural annotation
identifies critical genetic elements such as genes, promoters, and regulatory elements
39
Functional annotation
- predicts the function of genetic elements
40
Reading ORFS →
1. Seek the standard start codons: ATG, GTG, or TTG 2. Seek the stop codons based on the translation table TAA, TAG, TGA for bacteria, archaea, and plant plastids
41
Typical elements of a gene that are annotated
Promoter, start site, 5’ UTR, exons, introns, start codon, CDS, stop codon, 3’ UTR
42
MSA
the process of aligning three or more biological sequences simultaneously Identifies conserved regions across multiple species Reveals patterns not visible in pairwise comparisons (evol. relationships) Key characteristics: - Aligns multiple sequences in a single analysis - Introduces gaps to maximize alignment of similar characters - Preserves the order of characters in each sequence
43
Important elements of scoring in alignment selection
Objectivity: provides a quantitative measure for comparison Optimization: allows algorithms to find the best alignment Significance: helps distinguish real homology from random similarity
44
Alignment elements reflect ...
evolutionary events in sequences (match, gap, mismatch)
45
match
identical characters in aligned positions Represents conserved regions or no change
46
mismatch
different characters in aligned positions Indicates substitutions or mutations
47
gap
dash(-) inserted to improve alignment Represents insertions and deletions (indels)
48
global alignment
compares sequences in their entirety (start to end) Key characteristics: -- Attempts to align every residue in both sequences -- Introduces gaps as necessary to maintain end-to-end alignment -- Optimizes the overall alignment score for the entire sequences Needleman-Wunsch: guarantees optimal global alignment
49
Advantages of global alignment
Provides a complete picture of sequence similarity Ideal for detecting overall conservation patterns Useful for phylogentic analysis of related sequences
50
limitations of global alignment
May force alignment of unrelated regions in divergent sequence Less effective for sequences of very different lengths Can be computationally intensive for long sequences
51
local alignment
identifies best matching subsequences; focus on regions of high similarity Key characteristics: -- Does not require aligning entire sequences end-to-end -- Allows for identification of conserved regions or domains -- Ignores poorly matching regions -- Can find multiple areas of similarity in a single comparison -- Aligns subsections of sequences -- Protein motif identification exemplifies local alignment utility (identifies functional regions) Smith-Waterman
52
Needleman Wunsch
- start with 0 in top corner - add gap penalty down the first row - move across to get the highest possible score while including penalities - score is in bottom row
53
Smith Waterman
- 0 zero is the lowest score - if negative, make it 0 - enter 0's in starting rows - Start alignment at the highest cell - Stop aligning when you encounter a zero
54
Smith-Waterman differs from Needleman-Wunsch in key aspects →
Matrix initialization: NW: the first row and column are filled with gap penalties SW: first row and column filled with zeros Scoring system: NW: allows negative scores SW: negative scores are set to zero Traceback: NW: starts from the bottom-right cell SW: starts from the highest scoring cell in the matrix
55
linear gap penalty
fixed cost for each gap Similar to implement, over-penalizes long gaps
56
affine gap penalty
different costs for opening and extending gaps Better for long indels, more biologically realistic (Single mutation event often causes multi-base indel)
57
position-specific gap penalties
Reduced in variable regions; increase in conserved regions
58
residue-specific gap penalties
Adjust penalties based on amino acid properties
59
terminal gap penalties
Often reduced to allow end gaps in local alignments
60
transcriptomics
allows us to see exactly what genes are active within a given moment Allows us to see changes in gene expression overtime (picture of gene exp.) Works with a complete set of RNA transcripts (mRNA, rRNA, tRNA, non-coding RNA) Captures the dynamic nature of the transcriptome to reflect the functional state of the cell; captures cell’s response to environment and signals *** what annotated genes are actually being used
61
isoforms
a single gene can produce multiple mRNA transcripts Way for org. to increase protein diversity without increasing the number of genes reveals alternative splicing and reforms (cell type, envt, developmental state)
62
genomics VS transcriptomics
Functional insights → - Identifies potential functional elements - Reveals which elements are active - Predicts disease risk - Shows diseases state Temporal insights → - Requires one-time sampling - Captures real-time cellular responses - Reveals evolutionary history
63
single-cell transcriptomics
- revolutionizes resolution - best for rare cells with complex tissue types - captures gene expression in an individual cell - reveals cellular heterogeneity within the tissues ***very powerful data but can be very sparse and noisy ***Not very reproducible bc there is so little RNA in a cell; typically paired with bulk RNA analysis
64
spatial transcriptomics
- maps gene expression to location - Preserves spatial information of transcripts within tissue sections - Reveals how cellular neighborhoods influence gene expression
65
RNA integrity number
- rRNA makes up a large percentage of our RNA - lower numbers are degreaded sample (28S is degraded to 18S rRNA)
66
filter for mRNA only
poly A tail primer allows for amplification of only mRNA
67
microarrays
- convert mRNA to cDNA - no longer in practice - LIMITED to known sequences - similar sequences may cause false positives - limited dynamic range - normalization challenges - potential for bias
68
RNA-seq
- doesn't require prior knowledge of sequences; allows for discovery of novel trancripts/isoforms (primary advantages over microarray technology)
69
computational pipeline for RNA-seq data analysis
1. Read alignment: mapping transcripts to the genome 2. Quantification: measuring gene expression levels 3. Differential expression analysis: identifying key genes 4. Dimensionality reduction: visualizing complex data (not in practice)
70
Hash table
- link a key to a value - keys represent a label we can use to get info - hash function used to determine where to find their number - DNA dictionary with quick lookup and direct access to potential matches (large memory and slow for large genomes) ** way for reads to be mapped to reference genomes
71
suffix arrays/trees
- represent all suffixes of a given string - used to find the starting index of a suffix - arrays are a memory-efficient alternative to trees --- require less memory, but are less powerful *** create all suffixes; fix with end-of-string identifier; then sort lexicographically we LOSE the original data
72
Burrows-Wheeler transforms (BWT)
- compresses the amount of data that we have to store without losing the original data --- allows for reversibility of data Basic concept of BWT: -- Append a unique end-of-string (EOS) marker to the input string. -- Generate all rotations of the string. -- Sort these rotations lexicographically -- Extract the last column of the sorted matrix as the BWT output. -- 1st column is more compressible but lose context/reversibility
73
backwards search for BWT
- backwards search efficiently finds occurrences of a pattern in the text using L-F mapping - reversibility of BWT is better than suffix arrays bc we do not lose data
74
alignment
- specifies where exactly in the transcript this read came from -- (at position ___) - specifies where exactly in the transcript this read came from *** need to determine the read's exact position in the transcript but they are SOOO EXPENSIVE $$$$
75
pseudoalignment
- specifies that it came somewhere from this transcript (compatible) - Finds which transcript, but not where - Identifies which transcripts are compatible with the read, skipping the precise location step - Faster and less resource intensive than alignment based methods - Lacks certain details (position and orientation of reads) which are useful for correcting technical biases
76
quantifying gene expression levels
- Must scale data for higher precision, less memory -- Read per kilobase(RPK): corrects this experimental bias through normalization by gene length -- Parts per million (ppm)
77
generative model
- a statistical model that explains how the observed data are generated from the underlying system - Defines a computational framework that produces sequencing reads from a population of transcripts - get reads from the transcript though we don't know how much transcript is there bc it is bias --- go backwards to calculate transcript abundance from the read distribution
78
transcript fraction
- tells us the proportion of total RNA molecules in the sample that come from a certain transcript - adjusts for the fact that longer transcripts generate more reads - normalizes length VS nucleotide to transcript proportions
79
fragment probabilities
conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical bias - SALMON approximates Positional bias: - Fragments that include transcript ends might be too short - Fragments from central regions are more likely to be of optimal length for sequencing reads GC content: - Undersample GC regions - Make good stop codons - Oversample AT rich regions
80
expectation-maximization algorithm
E) estimate missing info (assignment of fragments to transcripts) using the current transcript abundance estimates M) use the estimated assignments to update the transcript abundances (improves likelihood) For each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimate until it reaches a maximum -- ensures the accuracy of abundance estimates by correcting bias learned during the estimation (online) phasee
80
transcript effective length
adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled
81
maximum likelihood estimation (MLE) goal
The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)
82
2 Phase interference in Salmon
1. online phase: fast, initial estimates of transcript abundances 2. offline phase: refines initial estimates using more complex optimization techniques ** balances speed (online) with accuracy (offline)
83
quasi-mapping
- a fast, lightweight technique used to associate RNA-seq fragments with possible transcripts *** often used for the initial estimates of the online phase in SALMON Expensive so stops after identifying seeds !!!
84
SALMON transcript-fragment assignment matrix
- uses matrix to identify distributions of reads amongst the transcripts --- computationally assigns fragments to transcripts *** maps RNA-seq reads (fragments) to transcripts, enabling accurate quantification of transcript levels --- decides how many fragments are assigned to a specific transcript (higher expression = more fragment abundance in a transcript)
85
statistical model
mathematical tool that describes how data is generated ***help us to make sense of complex data by identifying patterns and determining whether differences are meaningful or just due to chance
86
hypothesis testing
- see if the difference in expression between conditions statistically significant Null (Ho) = There is no difference in gene expression between the 2 conditions. Alternative (H1) = There is a significant difference in gene expression between the conditions. Reject the null hypothesis when there is a difference that could not have happened by random chance
87
p-value
- the probability of the null hypothesis being true -- The higher the p-value, the more our model supports the null hypothesis -- The lower the p-value, the more our model supports the alternative hypothesis
88
differential gene expression
the process of identifying and quantifying changes in gene expression levels between different sample groups or conditions
89
RNA-seq pipeline
1. read alignment: mapping transcripts to the genome 2. quantification: measuring gene expression levels 3. differential expression analysis: identifying key genes and comparing gene expression levels 4. dimensionality reduction: visualizing complex data *** quantifying gene expression levels using high-throughput sequencing technologies generates count data
90
count data
the number of RNA fragments that map to each gene ** generated for RNA-seq
91
binomial distribution
models the number of successes in a fixed number of independent trials, where each trial has the same probability of success
92
binomial distribution limitations
MAIN limitation → assumes that the probability of success is constant between samples Smaller limitation 1 → The number of possible trials can be very large, especially when sequencing at a high depth. Smaller limitation 2 → The probability of expression is very small for many genes because they are either lowly expressed or not at all.
93
Poisson distribution
- a baseline for modeling discrete counts - statistical tool used to model the number of events(or counts) that happen in a fixed period of time or space, where: -- The events are independent of each other -- Each event has a constant average rate ***gives an accurate distribution of counts if mean and variance are equal (AKA probability of observing the sequenced fragments)
94
parity plots
show deviations in Poisson distribution (mean = variance line) Higher counts typically have larger variance
95
overdispersion
when the variance in the data is larger than what is predicted by simpler models (e.g. Poisson distribution) -- Expected variance for Poisson-distributed data equals the mean Reflect biological variability between samples not captured by the experimental conditions -- Differences in RNA quality -- Sequencing depth -- Biological factors like different cell types within the same tissue ***Negative Binomial distribution accounts for high dispersion***
96
zeros in RNA-seq data
- RNA-seq data contains zero counts for some genes because not all genes are expressed under all conditions. - Most statistical models account for variance, but not that 0’s can dominate counts --- zero-inflated models
97
RNA-seq data is SO MESSY
Need models to account for this complexity and figure out which genes are differentially expressed in a meaningful way
98
negative binomial
- gives an estimation of a log fold change - also gives standard error of how uncertain
99
structural biology
SB determines the 3D shapes of biological macromolecules and how these shapes relate to function Primary Goal: to understand how molecular machines in cells work by deciphering their atomic arrangements
100
structural biology challenges
- technical limitations - biological complexity - resource constraints
101
covalent bonds
- the framework of biomolecules - Formed when atoms share pairs of electrons that hold molecules together
102
Relevant characteristics of covalent bonds →
Strength and stability: covalent bonds provide the necessary stability for complex biological structures Directionality: covalent bonds limit the specific angles and orientations leading to the 3D shapes of biomolecules
103
Non-covalent
- dynamic glue - weaker than the covalent bonds and involve electostatics
104
single vs double/triple bonds
-- Single bonds: allow rotation, contributing to molecular flexibility -- Double/Triples bonds: restrict rotation, affecting rigidity/molecule function
105
Noncovalent interactions drive most of biology →
Molecular recognition → -- Enzyme-substrate binding -- Antigen-Antibody interactions Macromolecular structure → -- Membrane formation -- Protein-protein interactions -- Base pairing in DNA and RNA -- Protein folding
106
primary structure
linear sequence of amino acids, held together by peptide bonds -- dictates how the protein will fold into higher-order structures -- does not reveal the protein’s functional form or activity alone -- may also depend on cellular factors (e.g. chaperones)
107
Secondary structure →
local conformations of the polypeptide chain *** stabilized primarily by hydrogen bonds -- Structural motifs are critical for certain functions. (alpha helixes/B-pleated sheets) -- undergo local fluctuations adding to functional flexibility (unwind/twist)
108
Tertiary structure →
Refers to the complete 3D shape of a single polypeptide chain -- Predicting how a sequence folds into its tertiary structure is complex -- Reveal active sites/binding pocket
109
X-ray Crystallography →
- electrons mix into different molecular orbitals at characteristic energy levels - products an e- density distribution that is unique to that structure - Probe: photon (carrier of electromagnetic radiation) - Basic Principle: photons scatter when they interact with atoms --- The scattered X-rays form a diffraction pattern unique to the crystal --- Constructive interference is needed to amplify signal for detectors
110
diffraction patterns
- The spots on the detector represent the reflections of the scattered X-rays - Intensity of the spots reflects the electron density in the crystal - Position and angle of the spots correspond to the geometry *** The diffraction pattern does not directly show the atomic positions but provides the data needed to infer the electron density
111
Building the electron density map →
- Reveals the distribution of e- in the crystal, indicating where atoms are located - The electron density map is interpreted by fitting atomic models (e.g. amino acids for proteins) into density * Low-resolution data make it difficult to assign atomic positions precisely, leading to uncertainty in the model
112
why crystals?
- Crystals have the same repeating unit cell, which amplifies our signals - If in solution, particles would be: - Too sparse to diffract - Moving and diffraction pattern would constantly change
113
Challenges in X-ray Crystallography →
- Flexible or disordered regions do not pack into crystals well, often leading to failure in obtaining high-quality crystals - flexible/disordered regions do not show up clearly in the electron density map - Crystals capture a single conformation of the molecule, often ignoring the flexibility or dynamic range
114
Cryo-Electron Microscopy
a beam of high-energy electrons is used instead of photons -- Electrons have a shorter wavelength than photons -- Scatter light more effectively than x-rays No crystals: The sample is sample is rapidly frozen in vitreous ice to preserve its native structure -- imaged in their native hydrated state. -- can capture multiple conformations uses SPA
115
Single particle analysis (SPA)
- main Cryo-EM technique used to determine the 3d structures of individual macromolecules - Millions of image of individual particles are collected from a thin layer - Particles are computationally aligned and classified into different orientations ---- 2D imaging, particle alignment and averaging, compete 3D structure from 2D projections
116
Cryo-EM advantages and challenges
its ability to capture multiple conformational states of a molecule, providing insights into flexibility and structural heterogeneity highly flexible or disordered molecules may appear as fuzzy or low-resolution regions in the final structure
117
IDPs
lack a stable 3D structure under physiological conditions but are still functional, often gaining structure upon binding to partners - structural techniques often require ordered/stable configurations - May appear fuzzy or have low-resolution in these regions Fit force fields to experimental data of structured proteins BUT there is not a lot of data of IDPs
118
Levinthal’s Paradox
- Proteins can adopt a large number of possible conformations. Levinthal’s Paradox: a protein can’t sample all conformations in a biologically reasonable time, yet it folds quickly.
119
challenges of protein structure prediction
1. large conformational space 2. complex energy landscapes 3. flexibility and dynamics 4. environmental effects 5. PTMs 6. data driven methods
120
potential energy surface (PES)
- represents the energy of a system as a function of the positions of its atoms. - Understands how the system’s E changes upon reactions or movements - Proteins fold to lowest free-E state, but this landscape is highly rugged. - Energy calculations are computationally $$$$ /depend on accurate force fields. -- Lots of potential E minima (so many conformations needs to be tired) -- Multiple minima may be similar but can be far apart in a conformational space
121
homology modeling
- predicts protein structures based on evolutionary relationships - the main principle is that proteins with similar sequences tend to fold into similar structures *** most accurate when sequence identity to other proteins is high (>30%) -- across full protein length
122
threading
- In cases where sequence similarity to known structures is low (<30%), homology modeling becomes unreliable. - Matches sequences to known structural folds based on structural rather than sequence similarity - When remote homologs exist but their evolutionary relationship cannot be detected by sequence comparison alone.
123
Hidden Markov Models (HMMs)
- statistical models representing sequences using probabilities for matches/indels - HMMs model protein sequences as a series of probabilistic states Hidden states: the underlying biological events that are not directly observable Match states: conserved positions in the sequence Insertion states: positions where extra residues are added Deletion states: positions where residues are missing
124
contact map
- 2D representation of which residues are in close proximity (residue interactions in proteins) - represent spatial proximity, not sequence order
125
coevolutionary analysis
- Coevolving residues mutate in a correlated manner. - Mutations in one residue often result in compensatory mutations in its interacting partner - observed across species through analysis of homologous protein sequences - Correlated mutations indicate functionally significant residue pairs *** helps to predict which residues are close in the 3D structure (useful when there is no experimental structure available)
126
how is coevolution detected?
- Coevolution is detected using large MSAs from homologous proteins. - The more diverse the sequences in the MSA, the better the resolution of coevolving residues. - Evolutionary info from MSAs guides predictions for residue-residue contacts.
127
co-evolution signals
- Co-evolution signals can be noisy → - Noise from data can come from random mutations or insufficient evolutionary diversity. - Large and diverse sequence data sets are needed for reliable coevolution predictions.
128
How does alpha fold work
- ML predicts 3D structures only from sequenced data - AlphaFold predicts protein structures with atomic accuracy by using deep learning models trained on large structural datasets --- Trains neutral networks on large amounts of protein sequence and structural data --- Neural networks analyzes pattern and learn to recognize them from coevol data Machine learning leverages coevolution for high-accuracy predictions. --- incorporate evolutionary info along with structural features *** Struggle with disordered proteins
129
Molecular dynamics (MD) simulations
- Protein structure determination and prediction provide fixed snapshots - Understanding the motions of the proteins is important - MD simulations → Provide time-resolved insights into protein behavior 1) 3D coordinates of atoms 2) Atoms exert forces 3) Use Newtons to predict movement
130
time steps
Smaller time steps ⇒ more accurate, but more calculations The time step must be smaller than the shortest vibrational period to accurately capture atomic motions.
131
Simulation of Atomic Movement
- computes trajectories of atoms over time scales of femtoseconds to microseconds. - capture both small-scale vibrations and large-scale conformational changes. - Provides detailed information on atomic interactions and energy changes. - Enables the study of mechanisms at an atomic level ***CLASSICAL MECHANICS
132
why are MD simulations beneficial?
- More realistic analysis of proteins (dynamic vs static) - Refines predicted structures - Minimizes E - Accounts for environmental effects for improved accuracy - Studies IDPs - Captures the flexible nature of disordered regions - Aids in understanding functions that rely on disorder - Identifies folding intermediates and misfolding mechanisms-
133
classical mechanics
- Describes the motion of macroscopic objects - Assumes particles have well-defined positions and velocities - Governed by Newton’s Laws of Motion
134
quantum mechanics
- Necessary for describing behavior at atomic and subatomic scales - Accounts for wave-particle duality, uncertainty principle, proton tunneling - E- exhibit quantum behavior that can’t be captured classically
135
electrons in MD simulations
- neglect quantum effects - Electrons are not simulated in MD - Effect is included implicitly through potential E functions (force fields)
136
suitable systems for MD simulations
- Biological macromolecules (protein, nucleic acids, lipids) - Materials where electronic excitations are not critical. - Processes where bond breaking/forming does not occur.
137
limitations of MD simulations
- cannot accurately simulate chemical reactions w/ electronic transitions. - Quantum stuff like tunneling and zero-point energy are not captured.
138
why classical particles?
- It is very expensive to simulate large systems of atoms using quantum mechanics - Detailed electronic structure is not as important (faster, cheaper calculations) *** As a result, no electronic interactions/movements can be captured with MD
139
newton's 2nd law
the acceleration of an object is directly proportional to the net force acting on it and inversely proportional to its mass.
140
forces
- Forces are negative gradients of potential energy - Thus energy gradients can be used to determine the acceleration and therefore motion of the atoms (velocity)
141
predicting the spatial position of each atom
- Time evolution of a system: computed by integrating equations of motion - Determine force, move forward in time, repeat -Provides cont. Equations of motion using the time steps
142
molecular dynamics are a combination of:
Molecule dynamics are a combination of: bond lengths, angles, and dihedral angles -- Bonds behave like springs (resists changes in distances between 2 atoms) -- Angles behave like harmonic oscillators (hinge with central atom) -- Dihedral angle (angle between 2 planes formed by 4 bonded atoms; describes the rotations of the bond between 2 atoms) - not like springs Approximate bond vibrations and angles as harmonic oscillators Spring constants are determined by bond order (s, d, t)
143
Bonds and Angles
govern local geometry (bond lengths and bond angles) using quadratic (harmonic) potentials that favor specific distances and angles
144
Dihedrals
- govern torsional or rotational flexibility around bonds, typically using periodic and multi-well potentials to allow for multiple stable conformations. - Dihedral potentials must capture arbitrary functions with rotational symmetry.
145
Fourier series
- Modeling dihedral potentials → Fourier series -- Approximate functions as sum of sine and cosine waves --- More sine/cosine terms improves approximations -- Can approximate (any) symmetrical rotational energy function. *** equation to go from structure to energy (instead of using QM)
146
force fields
- compute energies and atomic forces - Setting the force field depends on system type, accuracy/speed, and compatibility - Bonded and non-bonded interactions make up the complete force field
147
topology files
- tell the program which force field parameters to use and where) - define the molecular structure and interactions in the simulations - contains info on atom types, bonds, angles, dihedrals, and non-bonded interactions based on the chosen force field
148
force field parameterization stepwise
Quantum Mechanical Calculations: obtain high-accuracy data for smell molecules and representative fragments Empirical Data Integration: incorporate experimental measurements to validate and refine parameters Parameter Optimization: adjust force field parameters through iterative simulations and comparisons Advanced Techniques: utilize machine learning, multi-scale modeling, and automated pipelines to enhance parameters accuracy and efficiency
149
scenarios in which quantum mechanical effects cannot be ignored
Electronic interactions involving electrons (smaller molecules better) -- Wave-particles, particle tunneling -- Electron transfer involving reactions
150
importance of noncovalent interactions in molecular recognition
- Crucial for simulating multiple molecules in MD - Facilitate organization of molecules into structures - Determine macroscopic props (MP, BP, solubility) - Govern biological functions (enzyme binding, protein folding) *** covalent interactions define primary structure while noncovalent interactions dictate how molecules interact
151
Dual nature of non-covalent interactions →
1. Dispersion forces: weak, attractive forces arising from instantaneous dipoles Stabilize molecular assemblies by promoting close packing 2. Repulsion forces: strong, short-range forces due to overlapping e- clouds Prevents atoms from collapsing into each other; maintains atom integrity
152
criteria for selecting high-quality experimental structures for simulations.
Need a good structure before starting any molecular simulation (equilibrium) Avoid low quality, high-energy conformations with missing/wrong co-factors 1. Resolution: how well the atomic positions are determined Resolution below 2.0 A is generally preferred for high-quality simulations 2. Completeness: Flexible loops/disordered regions are often missing from the structure No missing residudes 3. Functional State: Proteins can exist in different functional conformations: active vs inactive state, bound to ligands or unboard 4. B-factors: Higher B-factors suggest more uncertainty in atom positions, which might make that part of the structure less reliable
153
strategies for adding missing residues or atoms in protein models prior to conducting MD simulations.
- Missing atoms or residues can be added using modeling software like Modelleer - Homology modeling, structure prediction software, MD simulations Unwanted components like ligands or non-essential ions should be removed ligands, ions, or crystallization agents that are not physiologically relevant Distorts protein’s behavior in a simulated biological environment if not removed
154
Describe the steps involved in protein preparation for simulations.
1. Add missing residues 2. Remove unwanted ligands, non-essential ions that distort protein behavior 3. Correct protonation state (pH-sensitive residues) 4. Energy minimization for sterics (adjust structure to remove unfavorable atom position and steric clashes that cause instability) 5. Assign force field parameters 6. Stabilize temp and pressure for MD
155
Evaluating the suitability of a protein structure for simulations
Completeness: Check for missing residues, loops, or side chains; incomplete regions may need modeling to avoid simulation artifacts. Functional State: Ensure the protein’s conformational state (e.g., active, inactive) aligns with the simulation goals; incorrect states could yield irrelevant results. Clash Scores: Assess clash scores (using tools like MolProbity) to identify steric issues; high clash scores indicate steric conflicts that should be resolved with energy minimization or correction. Resolution and R-Factors: Higher resolution (<2.0 Å) and low R-factors indicate greater structural accuracy, while poor values suggest potential inaccuracies.
156
periodic boundary conditions (PBC) in molecular simulations
*** Realistic systems do not have walls - For simulations, need to apply force to keep molecules in the box -- H20 and proteins would bounce off thse walls in an unphysical manner (edge effects) PBC simulates infinite systems from a finite box --- Virtually place exact copies of system in all directions --- Atoms that cross the box edge reappear on the other side (no edge effects) → like pacman - uses Minimum image convention (MIC)
157
Minimum image convention (MIC)
ensures that an atom in the primary box only interacts with the closest image of another atom Images atoms in adjacent boxes are used to calculate interactions across the boundaries (ensures correct interactions)
158
Macrostate
specifies the temp, pressure, volume, and number of particles of molecular systems *** infinite number of macrostates - Large-scale system that defines properties of molecular system (temp, pressure, vol) → changing these changes macrostate - Encompasses all microstates that share the same properties
159
ensemble
- the collection of all possible microstates of a single macrostate - Perfect/accurate ensemble averages require sampling every possible configuration - Macrostate observables are ensemble averages
160
Ensemble examples
Microcanonical Ensemble (NVE) → - Fixed number of particles (N) - Volume (V) - Energy (E) Canonical Ensemble (NVT) → - Fixed number of particles (N) - Volume (V) - Temperature (T) Isothermal-Isobaric Ensemble (NPT) → most common - Fixed number of particles (N) - Pressure (P) - Temperature (T)
161
energy in ensembles
- The instantaneous temperature of microstates will fluctuate, but the ensemble average should be constant There should be no net flow of energy!!!
162
microstate
s a unique configuration defined by the positions and velocities of all particles specific, detailed configuration of a system at the molecular level - Indicates exact position and energy of a particle - Multiple microstates can have the same distance (use mean of them)
163
importance of adequately sampling microstates in MD simulations
- needed to compute reliable ensemble averages - Longer simulation provide better sampling of microstates and their probabilities 1) Statistical accuracy, 2) Covering all conformations shifts, 3) Thermodynamic quantities
164
thermostats
adjust the velocities of particles to increase or decrease the system's kinetic energy → thereby controlling the temperature
165
Berendsen thermostat
adjusts the velocities of all particles uniformly based on the current temperature and target temperature - Scales current velocity based on temp deviation - Prevents abrupt changes that could destabilize the simulation - Simple velocity scaling does not generate a true canonical (NVT) ensemble; it cannot reproduce realistic temperature fluctuations ^^^inaccurately models thermal energy transfer via particle collisions
166
Nose-Hoover thermostat
connect particle momenta to fictitious heat bath - Momenta scaling provides realistic kinetic energy and thus temperature control - Heat bath allows thermal energy to flow in and out of our simulation Q ⇒ a “mass” coupling parameter controls thermostat responsiveness
167
barostats
maintain desired pressure during simulations - Adjusts volume of simulation box to achieve and maintain target pressure - Pressure is proportional to density and temperature - Scales box volume based on pressure difference to target *** all help to keep a consistent macrostate!
168
Ensemble averages improve ...
improve with more simulation time by sampling more microstates -- Many short simulations is better than 1 long one!!! Random initial velocities provide better change of sampling different microstates -- Initial velocities are sent in a direction on the potential E surface simulation; there is a change it never samples a certain minima (multiple simulations with random velocities reduces chance) *** discard initial relaxation as it is not our desired microstate
169
Equilibration (Relaxation) Phase
Purpose: To allow the system to relax and reach a stable, equilibrated state after initial setup. Process: Temperature, pressure, and density are gradually stabilized, with constraints often applied to avoid abrupt movements. Goal: Achieve a realistic starting conformation that reflects the desired ensemble
170
Production (Data Collection) Phase
Purpose: To collect data on the system’s behavior for analysis of properties like energy, structure, and dynamics. Process: Constraints are usually removed, and the system is allowed to evolve naturally. Goal: Gather accurate ensemble averages and insights into properties for the equilibrated system over time.
171
Root Mean Square Deviation (RMSD)
measures deviation in structure over time (what conditions allow conformations to change faster) - Overall change in structure during simulation, tracks deviations from starting conformations (global conformational changes) ***Low is good: close to reference structure ***High indicates significant deviation and large structure changes over time
172
Root Mean Square Fluctuation (RMSF)
How much does amino acid position change; Identifies regions of flexibility in the protein by calculating fluctuation of each atom (tracks local flexibility) about a mean position -- How does it change around its mean, not relative to ref structure ***Low: atom is fixed in place (well-ordered) ***High: fluctuates a lot (flexibility)
173
relationship between energy and probability in molecular simulations
Minima ⇒ concave part in energy diagram - Must overcome energy barrier to make the conformations (TSs can be higher) -- Most preferred would be the lowest energy state ***The probability curve will be exactly OPPOSITE of the energy curve - How much energy is needed between conformation or energy of binding . *** without minimization, high-energy configurations may lead to bad results in MD simulations (removes unfavorable atom positions / sterics)
174
stages of the drug discovery pipeline
Discovery and Preclinical Research (***Computation is most helpful with this stage) -- Potential drugs are identified and tested in non-human studies Clinical Trials -- Testing in human subjects to assess safety and efficacy Regulatory Approval -- Evaluation by agencies like the FDA before the drug can be marketed Post-Marketing Surveillance -- Ongoing monitoring after the drug is available to the public
175
key factors in selecting protein targets for drug design
Disease Relevance: the protein plays a critical role in the disease mechanism Druggability: target has a structure that allows it to bind with drug-like molecules Specificity: Targeting the protein minimizes effects on healthy cells, reducing side effects
176
virtual screening
narrows down potential compounds from large chemical libraries during drug discovery. - tests (HTS) compounds against the target protein - Experimental assays are still expensive, and limited to commercially available compounds -- use computational methods to predict which compounds we should experimental validate *** SO MUCH FASTER TO NARROW DOWN SEARCH SPACE
177
gibbs free energy
Energy is released/uptaken during binding → spontaneity of binding -- Combines enthapy and entropy ***Simulations capture free energy directions instead of treating enthalpy and entropy separately BINDING STRENGTH AND STABILITY AND AFFINITY
178
enthalpy
energetic interactions (sum of contributions provide ensemble avg) - Accounts for non covalent interactions (electrostatics, H-bonds, dipoles, pi-pi) - Ensemble differences in non covalent interactions provide binding enthalpy
179
Electrostatic forces role in binding
(strongest force) Long range interaction, anchor points (~5 to 20 kcal/mol per inter) Charged molecules have a net imbalance between net electrostatic attractions or repulsions between different atoms or molecules
180
h bond role in binding
Specificity/orientation, stabilization, dynamic, strongest when hydrogen, donor, and acceptor atoms are collinear Attraction between a (donor) hydrogen atom covalently bonded to an electronegative atom and another (acceptor) electronegative atom with a lone pair
181
Uneven e- distribution role in binding
Directional binding for proper ligand alignment, flexibility (weak) creates partial charges and dipoles Electronegativity differences lead to unequal distribution of electron density Consistent electron density spatial variation results in permanent dipoles
182
VDW role in binding
Maximizes surface contact (complementary fit); flexibility Dispersion: Electrons in molecules are constantly moving, leading to temporary uneven distributions that induce dipoles in neighboring molecules Induction: The electric field of a polar molecule distorts the electron cloud of a nonpolar molecule, creating a temporary dipole
183
Pi-pi interactions role in binding
Involve stacking of aromatic rings Orientation of aromatics, selectivity Noncovalent interactions between aromatic rings due to overlap of pi-electron clouds
184
entropy
how much conformational flexibility changes (Energy dispersion) *** Higher entropy ⇒ greater microstate diversity for a given macrostate - Accounts for microstate diversity of a single macrostate Can increase/decrease/remain the same depending on ligand concentration
185
purpose of alchemical free energy simulations
Slowly disappear interactions to see the lowest energy conformations to get insight into binding affinities Gets energy difference between conformations
186
how alchemical simulations work
Compute the binding free energy somewhere in solution and then bind to protein Slowly disappear the ligand (1 = normal interaction, 0 = no interactions) -- More relevant conformational sampling -- Run independent simulations in parallel -- Focuses on taking difference w smaller numbers integrate over these small free energy changes (turn interaction on and off) -- Use this to relatively calculate the free energy difference between bound and unbound states
187
why alchemical simulations not ideal
VERY EXPENSIVE; use docking first to screen molecule, then computes energies - captures all atomistic forces - wide range of conformation sampling - specific parameters
188
docking
Avoid sampling all microstates and determine one “optimal” protein-ligand structure (rigid structure) -- using this bound structure, predict a “score” that is correlated to binding affinity -- Simplifies binding free energy prediction to enhance speed --Ligand is not guaranteed to fit perfectly
189
the importance of choosing an appropriate protein conformation
Protein-ligand interactions are highly-dependent on the protein’s 3D structure Using an inappropriate protein conformation can lead to inaccurate docking results ****ONLY PICKING ONE STRUCTURE -- Challenge bc proteins are dynamic
190
challenge for docking
1. Conformational Flexibility (proteins are not rigid structure and experience movement side to side) 2. Binding sites can change 3. Limited experimental structures (not all relevant states may be covered)
191
Experimental Structure Selection Criteria → Docking
Resolution and Quality Ligand-Bound vs. Unbound Structures Relevance to Target Ligand
192
water molecules in docking
Role in binding: structured water molecules can mediates interactions between the protein and ligand Inclusion Criteria: retain water molecules that are conserved across multiple crystal structures Handling water in docking → Some docking programs allow explicit water molecules in the binding site Alternatively, consider their effect implicitly in scoring functions
193
convex vs concave regions
Convex Regions: Typically inaccessible to ligands. Concave Regions (Cavities): Potential binding sites.
194
grid-based binding pocket detection
Grid-based: puts a protein on a grid, looks for protein vs no protein within grid If there is no protein in a space, there is likely a pocket there (concave space) Macrostate remains constant
195
alpha shape theory
alpha spheres touch certain about of atoms (3 atoms only); cannot put any spheres on the outside in protein land -- Shows pockets based on how many spheres it is touches (group spheres placed in open spaces and indicate it as a pocket) -- uses Delaunay triangulation and alpha complexes to define cavities
196
3 binding pocket classifications
1. Orthosteric: primary active site where ligands bind 2. Allosteric: secondary sites that modulate protein function upon ligand binding 3. Cryptic: binding pockets not apparent in the unboard protein structure but form upon ligand binding or conformational change
197
cryptic pockets
binding pockets not apparent in the unboard protein structure but form upon ligand binding or conformational change -- hard for MD simulations to detect -- must use enhanced MD methods, and apply pocket detection to multiple conformations
198
process of pose optimization in docking to optimize ligand positions within the binding site for accurate binding affinity prediction
*** Accurate docking depends on optimized ligand poses (binding affinity) 1. Initial ligand placement 2. Scoring function evaluation of binding affinity 3. Pose adjustment/optimization (move around ligand and protein residues to achieve best fit) + energy minimization 4. Rescore and rank in terms of best pose 5. Final pose choice
199
Search strategies for docking →
systematic and stochastic
200
systematic search
numerically iterate over all possible conformations -- Only possible for very small molecules, not used very often -- ID important degrees of freedom -- Remove structures with high strain *****ALMOST NEVER DO
201
stochastic searches
random sampling (Monte Carlo) -- Provide better balance of sampling and cost -- See if energy change of new conformation is less than random Steps → - Generate conformation - Compete energy change - If energy change less than a random sample: make move - Repeat ***Allows us to sample efficiently!
202
Scoring function
are parameterized models to estimate binding affinity after docking -- Physics-based methods using force-field like methods -- Machine learning (graphing neural networks) have been gaining traction recently
203
Structure-Based Drug Design VS Ligand-Based Drug Design
Structure-Based Drug Design: - Requires 3D structure of the target protein. - Uses the binding site structure to model potential interactions. - Often employs docking and molecular simulations. Ligand-Based Drug Design: - Requires no structural information of the target. - Uses the chemical structure and activity of known ligands as guides. - Relies on molecular similarity rather than direct binding predictions.
204
Molecular weight
indicates the overall size of the molecule -- Impacts drug distribution and elimination rates in the body
205
LogP
measures lipophilicity (chemical compound's ability to dissolve in lipids, fats, oils, and non-polar solvents) -- Influences a molecule's ability to cross cell membranes and affects absorption and bioavailability
206
molar refractivity
relates to polarizability and electron cloud distribution -- Affecting intermolecular interactions and binding affinity
207
TPSA
estimates the molecule’s ability to form hydrogen bonds --impacting solubility and permeability across biological membranes
208
number of rotatable bonds
reflects molecular flexibility -- influences binding affinity and oral bioavailability
209
Extended connectivity fingerprints (ECFPs)
encode structural features into numerical representations -- Hash functions are used to encode chemical information -- Can be encoded into a bit array
210
Tanimoto similarity
compares ECFPs between 2 molecules based on ... Molecular similarity: the concept that similar molecules often show similar biological effects. Formula measures the ratio of the shared features to the total number of unique features between 2 molecules
211
how molecular fingerprints are generated by hashing atom-specific properties
- Numerical representation of the properties of the molecules - Chose a function, not a number → function turns into number - Generate a number consistently based on whatever input given for keeping track of molecules in the system
212
hash function iterations
used to encode chemical info For each iteration: - incorporate hashes of atoms that are n bonds away - Then encode atom IDs that exactly one bond away - Repeat while hashing n-1 IDs Each iteration encodes local chemical info into each atom’s ID Similar features will share atom IDs until our iterations starts incorporating new features (encodes multiple levels of info)
213
Bit arrays
fixed-length collections of ones and zeros Allow for efficient operations Encoded into bit array to store a collection of atom IDs
214
QSAR models
link chemical structure with biological activity Predict the biological activity of molecules based on their structure to reduce the need for experimental screen (quick and cost-effective)
215
challenges of efficiently exploring chemical space to find active compounds similar to known bioactive molecules
So many molecules to find and finite amount of time -- Chemical space is vast -- Diverse properties -- Exhausts computational resources -- Reliability of predictive models