Final Exam Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

DNA sequencing set-up

A
  1. Start with bacterial culture for product of interest
  2. Separate cells from media via centrifuge
  3. Keep DNA by breaking open cells via lysing
  4. Isolate and purify DNA using liquid-liquid extraction (aq layer has DNA)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

chemical lysis

A

destabilizes the lipid bilayer and denatures proteins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

surfactants

A

one hydrophobic tail, which allows them to further penetrate molecular structures as compared to phospholipids with 2 tails
Similar to phospholipids, but break through barrier and destabilize proteins better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Main problem of determining the order of nucleotides

A

DNA elongation happens rapidly and continually
Uses DNA polymerase and excess of nucleotides to make copies of DNA
3’ OH is required for DNA elongation
Di-deoxynucleotides stop replication bc it lacks 3’ OH so polymerase cannot add another nucleotide to it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

sanger sequencing

A
  • accurate, long reads, but resource consuming
  • use one beaker and fluorescence to distinguish between the ddNTPs
    – Fragment separation can be automated via capillary gel electrophoresis
    – Separates molecules by size based on their charge-to-mass ratio

Smaller molecules move more freely through the gel and migrate faster than larger molecules
molecules must be charged through tagging
– Unique signal per ddNTP products chromatogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Building strand from fragments

A

Sort DNA fragments by length to see what the last nucleotide was

Line up the last 5’ nucleotide; gradually builds the 3’ end up to get strand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Original Set up →

A
  1. Split sample into 4 beakers
  2. Add all 4 ddNTPS into each beaker & radioactive ddNTP
    Need separate beakers bc cannot differentiate between them
  3. Add Taq polymerase
  4. Separate by length using gel electro.
    Shortest lengths travel the farthest; associate them with a beaker
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Good vs Bad chromatogram

A

Good:
- Variation in peak high is less than 3-fold.
- Peaks are evenly distributed and one color
- Baseline noise is absent
Interpreted nucleotide sequence is 5’ → 3”
Bad:
- Significant noise up to ~20 bps in (unreliable transport properties)
- Dye blobs occur from unused ddNTPs
- Fewer longer fragments so signal is weaker

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Illumina

A

short reads, but high throughput

  • Adapter ligations attach P5 and P7 oligos to facilitate binding to flow cell
  • Primers are not complementary, so they do not base pair
  • Fragments become bound somewhere in the flow cell
  • locally amplify bound DNA fragments to get clusters of the same sequence
  • Bridge amplification creates double-stranded bridges
  • Double-stranded clonal bridges are denatured with cleaved reverse strands
  • uses pair-end sequencing

***clusters will give off a stronger signal compared to a single fragment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Illumina stepwise

A
  1. Add labeled dNTPs into flow cells
  2. Incorporate a complementary nucleotide
  3. Remove unincorporated fluorescent nucleotides
  4. Capture fluorescent signal & image clusters
  5. Cleave the fluorophores and the protecting group
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pair-end sequencing

A

generated from both ends of a DNA fragment with known insert size

enables both ends of the DNA fragment to be sequenced

Distance between each paired read is known, alignment algorithms can use this info to map the reads over repetitive regions more precisely.

Results in much better alignment of the reads, especially across difficult-to-sequence, repetitive regions of the genome

** more expensive but ideal for genome assembly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Nanopore

A

Longer reads, more accurate for assembling reads into genome

Very expensive, low throughput

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

single-end reads

A
  • generated from only one end of a DNA fragment
  • Simpler, fast, more cost-effective
  • Limited context for structural variations or duplications
  • Used for small genomes and RNA seq where contiguity is less critical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Genome assembly

A
  • process of combining our sequencing reads into a continuous DNA sequence

(Sequencing provides short, overlapping reads of DNA)

Having multiple fragments that contain the same portion of the sequence improves our coverage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

reads

A

raw sequences coming the experiments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Contigs

A

continuous stretches of DNA seq from overlapping seq reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Ambiguous assembly

A

contigs put together in an unknown order

Accounts for differences in scaffolds; Assemble using reference genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Scaffold

A

contigs put together overlapping with estimated gaps in a known order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

main challenges for deNovo genome reconstruction

A

Repeats: create ambiguity and can cause assemblies; inflate genome size

High Coverage: sequencing the genome multiple times, resulting in a greater number of reads that overlap any given region of the genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

greedy overlap

A

deNovo genome reconstruction

Goal is to assemble the strings (reads) into a continuous, single string (contig)

Want the shortest possible superstring

  1. Overlap maximization
    – Reduces redundancy, maximizes confidence with highest overlap
  2. Repeat resolution
    – Resolves repeats by favoring collapsed arrangements
  3. Evolutionary pressure
    – Most genomes have selective pressure to be efficient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

how to do a greedy assembly?

A

merge by highest overlap!!

Repeats ruin assembly ⇒ can cause missing reads

Increase K to overcome repeats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

de Brujin graphs

A
  • help for to visualize relationships/overlaps between the strings
  • Node = single entity [k-1]
  • Edge = represents a connection between entities (can have direction) [k]
  • uses direct edges to specify overlap and concatenation
  • Each unique k-mer is a node. (K-mer = substring of length k)
  • A node is balanced if indegree equals outdegree
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

multiple reads for DB graphs

A

not Eulerian bc cannot walk along each edge once; 2 semi-balanced nodes

edges on walk extend the contig in multiple directions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

errors in assembly effect on DB graphs

A

Errors affect:
1) k-mer counts, 2) increase # of edges and unconnected graphs

  • No overlap would lead to unconnected graphs; weights can be added to arrows (#)

Error correction should remove most tips, islands, bulges (splits and reconnects)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

high coverage for deBrujin

A

High coverage suggests that a node is likely a true sequence rather than an error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How do we choose the “best” path for our contig?

A

Long paths are desired but not always reliable due to potential repeats
High, consistent read coverage
Unique, non-branching paths

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

SPAdes

A
  • prokaryotic genome assembler based on DB graphs
  • Estimates gaps between reads using DB graphs
  • Builds multisized graphs with different k’s.
  • Using multiple graphs allows for a better handling of variable coverage.
  • Assemblers provide contigs and scaffolds (connections how contigs form scaffolds)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Large VS Small K

A

Large K ⇒ fragmented graphs; helps reduce repeat collapsing
Small K ⇒ collapsed/tangled graph good for low-coverage regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

L50

A

NUMBER of contigs whose combined length is at least 50%

Lower is better for L50 value

**longer contigs = more confidence that genome is right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

N50

A

LENGTH of the shortest contig at 50% of the total genome length

Higher is better for N50 value [median contig size = reliability factor]

***N50 is the length of shortest contig in L50 assembly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

leading synthesis

A
  • by addition of dNTP instead of ddNTP
  • synthesis is ahead by 1 nucleotide bc 2 were added at once
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

lagging synthesis

A

by failure to remove blocking fluorophore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

signal cross talk

A

degrades quality of assemblies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

phred quality score

A

assess the accuracy of nucleotide base calls in DNA sequencing (prob that base call is incorrect)

ASCII encoded probability store phred quality scores in FASTQ file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Per sequence GC content

A

Deviation from normal distribution indicates contamination (reads)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Trimming/Filtering

A

reduces bias of bad base calls normally at the ends of reads

Trimming/cutting/masking sequences
– From low quality score regions
– Beginning and end of sequence
– Remove adapters

Filtering of sequences
– With low mean quality score
– Too short
– With too many ambiguous (N) bases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

structural annotation

A

identifies critical genetic elements such as genes, promoters, and regulatory elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Functional annotation

A
  • predicts the function of genetic elements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Reading ORFS →

A
  1. Seek the standard start codons: ATG, GTG, or TTG
  2. Seek the stop codons based on the translation table
    TAA, TAG, TGA for bacteria, archaea, and plant plastids
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Typical elements of a gene that are annotated

A

Promoter, start site, 5’ UTR, exons, introns, start codon, CDS, stop codon, 3’ UTR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

MSA

A

the process of aligning three or more biological sequences simultaneously

Identifies conserved regions across multiple species

Reveals patterns not visible in pairwise comparisons (evol. relationships)

Key characteristics:
- Aligns multiple sequences in a single analysis
- Introduces gaps to maximize alignment of similar characters
- Preserves the order of characters in each sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Important elements of scoring in alignment selection

A

Objectivity: provides a quantitative measure for comparison

Optimization: allows algorithms to find the best alignment

Significance: helps distinguish real homology from random similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Alignment elements reflect …

A

evolutionary events in sequences

(match, gap, mismatch)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

match

A

identical characters in aligned positions

Represents conserved regions or no change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

mismatch

A

different characters in aligned positions

Indicates substitutions or mutations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

gap

A

dash(-) inserted to improve alignment

Represents insertions and deletions (indels)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

global alignment

A

compares sequences in their entirety (start to end)

Key characteristics:
– Attempts to align every residue in both sequences
– Introduces gaps as necessary to maintain end-to-end alignment
– Optimizes the overall alignment score for the entire sequences

Needleman-Wunsch: guarantees optimal global alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Advantages of global alignment

A

Provides a complete picture of sequence similarity

Ideal for detecting overall conservation patterns

Useful for phylogentic analysis of related sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

limitations of global alignment

A

May force alignment of unrelated regions in divergent sequence

Less effective for sequences of very different lengths

Can be computationally intensive for long sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

local alignment

A

identifies best matching subsequences; focus on regions of high similarity

Key characteristics:
– Does not require aligning entire sequences end-to-end
– Allows for identification of conserved regions or domains
– Ignores poorly matching regions
– Can find multiple areas of similarity in a single comparison
– Aligns subsections of sequences
– Protein motif identification exemplifies local alignment utility (identifies functional regions)

Smith-Waterman

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Needleman Wunsch

A
  • start with 0 in top corner
  • add gap penalty down the first row
  • move across to get the highest possible score while including penalities
  • score is in bottom row
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Smith Waterman

A
  • 0 zero is the lowest score
  • if negative, make it 0
  • enter 0’s in starting rows
  • Start alignment at the highest cell
  • Stop aligning when you encounter a zero
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Smith-Waterman differs from Needleman-Wunsch in key aspects →

A

Matrix initialization:
NW: the first row and column are filled with gap penalties
SW: first row and column filled with zeros

Scoring system:
NW: allows negative scores
SW: negative scores are set to zero

Traceback:
NW: starts from the bottom-right cell
SW: starts from the highest scoring cell in the matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

linear gap penalty

A

fixed cost for each gap

Similar to implement, over-penalizes long gaps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

affine gap penalty

A

different costs for opening and extending gaps

Better for long indels, more biologically realistic (Single mutation event often causes multi-base indel)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

position-specific gap penalties

A

Reduced in variable regions; increase in conserved regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

residue-specific gap penalties

A

Adjust penalties based on amino acid properties

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

terminal gap penalties

A

Often reduced to allow end gaps in local alignments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

transcriptomics

A

allows us to see exactly what genes are active within a given moment

Allows us to see changes in gene expression overtime (picture of gene exp.)

Works with a complete set of RNA transcripts (mRNA, rRNA, tRNA, non-coding RNA)

Captures the dynamic nature of the transcriptome to reflect the functional state of the cell; captures cell’s response to environment and signals

*** what annotated genes are actually being used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

isoforms

A

a single gene can produce multiple mRNA transcripts

Way for org. to increase protein diversity without increasing the number of genes

reveals alternative splicing and reforms (cell type, envt, developmental state)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

genomics VS transcriptomics

A

Functional insights →
- Identifies potential functional elements
- Reveals which elements are active
- Predicts disease risk
- Shows diseases state

Temporal insights →
- Requires one-time sampling
- Captures real-time cellular responses
- Reveals evolutionary history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

single-cell transcriptomics

A
  • revolutionizes resolution
  • best for rare cells with complex tissue types
  • captures gene expression in an individual cell
  • reveals cellular heterogeneity within the tissues

***very powerful data but can be very sparse and noisy

***Not very reproducible bc there is so little RNA in a cell; typically paired with bulk RNA analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

spatial transcriptomics

A
  • maps gene expression to location
  • Preserves spatial information of transcripts within tissue sections
  • Reveals how cellular neighborhoods influence gene expression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

RNA integrity number

A
  • rRNA makes up a large percentage of our RNA
  • lower numbers are degreaded sample (28S is degraded to 18S rRNA)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

filter for mRNA only

A

poly A tail primer allows for amplification of only mRNA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

microarrays

A
  • convert mRNA to cDNA
  • no longer in practice
  • LIMITED to known sequences
  • similar sequences may cause false positives
  • limited dynamic range
  • normalization challenges
  • potential for bias
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

RNA-seq

A
  • doesn’t require prior knowledge of sequences; allows for discovery of novel trancripts/isoforms (primary advantages over microarray technology)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

computational pipeline for RNA-seq data analysis

A
  1. Read alignment: mapping transcripts to the genome
  2. Quantification: measuring gene expression levels
  3. Differential expression analysis: identifying key genes
  4. Dimensionality reduction: visualizing complex data (not in practice)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Hash table

A
  • link a key to a value
  • keys represent a label we can use to get info
  • hash function used to determine where to find their number
  • DNA dictionary with quick lookup and direct access to potential matches
    (large memory and slow for large genomes)

** way for reads to be mapped to reference genomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

suffix arrays/trees

A
  • represent all suffixes of a given string
  • used to find the starting index of a suffix
  • arrays are a memory-efficient alternative to trees
    — require less memory, but are less powerful

*** create all suffixes; fix with end-of-string identifier; then sort lexicographically

we LOSE the original data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Burrows-Wheeler transforms (BWT)

A
  • compresses the amount of data that we have to store without losing the original data
    — allows for reversibility of data

Basic concept of BWT:
– Append a unique end-of-string (EOS) marker to the input string.
– Generate all rotations of the string.
– Sort these rotations lexicographically
– Extract the last column of the sorted matrix as the BWT output.
– 1st column is more compressible but lose context/reversibility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

backwards search for BWT

A
  • backwards search efficiently finds occurrences of a pattern in the text using L-F mapping
  • reversibility of BWT is better than suffix arrays bc we do not lose data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

alignment

A
  • specifies where exactly in the transcript this read came from
    – (at position ___)
  • specifies where exactly in the transcript this read came from

*** need to determine the read’s exact position in the transcript but they are SOOO EXPENSIVE $$$$

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

pseudoalignment

A
  • specifies that it came somewhere from this transcript (compatible)
  • Finds which transcript, but not where
  • Identifies which transcripts are compatible with the read, skipping the precise location step
  • Faster and less resource intensive than alignment based methods
  • Lacks certain details (position and orientation of reads) which are useful for correcting technical biases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

quantifying gene expression levels

A
  • Must scale data for higher precision, less memory
    – Read per kilobase(RPK): corrects this experimental bias through normalization by gene length
    – Parts per million (ppm)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

generative model

A
  • a statistical model that explains how the observed data are generated from the underlying system
  • Defines a computational framework that produces sequencing reads from a population of transcripts
  • get reads from the transcript though we don’t know how much transcript is there bc it is bias
    — go backwards to calculate transcript abundance from the read distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

transcript fraction

A
  • tells us the proportion of total RNA molecules in the sample that come from a certain transcript
  • adjusts for the fact that longer transcripts generate more reads
  • normalizes length VS nucleotide to transcript proportions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

fragment probabilities

A

conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical bias

  • SALMON approximates

Positional bias:
- Fragments that include transcript ends might be too short
- Fragments from central regions are more likely to be of optimal length for sequencing reads

GC content:
- Undersample GC regions
- Make good stop codons
- Oversample AT rich regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

expectation-maximization algorithm

A

E) estimate missing info (assignment of fragments to transcripts) using the current transcript abundance estimates

M) use the estimated assignments to update the transcript abundances (improves likelihood)

For each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimate until it reaches a maximum
– ensures the accuracy of abundance estimates by correcting bias learned during the estimation (online) phasee

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

transcript effective length

A

adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

maximum likelihood estimation (MLE) goal

A

The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

2 Phase interference in Salmon

A
  1. online phase: fast, initial estimates of transcript abundances
  2. offline phase: refines initial estimates using more complex optimization techniques

** balances speed (online) with accuracy (offline)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

quasi-mapping

A
  • a fast, lightweight technique used to associate RNA-seq fragments with possible transcripts
    *** often used for the initial estimates of the online phase in SALMON

Expensive so stops after identifying seeds !!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

SALMON transcript-fragment assignment matrix

A
  • uses matrix to identify distributions of reads amongst the transcripts
    — computationally assigns fragments to transcripts

*** maps RNA-seq reads (fragments) to transcripts, enabling accurate quantification of transcript levels
— decides how many fragments are assigned to a specific transcript (higher expression = more fragment abundance in a transcript)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

statistical model

A

mathematical tool that describes how data is generated

***help us to make sense of complex data by identifying patterns and determining whether differences are meaningful or just due to chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

hypothesis testing

A
  • see if the difference in expression between conditions statistically significant

Null (Ho) = There is no difference in gene expression between the 2 conditions.

Alternative (H1) = There is a significant difference in gene expression between the conditions.

Reject the null hypothesis when there is a difference that could not have happened by random chance

87
Q

p-value

A
  • the probability of the null hypothesis being true

– The higher the p-value, the more our model supports the null hypothesis
– The lower the p-value, the more our model supports the alternative hypothesis

88
Q

differential gene expression

A

the process of identifying and quantifying changes in gene expression levels between different sample groups or conditions

89
Q

RNA-seq pipeline

A
  1. read alignment: mapping transcripts to the genome
  2. quantification: measuring gene expression levels
  3. differential expression analysis: identifying key genes and comparing gene expression levels
  4. dimensionality reduction: visualizing complex data

*** quantifying gene expression levels using high-throughput sequencing technologies

generates count data

90
Q

count data

A

the number of RNA fragments that map to each gene

** generated for RNA-seq

91
Q

binomial distribution

A

models the number of successes in a fixed number of independent trials, where each trial has the same probability of success

92
Q

binomial distribution limitations

A

MAIN limitation → assumes that the probability of success is constant between samples

Smaller limitation 1 → The number of possible trials can be very large, especially when sequencing at a high depth.

Smaller limitation 2 → The probability of expression is very small for many genes because they are either lowly expressed or not at all.

93
Q

Poisson distribution

A
  • a baseline for modeling discrete counts
  • statistical tool used to model the number of events(or counts) that happen in a fixed period of time or space, where:
    – The events are independent of each other
    – Each event has a constant average rate

***gives an accurate distribution of counts if mean and variance are equal
(AKA probability of observing the sequenced fragments)

94
Q

parity plots

A

show deviations in Poisson distribution (mean = variance line)

Higher counts typically have larger variance

95
Q

overdispersion

A

when the variance in the data is larger than what is predicted by simpler models (e.g. Poisson distribution)

– Expected variance for Poisson-distributed data equals the mean

Reflect biological variability between samples not captured by the experimental conditions
– Differences in RNA quality
– Sequencing depth
– Biological factors like different cell types within the same tissue

Negative Binomial distribution accounts for high dispersion

96
Q

zeros in RNA-seq data

A
  • RNA-seq data contains zero counts for some genes because not all genes are expressed under all conditions.
  • Most statistical models account for variance, but not that 0’s can dominate counts
    — zero-inflated models
97
Q

RNA-seq data is SO MESSY

A

Need models to account for this complexity and figure out which genes are differentially expressed in a meaningful way

98
Q

negative binomial

A
  • gives an estimation of a log fold change
  • also gives standard error of how uncertain
99
Q

structural biology

A

SB determines the 3D shapes of biological macromolecules and how these shapes relate to function

Primary Goal: to understand how molecular machines in cells work by deciphering their atomic arrangements

100
Q

structural biology challenges

A
  • technical limitations
  • biological complexity
  • resource constraints
101
Q

covalent bonds

A
  • the framework of biomolecules
  • Formed when atoms share pairs of electrons that hold molecules together
102
Q

Relevant characteristics of covalent bonds →

A

Strength and stability: covalent bonds provide the necessary stability for complex biological structures

Directionality: covalent bonds limit the specific angles and orientations leading to the 3D shapes of biomolecules

103
Q

Non-covalent

A
  • dynamic glue
  • weaker than the covalent bonds and involve electostatics
104
Q

single vs double/triple bonds

A

– Single bonds: allow rotation, contributing to molecular flexibility
– Double/Triples bonds: restrict rotation, affecting rigidity/molecule function

105
Q

Noncovalent interactions drive most of biology →

A

Molecular recognition →
– Enzyme-substrate binding
– Antigen-Antibody interactions
Macromolecular structure →
– Membrane formation
– Protein-protein interactions
– Base pairing in DNA and RNA
– Protein folding

106
Q

primary structure

A

linear sequence of amino acids, held together by peptide bonds
– dictates how the protein will fold into higher-order structures
– does not reveal the protein’s functional form or activity alone
– may also depend on cellular factors (e.g. chaperones)

107
Q

Secondary structure →

A

local conformations of the polypeptide chain

*** stabilized primarily by hydrogen bonds
– Structural motifs are critical for certain functions. (alpha helixes/B-pleated sheets)
– undergo local fluctuations adding to functional flexibility (unwind/twist)

108
Q

Tertiary structure →

A

Refers to the complete 3D shape of a single polypeptide chain
– Predicting how a sequence folds into its tertiary structure is complex
– Reveal active sites/binding pocket

109
Q

X-ray Crystallography →

A
  • electrons mix into different molecular orbitals at characteristic energy levels
  • products an e- density distribution that is unique to that structure
  • Probe: photon (carrier of electromagnetic radiation)
  • Basic Principle: photons scatter when they interact with atoms
    — The scattered X-rays form a diffraction pattern unique to the crystal
    — Constructive interference is needed to amplify signal for detectors
110
Q

diffraction patterns

A
  • The spots on the detector represent the reflections of the scattered X-rays
  • Intensity of the spots reflects the electron density in the crystal
  • Position and angle of the spots correspond to the geometry

*** The diffraction pattern does not directly show the atomic positions but provides the data needed to infer the electron density

111
Q

Building the electron density map →

A
  • Reveals the distribution of e- in the crystal, indicating where atoms are located
  • The electron density map is interpreted by fitting atomic models (e.g. amino acids for proteins) into density
  • Low-resolution data make it difficult to assign atomic positions precisely, leading to uncertainty in the model
112
Q

why crystals?

A
  • Crystals have the same repeating unit cell, which amplifies our signals
  • If in solution, particles would be:
  • Too sparse to diffract
  • Moving and diffraction pattern would constantly change
113
Q

Challenges in X-ray Crystallography →

A
  • Flexible or disordered regions do not pack into crystals well, often leading to failure in obtaining high-quality crystals
  • flexible/disordered regions do not show up clearly in the electron density map
  • Crystals capture a single conformation of the molecule, often ignoring the flexibility or dynamic range
114
Q

Cryo-Electron Microscopy

A

a beam of high-energy electrons is used instead of photons
– Electrons have a shorter wavelength than photons
– Scatter light more effectively than x-rays

No crystals: The sample is sample is rapidly frozen in vitreous ice to preserve its native structure
– imaged in their native hydrated state.
– can capture multiple conformations

uses SPA

115
Q

Single particle analysis (SPA)

A
  • main Cryo-EM technique used to determine the 3d structures of individual macromolecules
  • Millions of image of individual particles are collected from a thin layer
  • Particles are computationally aligned and classified into different orientations
    —- 2D imaging, particle alignment and averaging, compete 3D structure from 2D projections
116
Q

Cryo-EM advantages and challenges

A

its ability to capture multiple conformational states of a molecule, providing insights into flexibility and structural heterogeneity

highly flexible or disordered molecules may appear as fuzzy or low-resolution regions in the final structure

117
Q

IDPs

A

lack a stable 3D structure under physiological conditions but are still functional, often gaining structure upon binding to partners

  • structural techniques often require ordered/stable configurations
  • May appear fuzzy or have low-resolution in these regions

Fit force fields to experimental data of structured proteins BUT there is not a lot of data of IDPs

118
Q

Levinthal’s Paradox

A
  • Proteins can adopt a large number of possible conformations.

Levinthal’s Paradox: a protein can’t sample all conformations in a biologically reasonable time, yet it folds quickly.

119
Q

challenges of protein structure prediction

A
  1. large conformational space
  2. complex energy landscapes
  3. flexibility and dynamics
  4. environmental effects
  5. PTMs
  6. data driven methods
120
Q

potential energy surface (PES)

A
  • represents the energy of a system as a function of the positions of its atoms.
  • Understands how the system’s E changes upon reactions or movements
  • Proteins fold to lowest free-E state, but this landscape is highly rugged.
  • Energy calculations are computationally $$$$ /depend on accurate force fields.
    – Lots of potential E minima (so many conformations needs to be tired)
    – Multiple minima may be similar but can be far apart in a conformational space
121
Q

homology modeling

A
  • predicts protein structures based on evolutionary relationships
  • the main principle is that proteins with similar sequences tend to fold into similar structures

*** most accurate when sequence identity to other proteins is high (>30%) – across full protein length

122
Q

threading

A
  • In cases where sequence similarity to known structures is low (<30%), homology modeling becomes unreliable.
  • Matches sequences to known structural folds based on structural rather than sequence similarity
  • When remote homologs exist but their evolutionary relationship cannot be detected by sequence
    comparison alone.
123
Q

Hidden Markov Models (HMMs)

A
  • statistical models representing sequences using probabilities for matches/indels
  • HMMs model protein sequences as a series of probabilistic states

Hidden states: the underlying biological events that are not directly observable
Match states: conserved positions in the sequence
Insertion states: positions where extra residues are added
Deletion states: positions where residues are missing

124
Q

contact map

A
  • 2D representation of which residues are in close proximity (residue interactions in proteins)
  • represent spatial proximity, not sequence order
125
Q

coevolutionary analysis

A
  • Coevolving residues mutate in a correlated manner.
  • Mutations in one residue often result in compensatory mutations in its interacting partner
  • observed across species through analysis of homologous protein sequences
  • Correlated mutations indicate functionally significant residue pairs

*** helps to predict which residues are close in the 3D structure (useful when there is no experimental structure available)

126
Q

how is coevolution detected?

A
  • Coevolution is detected using large MSAs from homologous proteins.
  • The more diverse the sequences in the MSA, the better the resolution of coevolving residues.
  • Evolutionary info from MSAs guides predictions for residue-residue contacts.
127
Q

co-evolution signals

A
  • Co-evolution signals can be noisy →
  • Noise from data can come from random mutations or insufficient evolutionary diversity.
  • Large and diverse sequence data sets are needed for reliable coevolution predictions.
128
Q

How does alpha fold work

A
  • ML predicts 3D structures only from sequenced data
  • AlphaFold predicts protein structures with atomic accuracy by using deep learning models trained on large structural datasets
    — Trains neutral networks on large amounts of protein sequence and structural data
    — Neural networks analyzes pattern and learn to recognize them from coevol data

Machine learning leverages coevolution for high-accuracy predictions.
— incorporate evolutionary info along with structural features

*** Struggle with disordered proteins

129
Q

Molecular dynamics (MD) simulations

A
  • Protein structure determination and prediction provide fixed snapshots
  • Understanding the motions of the proteins is important
  • MD simulations →
    Provide time-resolved insights into protein behavior
    1) 3D coordinates of atoms
    2) Atoms exert forces
    3) Use Newtons to predict movement
130
Q

time steps

A

Smaller time steps ⇒ more accurate, but more calculations

The time step must be smaller than the shortest vibrational period to accurately capture atomic
motions.

131
Q

Simulation of Atomic Movement

A
  • computes trajectories of atoms over time scales of femtoseconds to microseconds.
  • capture both small-scale vibrations and large-scale conformational changes.
  • Provides detailed information on atomic interactions and energy changes.
  • Enables the study of mechanisms at an atomic level

***CLASSICAL MECHANICS

132
Q

why are MD simulations beneficial?

A
  • More realistic analysis of proteins (dynamic vs static)
  • Refines predicted structures
  • Minimizes E
  • Accounts for environmental effects for improved accuracy
  • Studies IDPs
  • Captures the flexible nature of disordered regions
  • Aids in understanding functions that rely on disorder
  • Identifies folding intermediates and misfolding mechanisms-
133
Q

classical mechanics

A
  • Describes the motion of macroscopic objects
  • Assumes particles have well-defined positions and velocities
  • Governed by Newton’s Laws of Motion
134
Q

quantum mechanics

A
  • Necessary for describing behavior at atomic and subatomic scales
  • Accounts for wave-particle duality, uncertainty principle, proton tunneling
  • E- exhibit quantum behavior that can’t be captured classically
135
Q

electrons in MD simulations

A
  • neglect quantum effects
  • Electrons are not simulated in MD
  • Effect is included implicitly through potential E functions (force fields)
136
Q

suitable systems for MD simulations

A
  • Biological macromolecules (protein, nucleic acids, lipids)
  • Materials where electronic excitations are not critical.
  • Processes where bond breaking/forming does not occur.
137
Q

limitations of MD simulations

A
  • cannot accurately simulate chemical reactions w/ electronic transitions.
  • Quantum stuff like tunneling and zero-point energy are not captured.
138
Q

why classical particles?

A
  • It is very expensive to simulate large systems of atoms using quantum mechanics
  • Detailed electronic structure is not as important (faster, cheaper calculations)

*** As a result, no electronic interactions/movements can be captured with MD

139
Q

newton’s 2nd law

A

the acceleration of an object is directly proportional to the net force acting on it and inversely proportional to its mass.

140
Q

forces

A
  • Forces are negative gradients of potential energy
  • Thus energy gradients can be used to determine the acceleration and therefore motion of the atoms (velocity)
141
Q

predicting the spatial position of each atom

A
  • Time evolution of a system: computed by integrating equations of motion
  • Determine force, move forward in time, repeat

-Provides cont. Equations of motion using the time steps

142
Q

molecular dynamics are a combination of:

A

Molecule dynamics are a combination of: bond lengths, angles, and dihedral angles
– Bonds behave like springs (resists changes in distances between 2 atoms)
– Angles behave like harmonic oscillators (hinge with central atom)
– Dihedral angle (angle between 2 planes formed by 4 bonded atoms; describes the rotations of the bond between 2 atoms) - not like springs

Approximate bond vibrations and angles as harmonic oscillators

Spring constants are determined by bond order (s, d, t)

143
Q

Bonds and Angles

A

govern local geometry (bond lengths and bond angles) using quadratic (harmonic) potentials that favor specific distances and angles

144
Q

Dihedrals

A
  • govern torsional or rotational flexibility around bonds, typically using periodic and multi-well potentials to allow for multiple stable conformations.
  • Dihedral potentials must capture arbitrary functions with rotational symmetry.
145
Q

Fourier series

A
  • Modeling dihedral potentials → Fourier series
    – Approximate functions as sum of sine and cosine waves
    — More sine/cosine terms improves approximations
    – Can approximate (any) symmetrical rotational energy function.
    *** equation to go from structure to energy (instead of using QM)
146
Q

force fields

A
  • compute energies and atomic forces
  • Setting the force field depends on system type, accuracy/speed, and compatibility
  • Bonded and non-bonded interactions make up the complete force field
147
Q

topology files

A
  • tell the program which force field parameters to use and where)
  • define the molecular structure and interactions in the simulations
  • contains info on atom types, bonds, angles, dihedrals, and non-bonded interactions based on the chosen force field
148
Q

force field parameterization stepwise

A

Quantum Mechanical Calculations: obtain high-accuracy data for smell molecules and representative fragments

Empirical Data Integration: incorporate experimental measurements to validate and refine parameters

Parameter Optimization: adjust force field parameters through iterative simulations and comparisons

Advanced Techniques: utilize machine learning, multi-scale modeling, and automated pipelines to enhance parameters accuracy and efficiency

149
Q

scenarios in which quantum mechanical effects cannot be ignored

A

Electronic interactions involving electrons (smaller molecules better)
– Wave-particles, particle tunneling
– Electron transfer involving reactions

150
Q

importance of noncovalent interactions in molecular recognition

A
  • Crucial for simulating multiple molecules in MD
  • Facilitate organization of molecules into structures
  • Determine macroscopic props (MP, BP, solubility)
  • Govern biological functions (enzyme binding, protein folding)
    *** covalent interactions define primary structure while noncovalent interactions dictate how molecules interact
151
Q

Dual nature of non-covalent interactions →

A
  1. Dispersion forces: weak, attractive forces arising from instantaneous dipoles
    Stabilize molecular assemblies by promoting close packing
  2. Repulsion forces: strong, short-range forces due to overlapping e- clouds
    Prevents atoms from collapsing into each other; maintains atom integrity
152
Q

criteria for selecting high-quality experimental structures for simulations.

A

Need a good structure before starting any molecular simulation (equilibrium)

Avoid low quality, high-energy conformations with missing/wrong co-factors

  1. Resolution: how well the atomic positions are determined
    Resolution below 2.0 A is generally preferred for high-quality simulations
  2. Completeness: Flexible loops/disordered regions are often missing from the structure
    No missing residudes
  3. Functional State: Proteins can exist in different functional conformations: active vs inactive state, bound to ligands or unboard
  4. B-factors: Higher B-factors suggest more uncertainty in atom positions, which might make that part of the structure less reliable
153
Q

strategies for adding missing residues or atoms in protein models prior to conducting MD simulations.

A
  • Missing atoms or residues can be added using modeling software like Modelleer
  • Homology modeling, structure prediction software, MD simulations

Unwanted components like ligands or non-essential ions should be removed
ligands, ions, or crystallization agents that are not physiologically relevant

Distorts protein’s behavior in a simulated biological environment if not removed

154
Q

Describe the steps involved in protein preparation for simulations.

A
  1. Add missing residues
  2. Remove unwanted ligands, non-essential ions that distort protein behavior
  3. Correct protonation state (pH-sensitive residues)
  4. Energy minimization for sterics (adjust structure to remove unfavorable atom position and steric clashes that cause instability)
  5. Assign force field parameters
  6. Stabilize temp and pressure for MD
155
Q

Evaluating the suitability of a protein structure for simulations

A

Completeness: Check for missing residues, loops, or side chains; incomplete regions may need modeling to avoid simulation artifacts.

Functional State: Ensure the protein’s conformational state (e.g., active, inactive) aligns with the simulation goals; incorrect states could yield irrelevant results.

Clash Scores: Assess clash scores (using tools like MolProbity) to identify steric issues; high clash scores indicate steric conflicts that should be resolved with energy minimization or correction.

Resolution and R-Factors: Higher resolution (<2.0 Å) and low R-factors indicate greater structural accuracy, while poor values suggest potential inaccuracies.

156
Q

periodic boundary conditions (PBC) in molecular simulations

A

*** Realistic systems do not have walls

  • For simulations, need to apply force to keep molecules in the box
    – H20 and proteins would bounce off thse walls in an unphysical manner (edge effects)

PBC simulates infinite systems from a finite box
— Virtually place exact copies of system in all directions
— Atoms that cross the box edge reappear on the other side (no edge effects) → like pacman

  • uses Minimum image convention (MIC)
157
Q

Minimum image convention (MIC)

A

ensures that an atom in the primary box only interacts with the closest image of another atom

Images atoms in adjacent boxes are used to calculate interactions across the boundaries (ensures correct interactions)

158
Q

Macrostate

A

specifies the temp, pressure, volume, and number of particles of molecular systems

*** infinite number of macrostates

  • Large-scale system that defines properties of molecular system
    (temp, pressure, vol) → changing these changes macrostate
  • Encompasses all microstates that share the same properties
159
Q

ensemble

A
  • the collection of all possible microstates of a single macrostate
  • Perfect/accurate ensemble averages require sampling every possible configuration
  • Macrostate observables are ensemble averages
160
Q

Ensemble examples

A

Microcanonical Ensemble (NVE) →
- Fixed number of particles (N)
- Volume (V)
- Energy (E)
Canonical Ensemble (NVT) →
- Fixed number of particles (N)
- Volume (V)
- Temperature (T)
Isothermal-Isobaric Ensemble (NPT) → most common
- Fixed number of particles (N)
- Pressure (P)
- Temperature (T)

161
Q

energy in ensembles

A
  • The instantaneous temperature of microstates will fluctuate, but the ensemble average should be constant

There should be no net flow of energy!!!

162
Q

microstate

A

s a unique configuration defined by the positions and velocities of all particles
specific, detailed configuration of a system at the molecular level

  • Indicates exact position and energy of a particle
  • Multiple microstates can have the same distance (use mean of them)
163
Q

importance of adequately sampling microstates in MD simulations

A
  • needed to compute reliable ensemble averages
  • Longer simulation provide better sampling of microstates and their probabilities

1) Statistical accuracy, 2) Covering all conformations shifts, 3) Thermodynamic quantities

164
Q

thermostats

A

adjust the velocities of particles to increase or decrease the system’s kinetic energy → thereby controlling the temperature

165
Q

Berendsen thermostat

A

adjusts the velocities of all particles uniformly based on the current temperature and target temperature

  • Scales current velocity based on temp deviation
  • Prevents abrupt changes that could destabilize the simulation
  • Simple velocity scaling does not generate a true canonical (NVT) ensemble; it cannot reproduce realistic temperature fluctuations

^^^inaccurately models thermal energy transfer via particle collisions

166
Q

Nose-Hoover thermostat

A

connect particle momenta to fictitious heat bath

  • Momenta scaling provides realistic kinetic energy and thus temperature control
  • Heat bath allows thermal energy to flow in and out of our simulation

Q ⇒ a “mass” coupling parameter controls thermostat responsiveness

167
Q

barostats

A

maintain desired pressure during simulations

  • Adjusts volume of simulation box to achieve and maintain target pressure
  • Pressure is proportional to density and temperature
  • Scales box volume based on pressure difference to target
    *** all help to keep a consistent macrostate!
168
Q

Ensemble averages improve …

A

improve with more simulation time by sampling more microstates

– Many short simulations is better than 1 long one!!!
Random initial velocities provide better change of sampling different microstates
– Initial velocities are sent in a direction on the potential E surface simulation; there is a change it never samples a certain minima (multiple simulations with random velocities reduces chance)

*** discard initial relaxation as it is not our desired microstate

169
Q

Equilibration (Relaxation) Phase

A

Purpose: To allow the system to relax and reach a stable, equilibrated state after initial setup.

Process: Temperature, pressure, and density are gradually stabilized, with constraints often applied to avoid abrupt movements.

Goal: Achieve a realistic starting conformation that reflects the desired ensemble

170
Q

Production (Data Collection) Phase

A

Purpose: To collect data on the system’s behavior for analysis of properties like energy, structure, and dynamics.

Process: Constraints are usually removed, and the system is allowed to evolve naturally.

Goal: Gather accurate ensemble averages and insights into properties for the equilibrated system over time.

171
Q

Root Mean Square Deviation (RMSD)

A

measures deviation in structure over time (what conditions allow conformations to change faster)

  • Overall change in structure during simulation, tracks deviations from starting conformations (global conformational changes)

***Low is good: close to reference structure

***High indicates significant deviation and large structure changes over time

172
Q

Root Mean Square Fluctuation (RMSF)

A

How much does amino acid position change; Identifies regions of flexibility in the protein by calculating fluctuation of each atom (tracks local flexibility) about a mean position

– How does it change around its mean, not relative to ref structure

***Low: atom is fixed in place (well-ordered)

***High: fluctuates a lot (flexibility)

173
Q

relationship between energy and probability in molecular simulations

A

Minima ⇒ concave part in energy diagram
- Must overcome energy barrier to make the conformations (TSs can be higher)
– Most preferred would be the lowest energy state

***The probability curve will be exactly OPPOSITE of the energy curve

  • How much energy is needed between conformation or energy of binding .

*** without minimization, high-energy configurations may lead to bad results in MD simulations (removes unfavorable atom positions / sterics)

174
Q

stages of the drug discovery pipeline

A

Discovery and Preclinical Research (***Computation is most helpful with this stage)
– Potential drugs are identified and tested in non-human studies

Clinical Trials
– Testing in human subjects to assess safety and efficacy

Regulatory Approval
– Evaluation by agencies like the FDA before the drug can be marketed

Post-Marketing Surveillance
– Ongoing monitoring after the drug is available to the public

175
Q

key factors in selecting protein targets for drug design

A

Disease Relevance: the protein plays a critical role in the disease mechanism

Druggability: target has a structure that allows it to bind with drug-like molecules

Specificity: Targeting the protein minimizes effects on healthy cells, reducing side effects

176
Q

virtual screening

A

narrows down potential compounds from large chemical libraries during drug discovery.

  • tests (HTS) compounds against the target protein
  • Experimental assays are still expensive, and limited to commercially available compounds
    – use computational methods to predict which compounds we should experimental validate

*** SO MUCH FASTER TO NARROW DOWN SEARCH SPACE

177
Q

gibbs free energy

A

Energy is released/uptaken during binding → spontaneity of binding
– Combines enthapy and entropy

***Simulations capture free energy directions instead of treating enthalpy and entropy separately

BINDING STRENGTH AND STABILITY AND AFFINITY

178
Q

enthalpy

A

energetic interactions (sum of contributions provide ensemble avg)

  • Accounts for non covalent interactions (electrostatics, H-bonds, dipoles, pi-pi)
  • Ensemble differences in non covalent interactions provide binding enthalpy
179
Q

Electrostatic forces role in binding

A

(strongest force)
Long range interaction, anchor points (~5 to 20 kcal/mol per inter)

Charged molecules have a net imbalance between

net electrostatic attractions or repulsions between different atoms or molecules

180
Q

h bond role in binding

A

Specificity/orientation, stabilization, dynamic, strongest when hydrogen, donor, and acceptor atoms are collinear

Attraction between a (donor) hydrogen atom covalently bonded to an electronegative atom and another (acceptor) electronegative atom with a lone pair

181
Q

Uneven e- distribution role in binding

A

Directional binding for proper ligand alignment, flexibility (weak)

creates partial charges and dipoles

Electronegativity differences lead to unequal distribution of electron density

Consistent electron density spatial variation results in permanent dipoles

182
Q

VDW role in binding

A

Maximizes surface contact (complementary fit); flexibility

Dispersion: Electrons in molecules are constantly moving, leading to temporary uneven distributions that induce dipoles in neighboring molecules

Induction: The electric field of a polar molecule distorts the electron cloud of a nonpolar molecule, creating a temporary dipole

183
Q

Pi-pi interactions role in binding

A

Involve stacking of aromatic rings

Orientation of aromatics, selectivity

Noncovalent interactions between aromatic rings due to overlap of pi-electron clouds

184
Q

entropy

A

how much conformational flexibility changes
(Energy dispersion)

*** Higher entropy ⇒ greater microstate diversity for a given macrostate

  • Accounts for microstate diversity of a single macrostate

Can increase/decrease/remain the same depending on ligand concentration

185
Q

purpose of alchemical free energy simulations

A

Slowly disappear interactions to see the lowest energy conformations to get insight into binding affinities

Gets energy difference between conformations

186
Q

how alchemical simulations work

A

Compute the binding free energy somewhere in solution and then bind to protein

Slowly disappear the ligand (1 = normal interaction, 0 = no interactions)
– More relevant conformational sampling
– Run independent simulations in parallel
– Focuses on taking difference w smaller numbers

integrate over these small free energy changes (turn interaction on and off)
– Use this to relatively calculate the free energy difference between bound and unbound states

187
Q

why alchemical simulations not ideal

A

VERY EXPENSIVE; use docking first to screen molecule, then computes energies

  • captures all atomistic forces
  • wide range of conformation sampling
  • specific parameters
188
Q

docking

A

Avoid sampling all microstates and determine one “optimal” protein-ligand structure (rigid structure)

– using this bound structure, predict a “score” that is correlated to binding affinity
– Simplifies binding free energy prediction to enhance speed
–Ligand is not guaranteed to fit perfectly

189
Q

the importance of choosing an appropriate protein conformation

A

Protein-ligand interactions are highly-dependent on the protein’s 3D structure

Using an inappropriate protein conformation can lead to inaccurate docking results

**ONLY PICKING ONE STRUCTURE
– Challenge bc proteins are dynamic

190
Q

challenge for docking

A
  1. Conformational Flexibility (proteins are not rigid structure and experience movement side to side)
  2. Binding sites can change
  3. Limited experimental structures (not all relevant states may be covered)
191
Q

Experimental Structure Selection Criteria → Docking

A

Resolution and Quality
Ligand-Bound vs. Unbound Structures
Relevance to Target Ligand

192
Q

water molecules in docking

A

Role in binding: structured water molecules can mediates interactions between the protein and ligand

Inclusion Criteria: retain water molecules that are conserved across multiple crystal structures

Handling water in docking →
Some docking programs allow explicit water molecules in the binding site
Alternatively, consider their effect implicitly in scoring functions

193
Q

convex vs concave regions

A

Convex Regions: Typically inaccessible to ligands.
Concave Regions (Cavities): Potential binding sites.

194
Q

grid-based binding pocket detection

A

Grid-based: puts a protein on a grid, looks for protein vs no protein within grid

If there is no protein in a space, there is likely a pocket there (concave space)

Macrostate remains constant

195
Q

alpha shape theory

A

alpha spheres touch certain about of atoms (3 atoms only); cannot put any spheres on the outside in protein land

– Shows pockets based on how many spheres it is touches (group spheres placed in open spaces and indicate it as a pocket)
– uses Delaunay triangulation and alpha complexes to define cavities

196
Q

3 binding pocket classifications

A
  1. Orthosteric: primary active site where ligands bind
  2. Allosteric: secondary sites that modulate protein function upon ligand binding
  3. Cryptic: binding pockets not apparent in the unboard protein structure but form upon ligand binding or conformational change
197
Q

cryptic pockets

A

binding pockets not apparent in the unboard protein structure but form upon ligand binding or conformational change
– hard for MD simulations to detect
– must use enhanced MD methods, and apply pocket detection to multiple conformations

198
Q

process of pose optimization in docking to optimize ligand positions within the binding site for accurate binding affinity prediction

A

*** Accurate docking depends on optimized ligand poses (binding affinity)

  1. Initial ligand placement
  2. Scoring function evaluation of binding affinity
  3. Pose adjustment/optimization (move around ligand and protein residues to achieve best fit) + energy minimization
  4. Rescore and rank in terms of best pose
  5. Final pose choice
199
Q

Search strategies for docking →

A

systematic and stochastic

200
Q

systematic search

A

numerically iterate over all possible conformations
– Only possible for very small molecules, not used very often
– ID important degrees of freedom
– Remove structures with high strain
*****ALMOST NEVER DO

201
Q

stochastic searches

A

random sampling (Monte Carlo)
– Provide better balance of sampling and cost
– See if energy change of new conformation is less than random

Steps →
- Generate conformation
- Compete energy change
- If energy change less than a random sample: make move
- Repeat
***Allows us to sample efficiently!

202
Q

Scoring function

A

are parameterized models to estimate binding affinity after docking
– Physics-based methods using force-field like methods
– Machine learning (graphing neural networks) have been gaining traction recently

203
Q

Structure-Based Drug Design VS Ligand-Based Drug Design

A

Structure-Based Drug Design:
- Requires 3D structure of the target protein.
- Uses the binding site structure to model potential interactions.
- Often employs docking and molecular simulations.

Ligand-Based Drug Design:
- Requires no structural information of the target.
- Uses the chemical structure and activity of known ligands as guides.
- Relies on molecular similarity rather than direct binding predictions.

204
Q

Molecular weight

A

indicates the overall size of the molecule
– Impacts drug distribution and elimination rates in the body

205
Q

LogP

A

measures lipophilicity (chemical compound’s ability to dissolve in lipids, fats, oils, and non-polar solvents)
– Influences a molecule’s ability to cross cell membranes and affects absorption and bioavailability

206
Q

molar refractivity

A

relates to polarizability and electron cloud distribution
– Affecting intermolecular interactions and binding affinity

207
Q

TPSA

A

estimates the molecule’s ability to form hydrogen bonds
–impacting solubility and permeability across biological membranes

208
Q

number of rotatable bonds

A

reflects molecular flexibility
– influences binding affinity and oral bioavailability

209
Q

Extended connectivity fingerprints (ECFPs)

A

encode structural features into numerical representations
– Hash functions are used to encode chemical information
– Can be encoded into a bit array

210
Q

Tanimoto similarity

A

compares ECFPs between 2 molecules based on …

Molecular similarity: the concept that similar molecules often show similar biological effects.

Formula measures the ratio of the shared features to the total number of unique features between 2 molecules

211
Q

how molecular fingerprints are generated by hashing atom-specific properties

A
  • Numerical representation of the properties of the molecules
  • Chose a function, not a number → function turns into number
  • Generate a number consistently based on whatever input given for keeping track of molecules in the system
212
Q

hash function iterations

A

used to encode chemical info
For each iteration:
- incorporate hashes of atoms that are n bonds away
- Then encode atom IDs that exactly one bond away
- Repeat while hashing n-1 IDs

Each iteration encodes local chemical info into each atom’s ID

Similar features will share atom IDs until our iterations starts incorporating new features (encodes multiple levels of info)

213
Q

Bit arrays

A

fixed-length collections of ones and zeros

Allow for efficient operations

Encoded into bit array to store a collection of atom IDs

214
Q

QSAR models

A

link chemical structure with biological activity

Predict the biological activity of molecules based on their structure to reduce the need for experimental screen (quick and cost-effective)

215
Q

challenges of efficiently exploring chemical space to find active compounds similar to known bioactive molecules

A

So many molecules to find and finite amount of time
– Chemical space is vast
– Diverse properties
– Exhausts computational resources
– Reliability of predictive models