Final Exam Flashcards
DNA sequencing set-up
- Start with bacterial culture for product of interest
- Separate cells from media via centrifuge
- Keep DNA by breaking open cells via lysing
- Isolate and purify DNA using liquid-liquid extraction (aq layer has DNA)
chemical lysis
destabilizes the lipid bilayer and denatures proteins
surfactants
one hydrophobic tail, which allows them to further penetrate molecular structures as compared to phospholipids with 2 tails
Similar to phospholipids, but break through barrier and destabilize proteins better
Main problem of determining the order of nucleotides
DNA elongation happens rapidly and continually
Uses DNA polymerase and excess of nucleotides to make copies of DNA
3’ OH is required for DNA elongation
Di-deoxynucleotides stop replication bc it lacks 3’ OH so polymerase cannot add another nucleotide to it
sanger sequencing
- accurate, long reads, but resource consuming
- use one beaker and fluorescence to distinguish between the ddNTPs
– Fragment separation can be automated via capillary gel electrophoresis
– Separates molecules by size based on their charge-to-mass ratio
Smaller molecules move more freely through the gel and migrate faster than larger molecules
molecules must be charged through tagging
– Unique signal per ddNTP products chromatogram
Building strand from fragments
Sort DNA fragments by length to see what the last nucleotide was
Line up the last 5’ nucleotide; gradually builds the 3’ end up to get strand
Original Set up →
- Split sample into 4 beakers
- Add all 4 ddNTPS into each beaker & radioactive ddNTP
Need separate beakers bc cannot differentiate between them - Add Taq polymerase
- Separate by length using gel electro.
Shortest lengths travel the farthest; associate them with a beaker
Good vs Bad chromatogram
Good:
- Variation in peak high is less than 3-fold.
- Peaks are evenly distributed and one color
- Baseline noise is absent
Interpreted nucleotide sequence is 5’ → 3”
Bad:
- Significant noise up to ~20 bps in (unreliable transport properties)
- Dye blobs occur from unused ddNTPs
- Fewer longer fragments so signal is weaker
Illumina
short reads, but high throughput
- Adapter ligations attach P5 and P7 oligos to facilitate binding to flow cell
- Primers are not complementary, so they do not base pair
- Fragments become bound somewhere in the flow cell
- locally amplify bound DNA fragments to get clusters of the same sequence
- Bridge amplification creates double-stranded bridges
- Double-stranded clonal bridges are denatured with cleaved reverse strands
- uses pair-end sequencing
***clusters will give off a stronger signal compared to a single fragment
Illumina stepwise
- Add labeled dNTPs into flow cells
- Incorporate a complementary nucleotide
- Remove unincorporated fluorescent nucleotides
- Capture fluorescent signal & image clusters
- Cleave the fluorophores and the protecting group
Pair-end sequencing
generated from both ends of a DNA fragment with known insert size
enables both ends of the DNA fragment to be sequenced
Distance between each paired read is known, alignment algorithms can use this info to map the reads over repetitive regions more precisely.
Results in much better alignment of the reads, especially across difficult-to-sequence, repetitive regions of the genome
** more expensive but ideal for genome assembly
Nanopore
Longer reads, more accurate for assembling reads into genome
Very expensive, low throughput
single-end reads
- generated from only one end of a DNA fragment
- Simpler, fast, more cost-effective
- Limited context for structural variations or duplications
- Used for small genomes and RNA seq where contiguity is less critical
Genome assembly
- process of combining our sequencing reads into a continuous DNA sequence
(Sequencing provides short, overlapping reads of DNA)
Having multiple fragments that contain the same portion of the sequence improves our coverage
reads
raw sequences coming the experiments
Contigs
continuous stretches of DNA seq from overlapping seq reads
Ambiguous assembly
contigs put together in an unknown order
Accounts for differences in scaffolds; Assemble using reference genome
Scaffold
contigs put together overlapping with estimated gaps in a known order
main challenges for deNovo genome reconstruction
Repeats: create ambiguity and can cause assemblies; inflate genome size
High Coverage: sequencing the genome multiple times, resulting in a greater number of reads that overlap any given region of the genome
greedy overlap
deNovo genome reconstruction
Goal is to assemble the strings (reads) into a continuous, single string (contig)
Want the shortest possible superstring
- Overlap maximization
– Reduces redundancy, maximizes confidence with highest overlap - Repeat resolution
– Resolves repeats by favoring collapsed arrangements - Evolutionary pressure
– Most genomes have selective pressure to be efficient
how to do a greedy assembly?
merge by highest overlap!!
Repeats ruin assembly ⇒ can cause missing reads
Increase K to overcome repeats
de Brujin graphs
- help for to visualize relationships/overlaps between the strings
- Node = single entity [k-1]
- Edge = represents a connection between entities (can have direction) [k]
- uses direct edges to specify overlap and concatenation
- Each unique k-mer is a node. (K-mer = substring of length k)
- A node is balanced if indegree equals outdegree
multiple reads for DB graphs
not Eulerian bc cannot walk along each edge once; 2 semi-balanced nodes
edges on walk extend the contig in multiple directions
errors in assembly effect on DB graphs
Errors affect:
1) k-mer counts, 2) increase # of edges and unconnected graphs
- No overlap would lead to unconnected graphs; weights can be added to arrows (#)
Error correction should remove most tips, islands, bulges (splits and reconnects)
high coverage for deBrujin
High coverage suggests that a node is likely a true sequence rather than an error
How do we choose the “best” path for our contig?
Long paths are desired but not always reliable due to potential repeats
High, consistent read coverage
Unique, non-branching paths
SPAdes
- prokaryotic genome assembler based on DB graphs
- Estimates gaps between reads using DB graphs
- Builds multisized graphs with different k’s.
- Using multiple graphs allows for a better handling of variable coverage.
- Assemblers provide contigs and scaffolds (connections how contigs form scaffolds)
Large VS Small K
Large K ⇒ fragmented graphs; helps reduce repeat collapsing
Small K ⇒ collapsed/tangled graph good for low-coverage regions
L50
NUMBER of contigs whose combined length is at least 50%
Lower is better for L50 value
**longer contigs = more confidence that genome is right
N50
LENGTH of the shortest contig at 50% of the total genome length
Higher is better for N50 value [median contig size = reliability factor]
***N50 is the length of shortest contig in L50 assembly
leading synthesis
- by addition of dNTP instead of ddNTP
- synthesis is ahead by 1 nucleotide bc 2 were added at once
lagging synthesis
by failure to remove blocking fluorophore
signal cross talk
degrades quality of assemblies
phred quality score
assess the accuracy of nucleotide base calls in DNA sequencing (prob that base call is incorrect)
ASCII encoded probability store phred quality scores in FASTQ file
Per sequence GC content
Deviation from normal distribution indicates contamination (reads)
Trimming/Filtering
reduces bias of bad base calls normally at the ends of reads
Trimming/cutting/masking sequences
– From low quality score regions
– Beginning and end of sequence
– Remove adapters
Filtering of sequences
– With low mean quality score
– Too short
– With too many ambiguous (N) bases
structural annotation
identifies critical genetic elements such as genes, promoters, and regulatory elements
Functional annotation
- predicts the function of genetic elements
Reading ORFS →
- Seek the standard start codons: ATG, GTG, or TTG
- Seek the stop codons based on the translation table
TAA, TAG, TGA for bacteria, archaea, and plant plastids
Typical elements of a gene that are annotated
Promoter, start site, 5’ UTR, exons, introns, start codon, CDS, stop codon, 3’ UTR
MSA
the process of aligning three or more biological sequences simultaneously
Identifies conserved regions across multiple species
Reveals patterns not visible in pairwise comparisons (evol. relationships)
Key characteristics:
- Aligns multiple sequences in a single analysis
- Introduces gaps to maximize alignment of similar characters
- Preserves the order of characters in each sequence
Important elements of scoring in alignment selection
Objectivity: provides a quantitative measure for comparison
Optimization: allows algorithms to find the best alignment
Significance: helps distinguish real homology from random similarity
Alignment elements reflect …
evolutionary events in sequences
(match, gap, mismatch)
match
identical characters in aligned positions
Represents conserved regions or no change
mismatch
different characters in aligned positions
Indicates substitutions or mutations
gap
dash(-) inserted to improve alignment
Represents insertions and deletions (indels)
global alignment
compares sequences in their entirety (start to end)
Key characteristics:
– Attempts to align every residue in both sequences
– Introduces gaps as necessary to maintain end-to-end alignment
– Optimizes the overall alignment score for the entire sequences
Needleman-Wunsch: guarantees optimal global alignment
Advantages of global alignment
Provides a complete picture of sequence similarity
Ideal for detecting overall conservation patterns
Useful for phylogentic analysis of related sequences
limitations of global alignment
May force alignment of unrelated regions in divergent sequence
Less effective for sequences of very different lengths
Can be computationally intensive for long sequences
local alignment
identifies best matching subsequences; focus on regions of high similarity
Key characteristics:
– Does not require aligning entire sequences end-to-end
– Allows for identification of conserved regions or domains
– Ignores poorly matching regions
– Can find multiple areas of similarity in a single comparison
– Aligns subsections of sequences
– Protein motif identification exemplifies local alignment utility (identifies functional regions)
Smith-Waterman
Needleman Wunsch
- start with 0 in top corner
- add gap penalty down the first row
- move across to get the highest possible score while including penalities
- score is in bottom row
Smith Waterman
- 0 zero is the lowest score
- if negative, make it 0
- enter 0’s in starting rows
- Start alignment at the highest cell
- Stop aligning when you encounter a zero
Smith-Waterman differs from Needleman-Wunsch in key aspects →
Matrix initialization:
NW: the first row and column are filled with gap penalties
SW: first row and column filled with zeros
Scoring system:
NW: allows negative scores
SW: negative scores are set to zero
Traceback:
NW: starts from the bottom-right cell
SW: starts from the highest scoring cell in the matrix
linear gap penalty
fixed cost for each gap
Similar to implement, over-penalizes long gaps
affine gap penalty
different costs for opening and extending gaps
Better for long indels, more biologically realistic (Single mutation event often causes multi-base indel)
position-specific gap penalties
Reduced in variable regions; increase in conserved regions
residue-specific gap penalties
Adjust penalties based on amino acid properties
terminal gap penalties
Often reduced to allow end gaps in local alignments
transcriptomics
allows us to see exactly what genes are active within a given moment
Allows us to see changes in gene expression overtime (picture of gene exp.)
Works with a complete set of RNA transcripts (mRNA, rRNA, tRNA, non-coding RNA)
Captures the dynamic nature of the transcriptome to reflect the functional state of the cell; captures cell’s response to environment and signals
*** what annotated genes are actually being used
isoforms
a single gene can produce multiple mRNA transcripts
Way for org. to increase protein diversity without increasing the number of genes
reveals alternative splicing and reforms (cell type, envt, developmental state)
genomics VS transcriptomics
Functional insights →
- Identifies potential functional elements
- Reveals which elements are active
- Predicts disease risk
- Shows diseases state
Temporal insights →
- Requires one-time sampling
- Captures real-time cellular responses
- Reveals evolutionary history
single-cell transcriptomics
- revolutionizes resolution
- best for rare cells with complex tissue types
- captures gene expression in an individual cell
- reveals cellular heterogeneity within the tissues
***very powerful data but can be very sparse and noisy
***Not very reproducible bc there is so little RNA in a cell; typically paired with bulk RNA analysis
spatial transcriptomics
- maps gene expression to location
- Preserves spatial information of transcripts within tissue sections
- Reveals how cellular neighborhoods influence gene expression
RNA integrity number
- rRNA makes up a large percentage of our RNA
- lower numbers are degreaded sample (28S is degraded to 18S rRNA)
filter for mRNA only
poly A tail primer allows for amplification of only mRNA
microarrays
- convert mRNA to cDNA
- no longer in practice
- LIMITED to known sequences
- similar sequences may cause false positives
- limited dynamic range
- normalization challenges
- potential for bias
RNA-seq
- doesn’t require prior knowledge of sequences; allows for discovery of novel trancripts/isoforms (primary advantages over microarray technology)
computational pipeline for RNA-seq data analysis
- Read alignment: mapping transcripts to the genome
- Quantification: measuring gene expression levels
- Differential expression analysis: identifying key genes
- Dimensionality reduction: visualizing complex data (not in practice)
Hash table
- link a key to a value
- keys represent a label we can use to get info
- hash function used to determine where to find their number
- DNA dictionary with quick lookup and direct access to potential matches
(large memory and slow for large genomes)
** way for reads to be mapped to reference genomes
suffix arrays/trees
- represent all suffixes of a given string
- used to find the starting index of a suffix
- arrays are a memory-efficient alternative to trees
— require less memory, but are less powerful
*** create all suffixes; fix with end-of-string identifier; then sort lexicographically
we LOSE the original data
Burrows-Wheeler transforms (BWT)
- compresses the amount of data that we have to store without losing the original data
— allows for reversibility of data
Basic concept of BWT:
– Append a unique end-of-string (EOS) marker to the input string.
– Generate all rotations of the string.
– Sort these rotations lexicographically
– Extract the last column of the sorted matrix as the BWT output.
– 1st column is more compressible but lose context/reversibility
backwards search for BWT
- backwards search efficiently finds occurrences of a pattern in the text using L-F mapping
- reversibility of BWT is better than suffix arrays bc we do not lose data
alignment
- specifies where exactly in the transcript this read came from
– (at position ___) - specifies where exactly in the transcript this read came from
*** need to determine the read’s exact position in the transcript but they are SOOO EXPENSIVE $$$$
pseudoalignment
- specifies that it came somewhere from this transcript (compatible)
- Finds which transcript, but not where
- Identifies which transcripts are compatible with the read, skipping the precise location step
- Faster and less resource intensive than alignment based methods
- Lacks certain details (position and orientation of reads) which are useful for correcting technical biases
quantifying gene expression levels
- Must scale data for higher precision, less memory
– Read per kilobase(RPK): corrects this experimental bias through normalization by gene length
– Parts per million (ppm)
generative model
- a statistical model that explains how the observed data are generated from the underlying system
- Defines a computational framework that produces sequencing reads from a population of transcripts
- get reads from the transcript though we don’t know how much transcript is there bc it is bias
— go backwards to calculate transcript abundance from the read distribution
transcript fraction
- tells us the proportion of total RNA molecules in the sample that come from a certain transcript
- adjusts for the fact that longer transcripts generate more reads
- normalizes length VS nucleotide to transcript proportions
fragment probabilities
conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical bias
- SALMON approximates
Positional bias:
- Fragments that include transcript ends might be too short
- Fragments from central regions are more likely to be of optimal length for sequencing reads
GC content:
- Undersample GC regions
- Make good stop codons
- Oversample AT rich regions
expectation-maximization algorithm
E) estimate missing info (assignment of fragments to transcripts) using the current transcript abundance estimates
M) use the estimated assignments to update the transcript abundances (improves likelihood)
For each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimate until it reaches a maximum
– ensures the accuracy of abundance estimates by correcting bias learned during the estimation (online) phasee
transcript effective length
adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled
maximum likelihood estimation (MLE) goal
The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)
2 Phase interference in Salmon
- online phase: fast, initial estimates of transcript abundances
- offline phase: refines initial estimates using more complex optimization techniques
** balances speed (online) with accuracy (offline)
quasi-mapping
- a fast, lightweight technique used to associate RNA-seq fragments with possible transcripts
*** often used for the initial estimates of the online phase in SALMON
Expensive so stops after identifying seeds !!!
SALMON transcript-fragment assignment matrix
- uses matrix to identify distributions of reads amongst the transcripts
— computationally assigns fragments to transcripts
*** maps RNA-seq reads (fragments) to transcripts, enabling accurate quantification of transcript levels
— decides how many fragments are assigned to a specific transcript (higher expression = more fragment abundance in a transcript)
statistical model
mathematical tool that describes how data is generated
***help us to make sense of complex data by identifying patterns and determining whether differences are meaningful or just due to chance