Bioinformatics Exam Flashcards

1
Q

AlphaFold 3

A

Predict the joint structure of complexes including proteins, NA, small molecules, ions, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

HOMER2

A

Show that the effect of transcription factor binding on transcription initiation is position dependent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to we acquire our DNA sample for DNA sequencing?

A
  1. Start with bacterial culture to produce the product of interest
    — Biotechnology frequently uses massive E. coli cultures to produce.
  2. Separate cells from media
    – Centrifuge and separate cells and media
    – Keep the component of interest (DNA)
    – Break open the cells by lysing them (chemical lysis destabilizes the lipid bilayer and denatures proteins)
  3. Isolate and purify our DNA
    – phenol-chloroform extraction (liquid-liquid separation)
    – Aq DNA/RNA on top
    – Lipids/large molecules on the bottom
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Surfactants VS Phospholipids

A
  • both contain a hydrophilic head and hydrophobic tail

– surfactant have only hydrophobic tail which allows them to further penetrate molecular structure as compared to phospholipids with 2 tails

– break phospholipid barrier more and destabilize proteins (used for chemical lysis)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

260 nm DNA sample absorbance

A
  • absorbance at 260 nm is correlated to the DNA concentration of the sample

— looks for impurities in the sample solution

— can assume we have purified DNA sample after this step

— based on the the absorbance of UV irradiation (Bier Lambert’s Law)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Main purpose of Sanger Sequencing

A

— determine the precise ordering of nucleotides

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

DNA elongation

A
  • occurs rapidly and continuously
  • use DNA polymerase and excess nucleotides to make copies of DNA
  • requires 3’ OH to add another nucleotide to the chain
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Di-deoxynucleotides (ddNTP)

A
  • ddNTPs stop replication
  • do not have a 3’ OH for continued elongation
  • usually a 1:100 ratio

*** left with DNA strands of variable length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sanger sequencing process

A
  • sort DNA fragments by length to see what the last nucleotide is
    – the less ddNTP results in a longer strand
    – higher concentration of ddNTP results in shorter strands

*** by sorting fragments by length, we can see what the last nucleotide was (line up 5’ nucleotide)
— get the template strand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Original Sanger Sequencing SetUp

A
  1. split DNA sample into 4 beakers
  2. Add a ddNTP into each beaker (A,T,C,G)
  3. Add some radioactive ddNTP into a single beaker
  4. Add Taq and run PCR

** separate by length in gel electrophoresis
(larger fragments do not travel as far)
– order from farthest traveled (shortest) to least traveled (longest)

***** need SEPARATE beakers bc you cannot differentiate between radioactive nucleotides

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sanger Sequencing Now

A
  • now use fluorescent tags to distinguish ddNTPs
  • only need one beaker for PCR
  • also automate fragment separation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

capillary gel electrophoresis

A
  • can accelerate fragment length sorting and detection
  • separates molecules by sized based on their charge-to-mass ratio
  • Smaller molecules move more freely/faster through the gel than larger molecules
  • molecules must be charged through tagging with a charged molecule
  • DNA and RNA are charged bc each nucleotide has a charge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

SanSeq Chromatogram

A
  • unique fluorescence signal per ddNTP produces a chromatogram
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ideal SanSeq Chromatogram

A
  • variation in peak height is less than 3-fold
  • peaks are evenly distributed
  • peaks contain only 1 color
  • absent baseline noise
  • interpreted nucleotide sequence is 5’ to 3’
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Nonideal SanSeq Chromatogram

A
  • significant noise up to ~20 bps is unreliable transport
  • dye blobs from unused ddNTPs
  • fewer longer fragments so signal is weaker
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

SanSeq VS Illumina Sequencing

A
  • Sanger sequencing is very accurate but slow compared to Illumina
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Illumina Sequencing

A
  • sequencing by synthesis
  • used polymerase/ligase enzyme to incorporate nucleotides with fluorescent tag (fluorescently labeled reversible terminator)
  • tags are then identified to determine the DNA sequence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Illumina Sequencing Process

A
  • Adapter ligations attach P5 and P7 oligos to facilitate binding to flow cell
  • fragments become bound somewhere in the flow cell
  • locally amplify bound DNA fragments to get clusters of the same sequence
    – bridge amplification creates double-stranded bridges
    – double-stranded clonal bridges are denatured with cleaved reverse strands

***clusters will give off a stronger signal compared to a single fragment

We repeatedly →
1. Add nucleotide
2. Capture signal
3. Cleave fluorophore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

5 step iIlumina sequencing process

A
  1. Add labeled dNTPs into flow cells
  2. Incorporate a complementary nucleotide
  3. Remove unincorporated fluorescent nucleotides
    4, Capture fluorescent signal & image clusters
  4. Remove the fluorophores and the protecting group
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

pair-ended sequencing

A
  • enables both ends of the DNA fragment to be sequenced

– Because the distance between each paired read is known, alignment algorithms can use this information to map the reads over repetitive regions more precisely.

***Results in much better alignment of the reads, especially across difficult-to-sequence, repetitive regions of the genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Nanopore Sequencing Technology

A
  • nanopore and polymer membrane respond to electrical perturbations

*** gives us much longer reads which is important for assembling reads into a genome

** type of third-generation sequencing (TGS)

  • can give long reads with no amplification
  • Direct detection of epigenetic modifications on native DNA.
  • sequencing through regions of the genome inaccessible or difficult to analyze by short-read platforms.
  • Uniform coverage of the genome; not as sensitive to GC content as short-read platforms.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

genome assembly

A
  • process of combining the short, overlapping sequencing reads into continuous DNA sequence

– having multiple fragments that contain the same portion of the sequence improves our coverage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

reads

A

raw sequences coming from experimentatation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

contigs

A

continuous stretches of DNA sequence from overlapping sequencing reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

ambiguous assembly

A

connecting contigs in an unknown order
- accounts for differences in scaffolds
- assemble using reference genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

scaffolds

A

multiple overlapping contigs with estimated gaps put together in a known order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Assembly quality metrics

A
  1. sort contigs from longest to shortest
  2. Find point when you have ~50% of genome
  3. then annotate our genome with exons and introns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

L50

A
  • number of contigs whose combined length is at least 50%

*** Lower is better for L50 value

  • longer contigs = more confidence that genome is right (higher quality assembly)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

N50

A
  • sequence length of the shortest contig at 50% of the total genome length

*** Higher is better for N50 value [median contig size = reliability factor]

  • N50 is the length of shortest contig in L50 assembly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

why clean sequencing reads?

A
  • improves assembly
  • Garbage in = garbage out
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

FASTA files

A
  • store sequences
  • One line starts with a “>” and a sequence ID code
    — It is optionally followed by a description of the sequence
  • One or more lines containing the sequence itself.

However…base calling is NOT perfect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

lagging synthesis

A
  • by failure to remove blocking fluorophore
  • synthesis is behind by 1 because block fluorophore was not removed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

leading synthesis

A
  • by addition of dNTP instead of ddNTP
  • synthesis is ahead by 1 nucleotide because 2 were added at one
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

signal cross talk

A
  • degrades the quality of assembly
  • clean = clear
  • noisy - blurry
  • ML models and algorithms compute the probability of error → i.e. quality
    — not confident that what it is seeing is purely blue, green, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

FASTQ files

A
  • store sequence and quality
  • quality scores measure the probability that a base is called incorrectly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

ASCII-encoded probabilities

A
  • allow for storing many floats per nucleotide
  • ASCII characters require ~¼ the memory and we already have to store nucleotides
  • Hexadecimal characters have an associated integer
    (phred quality (Q))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Phred quality P(Q)

A

the integer associated with the ASCII symbol

  • indicates the probability that an error has occurred

– smallest value 33 = lower hexadecimal cannot be rendered on screen

– ! = probability of error =1 (very bad quality)

*** Lower you go down the chart = higher quality of read; less likely for there to be errors within the text file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

calculating phred quality

A

P(#) = 10 ^ - (#-#)/10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Where do FASTQ file entries go?

A
  • NIH databases
  • GeneBank for genomic sequences
  • Sequence read archive (SRA) for sequencing data
  • RefSeq for reference genomes
  • BioProject for curated resources for a specific project
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

quality issues in sequencing data

A
  • errors are introduced to to the technical limitations of sequencing platforms
  • adapters may be present if reads are longer than the fragments sequenced
    — trimming adapters may improve the number of reads mapped

** quality control is an essential first step in any analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Per base sequence quality

A
  • box and whisker plot of base-call accuracy
  • green = excellent
  • yellow = good
  • poor = red
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Per sequence GC content

A
  • strong deviation from normal distribution could indicate contamination
  • shows curves of GC count per read
  • and theoretical distribution curve

**Compared similiarity of 2 to indicate purity/quality of sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

trimming and filtering

A
  • trimming of problematic bases at the ends of reads must be done in order to reduce bias in future analyzes

Trimming:
1. low quality score regions
2. beginning/end of sequence
3. remove adapters

Filtering:
1. with low mean quality score
2. too short
3. too many ambiguous (N) bases

Ex/ CutAdapt, Trimmomatic, FastP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

adapters for trimming

A
  • adapters are unique to DNA prep protocol and technology
  • note which specified adapter sequences are used for trimming
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

automatically cleaning data for quality

A

processing data many times consumes many resources SO combine tool features into runs

Instead of trimming adapters in one run and quality in another, we can simultaneously remove base calls with low accuracy.
— Phred </= 20 → poor
— Length required = 20

  • automatically removes low accuracy and short reads needed to assemble reads into quality contigs/scaffolds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

resequencing

A

align reads to reference genome and identify variants

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

deNovo Assembly

A

construct genome sequence from overlaps between reads
*** done 99% of the time

  • repeats/high coverage are the main challenges
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Why do we want the shortest superstring?

A
  • overlap maximization
  • reduces redundancy
  • maximizes confidence with highest overlaps
  • repeat resolution resolves repeats by favoring collapsed arrangements
  • evolutionary pressure allows for most genomes have selective pressure to be efficient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

greedy algorithm

A
  • merge strings by highest overlap

Procedure →
1. Merge strings one at time keeping consistent with 5’ and 3
2. Always merge the largest overlap (greedy), not necessarily the size of fragment
3. Repeat

*** Being greedy makes genome assembly tractable

*** not used in practice but helps us to understand the problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What happens if we have a tie for the greedy algorithm?

A
  • Chose randomly (first encountered, first merged)
  • Chose highest quality base call (use sequence with highest quality)
  • Chose highest coverage (whichever results in more coverage)
  • Look ahead (do both and evaluate consequence)
  • Exclude (don’t merge at all)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

repeats ruined our assembly

A
  • missing strings can result from the greedy assembly process
  • get the correct string back by increasing our K
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

de Bruijn graph

A
  • graphs is a data structure for drawing relationships between items
  • node = single entity [k-1]
  • edge = represents a connection between entities (can have direction) [k]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

directed multigraphs for genome assembly

A
  • genome assembly uses direct edges to specify overlap and concatenation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Building a directed multigraph

A
  1. each unique k-mer is a node. (k-mer = substring of length k)
  2. Add directed edges for each overlap and concatenation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

node is balanced if

A

indegree equals outdegree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

cyclical sequence

A
  • Circular genomes are not Eulerian
  • Contains an extra edge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Why is this not Eulerian?

A
  • more than two semi-balanced nodes
  • cannot walk along each edge once

– if there was no overlap, then we would have some unconnected graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

de Bruijn graphs and errors

A
  • errors dramatically increase the number of edges and unconnected graphs
  • errors affect k-mer counts
  • Error correction should remove most tips and islands; rest can be removed here, leveraging graph structure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Graph traversal algorithms are used to extract contigs (procedure)

A
  1. Select a start node
  2. Walk along the graph until a dead end or previously visited node is reached
  3. Backtrack and explore alternative paths
  4. Repeat for remaining unvisited nodes

*** walking along the graph produces strings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

how do we select a starting node?

A

using hubs with in and out degrees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

high coverage

A

suggests that the node is likely a true sequence rather than an error

– confidence in that overlap is good and that node is a good starting point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

How do you choose a walk

A
  1. Start a chosen vertex (node).
  2. Mark the current vertex as visited.
  3. Explore an adjacent unvisited vertex.
  4. If no unvisited adjacent vertices exist, backtrack to the last vertex with unvisited adjacent vertices.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

How do we choose the “best” path for our contig?

A
  • Long paths are desired but not always reliable due to potential repeats
  • High, consistent read coverage
  • Unique, non-branching paths
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

SPAdes

A

prokaryotic genome assembler
– based on DeBruijn graphs with numerous improvements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Error correction with BayesHamming

A
  1. Build hamming graphs for k-mers
    — Undirected edges for Hamming distances of n nucleotide differences
  2. Identify strong k-mers baked on clustering (i.e. high similarity)
    — Estimate read error based on base qualities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

multisized graphs and SPAdes

A
  • building multisized graphs with different k’s
  • using multiple graphs with different sizes of K’s allows for handling of variable coverages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

large K SPAdes graphs

A
  • leads to fragmented graphs
  • good for high-coverage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

small K SPAdes graphs

A
  • leads to collasped, tangled graphs
  • great for low-coverage regions (not too picky)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

potential bulge in SPAde graph

A

small, alternative path in the graph that diverges and then merges back into the main path
– due to sequencing errors, repetitive sequences, or small variations indels

** must remove bulge, but bulge will quickly deteriorate the graph and lose read info

– must project the info/coverage into Q
– P’s edges are removed in the process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

potential tip in SPAde graph

A

a short, dead-end path in the graph that does not connect back to the main sequence or structure
– result of sequencing errors, such as incomplete reads, low coverage, or random noise, which generate k-mers that don’t correctly align with the rest of the sequence

– Removes P (shortest) and projects information onto Q

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Paired-ended reads do not always cover our whole insert

A
  • If our insert (i.e. DNA sample) is longer than reads, then we don’t sequence the inner distance.
  • We want to maximize this inner distance.
  • A gap between paired reads gives us insight into repeated regions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

SPAdes estimates…

A
  • …estimates gap length between 2 reads via deBruijn gaps and graphs
  • doesnt not always have to be a repeating sequence; better for gap than unique sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

assembler graphs

A
  • assembler provide contigs and scaffolds
  • island contains 1 or more contigs
  • solid lines are called nodes and represent a contig
  • each connection suggests how these contigs connect to form a scaffold

ex/ bandage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

gene annotation

A
  • identifying the genetic elements and function in our contigs
  • results in sequences that likely encode for proteins
  • 2 types: structural and functional

Ex/ Prokka (several outputs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

structural annotation

A

identifies critical genetic elements such as genes, promoters, and regulatory elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

functional annotation

A

predicts the function of genetic elements

  • normally based on protein database search
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

eukaryotic VS prokaryotic annotation

A
  • Eukaryote annotation is significantly more challenging than prokaryotes
  • Introns and alternative splicing complicate eukaryote annotation

P: probabilistic models to identify open reading frames
E: accuracy demands supporting evidence like mRNA sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Identifying open reading frames (ORFs)

A
  1. Seek the standard start codons: ATG, GTG, or TTG
  2. Seek the stop codons based on the translation table
    — TAA, TAG, TGA for bacteria, archaea, and plant plastids

***then score the potential ORFs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

ribosomal binding site motif score

A
  • RBS score computed from dataset fitting
    – Search for RBS motif after start codon; choose whichever has the lowest bin number
    – take the training data from different annotated genomes to get computed frequency of RBS motif bin in the entire sequence (baseline) and RBS frequency
  • start codon score given by similar RBS framework
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

upstream score

A
  • Upstream score based on base analysis
    – By analyzing base frequency in specific upstream region, their annotation results improved
    **essentially looking for promoters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

coding score

A
  • computed based on gene enrichment parameters
  • computed frequency of nucleotide hexamers called in words
    – probability of observing word within single genes [G(w)]
    – probability of observing word within the whole genome/entire DNA sequence [B(w)]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

why is sequence alignment important for bioinformatics?

A
  • Biological sequences reveal evolutionary relationships
  • Sequences play a large role in the central dogma of DNA
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

hox genes

A
  • highly conserved gene
  • Play a crucial role in embryonic development, particularly in determining the body plan and specifying the anterior-posterior axis

***So how do we know that it is highly conserved?
– By aligning sequences!!
– infrequent changes (high similarity) indicate evolutionarily conserved sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

pairwise alignment

A

– reveals relationships between biological sequences
– Multiple Sequence Alignment (MSA) extends pairwise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

Multiple Sequence Alignment (MSA)

A
  • the process of aligning 3 or more biological sequences simultaneously

– Identifies conserved regions across multiple species
– Reveals patterns not visible in pairwise comparisons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

Aligning sequences can provide more insight than just conservation

A
  • Functional annotation (google search through data bases)
  • RNA and protein structure (ex/ alphafold)
  • Disease-associated mutations
  • Vaccine design
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

Importance of scoring in alignment selection

A

Alignment scores guide the selection of meaningful alignments

  1. objectivity
  2. optimization
  3. significance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

objectivity importance for alignment selection

A

provides a quantitative measure for comparison

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

optimization importance for alignment selection

A

allows algorithms to find the best alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

significance importance for alignment selection

A

helps distinguish real homology from random similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

match

A
  • identical characters in aligned positions
  • Represents conserved regions or no change
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

alignment elements reflect…

A

evolutionary events in sequences
- matches, mismatches, gap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

mismatch

A
  • different characters in aligned positions
  • Indicates substitutions or mutations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

gap

A
  • dash(-) inserted to improve alignment
  • Represents insertions and deletions (indels)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

linear gap penality

A
  • fixed cost for each gap

Ex/ -2 for each gap, regardless of length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

affine gap penalty

A
  • different costs for opening and extending gaps

Ex/ gap open= -4, gap extend= -1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

gap penalties

A

** reflect biological assumptions and impact alignment outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

Implications of Gap Penalty Types

A

1.) Linear penalties:
– Simpler to implement
– May over-penalize long gaps
2.) Affine penalties:
– Better handling of long indels
– More biologically realistic
3.) Biological rationale:
– Single mutation event often causes multi-base indel
– Affine penalties better model this biological reality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

Sophisticated scoring approaches (gap penalties)

A

***Advanced scoring methods enhance alignment accuracy

1.) Position- specific gap penalties:
– Reduce penalties in variable regions
– Increase penalties in conserved regions

2.) Residue-specific gap penalties:
– Adjust penalties based on amino acid properties

3.) Terminal gap penalities:
– Often reduced to allow end gaps in local alignments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

Protein alignments that require sophisticated scoring systems

A

***Simple match/mismatch scoring is insufficient bc:

  1. Some amino acid substitutions are more likely than others
  2. Chemically similar amino acids often substitute without affecting function
  3. Evolutionary relationships between amino acids are complex
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

substitution matrices

A

*quantify amino acid replacement probabilities

  • probability that a.acid i mutates into a.acid j for all pairs of a.acids
  • Constructed by assembling a large and diverse sample of verified amino acid alignments
  • Reflect the true probabilities of mutations occurring through a period of evolution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

global alignment

A
  • compares sequences in their entirety aka from START to END

***Needleman-Wunsch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

Key Characteristics of Global Alignment

A
  • Attempts to align every residue in both sequences
  • Introduces gaps as necessary to maintain end-to-end alignment
  • Optimizes the overall alignment score for the entire sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

Needleman-Wunsch

A
  • guarantees optimal global sequence alignment
  • Final number = final alignment score
  • traceback to find the best alignment

***Look at every possible move u can make to get into that cell
– Diagonal = mismatch/match
– Side/up/down = gap
– MATCH ⇒ diagonal

*** There can be multiple optimal alignments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

Advantages of global alignment (needleman-wunsch)

A
  • Provides a complete picture of sequence similarity
  • Ideal for detecting overall conservation patterns
  • Useful for phylogentic analysis of related sequences
106
Q

Limitations of Global Alignment

A
  • May force alignment of unrelated regions in divergent sequence
  • Less effective for sequences of very different lengths
  • Can be computationally intensive for long sequences
107
Q

local alignment

A
  • identifies best matching subsequences
  • focuses on finding regions of high similarity within sequences
  • Does not require aligning entire sequences end-to-end
  • Allows for identification of conserved regions or domains
108
Q

Key Characteristics of Local Alignment (Smith-Waterman)

A
  • Aligns subsections of sequences
  • Ignores poorly matching regions
  • Can find multiple areas of similarity in a single comparison
109
Q

Smith-Waterman

A
  • 0 zero is the lowest score
  • Start alignment at the highest cell
  • Stop aligning when you encounter a zero
110
Q

Needleman-Wunsch VS Smith-Waterman

A

1.) Matrix initialization:
NW: the first row and column are filled with gap penalties
SW: first row and column filled with zeros

2.) Scoring system:
NW: allows negative scores
SW: negative scores are set to zero

3.) Traceback:
NW: starts from the bottom-right cell
SW: starts from the highest scoring cell in the matrix

111
Q

Protein motif identification

A
  • exemplifies local alignment utility
  • can identify functional regions:
    – protein domains
    – active sites
    – binding motifs
    – signal sequences
    – post-translational modification sites
112
Q

Multiple Sequence Alignment

A
  • Compares three or more sequences simultaneously

Definition of MSA: arranges 3 or more biological sequences (DNA, RNA, or protein) to identify regions of similarities

  • Aims to infer structural, functional, or evolutionary relationships among the sequences

Ex/ Clustal Omega, MAFFT, and MUSCLE

113
Q

Key Characteristics of MSA

A
  • Aligns multiple sequences in a single analysis
  • Introduces gaps to maximize alignment of similar characters
  • Preserves the order of characters in each sequence
114
Q

Transcriptomics

A
  • A real-time microscope
  • Allows us to see exactly what genes are active at a given moment
  • Can see gene expression changes over time
115
Q

transcriptomics process

A
  • switch the complete set of RNA transcripts
  • including mRNA, rRNA, tRNA, non-coding RNA
116
Q

mRNA

A

instructions for protein synthesis

117
Q

rRNA

A

forms part of the ribosome structure

118
Q

tRNA

A

helps translate the genetic code into proteins

119
Q

non-coding RNA

A
  • play regulatory roles in the cell
120
Q

genome VS transcriptome

A

G: relatively static

T: constantly changing and captures the cell’s response to its environment and internet signals

*** The dynamic nature of the transcriptome reflects the functional state of the cell

121
Q

transcriptome can reflect:

A
  • cell type
  • developmental stage
  • environmental conditions

*** allows us to see which annotated games are actually being used

122
Q

cell type

A

a neuron will have a different gene expression profile than a liver cell

123
Q

developmental stage

A

The genes active in an embryo differ from those in an adult

124
Q

environmental conditions

A

cells respond to stress, nutrients, or pathogens by changing gene expression

125
Q

transcriptomics is versatile

A
  • developmental biology
  • disease research
  • drug discovery
  • ecology
126
Q

developmental biology

A
  • understanding cell differentiation

Which genes are expressed in a specific cell type or condition?

127
Q

disease research

A
  • identifying pathological gene expression patterns

What are the differences in gene expression between healthy and diseased states?

128
Q

Drug discovery

A
  • revealing mechanisms of action and side effects

How does gene expression change over time or in response to stimuli?

129
Q

ecology

A
  • studying organism-environment interactions

How do environmental factors influence gene expression?

130
Q

isoforms

A

A single gene can produce multiple mRNA transcripts (isoforms)

131
Q

transcriptomics reveals…

A
  • reveals alternative splicing and isoforms
  • One of the main ways organism can increase protein diversity without increasing the number of genes
132
Q

single-cell transcriptomics

A
  • revolutionizes resolution
  • Captures gene expression in individual cell (overall purpose)
  • Reveals cellular heterogeneity within tissues
  • While powerful, data is sparse and noisy
    – Not very reproducible bc there is very little RNA in cell
    – Often paired with bulk RNA analysis

*** most beneficial for rare cell with complex tissue types

133
Q

spatial transcriptomics

A
  • maps gene expression to location
  • Preserves spatial information of transcripts within tissue sections
  • Reveals how cellular neighborhoods influence gene expression
134
Q

functional insights from genomics

A
  • Identifies potential functional elements
  • Predicts disease risk
135
Q

functional insights from transcriptomics

A
  • Reveals which elements are active
  • Shows diseases state
136
Q

temporal insights from genomics

A
  • Requires one-time sampling
  • Reveals evolutionary history
137
Q

temporal insights from transcriptomics

A
  • captures real-time cellular responses
138
Q

RNA integrity number

A
  • Assess RNA integrity
    – rRNA makes up a large (~85%) of our RNA

** Based on the ratio of 28S and 18S rRNA vs. all RNA

18S is partially degrade 28S RNA
28S is largest peak (furthest to the right of graph)

139
Q

mRNA enrichment focus

A
  • focuses sequencing on protein-coding transcripts

Enrichment method affects:
– Gene expression measurements
– Detection of non-coding RNAs
– Identification of immature transcripts

140
Q

How could we filter our sample for only mRNA?

A
  • Poly A tail primer will allow for amplification of only mRNA
  • Poly (A) selection captures mature mRNAS
141
Q

Reverse transcription introduces unique challenge

A
  • RNA is converted to cDNA using reverse transcriptase
  • Random or oligo(dT) primers influence transcript representation
  • Second-strand synthesis method can preserve strand information
142
Q

microarrays

A

*detect gene expression
- cell sample is cultured
- mRNA is isolated
- reverse transcription to cDNA
- hybridize cDNA probes to oligo sequences on microarray

  • no longer in practice
    *** require previous knowledge/info input to reference
143
Q

Caveats of Microarrays

A
  • Limited to known sequences: can only detect pre-defined sequences
  • Cross-hybridization: similar sequences may cause false positives
  • Limited dynamic range: may miss very low or high abundance transcripts
  • Normalization challenges: complex process, potential for bias
144
Q

The primary advantage of RNA-sequencing over microarray technology?

A

Does not require prior knowledge/information

145
Q

RNA-seq

A
  • Now we just use the cDNA
  • RNA-seq doesn’t require prior knowledge of sequences
    – Enables discovery of novel transcripts and isoforms
146
Q

Computational Pipeline for RNA-seq data analysis Outline

A
  1. Read Alignment: Mapping Transcripts to the Genome
  2. Quantification: Measuring Gene Expression Levels
  3. Differential Expression Analysis: Identifying Key Genes
  4. Dimensionality Reduction: Visualizing Complex Data
147
Q

Read Alignment: Mapping Transcripts to the Genome

A
  • Consideration of splice junctions and gene isoforms
  • Needs to account for known and novel splice sites
  • Requires specialized alignment algorithms (e.g. STAR, HISAT2)
148
Q

Quantification: Measuring Gene Expression Levels

A
  • Counting aligned reads with HTSeq or featureCounts
  • Transcript-level quantification with Salmon or Kallisto
  • Normalization methods: ex/ TPM (transcripts per million)
  • Distinguishing between different isoforms of the same gene
149
Q

Differential Expression Analysis: Identifying Key Genes

A
  • Compares gene expression levels
  • Statistical testing with DESeq2 or edgeR
  • Visualization of results (volcano plots)
  • Clustering of differentially expressed genes
  • Results in list of up- and down-regulated genes
150
Q

Dimensionality Reduction: Visualizing Complex Data

A
  • Reduces high-dimensional data to 2D or 3D for visualization
  • Reveals patterns and clustering in the dta
  • Techniques include PCA, t-SNE, and UMAP
  • This practice is widely used, but extreme caution needs to be used and is not generally recommended

** generally not in practice because the data is not accurate; never analyze from reduced dimensions

151
Q

challenge of aligning short reads to large reference genomes

A
  • dealing with enormous data sets
  • millions of base pairs
  • hundreds of GB (most computers hold 8-12 GB)
152
Q

3 different alignment algorithm strategies

A
  • Hash tables
  • Suffix arrays/trees
  • Burrows-Wheeler transforms
153
Q

Hash table

A
  • Hash tables link a key to a value
    – Keys represent a “label” we can use to get information
  • A “hash function” determines where to find their number
  • convert labels to table indices

*** connects information to data in memory via hash function

ex/ like a phone book

154
Q

hash table for k-mer location

A
  • hashing our reference genome seeds our hash table with k-mer locations
  • provides quick lookups of our reference genome
  • query a k-mer read to get indices of our possible reference genome locations
155
Q

Seed-and-extend in hash-based alignment

A
  • determine k-mer strings
  • use hash table for rapid lookup of potential matches quickly
    -***multiple seeds increase chance of finding correct location
  • extend by starting from seed match and grow in both directions with reference genome

** check to see if we can align to reference

*Always go forward, but have to check backward if hit is not at the start of the sequence

156
Q

Hash-Based Alignment: Divide and Conquer

A

A “DNA dictionary” with a quick lookup and direct access to potential matches

157
Q

pros/cons of hash-based alignment

A

Pros:
- Easily parallelizable
- Flexible for allowing mismatches
- Conceptually simply
Cons:
- Large memory footprint for index
- Can be slower for very large genomes

158
Q

suffix trees

A
  • represent all suffixes of a given string
  • used to find starting index of suffix
159
Q

suffix arrays

A
  • memory efficient alternatives to trees
  • requires less memory but is also less powerful
  1. create all suffixes
  2. sort lexicographically
    3.
160
Q

Burrows-Wheeler Transform (BWT) purpose

A

Compression reduces the amount of data we have to store

  • sorts string without losing the original data when sorting lexicographically forces repeats that loses data
161
Q

BWT workflow

A
  1. Append a unique end-of-string (EOS) marker to the input string.
  2. Generate all rotations of the string.
  3. Sort these rotations lexicographically
  4. Extract the last column of the sorted matrix as the BWT output.

** First column is more compressible but we lose context and reversibility

162
Q

BWT reverse

A
  1. Write BWT output vertically
  2. Sort output lexographically.
  3. Append the BWT output to front of sorted string.
  4. Repeat sort and append
  5. Repeat into length of rows equals length of output
  6. The string that ends with EOS marker is the original string.
163
Q

Backward Search Algorithm for BWT

A
  • efficiently finds occurrences of a pattern in a text using the LF-mapping
  • number the F (first) and L (last) columns
  • find F rows that have last letter of search string
  • note which rows have the next letter in the L-column
  • repeat until first letter
164
Q

suppose we have isolated a normal and cancerous cell. We want to identify possible drug targets based on overexpressed genes

A

use transcriptomics

165
Q

normalizing the transcriptome

A
  • must normalize before making comparisons between transcriptomes (many sizes of such)
  • make ratio of normal to cancerous cell expression of transcriptomes
166
Q

Scaling data to “parts per million”

A
  • Transcripts and ratios are substantially smaller
  • Small floats require high precision and thus memory
  • This can make computations and communications challenging, so we often scale everything to a million to use unsigned integers
167
Q

read per kilobase (RPK)

A
  • corrects experimental biases where longer transcripts will have more reads
  • corrects through normalization of gene length (more exons)

RPK = (read counts for gene) / (gene length in kilobases)

168
Q

Reads per kilobase of transcript per million reads mapped (RPKM)

A

RPKM = 10^9 * (reads mapped to transcript / total reads*transcript length)

169
Q

traditional quantification and read mapping

A
  • assigns a read to single transcript using read mapping algorithms
  • once aligned we can count the number of mapped reads to each transcript
170
Q

BOWTIE 2

A

Uses BWT to map and quantify reads

171
Q

Spliced Transcripts Alignment to a Reference (STAR)

A
  • Maximum Mappable Prefix (MMP) approach for fast, accurate spliced alignments
  • finds the prefix that perfectly matches reference then repeats for unmatched regions
  • automatically detects junctions instead of relying on databases
172
Q

Alignment-based methods are computationally expensive
SO

A

Alignment-based methods need to determine the read’s exact position in the transcript

173
Q

Pseudoalignment

A
  • finds which transcript, but not where
    – Identifies which transcripts are compatible with the read, skipping the precise location step

** skips the full alignment process

  • Instead of mapping each read to a specific position, pseudoalignment identifies which transcripts are compatible with a given read
174
Q

alignment VS pseudoalignment

A

Alignment → specifies where exactly in the transcript this read came from (at position ___)

Pseudoalignment → specifies that it came somewhere from this transcript (compatible)

175
Q

Pros and cons of pseudoalignment

A

Pros: faster and less resource-intensive than alignment-based methods

Cons: It may lack certain details, such as the position and orientation of reads, which are useful for correcting technical biases

176
Q

generative model

A
  • statistical model that explains how the observed data are generated from the underlying system
  • Defines a computational framework that produces sequencing reads from a population of transcripts
177
Q

Salmon

A
  • mathematically defines a transcriptome by its individual transcripts and their counts
  • take nucleotide fractions by taking into account the effective length of each transcript
  • tells us how much of the total RNA pool comes from each transcript
  • tries to identify distributions of reads amongst the transcripts
  • Matrix computationally assigns fragments to transcripts
  • Salmon looks for parameters with the lowest errors (generative model)
178
Q

Converting salmon to relative abundances

A
  • The transcript fraction tells us the proportion of total RNA molecules in the sample that come from transcript i

*** normalizes the nucleotide fraction by the effective length (Ti)

  • adjusts for the longer transcripts generating more reads
179
Q

transcript fraction

A

proportion of total RNA molecules in the sample that come from a certain transcript (i)

180
Q

Transcript-Fragment Assignment Matrix

A
  • Z is binary matrix where all values are 0 or 1
  • M transcripts (rows)
  • N fragments (columns)

** shows if fragment is assigned to transcript

181
Q

conditional probability notation

A

P(a|b)
- what is the probability of a occurring if b is true

** we want to optimize values to get the highest probability

182
Q

Fragment Probabilities P(fj|ti)

A

Conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical biases

**SALMON quasi-mapping: probability is approximated based on transcript compatibility rather than exact positions

183
Q

positional bias in Salmon

A
  • Fragments that include transcript ends might be too short
  • Fragments from central regions are more likely to be of optimal length for sequencing reads
  • A transcript’s effective length adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled
184
Q

Overcoming GC content bias in Salmon

A
  • Undersample GC regions
  • Make good stop codons
  • Oversample AT rich regions
185
Q

2 Phase Interference in Salmon

A
  1. Online phase: makes fast, initial estimates of transcript abundances
  2. Offline phase: refines these initial estimates using more complex optimization techniques

This 2-phase approach balances speed(in the online phase) with accuracy (offline phase)

186
Q

quasi-mapping

A

Quasi-mapping is a fast, lightweight technique used to associate RNA-seq fragments with possible transcripts

  • early stopping of read mapping
  • alignment is expensive, so Quasi-mapping stops after identifying seeds

nt = (# of fragments mapping to t / total # of fragments)

187
Q

Iteratively update parameters based on mini batches

A
  • based on mini batches
  • Offline phase fine tunes transcript abundance
  • After the online phase, Salmon refines the estimates using a more complex optimization method, typically based on the Expectation-Maximization (EM) algorithm

** ensures the accuracy of abundance estimates, incorporating the bias corrects learned during the online phase

188
Q

Expectation-Maximization (EM) algorithm

A

ensures the accuracy of abundance estimates, incorporating the bias corrects learned during the online phase

189
Q

likelihood of data for Salmon

A
  • central to the interference process in Salmon
  • probability of observing the entire set of Fragments, given the transcriptome and nucleotide fractions
  • optimize the estimates of alpha, a vector of the estimated number of reads originating from each transcript

*** goal is to maximize this likelihood to infer the most likely values of n, which correspond to the relative abundances of transcripts

190
Q

Maximum Likelihood Estimation (MLE)

A

The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)

191
Q

Why the EM Algorithm Maximizes the Likelihood:

A

EM algorithm breaks down a difficult problem into 2 simpler problems:
– E-step: estimate the missing information(the assignment of fragments to transcripts) using the current transcript abundance estimates

– M-step: use the estimated assignments to update the transcript abundances, improving the likelihood

192
Q

EM algorithm and likelihood

A

For each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimate until it reaches a maximum

193
Q

What is Differential Gene Expression?

A

the process of identifying and quantifying changes in gene expression levels between different sample groups or conditions

194
Q

DGE workflow

A
  1. Sample collection: gather samples from different conditions (e.g. healthy or diseased)
  2. RNA sequencing (RNA-seq): Quantify gene expression level using high-throughput sequencing technologies
  3. Read Mapping and Quantification: align RNA-seq reads to a reference genome and quantify expression (e.g, using Salmon)
  4. Statistical Analysis: Identify genes with significant expression differences between conditions.
195
Q

Case Study: Breast Cancer
(DGE)

A

Objective:
– Identify genes differentially expressed between triple-negative breast cancer(TNBC) and hormone receptor-positive breast cancer

Findings:
– TNBC shows upregulation of genes involved in cell proliferation and metastasis

Implication:
– Targets for specific therapies
Improved classification and prognosis of breast cancer subtypes
***DGE provides statistical tools to identify changes between samples

196
Q

statistical model

A

A mathematical tool that describes how data is generated

***help us to make sense of complex data by identifying patterns and determining whether differences are meaningful or just due to chance

197
Q

what do statistical models help to answer?

A

It helps us answer:
– Is there an apparent difference in gene expression between 2 conditions?
– If so, is it real, or could it have happened by random chance or experimental flaws?

198
Q

hypothesis testing

A
  • perform hypothesis testing to see if the difference in expression between conditions is statistically significant.
199
Q

2 types of hypothesis

A

null (Ho)
alternative (H1)

200
Q

Null Hypothesis(Ho)

A

There is no difference in gene expression between the 2 conditions.

201
Q

Alternative Hypothesis(H1)

A

There is a significant difference in gene expression between the conditions.

202
Q

when do you reject the null hypothesis?

A

We reject the null hypothesis when our statistical test shows that the observed difference, if any, is unlikely to have happened by random chance.

203
Q

p-value

A
  • the probability of the null hypothesis being true

What is the probability that any difference is either (1) nonexistent or (2) due to random chance (i.e. “getting lucky”)

204
Q

higher VS lower p-value

A
  1. The higher the p-value, the more our model supports the null hypothesis
  2. The lower the p-value, the more our model supports the alternative hypothesis
205
Q

Differential gene expression uses statistical models for hypothesis testing

A

Ensures that we are not biasing our data or our interpretation

206
Q

count data

A

RNA-seq generates count data: the number of RNA fragments that map to each gene

207
Q

discrete data

A
  • data that can only take specific values (ex/ only whole numbers)

** In RNA-seq, we measure the number of reads mapped to a gene, so the data are count-based

  • requires special statistical tools
  • cannot use normal distribution bc it requires continuous data
208
Q

Binomial distribution

A

models the number of successes in a fixed number of independent trials, where each trial has the same probability of success

** simple model for discrete counts

209
Q

Limitations of Binomial distributions for RNA-seq

A
  1. MAIN limitation → assumes that the probability of success is constant between samples
  2. Smaller limitation 1 → The number of possible trials can be very large, especially when sequencing at a high depth.
  3. Smaller limitation 2 → The probability of expression is very small for many genes because they are either lowly expressed or not at all.
209
Q

Poisson distribution

A

a statistical tool used to model the number of events(or counts) that happen in a fixed period of time or space, where:
– The events are independent of each other
– Each event has a constant average rate

  • A baseline for modeling discrete counts
    *** simplifies computation and allows for varying probabilities
    *provides an accurate distribution of counts if your mean and variance are approximately equal
210
Q

parity plots with mean and variance

A
  • show deviations with Poisson distributions
    Mean = variance line
    ***Higher counts typically have larger variance
211
Q

Overdispersion in RNA-Seq

A

when the variance in the data is larger than what is predicted by simpler models (e.g. Poisson distribution)

** may reflect biological variability between samples not captured by the experimental conditions

212
Q

biological variability between samples not captured by the experimental conditions

A

Differences in RNA quality
Sequencing depth
Biological factors like different cell types within the same tissue

213
Q

Expected variance for Poisson-distributed data

A
  • equals the mean: Variance = u

*** Variance is often larger than the mean for RNA-Seq: Variance > u

214
Q

Negative Binomial distribution accounts for high dispersion

A

If alpha = 0, the Negative Binomial distribution reduces the Poisson distribution.
- negative binomial distribution models overdispersed count data (variance exceeds mean)

  • if this event rate is low, it is simplified into a poisson distribution. (that plots number of successes)
215
Q

The Challenge of zeros in RNA-seq data

A
  • RNA-seq data frequently contains zero counts for some genes because not all genes are expressed under all conditions.
  • Most statistical models account for variance, but not that 0’s can dominate counts

Ex/ high expected mean with Poisson distribution, we can still have zeros or very low counts. (zero = gene turned off)

In these circumstances, we have to use zero-inflated models.

216
Q

Why are statistical models important in RNA-seq?

A

RNA-seq data = messy: counts vary, lots of zeros, and data has no simple patterns

We need models to account for this complexity and figure out which genes are differentially expressed in a meaningful way

217
Q

maximum likelihood estimation (MLE) for optimization algorithms

A
  • used to estimate the parameters ų (mean) and ɑ (dispersion) for each gene

** MLE tries to find the model parameters that make the observed counts most likely
Adjusts the model until the predicted counts match the actual counts as closely as possible (i.e. minimize the error)

218
Q

wald’s test

A

statistical test that helps us to determine whether estimated log fold change between 2 conditions is significantly different from zero.

219
Q

Log Fold Change (β1) = 0

A
  • means that the gene is expressed at the same level in both conditions.
  • null = log fold change between conditions is 0. no difference in expression
  • alternative = log fold change between conditions is not zero (there is a difference in expression).
220
Q

Estimate Parameters from the Negative Binomial Model

A
  • gives us an estimated log fold change (β1) for each gene
  • Also gives us standard error (SE) for this estimate, which tells us how uncertain we are about the estimate of log fold change [ SE(β1) ]
221
Q

Wald Statistic

A

tells us how many standard deviations the estimated log fold change is away from zero (no difference = 0)

222
Q

likelihood ratio test (LRT)

A

Idea is to compare the likelihood of data under:
– The null model (same expression in both conditions)
– The alternative model (different expression levels in each condition

223
Q

volcano plot

A

displays the relationship between each gene’s statistical significance (p-value) and the magnitude of change (fold change)

224
Q

volcano plot interpretation

A

Top corners: genes with high significance and large fold changes (both upregulated and downregulated)

Center: genes with little to no change or low significance

225
Q

MA Plots

A

visualizes the relationship between the average expression (A) and the log fold change (M) for each gene

Usage: identifying trends or biases in expression data, such as mean-dependent variance

226
Q

MA Plots interpretation

A

Center Line (M=0): No change in expression.

Spread: indicates variability in fold changes across different expression levels

227
Q

heat map

A
  • displays the expression levels of multiple genes across different samples using color gradients

Rows: Genes
Columns: Samples
Color Intensity: represents expression level (e.g. red for upregulation, blue for downregulation)

228
Q

heat map interpretation

A

Identifying clusters of co-expressed genes and sample groupings based on expression profiles

229
Q

Principal Component Analysis (PCA) Plots

A
  • PCA transforms high-dimensional gene expression data into principal components that capture the most variance

Axes: Principal components representing the most significant sources of variation

Usage: assessing batch effects, overall data structure, and sample quality

230
Q

PCA interpretation

A

Sample clustering: samples from similar conditions cluster together
Outliers: samples that do not group with others may indicate technical or biological variability

231
Q

What is the main limitation of Sanger sequencing?

A

It has a high cost and low throughput.

232
Q

How can Sanger sequencing be used to help with next-generation sequencing (NGS) technologies?

A

to confirm sequencing results and fill gaps

233
Q

Which principle does Illumina sequencing rely on?

A

Sequencing by synthesis using reversible terminator nucleotides.

234
Q

What is a significant challenge particular to de novo genome assembly?

A

Handling repetitive DNA sequences.

235
Q

What does a directed edge represent in de Bruijn graphs?

A

The overlap between k-mer sequences

236
Q

What computational complexity is characteristic of de novo assembly?

A

The similarity of k-mers within the reads

237
Q

What is the primary benefit of using both short and long reads in de novo genome assembly?

A

Long reads can span repetitive regions, while short reads improve coverage.

238
Q

Which algorithm is commonly used for local pairwise sequence alignment?

A

Smith-Waterman
Blast

239
Q

What is the main difference between global and local sequence alignment?

A

Global alignment matches sequences in full; local alignment focuses on best-matching parts

Local alignment identifies conserved regions; global for full sequence comparison

240
Q

What is the significance of the numerical value in the bottom-right of the Needleman-Wunsch alignment matrix?

A

The numerical value in the bottom-right corner of the Needleman-Wunsch alignment matrix represents the optimal score of the global alignment between two sequences.

241
Q

What are the inherent limitations of greedy algorithms for genome assembly? In particular, how might these limitations affect the assembly outcome when dealing with complex eukaryotic genomes?

A

Greedy algorithms for sequence assembly can be efficient for simple genomes but struggle with complex ones due to issues with repetitive sequences, limited global perspective, and handling of genetic
variation. These limitations can lead to misassemblies, especially in genomes with high complexity,
such as those with repetitive regions or structural variations.

242
Q

You are analyzing bulk RNA-seq data from a study comparing gene expression in diseased versus
healthy tissue samples. You notice that a specific gene has a high fold change (i.e., it is up-regulated),
but the large p-value indicates that it is insignificant. What could be the most likely reason for this
observation?

A

The gene expression varies significantly within the sample groups

243
Q

When comparing RNA-seq data from two different developmental stages of an organism, you find
many genes with altered expression. Which factor should be considered before attributing these
changes to developmental processes?

A
  • Batch effects or variations in sequencing depth between the samples.
  • Exclusive reliance on fold-change values.
  • Attributing all changes in gene expression to transcriptional regulation.
244
Q

Given the challenge of genomic variability and sequencing errors in read mapping, which approach
is most effective in distinguishing true splice junctions from artifacts?

A

Employing statistical models that account for sequencing error rates and genomic variability

245
Q

The computational inference of splice junctions from RNA-Seq data involves aligning short reads that
may span exon-exon junctions. This process is complicated by the vast diversity of potential splicing events and the need for high accuracy in distinguishing actual splice junctions from sequencing
errors or genomic variations. Given these challenges, which strategy is most effective for improving
the accuracy of splice junction identification?

A

Use a hybrid approach that combines alignment to a reference genome with de novo assembly
of reads

246
Q

Aligning short reads to a reference genome presents significant challenges, especially in the context
of repetitive sequences or highly variable regions. Which approach offers the best potential to enhance read mapping accuracy in these complex genomic landscapes?

A

Initially aligns reads to a simplified model of the genome and gradually integrate more complex
regions

247
Q

What property of the Burrows-Wheeler transform is most crucial for improving the efficiency of pattern matching in biological sequences?

A

The rearrangement of characters to bring similar characters together.

248
Q

Why is the Burrows-Wheeler transform significant for bioinformatics applications in the context of
FM-indexes?

A

It allows for efficient backward search, reducing the time complexity of finding patterns.

249
Q

When applying the Burrows-Wheeler transform to a sequence, what is the importance of the last column?

A

It is for reconstructing the original sequence.

250
Q

Which of the following best describes the role of the Burrows-Wheeler transform (BWT) in the FM
index?

A

BWT is the first step in FM-index construction

250
Q

The true transcriptome of a sample is defined as:

A

The complete set of RNA molecules, including all isoforms present in the sample.

251
Q

The concept of effective length in RNA-seq data analysis accounts for:

A

The adjustment for the empirical distribution of fragment lengths obtained during sequencing.

252
Q

What is the primary goal of using maximum likelihood estimation in Salmon for RNA-Seq data analysis?

A

To maximize the probability of the observed RNA sequencing data

253
Q

Why is the quality of sequencing data typically lower at the end of a read in Sanger sequencing?

A
  • primarily attributed to the decreasing population of longer DNA fragments.
  1. probability of ddNTP incorporation
  2. concentration ratio of dNTPs to ddNTPs
  3. mass and mobility differences
  4. signal-to-noise ratio

NOT DUE TO QUALITY OF READS
- mixture contains an excess of both dNTPs and ddNTPs.
- concentrations of these nucleotides are not depleted during sequencing.
- concentrations of dNTPs and ddNTPs remain constant throughout

254
Q

What is the purpose of adding adapters to DNA fragments in Illumina sequencing?

A

Adapters are short oligonucleotide sequences added to DNA fragments during Illumina sequencing library preparation.

Purpose: Adapters contain sequences complementary to oligonucleotides on the Illumina flow cell surface. This allows DNA fragments to bind to the flow cell and form clusters.

255
Q

If the concentration of ddNTPs is too high:

A
  • short fragments
  • loss of long reads
  • reduced overall signal
256
Q

If the concentration of ddNTPs is too low:

A
  • long fragments
  • loss of short reads
  • weak signal for short fragments
257
Q

ratio of ddNTPs to dNTPs in Sanger sequencing

A
  • critical for generating a balanced distribution of fragment lengths
    – fragment distribution
    – read lengths
258
Q

smaller k-mer sizes

A
  • more likely to find overlaps between reads because they require fewer matching bases. This increases sensitivity, helping to connect reads in regions with low coverage or sequencing errors.
259
Q

larger k-mer sizes

A

Larger
- k-mers are more specific, reducing the chance of erroneous overlaps but requiring higher-quality data.
- more likely to span unique regions
- reduced overlap detection
-more memory used
f