Old exam questions Flashcards
EXAM QUESTION
what is a scaffold? Why do most genome assemblies consist of scaffolds? Give 2 reasons (2022)
what is it
- portion of genome sequence reconstructed from end-sequenced whole-genome shotgun clones.
- composed of contigs and gaps.
- Since lengths of fragments roughly known, number of b between contigs can be estimated.
Why
* GC-rich in sequence, often resulting in missing or low-quality sequence reads
* Large repetitive elements
* infer order, orientation, distance of contigs
* link contigs with Ns into scaffolds
EXAM QUESTION
How to use Hamiltonian path in sequence assembly, two problems (2020)
DBG
- break reads into k-mers (e.g., 4-mers) –> these become the edges
- break these into k-1-mers –> these become the nodes
- use these to construct a DBG - (drop k-mers that are too (in) frequent)
- target sequence: path that visits each edge once…at least in theory!
= eulerian path or cycle - not hamiltonian! nodes can be reused.
Maker pipeline method for gene prediction: steps
EXAM QUESTION
steps of a procedure combining intrinsic and extrinsic annotation for a non-model organism’s newly assembled genome without available RNA-seq data (2022)
- initial gene predictions (extrinsic)
- extraction of species-specific content statistics (intrinsic)
- generation of species-specific HMMs (intrinsic)
- (refined) gene predictions (intrinsic).. repeat steps 2-4 ca 2 times
- final gene predictions
EXAM QUESTION
Describe the main points of intrinsec and extrinsic genome annotation, and a disadvantage for each one of them. (2019)
What does intrinsic and extrinsic mean in gene prediction, explain/describe methods and name disadvantages (2020)
Intrinsic: just use query data (genome/s), build statistical model eg HMM
Extrinsic: comparative - use other data eg rna sequences, orproteins from other species/lineages
Disadvantage
intrinsic: need a lot of training data, not possible for non-model organisms
extrinsic: difficult to predict new genes
EXAM QUESTION
Describe 2 structural variants and how to detect them during mapping of long-read sequencing data (2022)
- insertions
- deletions
- duplications
- inversions, rearrangements
- copy number variations
long reads are split into sub-reads and mapped - anlaysis can then determine if they are one of the SVs above
EXAM QUESTION
Sequence A and B have a length of 1000aa. Seq A has N-terminal region (front), with high similarity to a tandemly duplicated region in the middle of sequence B. Draw a dotplot presenting the similarities. (2019)
First 250 amino acids are tandem duplicated in middle of B (2020)
Dotplot 2 sequences (2020)
EXAM QUESTION
4 parameter to increase sensitivity of BLAST (2022)
Default BLAST parameters are a good compromise between speed and sensitivity. List 4 parameters which you can change in a BLAST search in order to increase sensitivity. (2019)
4 BLAST parameter, how to change to increase sensitivity (or specificity (2020)
Word size
T- value
Gap Penalties:
Different substitution matrix
E-value threshold
EXAM QUESTION
In the lectures, we went over the concept of guide tree in two ocassions. Describe the use of guide trees in each of those contexts. (2019)
2 topics where guide tree was mentioned and how (2020)
First time: Guide trees for MSAs
Many progressive (iterative) alignment methods use guide trees to generate an MSA:
- compute a pairwise distance matrix
- use alignment scores to compute a guide tree which tells us which sequence to align next
Guide tree is not a phylogeny! guides the order in which sequences are being aligned
Second time:
Distance matrixes for phylogenies eg NJ?
EXAM QUESTION
List the steps for progressive alignment, and its main disadvantage. Describe and improvement for it, and the software that uses such improvement. (2019)
Main steps basic progressive alignment + disadvantage, how to overcome (2020)
- compute pairwise distance matrix
- alignment scores –> compute guide tree
- align closely related seqs, progressively add more distantly related seqs
major weakness:
alignment errors cannot be corrected once introduced, because sub-alignments are “frozen”.
solution: repair errors during post-processing = iterative refinement
work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA.
implemented in most programs (eg MAFFT)
EXAM QUESTION
Define (p?)HMM, covariance model, 1 similiarity and 1 difference (2022)
EXAM QUESTION
what are hidden states, states, transition and emission probabilities of HMM for prokaryotic sequences (2022)
depends on whether for gene prediction or for gene family prediction?
EXAM QUESTION
describe process of NJ and ML and name and describe one method to test validity of found clades (2022)
NJ
- distance-based
- Like a cookbook recipe
- very fast greedy heuristic
- use MSAs to make distance matrix
- compute only a single tree
ML
- character-based
- calculate likelihood of a phylogeny = probability of a dataset (MSA) given a model (substitution matrix and a tree)
- evaluate many different trees and pick the optimal one
test validity: bootstrapping
- create pseudosamples of the same length as original MSA but with columns shuffled
- compute a phylogeny for each pseudo-sample
- count how many times a bipartition (group) appears
- label nodes from the original (best) tree with bootstrap proportions
EXAM QUESTION
For which of the tree building methods we saw do multiple substitutions pose a larger difficulty? Explain your answer. (2019)
Which phylogeny reconstruction method does not correct for multiple substitutions and how does it present a problem (2020)
Maximum Parsimony (MP)
MP disadvantages
- uses an unrealistic model of substitutions
- does not correct for multiple substitutions
divergent sequences = more multiple substitutions -> MP doesn’t work here –> doesn’t make sense with today’s data
lit:
- inherent assumption of slow rate of evolution (so slow that multiple hits are negligible).
- molecular phylogenetics moving towards resolving deep phylogenies (typically involve sequences with multiple subs at the same site)
- –> MP has gradually faded away, like an old soldier.
EXAM (2019, 2020)
List three biological reasons for which we may get incongruences in gene trees. Explain one of them, and how it is reflected in the tree.
- incomplete lineage sorting / deep coalescence
- hybridization or introgression
- horizontal gene transfer (HGT)
- differential duplication and loss
- natural selection
- ILS: alleles coalesce first with alleles from more distantly related species
- Introgression: gene flow between closely related species (lineages)
- HGT - exchange of genetic info between differrent species
–> gene tree does not match species tree
EXAM (2020)
Given multicast file of homolog sequences, how to extract orthologs and paralogs
(probably won’t be asked this as we had 2 lectures reduced to 1)
All against all comparisons
- based on score & length criteria
–> homologs (candidate pairs)
Formation of stable pairs
- analysis within and between genomes
- pairwise & multiple sequence comparisons
- ML evolutionary distances
- protein similarity graph, clustering
–> putative orthologs (stable pairs)
Verification of stable pairs
- compare with third genome:
check for hidden paralogs,
differential loss
- use species tree information
- graph theoretic approaches
–> Orthologs (verified pairs)