Old exam questions Flashcards

Question 1

Q

EXAM QUESTION

what is a scaffold? Why do most genome assemblies consist of scaffolds? Give 2 reasons (2022)

Answer

A

what is it
- portion of genome sequence reconstructed from end-sequenced whole-genome shotgun clones.
- composed of contigs and gaps.
- Since lengths of fragments roughly known, number of b between contigs can be estimated.

Why
* GC-rich in sequence, often resulting in missing or low-quality sequence reads
* Large repetitive elements
* infer order, orientation, distance of contigs
* link contigs with Ns into scaffolds

Question 2

Q

EXAM QUESTION

How to use Hamiltonian path in sequence assembly, two problems (2020)

Answer

A

DBG

break reads into k-mers (e.g., 4-mers) –> these become the edges
break these into k-1-mers –> these become the nodes
use these to construct a DBG - (drop k-mers that are too (in) frequent)
target sequence: path that visits each edge once…at least in theory!
= eulerian path or cycle - not hamiltonian! nodes can be reused.

Question 3

Q

Maker pipeline method for gene prediction: steps

EXAM QUESTION

steps of a procedure combining intrinsic and extrinsic annotation for a non-model organism’s newly assembled genome without available RNA-seq data (2022)

Answer

A

initial gene predictions (extrinsic)
extraction of species-specific content statistics (intrinsic)
generation of species-specific HMMs (intrinsic)
(refined) gene predictions (intrinsic).. repeat steps 2-4 ca 2 times
final gene predictions

Question 4

Q

EXAM QUESTION

Describe the main points of intrinsec and extrinsic genome annotation, and a disadvantage for each one of them. (2019)

What does intrinsic and extrinsic mean in gene prediction, explain/describe methods and name disadvantages (2020)

Answer

A

Intrinsic: just use query data (genome/s), build statistical model eg HMM
Extrinsic: comparative - use other data eg rna sequences, orproteins from other species/lineages

Disadvantage
intrinsic: need a lot of training data, not possible for non-model organisms
extrinsic: difficult to predict new genes

Question 5

Q

EXAM QUESTION

Describe 2 structural variants and how to detect them during mapping of long-read sequencing data (2022)

Answer

A

insertions
deletions
duplications
inversions, rearrangements
copy number variations

long reads are split into sub-reads and mapped - anlaysis can then determine if they are one of the SVs above

Question 6

Q

EXAM QUESTION

Sequence A and B have a length of 1000aa. Seq A has N-terminal region (front), with high similarity to a tandemly duplicated region in the middle of sequence B. Draw a dotplot presenting the similarities. (2019)

First 250 amino acids are tandem duplicated in middle of B (2020)

Dotplot 2 sequences (2020)

Question 7

Q

EXAM QUESTION

4 parameter to increase sensitivity of BLAST (2022)

Default BLAST parameters are a good compromise between speed and sensitivity. List 4 parameters which you can change in a BLAST search in order to increase sensitivity. (2019)

4 BLAST parameter, how to change to increase sensitivity (or specificity (2020)

Answer

A

Word size

T- value

Gap Penalties:

Different substitution matrix

E-value threshold

Question 8

Q

EXAM QUESTION

In the lectures, we went over the concept of guide tree in two ocassions. Describe the use of guide trees in each of those contexts. (2019)

2 topics where guide tree was mentioned and how (2020)

Answer

A

First time: Guide trees for MSAs

Many progressive (iterative) alignment methods use guide trees to generate an MSA:
- compute a pairwise distance matrix
- use alignment scores to compute a guide tree which tells us which sequence to align next
Guide tree is not a phylogeny! guides the order in which sequences are being aligned

Second time:
Distance matrixes for phylogenies eg NJ?

Question 9

Q

EXAM QUESTION

List the steps for progressive alignment, and its main disadvantage. Describe and improvement for it, and the software that uses such improvement. (2019)

Main steps basic progressive alignment + disadvantage, how to overcome (2020)

Answer

A

compute pairwise distance matrix
alignment scores –> compute guide tree
align closely related seqs, progressively add more distantly related seqs

major weakness:
alignment errors cannot be corrected once introduced, because sub-alignments are “frozen”.

solution: repair errors during post-processing = iterative refinement
work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA.

implemented in most programs (eg MAFFT)

Question 10

Q

EXAM QUESTION

Define (p?)HMM, covariance model, 1 similiarity and 1 difference (2022)

Question 11

Q

EXAM QUESTION

what are hidden states, states, transition and emission probabilities of HMM for prokaryotic sequences (2022)

Answer

A

depends on whether for gene prediction or for gene family prediction?

Question 12

Q

EXAM QUESTION

describe process of NJ and ML and name and describe one method to test validity of found clades (2022)

Answer

A

NJ
- distance-based
- Like a cookbook recipe
- very fast greedy heuristic
- use MSAs to make distance matrix
- compute only a single tree

ML
- character-based
- calculate likelihood of a phylogeny = probability of a dataset (MSA) given a model (substitution matrix and a tree)
- evaluate many different trees and pick the optimal one

test validity: bootstrapping
- create pseudosamples of the same length as original MSA but with columns shuffled
- compute a phylogeny for each pseudo-sample
- count how many times a bipartition (group) appears
- label nodes from the original (best) tree with bootstrap proportions

Question 13

Q

EXAM QUESTION

For which of the tree building methods we saw do multiple substitutions pose a larger difficulty? Explain your answer. (2019)

Which phylogeny reconstruction method does not correct for multiple substitutions and how does it present a problem (2020)

Answer

A

Maximum Parsimony (MP)

MP disadvantages
- uses an unrealistic model of substitutions
- does not correct for multiple substitutions

divergent sequences = more multiple substitutions -> MP doesn’t work here –> doesn’t make sense with today’s data

lit:
- inherent assumption of slow rate of evolution (so slow that multiple hits are negligible).
- molecular phylogenetics moving towards resolving deep phylogenies (typically involve sequences with multiple subs at the same site)
- –> MP has gradually faded away, like an old soldier.

Question 14

Q

EXAM (2019, 2020)

List three biological reasons for which we may get incongruences in gene trees. Explain one of them, and how it is reflected in the tree.

Answer

A

incomplete lineage sorting / deep coalescence
hybridization or introgression
horizontal gene transfer (HGT)
differential duplication and loss
natural selection
ILS: alleles coalesce first with alleles from more distantly related species
Introgression: gene flow between closely related species (lineages)
HGT - exchange of genetic info between differrent species

–> gene tree does not match species tree

Question 15

Q

EXAM (2020)

Given multicast file of homolog sequences, how to extract orthologs and paralogs

(probably won’t be asked this as we had 2 lectures reduced to 1)

Answer

A

All against all comparisons
- based on score & length criteria
–> homologs (candidate pairs)

Formation of stable pairs
- analysis within and between genomes
- pairwise & multiple sequence comparisons
- ML evolutionary distances
- protein similarity graph, clustering
–> putative orthologs (stable pairs)

Verification of stable pairs
- compare with third genome:
check for hidden paralogs,
differential loss
- use species tree information
- graph theoretic approaches
–> Orthologs (verified pairs)

Question 16

Q

BLAST - What does changing word size do

Answer

Study These Flashcards

A

Word size
increase sensitivity: smaller
increase specificity: larger
increase speed: larger (?)

Question 17

Q

BLAST - What does changing T-value do

Answer

Study These Flashcards

A

T- value
lower–> increase sensitivity
higher –> increase specificity, speed

Old exam questions Flashcards

(17 cards)