Old exam questions Flashcards

1
Q

EXAM QUESTION

what is a scaffold? Why do most genome assemblies consist of scaffolds? Give 2 reasons (2022)

A

what is it
- portion of genome sequence reconstructed from end-sequenced whole-genome shotgun clones.
- composed of contigs and gaps.
- Since lengths of fragments roughly known, number of b between contigs can be estimated.

Why
* GC-rich in sequence, often resulting in missing or low-quality sequence reads
* Large repetitive elements
* infer order, orientation, distance of contigs
* link contigs with Ns into scaffolds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

EXAM QUESTION

How to use Hamiltonian path in sequence assembly, two problems (2020)

A

DBG

  • break reads into k-mers (e.g., 4-mers) –> these become the edges
  • break these into k-1-mers –> these become the nodes
  • use these to construct a DBG - (drop k-mers that are too (in) frequent)
  • target sequence: path that visits each edge once…at least in theory!
    = eulerian path or cycle - not hamiltonian! nodes can be reused.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Maker pipeline method for gene prediction: steps

EXAM QUESTION

steps of a procedure combining intrinsic and extrinsic annotation for a non-model organism’s newly assembled genome without available RNA-seq data (2022)

A
  1. initial gene predictions (extrinsic)
  2. extraction of species-specific content statistics (intrinsic)
  3. generation of species-specific HMMs (intrinsic)
  4. (refined) gene predictions (intrinsic).. repeat steps 2-4 ca 2 times
  5. final gene predictions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

EXAM QUESTION

Describe the main points of intrinsec and extrinsic genome annotation, and a disadvantage for each one of them. (2019)

What does intrinsic and extrinsic mean in gene prediction, explain/describe methods and name disadvantages (2020)

A

Intrinsic: just use query data (genome/s), build statistical model eg HMM
Extrinsic: comparative - use other data eg rna sequences, orproteins from other species/lineages

Disadvantage
intrinsic: need a lot of training data, not possible for non-model organisms
extrinsic: difficult to predict new genes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

EXAM QUESTION

Describe 2 structural variants and how to detect them during mapping of long-read sequencing data (2022)

A
  • insertions
  • deletions
  • duplications
  • inversions, rearrangements
  • copy number variations

long reads are split into sub-reads and mapped - anlaysis can then determine if they are one of the SVs above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

EXAM QUESTION

Sequence A and B have a length of 1000aa. Seq A has N-terminal region (front), with high similarity to a tandemly duplicated region in the middle of sequence B. Draw a dotplot presenting the similarities. (2019)

First 250 amino acids are tandem duplicated in middle of B (2020)

Dotplot 2 sequences (2020)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

EXAM QUESTION

4 parameter to increase sensitivity of BLAST (2022)

Default BLAST parameters are a good compromise between speed and sensitivity. List 4 parameters which you can change in a BLAST search in order to increase sensitivity. (2019)

4 BLAST parameter, how to change to increase sensitivity (or specificity (2020)

A

Word size

T- value

Gap Penalties:

Different substitution matrix

E-value threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

EXAM QUESTION

In the lectures, we went over the concept of guide tree in two ocassions. Describe the use of guide trees in each of those contexts. (2019)

2 topics where guide tree was mentioned and how (2020)

A

First time: Guide trees for MSAs

Many progressive (iterative) alignment methods use guide trees to generate an MSA:
- compute a pairwise distance matrix
- use alignment scores to compute a guide tree which tells us which sequence to align next
Guide tree is not a phylogeny! guides the order in which sequences are being aligned

Second time:
Distance matrixes for phylogenies eg NJ?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

EXAM QUESTION

List the steps for progressive alignment, and its main disadvantage. Describe and improvement for it, and the software that uses such improvement. (2019)

Main steps basic progressive alignment + disadvantage, how to overcome (2020)

A
  1. compute pairwise distance matrix
  2. alignment scores –> compute guide tree
  3. align closely related seqs, progressively add more distantly related seqs

major weakness:
alignment errors cannot be corrected once introduced, because sub-alignments are “frozen”.

solution: repair errors during post-processing = iterative refinement
work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA.

implemented in most programs (eg MAFFT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

EXAM QUESTION

Define (p?)HMM, covariance model, 1 similiarity and 1 difference (2022)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

EXAM QUESTION

what are hidden states, states, transition and emission probabilities of HMM for prokaryotic sequences (2022)

A

depends on whether for gene prediction or for gene family prediction?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

EXAM QUESTION

describe process of NJ and ML and name and describe one method to test validity of found clades (2022)

A

NJ
- distance-based
- Like a cookbook recipe
- very fast greedy heuristic
- use MSAs to make distance matrix
- compute only a single tree

ML
- character-based
- calculate likelihood of a phylogeny = probability of a dataset (MSA) given a model (substitution matrix and a tree)
- evaluate many different trees and pick the optimal one

test validity: bootstrapping
- create pseudosamples of the same length as original MSA but with columns shuffled
- compute a phylogeny for each pseudo-sample
- count how many times a bipartition (group) appears
- label nodes from the original (best) tree with bootstrap proportions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

EXAM QUESTION

For which of the tree building methods we saw do multiple substitutions pose a larger difficulty? Explain your answer. (2019)

Which phylogeny reconstruction method does not correct for multiple substitutions and how does it present a problem (2020)

A

Maximum Parsimony (MP)

MP disadvantages
- uses an unrealistic model of substitutions
- does not correct for multiple substitutions

divergent sequences = more multiple substitutions -> MP doesn’t work here –> doesn’t make sense with today’s data

lit:
- inherent assumption of slow rate of evolution (so slow that multiple hits are negligible).
- molecular phylogenetics moving towards resolving deep phylogenies (typically involve sequences with multiple subs at the same site)
- –> MP has gradually faded away, like an old soldier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

EXAM (2019, 2020)

List three biological reasons for which we may get incongruences in gene trees. Explain one of them, and how it is reflected in the tree.

A
  • incomplete lineage sorting / deep coalescence
  • hybridization or introgression
  • horizontal gene transfer (HGT)
  • differential duplication and loss
  • natural selection
  • ILS: alleles coalesce first with alleles from more distantly related species
  • Introgression: gene flow between closely related species (lineages)
  • HGT - exchange of genetic info between differrent species

–> gene tree does not match species tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

EXAM (2020)

Given multicast file of homolog sequences, how to extract orthologs and paralogs

(probably won’t be asked this as we had 2 lectures reduced to 1)

A

All against all comparisons
- based on score & length criteria
–> homologs (candidate pairs)

Formation of stable pairs
- analysis within and between genomes
- pairwise & multiple sequence comparisons
- ML evolutionary distances
- protein similarity graph, clustering
–> putative orthologs (stable pairs)

Verification of stable pairs
- compare with third genome:
check for hidden paralogs,
differential loss
- use species tree information
- graph theoretic approaches
–> Orthologs (verified pairs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

BLAST - What does changing word size do

A

Word size
increase sensitivity: smaller
increase specificity: larger
increase speed: larger (?)

17
Q

BLAST - What does changing T-value do

A

T- value
lower–> increase sensitivity
higher –> increase specificity, speed