Weeks 5-6: Genome Assembly; BLAST/FASTA Flashcards

1
Q

The IHGSC used a _____ approach to sequence the genome

A

hierarchical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Celera used ______ to sequence the genome

A

whole genome shotgun

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Pseudogene

A

DNA sequence resembling a gene but mutated into an inactive form over the course of evolution.
Often lacks introns and other essential DNA sequences necessary for function. Pseudogenes do not result in functional proteins, but may have regulatory effects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

True or false:

98% of the genome doesn’t encode proteins

A

True. Only 2% of the genome encodes proteins.

The other 98% encodes small RNAs that regulate gene expression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Output of Sanger sequencing

A

Single sequence ranging from 500-1000 bp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Output of Next-Gen Sequencing

A

Groups sequences ranging from 25-500 bp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

De novo assembly

A

Reconstruction of contiguous sequences without making use of any reference sequence.
Reads are partitioned into k-mers (substrings of the read sequence of length k)-form the nodes of the graph (network) and are linked when sharing a k-1 mer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Genome annotation

A

Computationally expensive process attaching biologically relevant information to genome sequence data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pre-Assembly Steps

A

FastQC: To check the quality of the sequencing data, overall GC content, repeat abundance, the proportion of duplicated reads.
Trim sequences: adapter trimming (cutadapt), trim reads based on quality (sickle).
Remove contaminant sequences such as DNA from the PhiX phage: use a short read aligner (such as Burrows-Wheeler Aligner)
Demultiplex reads: Galaxy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

True or false:

Gel electrophoresis is required for NGS

A

False. No gel electrophoresis needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Smith-Waterman Search

A

Perform dynamic programming between query and each sequence in the collection.
Accurate - guaranteed to report the highest scoring alignments
Slow - searching a 52,000,000,000 basepair collection (entire GenBank database) takes around 3 days on a modern workstation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

FASTA

A

First heuristic search algorithm
~5x faster than Smith-Waterman
Four stage search process - first stage based on algorithm of Wilbur and Lipman for finding exact matches of length n between query and collection sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Wilbur-Lipman Approach

A
  • Ignore indel events
  • Extract intervals (fixed-length overlapping subsequences from the first sequence of length n)
  • Store intervals in fast search structure
  • For each interval in the second sequence, search for it in the hash table
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

FASTA steps

A

Step 1: Identify regions shared by the two sequences of length n = 1 (using the Wilbur-Lipman method)
Step 2: Rescan the top-ten regions, and rescore using a scoring matrix (protein only)
Step 3: Check to see if initial regions can be joined to form rough alignment with gaps
Step 4: Perform banded Smith-Waterman location alignment centred around all regions that score greater than a threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

BLAST

A

~50x faster than Smith-Waterman, 10x faster than FASTA but not 100% accurate
Stage 1: BLAST searches for hits (matches of length W between query and subject). Location of each hit is passed to stage 2.
Stage 2: BLAST performs an ungapped alignment of region surrounding each hit. High-scoring ungapped alignments (where score > T) are passed to stage 3
Stages 3 and 4: BLAST performs a gapped alignment of region surrounding each high-scoring ungapped alignment
High-scoring alignments are displayed to the user

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Difference between BLAST protein and nucleotide searches

A

Blast Protein Search - Two-hits on the same diagonal (instead of just one) are required to trigger an ungapped alignment.

17
Q

Why are index-based approaches not suitable for searching large collections?

A

Because the index ends up being much larger than the data itself

18
Q

True or false:

Two proteins that are related in recent evolutionary terms will usually share sequence and structural similarity

A

True

19
Q

PSI-BLAST

A
Used to detect distantly related homologues not detected by BLASTP.
Several rounds (iterations) of BLAST are run. Between each round, a Position-Specific Scoring Matrix (PSSM) is constructed, used for the subsequent iteration.
20
Q

PSI-BLAST

A
Used to detect distantly related homologues not detected by BLASTP.
Several rounds (iterations) of BLAST are run. Between each round, a Position-Specific Scoring Matrix (PSSM) is constructed, used for the subsequent iteration. If new matches are found, another matrix is constructed. If no new matches found, hits with a E<1x10^-6 are recorded.
21
Q

Significance threshold for PSI-BLAST

A

Experimental tests of PSI-BLAST using default parameters have determined that proteins identified in the first 20 iterations with expect scores <1x10^-6 are most likely real

22
Q

Define: Domain

A

contiguous stretch of amino acids “that look as though they should have independent stability”.

23
Q

Decreasing the e-value threshold reduces the likelihood of _______ but decreases ________.

A

false positives; sensitivity

24
Q

False positives in PSI-BLAST searches are:

A

High-scoring alignments that are not in fact related to the query

25
Q

What is required to trigger an ungapped alignment in BLASTn?

A

One hit on the same diagonal

26
Q

What is required to trigger an ungapped alignment in BLASTP?

A

Two hits on the same diagonal

27
Q

Stages of BLAST

A
  1. BLAST searches for hits
  2. BLAST performs ungapped alignment. High scoring ungapped alignments are passed to stage 3.
  3. BLAST performs gapped alignment of region surrounding each high-scoring alignment.
  4. High scoring alignments are presented to the user.
28
Q

What types of reads are generated using shotgun sequencing?

A

whole-genome shotgun reads

29
Q

What types of reads are generated using hierarchical sequencing?

A

BAC shotgun reads

30
Q

Advantages of MSA over pairwise alignments

A

More information than pairwise alignment
Can create phylogenetic trees
Can identify conserved regions

31
Q

ClustalW steps

A
  1. Begins with pairwise alignment and scoring all the pairs
  2. Builds phylogenetic tree
  3. Most closely related sequences are aligned and form a consensus using dynamic programming. The next closest related sequences are then aligned and form a consensus and so forth.
32
Q

Iterative search

A
  1. Search database with query sequence
  2. Construct multiple alignment from high-scoring aligned sequences
  3. Construct a profile using the multiple alignment
  4. Search database with profile. Repeat.
33
Q

E-value

A

The probability that the sequence is similar to the probe sequence purely by chance

34
Q

Which profile does PSI-BLAST use?

A

Position-specific score matrices (PSSMs)

35
Q

What profile does SAM use?

A

Hidden Markov Models (HMMs)

36
Q

Examples of iterative searches

A

PSI-BLAST and SAM