Weeks 5-6: Genome Assembly; BLAST/FASTA Flashcards

Question 1

Q

The IHGSC used a _____ approach to sequence the genome

Answer

A

hierarchical

Question 2

Q

Celera used ______ to sequence the genome

Answer

A

whole genome shotgun

Question 3

Q

Pseudogene

Answer

A

DNA sequence resembling a gene but mutated into an inactive form over the course of evolution.
Often lacks introns and other essential DNA sequences necessary for function. Pseudogenes do not result in functional proteins, but may have regulatory effects

Question 4

Q

True or false:

98% of the genome doesn’t encode proteins

Answer

A

True. Only 2% of the genome encodes proteins.

The other 98% encodes small RNAs that regulate gene expression

Question 5

Q

Output of Sanger sequencing

Answer

A

Single sequence ranging from 500-1000 bp

Question 6

Q

Output of Next-Gen Sequencing

Answer

A

Groups sequences ranging from 25-500 bp

Question 7

Q

De novo assembly

Answer

A

Reconstruction of contiguous sequences without making use of any reference sequence.
Reads are partitioned into k-mers (substrings of the read sequence of length k)-form the nodes of the graph (network) and are linked when sharing a k-1 mer.

Question 8

Q

Genome annotation

Answer

A

Computationally expensive process attaching biologically relevant information to genome sequence data

Question 9

Q

Pre-Assembly Steps

Answer

A

FastQC: To check the quality of the sequencing data, overall GC content, repeat abundance, the proportion of duplicated reads.
Trim sequences: adapter trimming (cutadapt), trim reads based on quality (sickle).
Remove contaminant sequences such as DNA from the PhiX phage: use a short read aligner (such as Burrows-Wheeler Aligner)
Demultiplex reads: Galaxy

Question 10

Q

True or false:

Gel electrophoresis is required for NGS

Answer

A

False. No gel electrophoresis needed.

Question 11

Q

Smith-Waterman Search

Answer

A

Perform dynamic programming between query and each sequence in the collection.
Accurate - guaranteed to report the highest scoring alignments
Slow - searching a 52,000,000,000 basepair collection (entire GenBank database) takes around 3 days on a modern workstation

Question 12

Q

FASTA

Answer

A

First heuristic search algorithm
~5x faster than Smith-Waterman
Four stage search process - first stage based on algorithm of Wilbur and Lipman for finding exact matches of length n between query and collection sequences

Question 13

Q

Wilbur-Lipman Approach

Answer

A

Ignore indel events
Extract intervals (fixed-length overlapping subsequences from the first sequence of length n)
Store intervals in fast search structure
For each interval in the second sequence, search for it in the hash table

Question 14

Q

FASTA steps

Answer

A

Step 1: Identify regions shared by the two sequences of length n = 1 (using the Wilbur-Lipman method)
Step 2: Rescan the top-ten regions, and rescore using a scoring matrix (protein only)
Step 3: Check to see if initial regions can be joined to form rough alignment with gaps
Step 4: Perform banded Smith-Waterman location alignment centred around all regions that score greater than a threshold

Question 15

Q

BLAST

Answer

A

~50x faster than Smith-Waterman, 10x faster than FASTA but not 100% accurate
Stage 1: BLAST searches for hits (matches of length W between query and subject). Location of each hit is passed to stage 2.
Stage 2: BLAST performs an ungapped alignment of region surrounding each hit. High-scoring ungapped alignments (where score > T) are passed to stage 3
Stages 3 and 4: BLAST performs a gapped alignment of region surrounding each high-scoring ungapped alignment
High-scoring alignments are displayed to the user

Question 16

Q

Difference between BLAST protein and nucleotide searches

Answer

A

Blast Protein Search - Two-hits on the same diagonal (instead of just one) are required to trigger an ungapped alignment.

Question 17

Q

Why are index-based approaches not suitable for searching large collections?

Answer

A

Because the index ends up being much larger than the data itself

Question 18

Q

True or false:

Two proteins that are related in recent evolutionary terms will usually share sequence and structural similarity

Question 19

Q

PSI-BLAST

Answer

A

Used to detect distantly related homologues not detected by BLASTP.
Several rounds (iterations) of BLAST are run. Between each round, a Position-Specific Scoring Matrix (PSSM) is constructed, used for the subsequent iteration.

Question 20

Q

PSI-BLAST

Answer

A

Used to detect distantly related homologues not detected by BLASTP.
Several rounds (iterations) of BLAST are run. Between each round, a Position-Specific Scoring Matrix (PSSM) is constructed, used for the subsequent iteration. If new matches are found, another matrix is constructed. If no new matches found, hits with a E<1x10^-6 are recorded.

Question 21

Q

Significance threshold for PSI-BLAST

Answer

A

Experimental tests of PSI-BLAST using default parameters have determined that proteins identified in the first 20 iterations with expect scores <1x10^-6 are most likely real

Question 22

Q

Define: Domain

Answer

A

contiguous stretch of amino acids “that look as though they should have independent stability”.

Question 23

Q

Decreasing the e-value threshold reduces the likelihood of _______ but decreases ________.

Answer

A

false positives; sensitivity

Question 24

Q

False positives in PSI-BLAST searches are:

Answer

A

High-scoring alignments that are not in fact related to the query

Question 25

Q

What is required to trigger an ungapped alignment in BLASTn?

Answer

A

One hit on the same diagonal

Question 26

Q

What is required to trigger an ungapped alignment in BLASTP?

Answer

A

Two hits on the same diagonal

Question 27

Q

Stages of BLAST

Answer

A

BLAST searches for hits
BLAST performs ungapped alignment. High scoring ungapped alignments are passed to stage 3.
BLAST performs gapped alignment of region surrounding each high-scoring alignment.
High scoring alignments are presented to the user.

Question 28

Q

What types of reads are generated using shotgun sequencing?

Answer

A

whole-genome shotgun reads

Question 29

Q

What types of reads are generated using hierarchical sequencing?

Answer

A

BAC shotgun reads

Question 30

Q

Advantages of MSA over pairwise alignments

Answer

A

More information than pairwise alignment
Can create phylogenetic trees
Can identify conserved regions

Question 31

Q

ClustalW steps

Answer

A

Begins with pairwise alignment and scoring all the pairs
Builds phylogenetic tree
Most closely related sequences are aligned and form a consensus using dynamic programming. The next closest related sequences are then aligned and form a consensus and so forth.

Question 32

Q

Iterative search

Answer

A

Search database with query sequence
Construct multiple alignment from high-scoring aligned sequences
Construct a profile using the multiple alignment
Search database with profile. Repeat.

Question 33

Q

E-value

Answer

A

The probability that the sequence is similar to the probe sequence purely by chance

Question 34

Q

Which profile does PSI-BLAST use?

Answer

A

Position-specific score matrices (PSSMs)

Question 35

Q

What profile does SAM use?

Answer

A

Hidden Markov Models (HMMs)

Question 36

Q

Examples of iterative searches

Answer

A

PSI-BLAST and SAM