Weeks 5-6: Genome Assembly; BLAST/FASTA Flashcards
The IHGSC used a _____ approach to sequence the genome
hierarchical
Celera used ______ to sequence the genome
whole genome shotgun
Pseudogene
DNA sequence resembling a gene but mutated into an inactive form over the course of evolution.
Often lacks introns and other essential DNA sequences necessary for function. Pseudogenes do not result in functional proteins, but may have regulatory effects
True or false:
98% of the genome doesn’t encode proteins
True. Only 2% of the genome encodes proteins.
The other 98% encodes small RNAs that regulate gene expression
Output of Sanger sequencing
Single sequence ranging from 500-1000 bp
Output of Next-Gen Sequencing
Groups sequences ranging from 25-500 bp
De novo assembly
Reconstruction of contiguous sequences without making use of any reference sequence.
Reads are partitioned into k-mers (substrings of the read sequence of length k)-form the nodes of the graph (network) and are linked when sharing a k-1 mer.
Genome annotation
Computationally expensive process attaching biologically relevant information to genome sequence data
Pre-Assembly Steps
FastQC: To check the quality of the sequencing data, overall GC content, repeat abundance, the proportion of duplicated reads.
Trim sequences: adapter trimming (cutadapt), trim reads based on quality (sickle).
Remove contaminant sequences such as DNA from the PhiX phage: use a short read aligner (such as Burrows-Wheeler Aligner)
Demultiplex reads: Galaxy
True or false:
Gel electrophoresis is required for NGS
False. No gel electrophoresis needed.
Smith-Waterman Search
Perform dynamic programming between query and each sequence in the collection.
Accurate - guaranteed to report the highest scoring alignments
Slow - searching a 52,000,000,000 basepair collection (entire GenBank database) takes around 3 days on a modern workstation
FASTA
First heuristic search algorithm
~5x faster than Smith-Waterman
Four stage search process - first stage based on algorithm of Wilbur and Lipman for finding exact matches of length n between query and collection sequences
Wilbur-Lipman Approach
- Ignore indel events
- Extract intervals (fixed-length overlapping subsequences from the first sequence of length n)
- Store intervals in fast search structure
- For each interval in the second sequence, search for it in the hash table
FASTA steps
Step 1: Identify regions shared by the two sequences of length n = 1 (using the Wilbur-Lipman method)
Step 2: Rescan the top-ten regions, and rescore using a scoring matrix (protein only)
Step 3: Check to see if initial regions can be joined to form rough alignment with gaps
Step 4: Perform banded Smith-Waterman location alignment centred around all regions that score greater than a threshold
BLAST
~50x faster than Smith-Waterman, 10x faster than FASTA but not 100% accurate
Stage 1: BLAST searches for hits (matches of length W between query and subject). Location of each hit is passed to stage 2.
Stage 2: BLAST performs an ungapped alignment of region surrounding each hit. High-scoring ungapped alignments (where score > T) are passed to stage 3
Stages 3 and 4: BLAST performs a gapped alignment of region surrounding each high-scoring ungapped alignment
High-scoring alignments are displayed to the user