Genome Sequencing Methods Flashcards
Physical mapping
BACs= bacterial artificial chromosomes
cloning vectors for larger pieces of DNA
physical mapping- provides path of minimally overlapping clones along chromosome
BAC
physical mapping
bacterial artificial chromosomes
Bac ends sequenced to provide landmarks in genome
BAC clones can be sequenced individually (clone by clone) – now outdates and done by whole genome shotgun method
Shotgun sequencing
whole genome sequencing
- fragment whole genome into 1-3kb pieces
- clone fragments into vectors (make a library)
- sequence clones from both ends (parallel end seq)
- align using computational methods
- individual reads aligned into Contigs. Contigs into scaffolds
- many gaps often remain
- can be problematic for large genomes with many TEs that are similar and repetative DNA
- fixed by genome annotation
Contigs
individual reads that can align together
- contigs linked together by forming scaffolds
scaffolds
connecting contigs together
Issues with shotgun sequencing
many gaps remain
large genomes can have many TEs that are similar to each other and repeatative DNA = misalignment etc
Genome annotation
After shotgun sequencing
gene finding programs (FGENESH and GeneScan)
- repeat masking done to hide repeated regions
- exon/intron structure predicted by programs
-transcriptome comparisons to find expressed genes and intron/exon junctions
-putative functions- found by comparisons to related species
-detection and annotation of Non-coding RNA and TES
Next generation sequencing
“2nd generation”
Ultra high throughput
-genome sequencing much faster
- 454, illumina, ion torret
- millions of sequence reads obtained per run instead of 96 or 384, with conventional sanger
- 454- sequencing by synthesis - no longer used
- illumina sequencing - higher throughput than 454
– 100-150bp short end (often paired end) reads
– used for re-sequencing to look for polymorphisms
–now (not often) used for de novo but challenges for assembling short reads – needs to be paired with other technologies
-ion torret= intermediate btw 454 and illumina, reads comparable to 454
Single molecule sequencing
3rd generation
- using DNA polymerase
- does not require amplification of template
- SMRT sequencing and Oxford Nanopore’s Minion
– long reads (2400-4000 bp)
moderate throughput- like 454 technique
-good for De novo sequencing - long reads= easier assembly
-sometimes can be paired with others like illumina to correct for errors
Resequencing of genomes
multiple lines, cultivars, accessions, ecotypes with sequenced reference genomes can be sequenced
- find variants etc
Genotyping
use of DNA data to analyze relationships in or among populations
- GBS (genotyping by sequencing) and RAD sequencing
- using illumina sequencing of restriction digested DNA, using barcoding to sequence many samples in lane
- used in population and evolutionary genetic studies
- previously AFLPS and microsatellites were used but not as efficient
- data allow for analysis of SNPs among individuals
Main purposes for RNA-seq
transcriptome sequencing
- reference transcriptome RNA-seq to obtain a set of reference transcripts
Expression profiling
- compare gene expression levels in 2 or more samples from RNA-seq data
Illumina
millions of 100-150bp reads
sometimes paired end
useful for expression profiling: high depth sequencing
de novo transcriptome ref sequencing but challenging with short reads
- normalize reads to length of gene– more reads higher mRNA expression
454 and Ion torret
Longer reads than illumina, easier to align together or to reference genome
very useful for reference transcriptome sequencing
not as widely used for expression profiling studies as illumina (read depth lower)
PacBio SMRT and Oxford nanopore’s minion
long reads sequence entire transcripts
very useful for reference transcriptome sequencing
allows sequencing of isoforms
RNA seq aplications
new gene discovery
profiling of tissue/organ types, diseased vs wt, mutant vs wt, effects of stress/pathogens, etc.
-discovery of alternative splice sites
-expression profiling and discovery of miRNA, and siRNAs
- analyzing TF binding sites with Chromatin immunoprecipitation followed by RNA-seq
Bioinformatics
study of biological information using concepts and methods from computer science and stats
- algorithm and program development
- genome database development
- computational analysis of high throughput DNA sequence and expression data to answer biological questions
blast searching
BLASTn= nucleotide to nucleotide BLASTx= nucleotide to protein database BLASTp= protein to protein tBLASTn= protein to nucleotide database translated into all possible reading frames tBLASTx= nucleotide to nucleotide translated to all possible reading frames-- slowest
E-value
expectation value
- lower/closer to 0 is best, 0.1=worst
- represents significance of each hit
- defined as number of hits one can expect to find by chance when searching a database of particular size
few ways in which genome size effects organism
- nucleus size, cell size
- duration of cell cycle
- cell differentiation rate
- metabolic rate
- embryotic developmental rate
- life history strategy
- invasiveness
- extinction rate
4 ways TE insertions can negatively impact host
- energetic costs of replication, transcription, translation
- disrupts cellular processes by TE proteins
- susceptibility to harmful GOF mutations
- deleterious rearrangements caused by ectopic recombination
repeatMASKER
program that detects and filters out repeated sequences in genomes using sequence similarity to known set of repetitive sequences
- only as good as reference genome
why is homozygosity important for genome sequencing?
facilitates the assembly of the genome and only 1 copy is required as don’t have to deal with allelic variants
- implications for putting together genome
when assembling a genome why is the percentage of genes higher than the total amount in the assembled genome
reads are overlapping due to many different types of sequencing– over estimation of genes
- genes are easier to find
- repeatative regions hard to asssemble