The Woogie 1 Flashcards
Cons of sanger sequencing?
- expensive
- error prone
- prone to bias
- Low throughput (low amounts of reads can be sequenced at one time)
Alignment techniques using reference genome
- Align reads against reference genome
2.Mapping: recover position of sequence reads in genome - Alignment: recover position of sequence read in genome read
- Alignment based on sequence matches to locate gene read
Genome Project Challenges
1.Sequence technologies not perfect
2. DNA harder to sequence between samples
3. Reference genome could have multiple organisms which means bigger reads.
4. Sample sequence error/mismatches/gaps.
2nd Generation Sequencing methods
- Library prep: Cells -DNA
- DNA fragments attach to adapters
- Anchoring to sequencer
- DNA melted added to flow cell
- Floor cell saturated with oligonucleotides, complimentary to the adapters. - Adapters sequence attach fragments to the bottom of the machine
First generation: Sanger sequencing- sequencing by synthesis
- Fluorescent bases added by polymerase
- Laser Identifies base type
- Fluorescent base then removed
4.Terminal nucleotide can accept another base
Genome Size Abundance
1.Complex Organisms generally have larger genomes. Less Complex smaller.
What is in a genome?
1.Genes
2.Centromeres and telomeres
3. Bind site, repeat region transposable elements
4. Epigenetics marks:- methyl markers in histones
Sanger sequencing using dNTP and ddNTP
- Unwind DNA helix to single strands
- Polymerase makes new strand
- ddNTP halts polymerase action
- Sequences run on gel (electrophoresis)
- Laser shines onto ddNtP nucleotides
Pro’s of 2nd Gen sequencing
1.Fast
2.Cheap
3.High throughput
3rd Generation Sequencing Pros
- Fast
- Cheap
- High throughput
- Can read single strands
- Can interpret long reads
What is throughput sequencing?
The the computers ability to test a specific number of sequences at one time?
Cons of 2nd generaton sequencing
1.Prone to error (repeat regions)
2.Prone to amplification biases
3.Fragment length restricted
A Genome consists of?
1.Chromosomes
2. mtDNA
3. Chloroplast DNA
4. Plasmids
3rd Generation Sequencing Cons
- Some sequencing can be expensive
- Nanopore cheap, but flow cell expensive
- Prone to error (only 80% accuracy)
N50 contig read
Point at which genome is covered by contigs of this size
1. ADD up all contigs lengths
2. Sort by decreasing length
3. largest contig
- covers 50% assembly length
- if not take length of second largest
Genome assembly metrics depend on:
- Number of contigs
- Length of assembly
- Number of genes
- Assembly accuracy
String graph assembly
- uses overlap of reads to build graph.
- reads becomes nodes and overlap become edges
- need to specify minimum overlap to have meaningful graph to consider as true.
De Bruijn Graph
- Split reads into 2 strings of letters (kmers)
- Use Kmer overlap to build graph
- De Bruijn uses short strings and nodes to build graph
- Repeat regions can make this complicated
Presence of necessary of biological genes
90% of species have 1 copy of genes
Challenges in genome sequencing
- Short reads around 300bps
- Similar sequence multiples found in genome
- Repetitive regions increase read length
- Multiple gene copies, sequence errors and uneven genomic coverage
Assembly vs alignment
- Assembly: without genome assembles reads into a genome
- Alignment: aligns reads against given genome
Genome assembly
- Reads look to be made into one piece (hard to achieve)
- Can assembly into contiguous
- String between contiguous are scaffolds (NNN)
- Overall, assemblies exist as contiguous, scaffolds and chromosomes
Gene Annotation
1.Assemblies useful when genomic features are identified(gene location/promoters)
2. 1st annotation level (start/stop) codons. 2nd use other organism genes to confirm annotation
3. Use RNA sequence from other species to align against