W8L1 Genomic: genome assembly and genetic markers Flashcards
Some problem with initial genome assembly
- Shotgun sequencing does not define a genome. Which requires assembly and annotation.
- Genome sequence does not provide all the information. As chromatin conformation, epigenetic states are important too.
-A single genome does not represent the whole species
Why is genome information important
Genome information is highly valuable at defining a species and gaining information on the structures and function of the organism.
Why is genetic marker used as landmark
- to assess genetic diversity
- to gain positional information about genetic basis of traits
The thousand genome project
From one genome to 1000 genomes, increasing appreciation of
the importance of variation
Mostly re-sequencing shotgun reads from many individuals and align the reads against reference genome
Sequencing the genome
Shotgun sequencing using high throughput short read technology
Short random fragments of DNA are sequenced across the genome to a given depth of coverage.
Fragments can consist of
- Single reads (typically 50–1000 bp)
- Paired-end reads of varying insert size (note that paired-end reads can overlap).
- Mate-pair libraries span larger genomic regions ( 2–20 ∼ kb inserts) with reads generally facing outwards
Assembling a genome
First define contigs (mate pair) based on sequence overlap represented in a de Bruijn graph
Second scaffold the contigs using large insert or long read technology
- Intergrating the scaffold is into maps, using either physical map such as BAC or genetic maps: linkage maps
Annotating a genome
-using gene prediction models, expression data, homologous protein identification(from database)
-Final model combine multiple source of evidence
-comparing the annotated gene set against biological expectation
Some problem with gene annotation
With increasing biological complexity, genomic size becomes a bad proxy for gene number
Genome complexity in terms of repetitive elements
The kinetics of the reassociation of fragments of E. coli and bovine DNAs is a function of the initial concentration of DNA multiplied by the time of incubation.
- The E. coli DNA reassociates at a uniform rate, consistent with each fragment of DNA being represented once.
- The bovine DNA fragments exhibit two distinct steps in their reassociation.
Coupling chemical or enzymatic treatment with sequencing
Bisulfite conversion
-Non-methylated cytosine is converted to uracil.
-Comparing treated vs non treated DNA samples identifies differentially methylated cytosines
-Chromatin Immuno-precipitation (ChIP) An antibody specific of a protein of interest (eg: histone) is used to isolate specific DNA subfraction interacting with the protein
Assay for Transposase Accessible chromatin (ATAC)
Modified Tn5 transposase coupled with
sequencing adapters targets accessible open chromatin segments of DNA sample
-High-throughput Chromosome Conformation Capture (Hi-C)
Capture fragments in close proximity in DNA sample by establishing cross-linkage
What is a genetic markers
Genetic markers are simpler genetic landmarks used as a proxy for more complex and less accessible causal source of variation.
Explicitly relies on Linkage Disequilibrium
What is a good marker
- Polymorphic and Abundant
- Unambiguous/Repeatable
- Neutral (Not causal)
+ Co-dominant
Why is SNP use as a reference marker
- Most abundant
- Easy to genotype
either using Microarray Or direct sequencing
Genetic marker coverage: depth vs breadth
There is two dimension to genomic data:
- coverage (completeness)
- depth (accuracy)
Genome sizes and complexities for non-model species often limit whole-genome sequencing approaches, forcing the use of:
- Reduced representation
- targeted or not
genetic marker for population genetic: absence of prior knowledge
For population level information:
Use of Restriction-Assisted DNA Sequencing
Increase the depth of coverage for a set of neutral markers
Accurate estimation of frequencies
Random coverage of the genome (no linkage with specific loci)