Human Genome Sequencing Flashcards
Shotgun Sequencing
Random fragmentation of DNA that are sequenced individually, overlapping fragments are short and reassembled to reconstitute genome
depends upon coverage
first was Haemophilus influenzae
Celera Genomics 1998 (3 years)
Drosophila genome (~180Mb)
- 120Mb of eurochromatic genome
- a lot errors supported by BAC map
Human genome (fev 2001) - 2,696Mb
- used IHGSC data
- Chr16 smaller
- under represent repeats
- not good for long alinments (1-4kb)
- effectiveness: number, type, size
- difficult assembling highly similar repeats (recent origin)
- oversimplifies duplicated regions
- 300million
Hierarchical Shotgun (BAC library)
Decompose genome into overlapping BAC clones that are shotgun sequenced and reassemble each one and merge with sequences of adjacent clones - clone contig map
IHGSC (1990)
Human BAC libraries
- 20x coverage
- 2865Mb
- RPC1-11 male library (543797 clones, 32,2x)
3billion
BAC fingerprinting - digest with 1 or 2 RE - separate electrophoresis - identify BACs with common bands (complete and partial)
clone end sequencing (map as you go): sequence ends of BAC clones and BAC entirely, query database of ends using BAC sequence as seed, identify overlapping BACs and sequence
build 34
algorithm approach, problem and solution
overlap: compare all sequence reads pairwise and find overlaps (graph)
layout: determine the shortest path through graph
consensus: where overlaps differ in sequence use consensus
Problem - repeats cause many identical overlaps (miss assembly, gaps, missing data)
-tandem repeat: only one copy of the repeat
- 2 genome wide repeats w/ sequence in the middle: lost sequence between them
Solution: sequence ends of large inserts (2,10,50kb) and use it to make sure nothing is being lost (distance, correspondent ends)
2001 assemblies and 3 problems
both missing 10% euchromatic map and 30% overall genome
many gaps specially celera
missambled regions
pseudogene were actually sequencing errors
1- HEXA (chr 15): miss assembly - exons 6,7,8 also present at chr3
2- ILI2RB2: inversion 9-15 exons and duplication of 15
3- ITBG3: pseudogene - framshift, missing terminal exon 15
IHGSH revision build 35
oct 2004
2,85billion nt
gaps: 341
~99% euchromatic
error rate: 1 event per 10000 bases
protein encoding genes: 20000 (LOW)
nowadays
problem: gaps in coverage, low fold coverage, unstable/uncleared sequences
solution: high fold coverage region, specialise cloning strategies
Human Genome (mar 2019)
-3272116950bp
-gaps: 349
-accuracy: <1error/10000bp
- anomalies: PAR of y - copies X; centromeres modelled not gaps
Apr 2022 - T2T most complete with gap free derived from fertilised egg, good representation of repetitive sequences
1000 genomes (2008-2015): 1000 individuals around the world, 454/illumina
10000 genome - UK (2019) - 62 000 genomic analysis, NHS patients and families, 1 in 5 rare disease have a diagnosis, ~50% cancer cases with add data