Lecture 4 Flashcards
Gives route of traditional gene discovery
- Determine mode of inheritance
- Recombination mapping using markers
- Haplotype analysis of recombinants
- Rapid screening of candidate genes for mutation
- Identify mutation using sanger sequencing
Give state of the art gene discovery
- Whole exome next generation sequencing
- Lots of polymorphisms
- Filter polymorphisms to get candidate genes
- Confirm using sanger sequencing
Explain the human genome project
Launched in 1990 and completed in 2003
Generated representative human genome of ~3,000,000,000 bases
Performed using sanger sequencing
Covered 92% of human genome sequence
Used enough gel to fill a lecture theatre
What are flow cells
Can sequence 2 human genomes in 2-3 days
What has happened to the cost of human genomes over time?
Decreased from almost 100 million dollars in 2002 to almost 100 dollars from 2010 onwards
What caused the decrease in human genome cost
Next generation sequencing technology, but also filled in 8% gap
Long read NGS helps resolve duplication and repetitive regions
Complete human genome was described in 2022 by Telomere-to-telomere consortium
What are reference genomes
- Forms foundation of medical, function and diversity studies
- Provides common point of reference for genomic loci:
Gives genes addresses
Reported variants relative to reference genome - Provides template to guide assembly of new genomes and enables assay design/data analysis
How can genetic variations be characterised against reference genomes?
- Single nucleotide polymorphisms
- Structural variants e.g. deletions, insertions, duplications, inversions, translocations, copy number variation
What are the most investigated variant types in genomes?
Single nucleotide polymorphisms:
- Ease of analysis
- Single nucleotide substitutions
- Present at >1% of population
- 4-5 million SNPs in every individuals (every 1000 bp)
- Over 600 million reported
- Single nucleotide variations are similar to SNPs but don’t require >1% in population
What factors need to be considered when exploring genetic variations
Cost - experimental, analysis, other
Time - Sample prep, run time, analysis time, sample transport
Information capture - Accuracy, feature length, complex variant detection
Appropriate tools should be selected
What can be used as an input sample?
DNA (easy to manipulate)
RNA - can be useful for charactersing disease subtypes as ENA shows genes cells are actively using
Selected DNA/RNA
Protein not typically used as can’t be easily manipulated
How much information is required
3 billion bases in human genome
BRCA1:
110kb/85kb in length (intron and exon)
0.006% of genome
BRCA2:
7.8kb/10.2kb in length (exon only)
0.0005% genome
Is all genome info required?
- Microarray
- Enrichment/amplicon -> gene panel sequencing
- Enrichment/aplicon -> exome sequencing
Examples of target enrichment and amplicon
Hybridisation capture:
- Section of genome fragmented
- Adapter and DNA bound to gel to form gene library
- Washing and elution of DNA
Amplification:
1. Library hybridisation
2. PCR amplification
3. Washing and elution
Amplicon sequencing:
- Genomic DNA undergoes multiplex PCR with specific primers
- Ligate adaptors are then used to form a barcoded library
Explain in more detail amplicon sequencing
PCR primers designed to target gene of interest
Amplify region using PCR
Regions sequenced
Explain target enrichment methods
- Allow more targets to be enriched at once compared to amplicon sequencing
Exons account for 2% of genome, but 85% of known disease variants. Exome sequencing is therefore cheaper as it only sequences exons
Selected inputs and sequencing
DNA -> Whole genome sequencing
DNA -> PCR -> Amplicon sequencing
DNA -> hybridisation capture -> target enrichment sequencing
Read, map, and depth/coverage
Read - sequence corresponding to DNA fragment
Map - Determining where reads originated in genome
Depth/coverage - Depth is number of times sequencing reads cover a region if genome
How are sequences of DNA fragments mapped onto reference genome
DNA fragmented by chemical, physical or enzymatic means
Individual fragments are sequenced
Map reads to reference genome to identify variants
What are the three different sequencing methods?
Sanger sequencing: Validation and disease diagnosis of known genomic regions
800-1000bp read length
Fast, cost-effective, high accuracy
Low sensitivity, more sample required, and might miss new variants
Short read NGS: genetic contribution of diseases, GWAS, gene panel testing
20-500bp
High sensitivity, low sample input, can discover new variants
Less cost effective and poor resolution for repetitive sequences
Long read NGS: identify difficult to detect de novo mutations
10,000-100,000bp
Low sample input and resolves structural variants
High error rates and more expensive
Explain microarrays
Hybridisation approach (not sequencing)
Array of spots which contain small DNA sequence commentary to interest sequence:
- Array-based comparative genomic hybridsation (aCGH) - detects copy number variation
- Single-nucleotide polymorphism (SNP) array for GWAS
- Transcriptomics
Easier to analyse than sequencing
Process of microarray example with tumour and normal DNA
Normal and tumour RNA undergoes RT-PCR and labelled with fluorescent dyes
Combine equal amounts of cDNA and hybridize probe to microarray
Scan
What might genetic tests be comparing
Cancer vs non-cancer samples
de novo mutations in children by testing mothers, fathers and children
Common genetic traits
The HapMap project in 2007
269 genomes from geographically diverse cohort e.g. Japanese, Han Chinese, European, African etc
Described chromosome regions with sets of strongly associated SNPs
1000 genome project
Whole genome sequenced in 2500 people
100,000 genome in 2018
UK biobank - genomic and clinical information
23andme
Microarray based (600,000 SNPs)
- 12 million people have paid for service to get information on their genetics
- A few FDA-approved clinically relevant genetics
Genotype Tissue expression project
- Genomic and transcriptomic data from 54 tissue types
- 948 decreased donors
TCGA project
Genomic, epigenomic, transcriptomic, proteomic data from 33 cancer types
11,000 cancer patients
What are a major issue regarding data collection in variant discovery
Ethnicity:
Reference genome is European ancestory
HapMap, 1000 genome - efforts to characterise variations across different ethnicities
T2T - constructed a Chinese genome - highlighted unique genes and exclusive sequences/variants compared to European genome
Poorer ability to identify contributing genetic variants in non-European populations
- Participation bias
Pangenome reference
Improved variant detection using multiple ethinicity genomes
First draft in May 2023
Reduce small variant discovery error by 34%
Increase structural variants detected per haplotype by 104%