L4, Genomics and Health I Flashcards
Human genome project: Overview
- 1990 to 2003
- Generated a representative human genome sequence of around 3 billion bases
- Performed using Sanger sequencing
- 92% of genome covered
- Used 1 reference genome (20 volunteers)
NGS: Features
- Long Read released 2022
- aka: Massively parallel sequence
- Simultaneous -> reducing cost
- Helps resolving duplication and repetitive regions of Human genome project
Why are reference genomes useful?
- Provide a common reference point for genomic loci -> gives gene ‘addresses’, reported variants are relative to the reference genome
- Provides a template (Guiding assembly of new genomes, enables assay design and data analysis)
SNPs: What are they? Prevalence? Why are they so well investigated?
- Single nucleotide polymorphism -> more than 1% of population has this variation
- ~4-5 million SNPs per individual, over 600 million reported
- Ease of analysis
Common structural variants in DNA (x6):
- Deletion
- Insertion
- Duplication
- Inversion
- Translocation
- Copy number variation (e.g. microsatellites)
How are SNVs different to SNPs?
- Don’t have the 1% of population requirement
- V = variations
Haplotypes : What are they?
Stats included
- Haplotype = a set of closely linked genetic markers or DNA variations on chromosome that tend to be inherited together
- SNPs within a block (~5kb) can stay associated for many generations e.g. disease susceptibility allele and marker SNP in same block
- 4-6 alternative haplotypes for each block, with around 20 SNPs per block
- Consider humans as haplotype mosaic
Factors to consider when choosing a sequencing technology:
- Cost (experimental, analysis, logistics)
- Time (sample preparation, run time, analysis time, sample transport)
- Information capture (accuracy, feature length, complex variant detection)
Key methods for high-throughput exploration of genetic information:
- Sanger sequencing, Short-read NGS or Long-read NGS will be used
Inputs:
- Whole inputs (either genome or transcriptome)
- Amplicon (PCR used to amplify particular gene -> cheaper, more specific)
- Enrichment/depletion (slightly wider net than amplicon e.g. sequencing exons only, but still narrowing things down)
- Other: Microarray (high through-put)
Define read, assemble and map:
- Read: The sequence corresponding to a DNA fragment
- Assemble: Aligning and merging reads to reconstruct the original DNA sequence
- Map: Determining where the reads originated from in a genome
Define depth and coverage:
- Depth: Number of times sequencing reads cover a specific region of the genome
- Coverage: Context dependent; similar to depth when discussing how much sequencing is done (e.g. 4 fold, 4X, shallow or deep), whereas when discussing alignment it usually means % covered by reads
How are samples typically prepared for sequencing:
- Long strands of DNA are fragmented (physical, chemical or enzymatic methods) -> small pieces
- Sequence the individual reads
- Either assemble the reads together through overlaps or map the reads to reference genome
Sanger sequencing: Read length and benefits
- 800-1000 bps
- Fast and cost effective
Sanger sequencing: Challenges and use in medical genetics
- Low sensitivity, requires higher sample input
- Often used in diagnosis of diseases through sequencing of known genomic reasons (e.g. BRCA)
Short read NGS: Read length and benefits
- 20-500bp
- High sensitivity, ability to discover new variants, low sample input
Short-read NGS: Challenges and use in medical genetics
- Analysis requires specialist expertise.
- Less cost effective for low number of targets
- Poor at resolving repetitive sequences
- Useful in studying genetic contribution of rare or common disease in populations
Long read NGS: Read length and benefits
- 10,000-100,000 (up to 2 million) bp
- Able to resolve structural variants
- Very low sample input
- Methylation information
Long read NGS: Challenges and use in medical genetics
- Higher error rates than short-read GS
- More expensive than s hort-read NGS
- Useful for studying repetitive regions, structural variants and methylation in genetic disorders
When is target enrichment useful?
- Allows for more targets to be enriched at once compared to amplicon sequencing
- Exome sequencing is a common example; <2% of human genome, thus greatly reducing the cost (but still able to study ~85% of known disease-related variants
Two methods for target enrichment/depletion in transcriptomics:
- Poly-A selection (using poly-A tail in mRNA to select for protein coding transcripts)
- Ribodepletion (removing majority of rRNA in sample, 90% of total RNA) -> also retain tRNA, lncRNA etc
How might target enrichment be carried out (x3)?
- Microarray hybridisation
- In solution hybridisation
- Molecular inversion probes
Applications for microarray:
- Microarray: Hybridisation technique not sequencing method
- Array-based comparitive genomic hybridisation (aCGH) for detecting copy number variation
- Single-nucleotide polymorphism
- Transcriptomics
Examples of controls to compare genomes to:
- Cancer sample vs unaffected sample
- Paediatric disorder from de novo mutation; familial genetic disorders
- Common genetic traits on a population level (control doesn’t have disorder/phenotype)
HapMap project:
- 2007 project, using 279 genomes from a geographically diverse cohort
- Described chromosome regions with sets of strongly associated SNPs (Human haplotype map)
+ This project facilitates large-scale association-mapping studies and positional cloning studies by cataloguing LD across the genome in many populations
1,000 genome project:
- 2500 individuals of diverse genetic background
- 2015
- Whole genome sequence
UK biobank:
Purpose and features
- Ongoing project
- Genomic and other clinical information
- Investigating respective contributions of genetic predisposition and environmental exposure to the development of disease (GxE)
23 and me:
- Consumer based DNA testing service
- Microarray based (~600,000 SNPs)
- Individuals pay for the service to get information on their genetics
- A few FDA-approved clinically relevant genetic markers
- 12 million customers
Two examples of large scale multi-omic projects linking genetic variation to gene or protein expression:
- GTEx project: Genomic and transcriptomic data from 53 tissue types, from post-mortem donors
- TCGA project: Epigenetic, transcriptomic, proteomic data from 33 cancer types (11,000 cancer patients)
Key challenges when evaluating genomic projects:
- Biased data collection (majority of historical GWAS used primarily European ancestry) -> participation bias (skewing towards certain demographics (ethnicity, age, gender, social-economic status, education)
- Researchers may need to implement matching on these bases when conducting research based on large genome projects
- Sample size limiting statistical power
What is the pangenome? Outcomes and aims of this approach:
- Reference genome that includes multiple ethnicities (graphical)
- Aiming to reduce collection bias
- Reduces small variant discovery error by 34%
- Increases number of structural variants detected per haplotype by 104%
- Still early stages; analysis is extremely complex
+ What is linkage disequilibrium?
- Particular alleles at nearby sites can co-occur at the same haplotype more often than is expected by chance
- Can occur in regions close or very far apart