L4, Genomics and Health I Flashcards

1
Q

Human genome project: Overview

A
  • 1990 to 2003
  • Generated a representative human genome sequence of around 3 billion bases
  • Performed using Sanger sequencing
  • 92% of genome covered
  • Used 1 reference genome (20 volunteers)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

NGS: Features

A
  • Long Read released 2022
  • aka: Massively parallel sequence
  • Simultaneous -> reducing cost
  • Helps resolving duplication and repetitive regions of Human genome project
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are reference genomes useful?

A
  • Provide a common reference point for genomic loci -> gives gene ‘addresses’, reported variants are relative to the reference genome
  • Provides a template (Guiding assembly of new genomes, enables assay design and data analysis)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

SNPs: What are they? Prevalence? Why are they so well investigated?

A
  • Single nucleotide polymorphism -> more than 1% of population has this variation
  • ~4-5 million SNPs per individual, over 600 million reported
  • Ease of analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Common structural variants in DNA (x6):

A
  • Deletion
  • Insertion
  • Duplication
  • Inversion
  • Translocation
  • Copy number variation (e.g. microsatellites)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How are SNVs different to SNPs?

A
  • Don’t have the 1% of population requirement
  • V = variations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Haplotypes : What are they?

Stats included

A
  • Haplotype = a set of closely linked genetic markers or DNA variations on chromosome that tend to be inherited together
  • SNPs within a block (~5kb) can stay associated for many generations e.g. disease susceptibility allele and marker SNP in same block
  • 4-6 alternative haplotypes for each block, with around 20 SNPs per block
  • Consider humans as haplotype mosaic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Factors to consider when choosing a sequencing technology:

A
  • Cost (experimental, analysis, logistics)
  • Time (sample preparation, run time, analysis time, sample transport)
  • Information capture (accuracy, feature length, complex variant detection)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Key methods for high-throughput exploration of genetic information:

A
  • Sanger sequencing, Short-read NGS or Long-read NGS will be used

Inputs:

  • Whole inputs (either genome or transcriptome)
  • Amplicon (PCR used to amplify particular gene -> cheaper, more specific)
  • Enrichment/depletion (slightly wider net than amplicon e.g. sequencing exons only, but still narrowing things down)
  • Other: Microarray (high through-put)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define read, assemble and map:

A
  • Read: The sequence corresponding to a DNA fragment
  • Assemble: Aligning and merging reads to reconstruct the original DNA sequence
  • Map: Determining where the reads originated from in a genome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define depth and coverage:

A
  • Depth: Number of times sequencing reads cover a specific region of the genome
  • Coverage: Context dependent; similar to depth when discussing how much sequencing is done (e.g. 4 fold, 4X, shallow or deep), whereas when discussing alignment it usually means % covered by reads
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How are samples typically prepared for sequencing:

A
  • Long strands of DNA are fragmented (physical, chemical or enzymatic methods) -> small pieces
  • Sequence the individual reads
  • Either assemble the reads together through overlaps or map the reads to reference genome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Sanger sequencing: Read length and benefits

A
  • 800-1000 bps
  • Fast and cost effective
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Sanger sequencing: Challenges and use in medical genetics

A
  • Low sensitivity, requires higher sample input
  • Often used in diagnosis of diseases through sequencing of known genomic reasons (e.g. BRCA)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Short read NGS: Read length and benefits

A
  • 20-500bp
  • High sensitivity, ability to discover new variants, low sample input
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Short-read NGS: Challenges and use in medical genetics

A
  • Analysis requires specialist expertise.
  • Less cost effective for low number of targets
  • Poor at resolving repetitive sequences
  • Useful in studying genetic contribution of rare or common disease in populations
17
Q

Long read NGS: Read length and benefits

A
  • 10,000-100,000 (up to 2 million) bp
  • Able to resolve structural variants
  • Very low sample input
  • Methylation information
18
Q

Long read NGS: Challenges and use in medical genetics

A
  • Higher error rates than short-read GS
  • More expensive than s hort-read NGS
  • Useful for studying repetitive regions, structural variants and methylation in genetic disorders
19
Q

When is target enrichment useful?

A
  • Allows for more targets to be enriched at once compared to amplicon sequencing
  • Exome sequencing is a common example; <2% of human genome, thus greatly reducing the cost (but still able to study ~85% of known disease-related variants
20
Q

Two methods for target enrichment/depletion in transcriptomics:

A
  • Poly-A selection (using poly-A tail in mRNA to select for protein coding transcripts)
  • Ribodepletion (removing majority of rRNA in sample, 90% of total RNA) -> also retain tRNA, lncRNA etc
21
Q

How might target enrichment be carried out (x3)?

A
  • Microarray hybridisation
  • In solution hybridisation
  • Molecular inversion probes
22
Q

Applications for microarray:

A
  • Microarray: Hybridisation technique not sequencing method
  • Array-based comparitive genomic hybridisation (aCGH) for detecting copy number variation
  • Single-nucleotide polymorphism
  • Transcriptomics
23
Q

Examples of controls to compare genomes to:

A
  • Cancer sample vs unaffected sample
  • Paediatric disorder from de novo mutation; familial genetic disorders
  • Common genetic traits on a population level (control doesn’t have disorder/phenotype)
24
Q

HapMap project:

A
  • 2007 project, using 279 genomes from a geographically diverse cohort
  • Described chromosome regions with sets of strongly associated SNPs (Human haplotype map)

+ This project facilitates large-scale association-mapping studies and positional cloning studies by cataloguing LD across the genome in many populations

25
Q

1,000 genome project:

A
  • 2500 individuals of diverse genetic background
  • 2015
  • Whole genome sequence
26
Q

UK biobank:

Purpose and features

A
  • Ongoing project
  • Genomic and other clinical information
  • Investigating respective contributions of genetic predisposition and environmental exposure to the development of disease (GxE)
27
Q

23 and me:

A
  • Consumer based DNA testing service
  • Microarray based (~600,000 SNPs)
  • Individuals pay for the service to get information on their genetics
  • A few FDA-approved clinically relevant genetic markers
  • 12 million customers
28
Q

Two examples of large scale multi-omic projects linking genetic variation to gene or protein expression:

A
  • GTEx project: Genomic and transcriptomic data from 53 tissue types, from post-mortem donors
  • TCGA project: Epigenetic, transcriptomic, proteomic data from 33 cancer types (11,000 cancer patients)
29
Q

Key challenges when evaluating genomic projects:

A
  • Biased data collection (majority of historical GWAS used primarily European ancestry) -> participation bias (skewing towards certain demographics (ethnicity, age, gender, social-economic status, education)
  • Researchers may need to implement matching on these bases when conducting research based on large genome projects
  • Sample size limiting statistical power
30
Q

What is the pangenome? Outcomes and aims of this approach:

A
  • Reference genome that includes multiple ethnicities (graphical)
  • Aiming to reduce collection bias
  • Reduces small variant discovery error by 34%
  • Increases number of structural variants detected per haplotype by 104%
  • Still early stages; analysis is extremely complex
31
Q

+ What is linkage disequilibrium?

A
  • Particular alleles at nearby sites can co-occur at the same haplotype more often than is expected by chance
  • Can occur in regions close or very far apart