Module 6.3 Genomic Variants Flashcards
genomic variation
- DNA sequence differences among individuals
- any two individuals’ genome ~ 99.6% identical (single and multiple nucleotide differences)
reference genome
features (5)
- most recent version = Hg38
- pieced together from multiple people
- has one assigned nucleotide for every position across entire genome
- only represents sequence of one copy from each chromosome
- can introduce bias bc doesn’t reflect diploid genome or genomic diversity of human population
pangenome
- collective genome sequences of multiple individuals that better represents breadth of genomic diversity of human population
- based on 47 ethnically diverse genomes
genomic variant types
3
- single nucleotide variant (SNV)
- Insertions and deletions (indels)
- structural variants (SV)
single nucleotide variant
(SNV)
- smallest and most common genomic variant
- one nucleotide change at specific location in genome
- includes SNPs and rare single nucleotide differences
- may have SNP on one or both homologous chromosomes
single nucleotide polymorphism
(SNP)
SNV that’s present in at least 1% of human population
insertions and deletions
(indels)
- extra or missing DNA nucleotides in a genome
- typically < 50 nucleotides
- sometimes have larger impact on health and disease
- common type are tandem repeats
indels
tandem repeats
- aka microsatellite
- short stretches of nucleotides repeated multiple times
- highly variable in size (2-3x to 100x)
- number of repeated units unique to each person and can be used for personal / relationship ID
genetic fingerprinting
technique of analyzing microsatellite lengths for individual identification
phased variants
Variants on same chromosome are linked together and inherited from one parent
genomic variation
phasing
process of separating maternal and paternal inherited copy of each chromosome during sequencing
structural variants
(SV)
types (5)
- large tandem repeats (repeated unit >15 bp)
- copy number variants (CNV)
- inversions
- insertions
- translocations
structural variants
copy number variants
- difference in total number of times segment of nucleotides appears in genome
- deletion: missing segment within chromosome
- duplication: duplicated segment within chromosome
structural variants
inversion
segment is inverted within chromosome
structural variant
insertion
segment deleted from one chromosome and added to different chromosome
translocation
segments that transfer (swap) between different chromosomes
synonymous mutation
- mutation that doesn’t change original protein
- can be base change in intron or non-coding region
- may affect transcription, splicing, RNA transportation, and translation and alter resulting phenotype
non-synonymous mutations
types (3)
- missense
- nonsense
- nonstop / readthrough
missense mutation
- mutation in a single nucleotide codes for different amino acid
- conservative: replaced by similar amino acid
- non-conservative: replaced by amino acid with different properties
nonsense mutation
mutation that changes original amino acid to stop codon = incomplete protein
nonstop / readthrough mutation
stop codon is exchanged for amino acid codon, causing protein to be too long
BRCA 1 / BRCA 2
features (5)
- tumor suppressor genes involved in DNA repair and cell cycle regulation
- BRCA 1 = 50% of all inherited / 5% of all breast cancer
- BRCA 2 = 30-40% of all inherited breast cancer
- slightly increases male risk of developing breast, prostate, or other cancers
germline mutation
- variation within germ cell (sperm and egg)
- only mutations that can be passed on to offspring
- can be caused by several factors and can occur throughout zygote development
constitutional mutation
- germline mutation present in all cells of offspring even if mosaic in parent
somatic mutation
- any mutation that occurs in cell of human body
- not usually transmitted to offspring
- will be present in all descendant cells of that cell (mutant clone)
- Many cancers result of accumulated somatic mutations
allele frequency
- relative frequency of allele at a particular locus in a population
- used to describe amount of variation at a particular locus or across multiple loci in population
fraction of all chromosomes in population carrying specified allele divided by chromosomes in population (or sample size)
- eg. MAF of A = (2 x AA + 1 x Aa) / (15 individuals x 2 chromosomes)
variant allele frequency
(VAF)
- measures proportion of variant alleles at a genomic locus within cell population of a human sample
- NGS experiment: VAF = (mutants reads that cover the position) / (all reads that cover the position)
Variant allele frequency
germline variant
- homozygous: VAF = 100%
- heterozygous: VAF = 50%
Variant allele frequency
somatic variant
depends on where and how biological sample is collected and how many mutant somatic cells are within sample population
homozygous: VAF = 2n / 2(N+n)
- eg. 3 mutant cells of 20 = (2 x 3) / 2 x (17+3) = 15%
heterozygous: VAF = n / 2(N+n)
- eg. 3 mutant cells of 20 = 3 / 2 x (17+3) = 7.5%
Variant detection methods
3
- PCR-based (known variants)
- Microarray (known variants)
- Genome sequencing
- whole genome
- targeted
whole genome sequencing
- human genome project (reference genome)
- currently focuses on individual genomes
- genetic variations among individuals
- study genetic basis of diseases
- facilitate personalized medicine
targeted sequencing
features (4)
- focuses on specific regions of genome
- allows deep sequencing to look for variants present at very low allele frequencies
- widely used, efficient, and cost effective
- Enrichment (selection of region of interest) is critical step
targeted sequencing
Target enrichment strategies
- hybrid capture
- amplification
targeted sequencing
hybrid capture
process
- convert DNA samples into sequencing libraries (fragmentation and adapter ligation)
- do low-cycle PCR to amplify libraries to ensure all molecules (target or non-target) have adapters
Hybridization - denature all library DNA molecules
- mix library molecules with blocking oligos to prevent non specific interactions between molecules.and biotinylated probes
- incubate mixture in hybridization buffer optimized for oligo and probe binding
- Blocking oligos bind to adapter sequences
- Probes bind to region of interest - After hybridization, add streptavidin-coated magnetic beads to separate target region from rest of genome
- remove probes
- Take enriched DNA through 2nd round of PCR and NGS
hybrid capture
probes
- synthetic DNA or RNA single stranded oligos specific to region of interest.
- 100-120 bp long with biotin attached to one end of probe
- collection of probes = panels
hybrid capture
benefits and drawbacks
Benefits
- can easily target hundreds to millions of bases in genome
- easier to scale for sequencing more complex and bigger target regions
- allow you to target more genes and support more comprehensive profiling
Drawbacks
- takes longer to complete experiment
- more laborious
Whole exon panel
includes all exons and upstream regulatory regions
Targeted Enrichment
Amplification
features
- can use whole genomic DNA or fragmented DNA
- regions of interest amplified using sequence specific primers
- singleplex or multiplex
- many different PCR versions
Targeted DNA enrichment
Amplification
process- P5/P7 adapter ligation
- sequence specific primers have a part of sequence adapter (Read 1 and Read 2) attached to 5’ end
- First-round PCR product amplified using 2nd set of primers complementary to the adapter sequence (eg. P5 or index + P7) to add full sequencing adapter to amplicons
Targeted DNA enrichment
Amplification
process- dual indexed amplicon library
- multiplex primers that only contain sequences targeting regions of interest
- After PCR, amplification products ligated with adapters to create full library molecules for sequencing
Targeted DNA enrichment
Amplification
benefits and drawbacks
Benefits
- simpler workflow
- smaller amounts DNA required
- faster turnaround workflow
Drawbacks
- multiplex PCR for big regions very challenging and often requires longer development time
Sequencing Metrics
(library prep and experimental design)
5
- Depth of coverage
- On-target rate
- GC bias
- Uniformity
- Duplication Rate
Sequencing QC
Depth of coverage
- number of times base within target region is represented in sequencing data
- expressed as a multiple (eg. 5X)
Avg coverage of a position = (read count x read length) / total target size
Avg coverage of a variant base = VAF x average coverage of a position
- 50% VAF x 10 reads = 5 variant reads (5 wild type)
- 10% VAF x 10 reads = 1 variant read (9 wild type)
Sequencing QC
Required Depth of coverage factors
4
- quality and amount of input sample
- number and type of variants
- variant’s expected frequencies
- coverage depths typically reported for similar studies
Sequencing QC
On-Target Rate
measures specificity of your target enrichment method
% On-Target Bases:
- number of bases that map to the target region
% Reads On-Target:
- all sequencing reads that overlap with target region by at least one base
Sequencing QC
On-Target Rate
Causes of low rates (3)
- suboptimal probe design
- poorly optimized protocols
- problem during the library preparation or enrichment process
Sequencing QC
GC Bias
- uneven coverage of AT and GC rich regions (GC content) during sequencing
- use GC bias distribution plots
- green dots: GC normalized experimental coverage (skewed in bad run)
- blue bars: % of GC in 100-base windows of reference genome
- can be from bad library preparation
- can help determine if more sequencing required for desired sequencing depths across all target regions
Sequencing QC
Uniformity
reveals uneven coverage of sequencing regions
Fold-80 base penalty score
- describes how much more sequencing is required to bring 80% of target bases to mean coverage.
- Fold-80 =1: uniform coverage
- Fold-80 >1: uneven uniformity
- Fold-80 =2: require 2x as much sequencing for 80% of reads to reach mean coverage
Sequencing QC
Duplication Rate
features
- fraction of mapped reads marked as duplicated reads in dataset
- duplication causes inflation of coverage in certain regions
- may overrepresent SNPs or false variant calls and inflate earlier frequency calculation
- deduplication: removal of duplicate reads from sequencing data during bioinformatics process
Duplication Rate
contributing factors to high rates
5
- Optical duplicates
- ExAmp clustering
- True biological duplication
- library prep
- PCR amplification
Duplication Factors
Optical Duplicates
- due to instrument system error
- 1 large cluster called as 2 clusters
- Illumina (non-patterned flow cell)
- Complete Genomics (large cluster or DNA ball)
Duplication factors
ExAmp clustering
- caused by underclustering on patterned flow cell (Illumina)
- library molecule from 1st cluster are free to go back into solution and overflow to neighboring empty nanowells to create 2nd cluster
Duplication Factors
True Biological Duplication
- happen to have both strands (sisters) of one double strand molecule in library
- sister strands create two clusters with identical start and end positions that appear as duplicated reads
- All Illumina platforms
Duplication factors
Library Prep
- non-random fragmentation method = higher chance of getting two molecules with same ends
- appear as duplicated reads but are from two different molecules
- try to use larger DNA input samples and pair end sequencing
- All Illumina platforms
Duplication Factors
PCR amplification
- PCR copies original molecules, may get duplicates
- try to reduce cycle number when possible
- All Illumina platforms