Module 6.3 Genomic Variants Flashcards

1
Q

genomic variation

A
  • DNA sequence differences among individuals
  • any two individuals’ genome ~ 99.6% identical (single and multiple nucleotide differences)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

reference genome

features (5)

A
  • most recent version = Hg38
  • pieced together from multiple people
  • has one assigned nucleotide for every position across entire genome
  • only represents sequence of one copy from each chromosome
  • can introduce bias bc doesn’t reflect diploid genome or genomic diversity of human population
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

pangenome

A
  • collective genome sequences of multiple individuals that better represents breadth of genomic diversity of human population
  • based on 47 ethnically diverse genomes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

genomic variant types

3

A
  1. single nucleotide variant (SNV)
  2. Insertions and deletions (indels)
  3. structural variants (SV)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

single nucleotide variant
(SNV)

A
  • smallest and most common genomic variant
  • one nucleotide change at specific location in genome
  • includes SNPs and rare single nucleotide differences
  • may have SNP on one or both homologous chromosomes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

single nucleotide polymorphism
(SNP)

A

SNV that’s present in at least 1% of human population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

insertions and deletions
(indels)

A
  • extra or missing DNA nucleotides in a genome
  • typically < 50 nucleotides
  • sometimes have larger impact on health and disease
  • common type are tandem repeats
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

indels

tandem repeats

A
  • aka microsatellite
  • short stretches of nucleotides repeated multiple times
  • highly variable in size (2-3x to 100x)
  • number of repeated units unique to each person and can be used for personal / relationship ID
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

genetic fingerprinting

A

technique of analyzing microsatellite lengths for individual identification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

phased variants

A

Variants on same chromosome are linked together and inherited from one parent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

genomic variation

phasing

A

process of separating maternal and paternal inherited copy of each chromosome during sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

structural variants
(SV)

types (5)

A
  1. large tandem repeats (repeated unit >15 bp)
  2. copy number variants (CNV)
  3. inversions
  4. insertions
  5. translocations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

structural variants

copy number variants

A
  • difference in total number of times segment of nucleotides appears in genome
  • deletion: missing segment within chromosome
  • duplication: duplicated segment within chromosome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

structural variants

inversion

A

segment is inverted within chromosome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

structural variant

insertion

A

segment deleted from one chromosome and added to different chromosome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

translocation

A

segments that transfer (swap) between different chromosomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

synonymous mutation

A
  • mutation that doesn’t change original protein
  • can be base change in intron or non-coding region
  • may affect transcription, splicing, RNA transportation, and translation and alter resulting phenotype
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

non-synonymous mutations

types (3)

A
  1. missense
  2. nonsense
  3. nonstop / readthrough
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

missense mutation

A
  • mutation in a single nucleotide codes for different amino acid
  • conservative: replaced by similar amino acid
  • non-conservative: replaced by amino acid with different properties
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

nonsense mutation

A

mutation that changes original amino acid to stop codon = incomplete protein

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

nonstop / readthrough mutation

A

stop codon is exchanged for amino acid codon, causing protein to be too long

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

BRCA 1 / BRCA 2

features (5)

A
  • tumor suppressor genes involved in DNA repair and cell cycle regulation
  • BRCA 1 = 50% of all inherited / 5% of all breast cancer
  • BRCA 2 = 30-40% of all inherited breast cancer
  • slightly increases male risk of developing breast, prostate, or other cancers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

germline mutation

A
  • variation within germ cell (sperm and egg)
  • only mutations that can be passed on to offspring
  • can be caused by several factors and can occur throughout zygote development
24
Q

constitutional mutation

A
  • germline mutation present in all cells of offspring even if mosaic in parent
25
Q

somatic mutation

A
  • any mutation that occurs in cell of human body
  • not usually transmitted to offspring
  • will be present in all descendant cells of that cell (mutant clone)
  • Many cancers result of accumulated somatic mutations
26
Q

allele frequency

A
  • relative frequency of allele at a particular locus in a population
  • used to describe amount of variation at a particular locus or across multiple loci in population

fraction of all chromosomes in population carrying specified allele divided by chromosomes in population (or sample size)
- eg. MAF of A = (2 x AA + 1 x Aa) / (15 individuals x 2 chromosomes)

27
Q

variant allele frequency
(VAF)

A
  • measures proportion of variant alleles at a genomic locus within cell population of a human sample
  • NGS experiment: VAF = (mutants reads that cover the position) / (all reads that cover the position)
28
Q

Variant allele frequency

germline variant

A
  • homozygous: VAF = 100%
  • heterozygous: VAF = 50%
29
Q

Variant allele frequency

somatic variant

A

depends on where and how biological sample is collected and how many mutant somatic cells are within sample population

homozygous: VAF = 2n / 2(N+n)
- eg. 3 mutant cells of 20 = (2 x 3) / 2 x (17+3) = 15%

heterozygous: VAF = n / 2(N+n)
- eg. 3 mutant cells of 20 = 3 / 2 x (17+3) = 7.5%

30
Q

Variant detection methods

3

A
  1. PCR-based (known variants)
  2. Microarray (known variants)
  3. Genome sequencing
    - whole genome
    - targeted
31
Q

whole genome sequencing

A
  • human genome project (reference genome)
  • currently focuses on individual genomes
  • genetic variations among individuals
  • study genetic basis of diseases
  • facilitate personalized medicine
32
Q

targeted sequencing

features (4)

A
  • focuses on specific regions of genome
  • allows deep sequencing to look for variants present at very low allele frequencies
  • widely used, efficient, and cost effective
  • Enrichment (selection of region of interest) is critical step
33
Q

targeted sequencing

Target enrichment strategies

A
  1. hybrid capture
  2. amplification
34
Q

targeted sequencing

hybrid capture

process

A
  1. convert DNA samples into sequencing libraries (fragmentation and adapter ligation)
  2. do low-cycle PCR to amplify libraries to ensure all molecules (target or non-target) have adapters
    Hybridization
  3. denature all library DNA molecules
  4. mix library molecules with blocking oligos to prevent non specific interactions between molecules.and biotinylated probes
  5. incubate mixture in hybridization buffer optimized for oligo and probe binding
    - Blocking oligos bind to adapter sequences
    - Probes bind to region of interest
  6. After hybridization, add streptavidin-coated magnetic beads to separate target region from rest of genome
  7. remove probes
  8. Take enriched DNA through 2nd round of PCR and NGS
35
Q

hybrid capture

probes

A
  • synthetic DNA or RNA single stranded oligos specific to region of interest.
  • 100-120 bp long with biotin attached to one end of probe
  • collection of probes = panels
36
Q

hybrid capture

benefits and drawbacks

A

Benefits
- can easily target hundreds to millions of bases in genome
- easier to scale for sequencing more complex and bigger target regions
- allow you to target more genes and support more comprehensive profiling

Drawbacks
- takes longer to complete experiment
- more laborious

37
Q

Whole exon panel

A

includes all exons and upstream regulatory regions

38
Q

Targeted Enrichment

Amplification

features

A
  • can use whole genomic DNA or fragmented DNA
  • regions of interest amplified using sequence specific primers
  • singleplex or multiplex
  • many different PCR versions
39
Q

Targeted DNA enrichment

Amplification

process- P5/P7 adapter ligation

A
  1. sequence specific primers have a part of sequence adapter (Read 1 and Read 2) attached to 5’ end
  2. First-round PCR product amplified using 2nd set of primers complementary to the adapter sequence (eg. P5 or index + P7) to add full sequencing adapter to amplicons
40
Q

Targeted DNA enrichment

Amplification

process- dual indexed amplicon library

A
  1. multiplex primers that only contain sequences targeting regions of interest
  2. After PCR, amplification products ligated with adapters to create full library molecules for sequencing
41
Q

Targeted DNA enrichment

Amplification

benefits and drawbacks

A

Benefits
- simpler workflow
- smaller amounts DNA required
- faster turnaround workflow

Drawbacks
- multiplex PCR for big regions very challenging and often requires longer development time

42
Q

Sequencing Metrics
(library prep and experimental design)

5

A
  1. Depth of coverage
  2. On-target rate
  3. GC bias
  4. Uniformity
  5. Duplication Rate
43
Q

Sequencing QC

Depth of coverage

A
  • number of times base within target region is represented in sequencing data
  • expressed as a multiple (eg. 5X)

Avg coverage of a position = (read count x read length) / total target size
Avg coverage of a variant base = VAF x average coverage of a position
- 50% VAF x 10 reads = 5 variant reads (5 wild type)
- 10% VAF x 10 reads = 1 variant read (9 wild type)

44
Q

Sequencing QC

Required Depth of coverage factors

4

A
  1. quality and amount of input sample
  2. number and type of variants
  3. variant’s expected frequencies
  4. coverage depths typically reported for similar studies
45
Q

Sequencing QC

On-Target Rate

A

measures specificity of your target enrichment method

% On-Target Bases:
- number of bases that map to the target region

% Reads On-Target:
- all sequencing reads that overlap with target region by at least one base

46
Q

Sequencing QC

On-Target Rate

Causes of low rates (3)

A
  • suboptimal probe design
  • poorly optimized protocols
  • problem during the library preparation or enrichment process
47
Q

Sequencing QC

GC Bias

A
  • uneven coverage of AT and GC rich regions (GC content) during sequencing
  • use GC bias distribution plots
  • green dots: GC normalized experimental coverage (skewed in bad run)
  • blue bars: % of GC in 100-base windows of reference genome
  • can be from bad library preparation
  • can help determine if more sequencing required for desired sequencing depths across all target regions
48
Q

Sequencing QC

Uniformity

A

reveals uneven coverage of sequencing regions

Fold-80 base penalty score
- describes how much more sequencing is required to bring 80% of target bases to mean coverage.
- Fold-80 =1: uniform coverage
- Fold-80 >1: uneven uniformity
- Fold-80 =2: require 2x as much sequencing for 80% of reads to reach mean coverage

49
Q

Sequencing QC

Duplication Rate

features

A
  • fraction of mapped reads marked as duplicated reads in dataset
  • duplication causes inflation of coverage in certain regions
  • may overrepresent SNPs or false variant calls and inflate earlier frequency calculation
  • deduplication: removal of duplicate reads from sequencing data during bioinformatics process
50
Q

Duplication Rate

contributing factors to high rates

5

A
  1. Optical duplicates
  2. ExAmp clustering
  3. True biological duplication
  4. library prep
  5. PCR amplification
51
Q

Duplication Factors

Optical Duplicates

A
  • due to instrument system error
  • 1 large cluster called as 2 clusters
  • Illumina (non-patterned flow cell)
  • Complete Genomics (large cluster or DNA ball)
52
Q

Duplication factors

ExAmp clustering

A
  • caused by underclustering on patterned flow cell (Illumina)
  • library molecule from 1st cluster are free to go back into solution and overflow to neighboring empty nanowells to create 2nd cluster
53
Q

Duplication Factors

True Biological Duplication

A
  • happen to have both strands (sisters) of one double strand molecule in library
  • sister strands create two clusters with identical start and end positions that appear as duplicated reads
  • All Illumina platforms
54
Q

Duplication factors

Library Prep

A
  • non-random fragmentation method = higher chance of getting two molecules with same ends
  • appear as duplicated reads but are from two different molecules
  • try to use larger DNA input samples and pair end sequencing
  • All Illumina platforms
55
Q

Duplication Factors

PCR amplification

A
  • PCR copies original molecules, may get duplicates
  • try to reduce cycle number when possible
  • All Illumina platforms