Chapter 3 - Exome Sequencing Flashcards
Reader Ch.3
Limiting factors of traditional gene-discovery strategies (linkage mapping and cadidate gene resequencing)
-Availability of small number of cases
-Reduced penetrance
-Locus heterogeneity
-Substantially diminished reproductive fitness
-Responsible mutation may be de novo
Mendelian disorders
Inherited disorders like cystic fibrosis (kinkhoest), sickle cell anaemia
Coding variation analysis > massively parallel DNA sequencing >
Exome sequencing
Limitation of exome sequencing
it does not assess the impact of the non-coding alleles, but discovery of rare alleles underlying Mendelian phenotypes and complex traits
Why is exome sequencing effective for detecting rare alleles in Mendelian disorders?
Positional cloning studies are succesful for monogenic disorders
> most alleles underlying Mendelian disorders are protein coding
> large fraction of the rare protein altering variants are predicted to have functional consequences
> splice acceptor and donor sites are enriched for highly functional variation (targeted in exome sequencing)
How is the exome defined?
By the entire RefSeq and a large number of hypothetical proteins (this has limitations)
Limitations exome defining
-incomplete overview of protein-coding exons
-variety in efficiency of capture probes
-not all templates are sequences efficiently
-not all sequences can be uniquely aligned to the reference genome
Wet-lab workflow for exome sequencing
- Genomic DNA is sheared and used for in vitro shotgun library
- library fragments are flanked by adapters
- enrichment for sequences corresponding to exons > aqueous-phase hybridized capture
- recovery of hybridized fragments by biotin-streptavidin pulldown and washing
- amplification and massively parallel sequencing
- Mapping > calling of candidate causal variants
Bioinformatics steps in exome sequencing
- Probe design
- Quality control
- Map reads
- Determine variants
- Annotate variants
- Filter known variants
- exome comparison
- validation of candidate genes
Probe design
Designing probes for capturing exon fragments > unique and efficient probes
Quality control
High base quality and equal nucleotide frequencies across the sequence
Mapping the reads (bwa)
mapping against reference genome by algorithm
> unmapped reads are discarded, non-unique as well. Low confidence reads may cause problems
Determine variants (varscan)
Difference detection compared to reference genome: potential variant or sequencing error.
Criteria varscan
- At position of the variant at least N reads (default 8)
- From the N reads at least K reads with variant (default 2)
- Average base quality at position of the variant at least Q (default 15)
Annotate variants
Each variant is assigned various properties; gene name, region, nucleotide position, type of mutation, number of reads, quality etc.
Filter the known variants
Remove synonymous variants and variants which are present in public SNP databases or an in-house reference database because they are unlikely to cause the disorder
Exome comparison
Between different patients to find one or more affected genes in each of the patients (same variant is not required)
Validation of cadidate genes
Wet-lab validation with Sanger sequencingfor example or comparison with sets of exomes and genomes
Depending factors for stategy of indentifying causal alleles > impact sample size for adequate power in bioinformatics
-mode of inheritance (exome sequencing is more efficient for recessive disorders > less genes with two novel protein altering alleles)
-pedigree or population structure
-phenotype arising de novo or inherited > screening family
-extent of locus heterogeneity for a trait
Filering data steps
- discrete filtering: by comparing variants among individuals and against public databases/controls
- Stratification of variants
Novelty of allele assessment
-Set of public database polymorphisms like dbSNP and 1000 genomes project
> from unaffected individuals
Filtering
Eliminating candidate genes by assuming any allele found in the filter set cannot be causative
Assumption for filtering from dbSNP, and problems with the assumption
Controls do not have any alleles in the set from the individuals with the diseased phenotype
-Problems
>dbSNP is contaminated with a small number of pathogenic alleles
>some pathogenic alleles have a higher minor frequency: pathogenic gene variant also occurs in control exomes > risk of eliminating truly pathogenic alleles
Stratification candidates (name the groups)
-by mutation type > predicted impact/deleteriousness
-by segmental duplications > variants found in segmental duplications are discarded
-by pseudognes: dysfuncitonal relatives of genes that have lost their protein-coding ability
-by function: predicted role of the protein product
-by functional impact: for non-synonymous alleles > impact on phenotype prediction
(technical) Failure reasons of exome sequencing
-part of all the causative genes is not in the target definition
-inadequate coverage of the region which contains the causal variant
-the causal variant is covered but not accurately called
-false variants in a gene are called because of mismapped reads or alignment errors
Failure with discrete filtering
reducing power due to genetic heterogeneity or false-positive calls (processed pseudogenes or segmental duplications)