Genome variation Flashcards
What are the genomic variations in humans?
Genetic variations are differences in the DNA sequence of individuals
* Approximately 99.9% of the DNA of two unrelated individuals is the same.
* Genetic variations can be described at the level of:
- DNA
- RNA
- Protein
Types of changes at DNA level
Substitution
Insertion
Deletion
Indel
What are silent changes?
Silent changes or synonymous changes:
the nucleotide change does not affect the amino acid
What is a missense variation?
the amino acid is replaced with a different amino acid e.g. p.Arg70Cys
What is a nonsense variation ?
Nonsense amino acid Arg70 is changed to a stop codon p.Arg70*
How can variations affect the phenotype?
Hair or eye color, height
q Increase or decrease susceptibility towards a condition :
u In combination with multiple other variants of small effect and/or with the environment
u Are usually common in the population
u Common diseases e.g diabetes, hypertension and asthma
q Directly cause a condition:
u Cause a major phenotypic effect
u Are usually rare in the population
u Typically cause Mendelian disorders, e.g. familial hypercholesterolaemia, sickle cell disease, cystic fibrosis.
q Increase or decrease response to drug
What can understanding variants give us?
Understanding the molecular basis of disease helps to identify the correct treatment and to design new drugs
What were the goals of the 1000 genome project?
§ The discovery of single nucleotide variants with frequencies ≥ 1%
§ The discovery of single nucleotide variants with frequencies of 0.1 –0.5% in gene regions
§ The discovery of structural variants, such as copy-number variants, other insertions and deletions, and inversions
§ Estimate the frequencies of variant alleles
What is the distribution of rare variants in the world?
Although most common variants are shared across the world, rarer variants are typically restricted to closely related populations
What should you consider when assesin theee effect of a genetic variant?
- has it been seen in someone with the phenotype I am studying?
- has it been seen before?
- how common is it in the general population?
What is dbSNP?
dbSNP is a public archive of all short sequence variations.
* It was established in 1998.
* It is hosted by the National Center for Biotechnology Information (NCBI).
* It includes data from several organisms, not just humans
* It includes single nucleotide substitutions, short insertions/deletions, multi- base deletions or insertions.
* Variations > 50 nucleotides in length are annotated in the Database of Genomic Structural Variation (dbVAR) not dbSNP.
Why are we interested in studying the exam?
- The exome includes all exons of protein coding genes
- The exome covers ~2% of our genome
- Whole exom sequencing helps to identify novel disease-causing variants in patients with rare diseases
What is the ExAC database?
Established in 2014
* ExAC database reports exome sequences of >60,000 unrelated individuals from different populations (African, American, Non-Finnish Europeans, Finnish Europeans, East Asians, South Asians) that were sequenced as part of several disease-specific and population genetic studies.
* Exomes are from individuals with adult-onset diseases
* No homozygous variants causing childhood-onset Mendelian diseases are
present in the database
* It provides for each genetic variant :
* Global allele frequency
* Population specific allele frequency
gnomAD
Genome Aggregation Database (gnomAD) aggregates exome and genome sequence data from several large scale-scale sequencing projects
* It provides data for 125,748 exomes and 15,708 whole-genome sequences
* It provides data for >240 million human variants
* Data are from unrelated individuals
What is CllinVar?
“ClinVar aggregates information about genomic variation and its relation to human health.”
-It includes germline and somatic variants
-The clinical significance of the variant (e.g. benign, damaging, unclassified) is reported directly from the submitters
-Clinical significance is calculated from all records submitted for the same variant. The presence of a consensus or conflict is indicated
-A clinical interpretation is present for >200,000 variants
What is Online Mendelian Inheritance in Man (OMIM)?
Online Catalog of Human Genes and Genetic Disorders
§ Freely available
§ It contains information on all known Mendelian disorders
- specific for monogenic disorders
What is Online Mendelian Inheritance in Man (OMIM)?
Online Catalog of Human Genes and Genetic Disorders
§ Freely available
§ It contains information on all known Mendelian disorders
- specific for monogenic disorders
Why should we do bioinformatics analysis?
You need to do bioinformatics before an experiment because you cannot realistically check all variants in the wet lab
Catalogue Of Somatic Mutations In Cancer
- Catalogue Of Somatic Mutations In Cancer (COSMIC)
- Freely available
- More than 4 million coding mutations reported
- It combines genome-wide sequencing results from > 28,000 tumors
What should be your next steps when you identify a variant?
- is the variants causing any changes to the protein?
- you need to look at the function of the residue
-is the residue conserved? is it important for the protein function?
What should you now about the function of thee wild type residue to asses the variant better?
Ø Is the postion conserved? If the wt amino acid is evolutionary conserved it is highly likely to be important for the protein function/structure
Ø What are the physico-chemico properties of the wt residue?
Ø Is the wild type residue in a protein domain? And if so, what is the function of the domain?
Ø Is the wild type residue involved in a protein-protein or protein-ligand interaction?
What does it tell us when a residue is conserved?
Functionally or structurally important residues are conserved across homologues
Why should we. study domains with protein variants?
Knowing the function of the domain can help to understand the function of the protein and the disruption caused by the genetic variant
What is Pfam?
Pfam is a comprehensive database of protein families
* Members of the same Pfam family are evolutionary related and are
identified using Hidden Markov Model (HMM) profiles
* UniProtKB, 77.0% of the amino acid sequences in UniProt have at least one domain annotated in Pfam
* 53.2% residues in UniProtKB belong to a Pfam domain
* There are still many proteins and amino acid regions without a
functional annotation
What is InterPro
InterPro is a consortium of protein domain databases
The two main advantages of InterPro:
- it integrates data from multiple resources-
- Adisadvantageisthedifficultyinkeepingup-to-datewiththeindividualresources
What can changes in amino acids cause?
§ Substitution of structurally important amino acids: § loss of cysteine bonds
§ loss of hydrogen – bonds
§ Change in amino acid size: § steric clashes
§ introduction of cavities
§ Substitution to proline can cause a bending of the alpha helix
§ Change of polarity:
§ e.g. Hydrophobic to hydrophilic substitution in a core residue
Can we predict the effect of genomic variations?
Only about 2% of human DNA codes for proteins. All the protein coding sequences in the genome collectively make up the exome.
q The non-protein coding areas of DNA between genes can have several functions e.g. :
u regulatory elements associated with gene expression,
u DNA elements which regulate chromosome structure,
u sequences which are involved in gene regulation and protein translation.
Genomic variations occurring in coding regions
Ø Genetic variants occurring in the coding regions account for a large proportion of the known genetic variants responsible for human inherited disease
Ø They are attractive candidates for disease since they can affect protein function or structure
Ø Genetic variants resulting in amino acid substitutions have been widely studied
Ø Several in silico tools have been developed over the last 10-20 years to predict the benign or damaging effect of genetic variants causing amino acid substitutions.
Predicting the effect of missense variants: in silico prediction methods
Sequence-based algorithms:
They use alignment of homologous sequences to calculate the conservation of the wild type residue
Evolutionary conserved residues are unlikely to tolerate substitutions
Structure-based algorithms. These can be divided in:
§ algorithms that calculate the difference in free energy (∆∆G) between the wild type and variant structures, e.g. FoldX
§ algorithms that use structural features but without providing a ∆∆G. These can be further divided in:
Structure-based algorithms that do not provide a ∆∆G
- In most cases, the 3D structural features of the residue harbouring the variant, such as surface accessibility, hydrophobicity etc., are combined with sequence- based features, e.g. Polyphen2.
§ These predictors generally calculate the probability of a variant being damaging but do not return information on the mechanism by which the variant affects the phenotype. - Some methods use 3D structure coordinates to perform an in-depth atom-based study of the effect of a missense variant, e.g. Missense3D and SAAPdap/SAAPpred.
§ These predictors provide information on the structural damage, e.g breakage of a cysteine bond or a steric clash, thus providing the user with information on the mechanism by which a variant may disrupt protein folding and/or function.
§ In the case of SAAPDab/SAAPpred information on sequence conservation is also included in the variant analysis.
Sorting Tolerant From Intolerant (SIFT) algorithm
§ It uses a sequence homology-based approach to predict the deleterious effect of an amino acid substitution.
§ A group of proteins homologous to the query protein is automatically identified with PSI-BLAST and used to build the sequence alignment
§ SIFT calculates the probabilities for all 20 amino acids at a specific position.
§ The output score is a probability for each of the 19 possible amino acid
substitutions at each position in the aligned target protein
§ A score of 0.05 or less is considered indicative of a deleterious substitution.
Polymorphism Phenotyping v2 (PolyPhen-2)
§ It estimates the probability of an amino acid substitution to be damaging based on a combination of sequence and structure-based features.
§ It also assesses the substitution qualitatively (benign, possibly damaging or probably damaging)
§ The properties of the wild type allele are compared to the properties of the mutant allele
§ Sequence conservation is evaluated by automatically selecting and aligning a set of homologous sequences
What can you predict with PolyPhen2?
§ a single amino acid substitution
§ a large number of amino acid substitutions (batch mode)
REVEL (Rare Exome Variant Ensemble Learner)
Ø REVEL is an ensemble method for predicting the pathogenicity of amino acid
substitutions
Ø It combines predictions from several tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons
Ø A set of variant (training set) was used to train a Random Forest
Ø Training set: ~6,000 disease variants and ~120,000 rare putative neutral exome
variants (allele frequency 0.001-0.01)
Variant Effect Predictor (VEP)
- It can be used to examine the effect of variants identified in human and non-human species
- It can be used to analyze variants in coding and non-coding regions
What does VEP provide?
- annotations for the effect of the variant on transcript, protein, and regulatory region (e.g.
promoter) - allele frequencies of the query variant
- disease and/or phenotype information
- In silico predictions from tools such as SIFT and Polyphen2
In silico prediction methods: SAAP
- Single Amino Acid Polymorphism data analysis pipeline (SAAPdap) and predictor (SAAPpred)
- It analyses the likely structural effect of an amino acid substitution on the 3D protein structure
- It also considers residue conservation
- It requires the availability of an experimental 3D structure
- The PDB structure does not need to be provided
What is the accuracy pf variant prediction from models compared to experimental structures?
The accuracy of variant prediction obtained using 3D models is similar to that obtained using 3D experimental structures
Why using 3D coordinates for variant prediction
§ 3D protein structure data enables one to investigate the effect of genetic variants at atomic level
§ Best Practice Guidelines for Variant Classification 2020: “3D structure data can be used to upgrade evidence for pathogenicity in the variant decision making process” (PM1 criterion)