Duncan - variant nomenclature and analysis Flashcards
how much variation do we see in the average human genome?
compared to a reference human genome, a person’s ~6 billion-nucleotide genome sequence will have:
5,000,000 Single Nucleotide Variants (SNPs) that involve ~5,000,000 nucleotides
600,000 insertion/deletion variants (2+ nucleotides) that involve ~2,000,000 nucleotides
25,000 structural variants (such as CNVs) that involve >20,000,000 nucleotides
what is the basic structure of a gene?
Start codon - ATG for amino acid methionine, initiates the reading frame/transcription
Exons - codes for the protein, contributes to the final mRNA molecule that determines order of amino acids
Introns - non-coding, don’t contribute to final mRNA molecule, removed by splicing
Stop codon - several options, e,g, UGA, UAG, UAA
one possible type of variant is a nonsense variant -
what is this?
how does the cell deal with it/the resultant mRNA?
is it likely to be disease causing?
Alter the amino acid code, resulting in a stop codon, so the protein ends prematurely.
These mRNAs are then targeted by NMD - nonsense mediated decay - a system that prevents production of faulty proteins (quality control, protects cell from ‘aberrantly’ functioning proteins). Likelihood of NMD should be considered when assessing variants that result in shortened proteins
mRNA that escapes NMD may produce proteins that retain some functionality – potentially not causing disease. however if mRNA doesn’t escape NMD, you’re essentially losing that protein/its not being expressed, likely disease causing
how can you, generally, identify whether or not a nonsense mutation will result in mRNA that manages to escape NMD?
General rules to identify those aberrant mRNAs that may escape NMD:
if the DNA variant is present in last exon
if the DNA variant is located in last 50 nucleotides of the penultimate exon
(then the mRNA may escape NMD)
name 6 types of variants, and the
consequences that can have
Stop and start variants -
Occur in the stop or start codons
In start codon: transcription not initiated, no protein product, probably disease causing
In stop codon: transcription continues into the non-coding DNA 3’ of the gene, resulting in a protein with additional amino acids that are likely to interfere with structure + function and cause disease
Missense variants -
Most common, it’s just a substitution, an amino acid is swapped AND changes the amino acid coded for
May or may not be pathogenic
Nonsense variants -
Alter the amino acid code, resulting in a stop codon, so the protein ends prematurely.
These mRNAs are then targeted by NMD - nonsense mediated decay (more later)
Deletion variants -
Result in a frameshift and therefore almost always disease-causing
Can be just 1 nucleotide or an entire gene (entire gene = CNV)
Codes for different amino acids than WT, you’ve got a novel protein with new or lost functions…
But a reading frame shift somehow gives a new stop codon within the first 200 codons, so you get a truncated protein that may be targeted by NMD
Duplications -
Addition of nucleotides = frame shift = altered amino acid sequence = likely to be disease causing
Also often gives a premature stop codon and truncated protein. Same as deletions in terms of consequences
RNA splicing - what are donor and acceptor sites?
what are donor and acceptor variants like/are they likely to cause disease?
Donor = the exonic and intronic sequences flanking the 5’ end of an intron, typically GT
Acceptor = the exonic and intronic sequences flanking the 3’ end of an intron, typically AG
Donor splice site variants -
Change in donor splice site = not recognised by splice machinery so removal of the intronic DNA not initiated, it gets included in the mRNA, altering protein function and structure
Also causes a frameshift so you get an early stop codon and a truncated protein
Often disease causing
Acceptor splice site variants -
Results in exclusion of exon. Donor site is recognised and removal of intron initiated but acceptor site is never reached, so the exon is removed too
Very likely to be disease causing as the exon may encode vital parts of the protein (active sites/binding sites etc…)
how does the spliceosome work?
objective = removal of introns from pre mRNA
small nuclear RNA (snRNA) molecules bind to specific proteins, forming a sn-ribonucleoprotein complex (snRNP)
this combines with other snRNPs forming the spliceosome. snRNPs recognise and bind to the acceptor and donor sites, the intron is looped out and excised
are donor/acceptor splice site variants likely to be pathogenic?
Donor/acceptor sites = 15% of recorded pathogenic variants, as can lead to aberrant splicing, excluding exons or including intronic sequences
when naming a variant, what are the three components involved (and where does the second one come from)?
- gene name
- reference sequence (represents normal WT)
- Variant description
a human genome reference sequence is used as a WT reference, while the HG was 1sr sequenced in 2003, it had gaps and erros, is constantly updated, latest version is GRCh38 ‘genome reference consortium human build 38’
there are multiple versions of the human genome sequence.
These are combined to form consensus sequences and are updated as more data is gathered, so different reference will differ slightly, which is why you must include it
why must you know which gene reference sequence you are using?
The DNA sequence of genes is predicted from human genome sequences and sometimes confirmed via assay.
But as sequence data and knowledge increases, the gene DNA sequences are regularly updated.
These can include new exons, longer introns, additional nucleotides etc
Therefore for accurate variant you must know which gene reference sequence you are using
We normally use references supported by sequencing of the corresponding mRNA transcripts as this provides good knowledge of intron exon boundaries
in terms of the reference used in variant naming, what are the three options?
NM_xxx = based on mRNA transcripts, includes introns
NG_xxx = genomic sequence of a gene
NP_xxx = protein sequence based on NM_xxx sequence
how is a gene’s reference sequence annotated (as in each base is given an annotation to tell you where it is in the sequence, how does this system work)?
the c. sequence
each nucleotide has a c number, with C.1 being the A of the ATG start codon
nucleotides in the exons are then just numbered in order (C.1, C.2, C.3 etc…)
nucleotides in introns are numbered based on how far they are from the nearest coding nucleotide, so if the first exon in a gene ends at C.5, the first nucleotide of the intron is c.5 +1
the last intronic nucleotide would be c.6 -1
youd include the base too, so c.5A or c.5+3T etc…
how are amino acids labelled?
Named by position, start codon being 1, then followed by single letter or three letter code e.g. p.A23 or p.Ala23 = an alanine at position 23
what would c.11G>A mean?
say this change makes the OG glycine become an aspartic acid, how would you write this at the protein level?
Nucleotide 11 (coding nucleotide) was a G and has been changed to an A (>means change/substitution)
at protein level:
11th nucleotide is in the fourth codon, so fourth amino acid = p.4
p.Gly4Asp same as above you just don’t use the arrow to indicate change (>)
in ‘p.(Gly4Asp) what do the brackets indicate?
Brackets indicate this proteins change is a prediction based on the sequence and has not been experimentally confirmed
for a gene we have two alleles.
how would I indicate:
1. two WT alleles in terms of DNA sequence?
- two WT alleles in terms of protein/Aa sequence?
- one WT one variant?
- c[=];[=]
the ‘=’ is for WT, the ‘[ ]’ show its the allele youre talking about, the ‘;’ separates them - p.[(=)];[(=)]
the extra bracket for the protein = only DNA sequencing has been done - If there is a variant, just shove the code in the brackets in place of the =. If its on both allele, replace both ‘=’