10 Standardisation - Genomes, Genes and Nomenclature Flashcards

1
Q

What was the specific aim of the human genome project?

A

To sequence the euchromatic human genome: this is the lightly packed chromatin that is enriched in genes. It’s 92% of the genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How much of the genome is considered protein coding?

A

25%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How much of the genome is actually protein coding because its in exons?

A

1%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What surprised people about the content of the genome?

A

There’s more segmentally duplicated DNA than expected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What makes up 15% of the genome?

A

Short interspersed Nuclear elements, primarily Alu elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Who contributed DNA to the human genome project?

A

13 or 30 people… From Buffalo NY. 66% is from one male donor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In 2001 how much of the genome had been sequenced? What was the error rate? And how many gaps were there?

A

90% coverage, high error rate of 1 in 1000, and there were 15,000 gaps in the euchromatic genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How much of the genome had been sequenced in 2003 with referene Hg16 (NCBI34), what was the error rate, and how many gaps were there?

A

99% coverage, 1 in 10,000 error rate, with 400 gaps in the euchromatic genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In 2009 Grch37 was published. How many gaps were there, and how many genes still had sequencing error?

A

300 gaps in euchromatic genome in build 37, and 550 genes with sequencing errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In 2013 what was published?

A

GRCh38

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How many gaps were in GRCh38?

A

as few as 89 gaps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When was the Telomere to Telomere (T2T) genome published?

A

2022

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is it hard to transition from GRCh38 to T2T?

A

Because the MANE project is still of GRCh38. The MANE project is trying to bring about consensus on a defined set of transcripts representative of the expression of the whole genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are patches?

A

Additional sequences with their own identifier, adding info to the reference genome without disrupting it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two types of genome patch?

A

Fix patches and Novel patches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a fix patch?

A

They correct gaps or sequence errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why are fix patches needed?

A

Because changing the reference genome by incorporating the patch would change downstream position numbers, so it can’t be done directly. A fix patch adds info without altering the number of bases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are novel patches?

A

They are designed to provide an alternate structure for a chromosomal region, such as for CYP206 duplication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What happens to fix patches and novel patches when a new genome is released?

A

Fix patches are incorporated, but novel patch scaffolds remain as Alt loci, representation variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What patches does GRCh38 have?

A

Some missing exons (e.g. SHANK3), some missing genes in patches, and indel errors in a few hundred genes like ABO, a few of which are clinically relevant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Where can the reference genome be downloaded from?

A

UCSC, EBI (Ensembl), or NCBI (RefSeq)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why is NCBI RefSeq recommended for looking at the reference genome?

A

It has a more conventional numbering system that is used in analyses e.g. Chr1.
Files are an unambigious format.
IDs are uniquely labelled with an accession and version number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What can you do if you have a reference position in UCSC or Ensembl format and you want it in RefSeq format?

A

RefSeq site has a table for rapid conversion of IDs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What makes a Transcript Reference Sequence Record different from a Genomic Reference Sequence Record?

A

A Transcript Reference Sequence Record needs to contain functional information, e.g. exon boundaries, transcription start and stop sites etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How is functional information stored in a transcript reference sequence record?

A

In Metadata: Sequence ID, CDS start/end, Translation ID/sequence, publications, features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

When considering genes from an informatics perspective, what are we really referring to?

A

A collection of transcript reference sequences, and sometimes functional reference sequences, and how they map onto a chromosomal reference sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are General Feature Format files?

A

A simple tab-delimited text file that describes how transcript reference sequences are mapped onto a genomic reference sequence. A single feature per line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Why might you find identical transcripts aligning differently to a given genome build between RS providers?

A

They may have different alignment algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What formats are there for files detailing how transcript RS’s map to genomic RS’s? Which is widely used?

A

GTF and GFF are both types of files suitable, with different file formats. GFF3 is widely used for representing genomic and transcriptomic features including gene annotations, alignments and other features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What do GFF3 files start with?

A

A header with metadata about the file, beginning with a #

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are some of the columns of feature lines in a GFF3 file?

A

Sequence ID - where the feature is located.
Source - The database or software tool
Type - gene, exon, transcript
Start
End
Score of quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How is the strand denoted in GFF3 files?

A

+ or -

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the ‘Phase’ in GFF3 files?

A

It is for coding sequences, either 0, 1 or 2, to denote the phase of the feature, e.g. the reading frame.

34
Q

What are some attribute tags in a GFF3 file?

A

ID; Name; Alias; Parent (e.g. A gene is the parent of an exon)

35
Q

What does LRG stand for?

A

A Locus Reference Genome

36
Q

What are LRGs created for?

A

LRGs are created for clinical reporting, in order to create a stable gene sequence and annotations. They are used for variant reporting and annotation.

37
Q

How are LRGs made?

A

They’re curated by experts and they have the minimal set of transcripts, ideally 1

38
Q

Before GRCh37 the genome sequence was too poor for clinical reporting, what was used instead?

A

GENE reference sequences which transcripts were annotated against.

39
Q

How to LRGs provide stability to genomic references?

A

Because we need unique stable identifiers that are not versioned in an ideal world, LRGs came along to try and establish a coordinate system that is independent of upgrades to the reference genome assembly and will map to present and past assemblies.

40
Q

What do we need to happen to have LRGs be fully useful?

A

A faster convergence between RefSeq and Ensembl. Need multidirectional data exchange between them

41
Q

LRGs are well supported, provide evidence of this

A

They were integrated into different genome browsers for visualisation in genomic context with other existing annotations. And they are compatible with HGVS nomenclature.

42
Q

What is now obsolete in the GRCh38 era?

A

LRGs and GENE reference sequences

43
Q

What are the issues with LRGs?

A

They were too low coverage, and now too old so can’t be aligned to GRCh38

44
Q

What are the take aways from LRGs?

A

They were designed as static reference sequences for standardised clinical practice, from before we had a good reference genome. They are difficult to incorporate new evidence into because they are static, and can’t be updated without creating a new LRG.

45
Q

What does MANE stand for?

A

Matched Annotation between NIH and ENSEMBL.

46
Q

MANE was built on what?

A

The LRG project

47
Q

What was MANE made for?

A

To ensure each gene had at least one identical transcript from end to end, including exon break points, in both the ENSEMBL and RefSeq datasets.

48
Q

What coverage of genes does MANE have compared to LRGs?

A

Much better. 98% compared to 15%. But LRGs did help to inform MANE.

49
Q

MANE ensures transcript IDs are what?

A

Interchangeable between RefSeq and ENSEMBL.

50
Q

LRGs tried to use just one transcript for each gene ideally. But MANE intends to include what?

A

MANE intends to include a comprehensive set of clinical transcripts based on the most up to date scientific and clinical evidence

51
Q

Why do HGVS stress that we must use unique and stable identifiers?

A

Otherwise variant reporting can easily lose track of the reference sequence against which we are describing variants.

52
Q

HGVS numbering states the most ____ position of the variation?

A

The most 3’ position, this is the nucleotide that is arbitrarily described as changed.

53
Q

NM_007298.3 refers to what?

A

A single stable transcript sequence that will never change over time.

54
Q

What would you do if NM_007298.3 needed to have a polyA tail removed?

A

The ID must also change, to NM_007298.4

55
Q

What’s the definition of a variant?

A

A sequence level change between a query sequence in comparison to the aligned position in a reference sequence

56
Q

For correct variant naming, you need to reference sequence to be what?

A

Widely available.
Well annotated (TSS, UTRs, Exon boundaries)
Maintained.
Uniquely identifiable.
With the appropriate meta data. (accession numbers, and versions, definitions, organism, related articles, annotations, revision history).

57
Q

What are two terms similar to variant that are now outdated?

A

Mutation (any DNA change in sequence), and Polymorphism (any mutation with a frequency >1%).

58
Q

Who came up with the SNV interpretation guidelines?

A

AMCG and AMP

59
Q

What does a VCF file stand for?

A

Variant call format.

60
Q

What does a VCF file do?

A

Describe genome level variation using a genome coordinate based system. Might be point mutations , indels, deletions, copy gains, translocations.

61
Q

What is the HGVS Nomenclature for a substitution?

A

g.1322G>T

62
Q

What is the HGVS Nomenclature for a deletion?

A

g.3601_3627del

63
Q

What is the HGVS Nomenclature for a inversion?

A

g.495_499.inv

64
Q

What is the HGVS Nomenclature for a duplication?

A

g.3661_3702dup

65
Q

What is the HGVS Nomenclature for a conversion? (Also what is a conversion?)

A

g.333_590con1844_2101
Think it’s a non reciprocal translocation of a region

66
Q

What is the HGVS Nomenclature for an insertion?

A

g.7339_7340insTAGG

67
Q

What is the HGVS Nomenclature for a deletion-insertion (delins)

A

g.112_117delinsTG

68
Q

Variant names should be simple. But to add additional context what can you add?

A

Descriptive terms from Sequence Ontology. These can relate to biological features such as binding_site and exon

69
Q

Sequence Ontology terms can be mutable (can change), so what must you provide when using them with variant names?

A

An SO accession number e.g. transcript_ablation - SO:0001893

70
Q

What is the standard format for describing variants, what’s the name for it?

A

The HUGO HGVS format. It’s due an update in 2024 as they do evolve.

71
Q

What was the first molecular disease identified? And what was the cause?

A

Sickle cell anemia. An abnormal beta globin chain due to Glu>Val / A>T change.

72
Q

If we have seen a genomic change, what do we need to do when we write it as a protein change?

A

Annotate that it is predicted unless it has been confirmed by protein sequencing or RNA sequencing.

73
Q

When annotating/naming a protein amino acid change, what important about the numbering of the amino acid?

A

The Methioinine at position 1 is usually cleaved off during translation, however we include the methioinine in the sequence numbering. So position 7 will actually only be the 6th amino acid in the final protein.

74
Q

Sequences start with N and then another letter. What do these represent?
NC
NG
NM
NP

A

Chromosome level
Gene level
mRNA/transcript level
protein level

75
Q

What letters come after the sequence/transcript reference for these?
NC_000011.3:
NG_003345.2:
NM_000651.3:
NP_000815.7:

A

NC_000011.3:g.
NG_003345.2:g.
NM_000651.3:c.
NP_000815.7:p.

76
Q

Where does numbering start for transcripts? i.e. what is c.1?

A

c.1 is the A of the ATG initiation codon. So it starts at a negative position in the 5’ UTR.

77
Q

What are MANE Select Plus Clinical Transcripts?

A

They are being created as a standardised set of additional clinically relevant reference transcripts for genes that are known to use alternatively spliced transcript isoforms.

78
Q

What is the cause of 90% of cases of the AD disease classic Ehlers-Danlos Syndrome (cEDS)

A

Pathogenic variants in either COL5A1 (typically haploinsufficiency) or COL5A2 (dominant negative - mutant protein interrupts function of wild type protein)

79
Q

Some classic Ehlers-Danlos Syndrome (cEDS) patients show a recessive inheritance pattern from a pathogenic variant in COL5A1. Why is this?

A

COL5A1 encodes two alpha1 chains that along with the third alpha2 chain from COL5A2, form the mature collagen V heterotrimer. COL5A1 produces two alternatively spliced transcript variants using mutually exclusive exons, leading to isoform A (~70% expression) with exon 64a, and isoform B (~30% of expression) with exon 64b. It’s something to do with this, but I don’t actually have an explanation…

80
Q

MANE Select Plus clinical transcripts capture all the what?

A

They are able to capture all pathogenic variant information