10 Standardisation - Genomes, Genes and Nomenclature Flashcards
What was the specific aim of the human genome project?
To sequence the euchromatic human genome: this is the lightly packed chromatin that is enriched in genes. It’s 92% of the genome.
How much of the genome is considered protein coding?
25%
How much of the genome is actually protein coding because its in exons?
1%
What surprised people about the content of the genome?
There’s more segmentally duplicated DNA than expected
What makes up 15% of the genome?
Short interspersed Nuclear elements, primarily Alu elements
Who contributed DNA to the human genome project?
13 or 30 people… From Buffalo NY. 66% is from one male donor
In 2001 how much of the genome had been sequenced? What was the error rate? And how many gaps were there?
90% coverage, high error rate of 1 in 1000, and there were 15,000 gaps in the euchromatic genome
How much of the genome had been sequenced in 2003 with referene Hg16 (NCBI34), what was the error rate, and how many gaps were there?
99% coverage, 1 in 10,000 error rate, with 400 gaps in the euchromatic genome
In 2009 Grch37 was published. How many gaps were there, and how many genes still had sequencing error?
300 gaps in euchromatic genome in build 37, and 550 genes with sequencing errors
In 2013 what was published?
GRCh38
How many gaps were in GRCh38?
as few as 89 gaps
When was the Telomere to Telomere (T2T) genome published?
2022
Why is it hard to transition from GRCh38 to T2T?
Because the MANE project is still of GRCh38. The MANE project is trying to bring about consensus on a defined set of transcripts representative of the expression of the whole genome.
What are patches?
Additional sequences with their own identifier, adding info to the reference genome without disrupting it
What are the two types of genome patch?
Fix patches and Novel patches
What is a fix patch?
They correct gaps or sequence errors
Why are fix patches needed?
Because changing the reference genome by incorporating the patch would change downstream position numbers, so it can’t be done directly. A fix patch adds info without altering the number of bases.
What are novel patches?
They are designed to provide an alternate structure for a chromosomal region, such as for CYP206 duplication
What happens to fix patches and novel patches when a new genome is released?
Fix patches are incorporated, but novel patch scaffolds remain as Alt loci, representation variation
What patches does GRCh38 have?
Some missing exons (e.g. SHANK3), some missing genes in patches, and indel errors in a few hundred genes like ABO, a few of which are clinically relevant
Where can the reference genome be downloaded from?
UCSC, EBI (Ensembl), or NCBI (RefSeq)
Why is NCBI RefSeq recommended for looking at the reference genome?
It has a more conventional numbering system that is used in analyses e.g. Chr1.
Files are an unambigious format.
IDs are uniquely labelled with an accession and version number.
What can you do if you have a reference position in UCSC or Ensembl format and you want it in RefSeq format?
RefSeq site has a table for rapid conversion of IDs.
What makes a Transcript Reference Sequence Record different from a Genomic Reference Sequence Record?
A Transcript Reference Sequence Record needs to contain functional information, e.g. exon boundaries, transcription start and stop sites etc.
How is functional information stored in a transcript reference sequence record?
In Metadata: Sequence ID, CDS start/end, Translation ID/sequence, publications, features
When considering genes from an informatics perspective, what are we really referring to?
A collection of transcript reference sequences, and sometimes functional reference sequences, and how they map onto a chromosomal reference sequence
What are General Feature Format files?
A simple tab-delimited text file that describes how transcript reference sequences are mapped onto a genomic reference sequence. A single feature per line.
Why might you find identical transcripts aligning differently to a given genome build between RS providers?
They may have different alignment algorithms
What formats are there for files detailing how transcript RS’s map to genomic RS’s? Which is widely used?
GTF and GFF are both types of files suitable, with different file formats. GFF3 is widely used for representing genomic and transcriptomic features including gene annotations, alignments and other features.
What do GFF3 files start with?
A header with metadata about the file, beginning with a #
What are some of the columns of feature lines in a GFF3 file?
Sequence ID - where the feature is located.
Source - The database or software tool
Type - gene, exon, transcript
Start
End
Score of quality
How is the strand denoted in GFF3 files?
+ or -