Big picture concepts Flashcards
In a sentence, describe the central dogma of molecular biology
• DNA is transcripted into mRNA, which is translated into proteins that, after protein modifications, have function
What -omic is used to analyse genes and how?
o Genes are analysed through genomics, involving DNA sequencing
What -omic is used to analyse mRNA and how?
o mRNA is analysed through transcriptomics, involving microarrays and next-gen sequencing
What -omic is used to analyse proteins and how?
o Proteins are analysed through proteomics, involving electrophoresis, chromatography and mass spectrometry
What -omic is used to analyse function and how?
o Function is analysed through metabolics, lipidomics and mass spectrometry
What does transcriptional regulation involve?
o Transcriptional regulation involves alternative splicing, cell type specific expression…
What does translational regulation involve?
o Translational regulation involves masking, mRNA stability…
What does post-translation regulation involve?
o Post-translational regulation involves modification by O-GlcNAc, phosphate, ubiquitin…
What percentage of the human genome codes for protein coding regions?
• Around 1% is protein coding regions
Describe the first estimate of the number of genes in the human genome and how it compared to reality
• First draft (2001) of the human genome estimated 30-40000 genes but by 2007 it was found that there were about 20,500 genes in the human genome
How are we similar to E.Coli and yeast?
o Metabolically, we are similar to E.Coli and yeast
Are the genes in the human genome unique?
o Most of our genes are shared with close and some with distant relatives
How many more genes do we have more than unicellular organisms?
we have 4-5x more genes than unicellular organisms
How many genes do dogs have? Do they have more or less genes than humans?
o We have more genes than dogs (19000)
How many genes does the worm have? Do they have more or less genes than humans?
We have less than the worm (25000)
How many genes does the arabidopsis have? Do they have more or less genes than humans?
We have less than the arabidopsis (28000)
How many genes does rice have? Does it have more or less genes than humans?
We have less genes than rice (75000)
What percentage are we identical to chimps?
o We are 96% identical to chimps
Do humans know the function of all their genes?
• Almost half the genes have an unknown function
Which is more complex, the genome or the proteome? Why?
• Complexity resides in the proteome
o Whilst the genome is static, the proteome can exhibit temporal and spatial differences
• The proteome is constantly changing as cells respond to environmental conditions
o DNA is chemically homogenous whilst proteins are heterogenous
• The proteome may be as complex as a whole organism, a tissue or a single cell type
o Proteins are cellular effectors
What is the proteome?
• Proteome- the proteins expressed by the genome at any one time
What is the functional proteome?
o Functional proteome- part of protein that is expressed at this point in time
What is the theoretical proteome?
o Theoretical proteome- the genetic basis of the proteome
What is proteomics?
• Proteomics is the study of the proteome
What are metabolites?
• Metabolites- small molecules that are chemically transformed during metabolism and that, as such, provide a functional readout of cellular states.
Why are metabolites easier to correlate with phenotype compared to genes and proteins?
o Unlike genes and proteins, the functions of which are subject to epigenetic regulation and post-translational modifications, respectively, metabolites serve as direct signatures of biochemical activity and are therefore easier to correlate with phenotype
What is metabolic targeting?
• Metabolic targeting- quantification of a specific metabolite
What is profiling?
• Profiling- quantification of a group of related compounds or those found in a single biochemical pathway
What are the definitions of systems biology and why are there so many?
o Systems biology- study of living systems/ecosystems (e.g. gut microflora)
o Systems biology- using a global systematic approach studying a living system
• Systems biology is defined by Leroy Hood as:
o Hypothesis-driven
o Requires global/big data acquisition
o Need to integrate different types of data
o Need to delineate biological network dynamics
Network has spatial and temporal aspects that need to be understood
o Know how every single element in the network influences all other elements-allows for deeper understanding of the system
o Formulate models that are predictive and actionable- hypothesis generating
• But there is no concise definition of systems biology that all system biologists agree upon
What are the two main philosophies towards systems biology
- The reductionist approach towards systems biology
* The expansionist approach towards systems biology
What is the reductionist approach towards systems biology?
• The reductionist approach towards systems biology
o Systems biology is molecular biology, which is a continuation of mechanistic Darwinism, at a larger scale
o Reductionism-the practice of analysing and describing a complex phenomenon in terms of its simple or fundamental constituents, especially when this is said to provide a sufficient explanation.
What is the expansionist approach towards systems biology?
• The expansionist approach towards systems biology
o Emergence- complex systems have emergent properties which can’t be deduced from a reductionist approach
Individual components in a living system interact with each other
o If components have to interact with each other, there cannot be an understanding of the living system by only looking at individual parts
What are Koch’s postulates?
o Koch’s postulates
The microorganism must be found in abundance in all organisms suffering from the disease, but should not be found in healthy organisms
The microorganism must be isolated from a diseased organism and grown in pure culture
The cultured microorganism should cause disease when introduced into a healthy organism
The microorganism must be reisolated from the inoculated, diseased experimental host and identified as being identical to the original specific causative agent
Describe Falkow 1988’s Koch’s molecular postulates
- The phenotype (sign or symptom of disease) should be associated only with pathogenic strains of a species
- Inactivation of the suspected gene(s) associated with pathogenicity should result in a measurable loss of pathogenicity
- Reversion of the inactive gene should restore the disease phenotype
Are Koch’s molecular postulates reductionist or expansionist?
Inherently reductionist-relies on a single gene being reasonable for a complete phenotype
Does not (until recently) consider all the off-target effects of knocking out the single gene
Many genes have multiple protein functions-no elucidation of the specific protein function affecting the disease
What -omics are primarily used in systems biology?
- Genomics
- Transcriptomics
- Proteomics
- Metabolomics
What is the difference between genomics and genetics
o Organism-scale rather than single-gene (genomics vs genetics)
o Genetics and molecular biology is reductionist
o Genomics is expansionist (how all parts work together)
What are genome wide association studies, their procedure and their purpose?
o Large scale SNP and mutation analysis (e.g. GWAS) provide associations
Genome-wide association studies
• Aims to identify genetic component of multifactorial diseases
• Hypothesis-free or unbiased testing of the genome for association with disease or observable traits
• Using DNA samples from many people
o Disease cases vs matched controls
Matched controls- people of the same ethnic background
• Rapid scanning of genetic markers (SNPs)
o Across DNA subsets or whole genomes
o DNA microarrays
o Next-generation DNA sequencing
• Searching for variation associated with disease
What is genomics enabled by?
o Enabled by high-throughput sequencing technology
What was the 1000 genome project, what did it aim to, what it achieved and what it cost
Launched 2008, published in 2012
Spent about $30-50 million for 1092 genomes (about $50 000/genome)
Identify >98% genetic variants which have a frequency of >1%
Achieved by light sequencing of the whole genome and heavy (high replicates) sequencing of the exome
Aims:
• To characterise the geographic and functional spectrum and to understand genetic contributions to disease by comparing these 1000s of genomes to each other
• Can tell us about evolution and sequence diversity
What was the 10,000 genome project, what did it do and what did it find?
Over 10,000 genomes with 30x-40x exome coverage
Presented the distribution of over 150 million single-nucleotide variants in the coding and noncoding genome
Each new sequenced genome contributed an average of 8579 novel variants
Found that single nucleotide variants (SNVs) are generally rare in transcription factors (due to their essentiality) and occur more frequently in non-protein coding regions and outside of transmembrane receptors
What is Miller’s syndrome?
Miller’s syndrome is a rare inheritable disease that causes facial and limb abnormalities
How was genomics essential for elucidating genetic associations in Miller’s syndrome? Give an example
Genome sequencing has been essential for elucidating genetic associations in Miller’s syndrome
Roach et al. 2010
• Sequenced 4 genomes (both parents and two affected offspring) at 99.999% accuracy
o Removes noise as nucleotide variants are accounted for due to familial relationships
• 3.6M single nucleotide polymorphisms within the group
• Clustered the single nucleotide polymorphisms to identify 4 candidate genes that may be responsible for Miller’s syndrome
o Those genes code for proteins
o The major gene associated with Miller’s syndrome is:
Dihydroorotate dehydrogenase (DHODH)
What is dihydroorate dehydrogenase (DHODH) and what is its purpose?
• DHODH is a major enzyme in the de novo pyrimidine biosynthesis pathway
• DHODH is essential for pyrimidine synthesis
o De novo pyrimidine biosynthesis is the major mechanism by which the cell generates pyrimidine nucleotides
Describe how dihydroorate dehydrogenase dysfunction affects Miller’s syndrome
- DHODH is essential for pyrimidine synthesis
- However, there is also a salvage pathway (via catabolic products) that recycles old nucleotide products
- In Miller’s syndrome, mitochondrial DHODH is dysfunctional (mutations may lead to- incorrect localization, incorrect folding, degradation, lack of efficient catalysis etc.)
- Survival is based on the ‘severity’ of the mutation, salvage and use of orotate from other sources
- This is highly inefficient from development onwards
When faced with a disease that is genetically inherited and influenced, what should be done to determine the cause of the disease and what are the benefits of this approach?
Genetic mutations cause disease because of the functional consequences on the encoded proteins
• Hence it is important to design experiments based on genome sequences to find a genetic association with disease, and then looking at every step of the central dogma (genome, transcriptome, proteome and metabolome) to thoroughly understand the cause
• This can lead to exploitation of the disease cause to treat other conditions
What is a disease example as to why it is important to employ an integrated -omics approach when looking at genetically influenced diseases?
o For example, CFTR in cystic fibrosis is an example of how multiple genetic mutations lead to different effects at the structure/function level and can affect the severity of the disease
Most common mutation is ΔPhe508 resulting in defective ER trafficking/processing, incorrect folding and proteolytic degradation
What is an example of how elucidating the biological cause of a disease that is genetically influence can lead to exploitation of the disease cause to treat other conditions?
o For example, inhibition of DHODH (which is responsible for Miller’s syndrome) can be used to treat myeloid malignancies (cancer therapy)
What is transcriptomics and why is it useful?
o DNA encodes for RNA
o If the genome sequence is known, can monitor gene expression
o Transcriptomics is the systematic analysis of gene expression
o Can be studied at a cellular, tissue or whole organism level
o The coding RNA molecules need to be translated
What is the aim of proteomics and why is it useful?
o Allows to see the global consequences of changes in protein abundance including changes to other proteins, compensatory effects and spatial effects
o Proteins are synthesised from RNA by translation
o Aim to identify and quantify all the proteins
o Protein abundance is controlled by many factors
o The proteome is spatially and temporally dynamic
Why is metabolomics useful?
• Metabolomics
o Allows to see the functional consequences of changes in protein abundance include depletion of the product and accumulation of the substrate
What kind of data is handled in systems biology?
- Resequencing projects
- De novo genome sequencing
- SNPs, GWAS
- Transcriptome sequencing/profiling
- Metagenomics
- ChIP-SEQ and RNA-SEQ
What is a crucial tenant in systems biology, especially when applying systems biology to look at disease causes and treatments?
• Integrating large-scale data is extremely important
o Cannot rely on a single step of the central dogma for evidence- need a global view
For example, simply performing genomics is not enough
o It is crucial to determine how such mutations can cause disease
• Genomics, proteomics, metabolomics and functional assays should be used together to understand disease and provide strategies for interventions
o Miller syndrome is such an example
What does personalised medicine mean?
• Medical treatment could be tailored to the individual
o Therapies should be tailored to an individual
Why is personalised medicine needed? Give examples
o One size fits all no longer works in medicine
o Alignment of health and disease is inherently personal
There is a need to understand health before an understanding of disease can be reached
o The characterisation of tumours could facilitate the selection of appropriate therapies and increase success rate within the population
o The characterisation of an individual’s genome could lead to prophylactic (preventative) therapies
Better prediction of disease offers hope for earlier intervention
o Understanding health requires determination of the baseline molecular profile of an individual
Currently, we set arbitrary values for diagnostics to provide a yes/no answer- irrelevant to individual baseline and based on population mean
E.g. prediction of prostate cancer based on prostate-specific antigen (PSA) (proteome as a diagnostic tool) is largely inaccurate as the threshold is fixed as a population- instead, should know the baseline PSA of an individual before setting the risk threshold
• Can lead to harmful effects- prostate biopsy is an invasive procedure that many do not need
o Molecular profiling would assist in providing the right therapy, giving higher success rates than where many patients are given a generic therapy that may not work for them
Based on predictors
Compare current medicine to future/personalised medicine in terms of:
- Focus
- Action type
- Measurements
- Frequency
- Target group
Current medicine-
Focused on illness and disease without reference to baseline health
Reactive
Measures very few things and often with poor accuracy
• High error margins
• Thresholds are insensitive and don’t take into account your baseline -omics
Infrequent
Population-based
• Mean based on population
Personalised/future medicine
Focused on health and deviations from health
Predictive of disease based on baseline and current -omics data
• Allows for early treatment and hence higher chance of treatment success
Measures many things with high accuracy
Frequent/consistent
• Profile the -omes of a person over time by providing sample every time a doctor is visited
Individual-based
• Mean based on the individual
What health data points can currently be obtained on an individual?
TeleHealth Phenome Social media Epigenome Transcriptome iPS cells Single cell Transactional Proteome Genome
What is the benefit of wearables in generating personalized day-to-day data and what can this data be used for?
• Generation of personalized data
o There are many levels at which personalized data are currently generated
o Common wearable technologies (such as iPhones, Fitbits…) collect personalised data that can be used in many ways, including monitoring health
These technologies form the basis for research grade and clinical grade wearables
o All the data, if collected appropriately, can be used as a predictive tool regarding an individual’s health
If an individual’s baseline is known, then perturbations can immediately be seen
Where can wearables monitoring health be used?
o Wearables are already being used to assist disease diagnosis- Locations- • In hospitals • In-home • In remote/rural areas • In low resource area
What sources can wearables monitoring health send data to?
Wearables can send data over the internet/other sources to
• Healthcare practitioners
• Telehealth
• Artificial intelligence
What personalized data can wearables monitoring health collect?
Data can regard
• Metabolic, cardiovascular and gastrointestinal health
• Sleep, neurology, mental health, and movement disorders
• Maternal, pre- and neonatal care
• Pulmonary health and environmental exposures
What challenges and limitations do wearables monitoring personalised health face?
Challenges and limitations include • Accuracy • Privacy and security • Oversight • Scientific peer-reviewed evidence for safety and efficacy in healthcare • Accessibility • Cost • Compatibility • Acceptability • Interpretation • Technological form factor • Lack of standards
What wearable device sensors are used for lyme disease and what does each measure?
- Resting heart rate (photoplethysmography)
- Skin temperature (thermopile)
- SpO2 (pulse oximeter)
What ML algorithm is used for Lyme disease to analyse the output of wearable device sensors?
Peak detection; logistic regression
What wearable device sensors are used for respiratory viral infection and what does each measure?
- Resting heart rate (photoplethysmography)
- Skin temperature (thermopile)
What ML algorithm is used for respiratory viral infection to analyse the output of wearable device sensors?
Sliding window peak detection; logistic regression
What wearable device sensors are used for insulin resistance and what does each measure?
- Diurnal heart rate difference (photoplethysmography)
- Physical activity (accelerometer)
What ML algorithm is used for insulin resistance to analyse the output of wearable device sensors?
Multiple regression
What wearable device sensors are used for atrial fibrillation and what does each measure?
- Heart rate (AliveCor ECG)
- Heart rate (photoplethysmography
What ML algorithm is used for atrial fibrillation to analyse the output of wearable device sensors?
Deep neural network
Which molecular approaches and -omics contribute to integrative personal omics profiles? Describe what sample each -ome comes from.
o Cells/tissues- Genome Epigenome Transcriptome Proteome Metabolome o Body fluids (saliva, serum, plasma, urine) Proteome Metabolome Autoantibodyome Microbiome Envirome/exposome o Body surface and waste (nasal cavity, skin, feces) Microbiome Envirome/exposome
Describe the impact of the microbiome on human health
- The microbiome can alter human health almost completely without reference to the genome
- Diet influences the microbiome-if microbiome is pushed in a certain direction due to an individual’s diet that puts a selective pressure on certain bacteria and hence certain waste products, it can predispose that individual to certain diseases
What is the envirome/exposome and how is it measured?
• Use of environmental detectors (wearables)
o Detects the air and your exposure to the environment
What categories are relevant exposures for inclusion in exposome studies? Describe them
External
Meteorology-Climate change, temperature, humidity, wind, atmospheric pressure
Outdoor exposures-NO2, SO2, CO, O3, volatile organic compounds, particulate matter, radiation, UV, traffic, pollen
Built environment-Population density, building density, facilities, green space, walkability, neighbourhood safety, accessibility to resources, noise
Home environment-Volatile organic compounds, particulate matter, NO2, CO, aldehydes, metals, plasticizers, dust, pets, pests, allergen , mold, fungi, microbes, endotoxin
Personal behaviour -Diets, physical activity, tobacco smoke, alcohol, drugs, sleep, sex, cosmetics
Social economic factors-Social factors, education, economy, psychological and mental stress
Food and water contaminants-Fertilizers, metals, pesticides, plasticizers, water disinfection by-products, polychlorinated biphenyl, flame retardants
Medications-Medicines, surgeries
Occupational exposures-Chemicals, dust, metals, virus, animal proteins, plants, heat/cold stress
Internal
Primary external exposures and associated metabolites, epigenetic (e.g. methylations, histone modifications), microbiome/metabolome/proteome/transcriptome/genome changes etc.
Is an individual’s integrative personal omics profile spatially and temporally static? Why/why not? Give examples
• The integrative personal omics profile has both spatial and temporal aspect
o Spatial- each different component/level contains different data
o Temporal- each -omics profile can evolve over time. Health and disease are temporal. Need to know what people look like in health to be able to define what they look like in disease
But the genome is static and predictive
The transcriptome, proteome and metabolome are dynamic and reflective
Are health and disease states static or temporal?
Health and disease are temporal. Need to know what people look like in health to be able to define what they look like in disease
Why do -omics profiles rarely correlate with each other? What is the importance of this?
• Important to test across time (in health and when symptomatic) to predict disease earlier
-Omics profiles rarely correlate when a snapshot is taken (one point measurement) due to the time displacement between each step -omics generation (e.g. the time between gene expression (genomics) and production of a protein profile (proteomics) in the body)
• -Omics are also affected temporally. EXTREMELY IMPORANT TO CONSIDER -OMICS PROFILES AS TEMPORALLY DEPENDENT
If tests are performed when symptoms are apparent, it is often too late. However, if tests are performed across time, can often find susceptibilities which can be treated with higher success
What is the Snyderome and how was it achieved? What was found from this Snyderome and what does this demonstrate?
o The Snyderome (Professor Snyder)
Monitored himself over 726 days with a variety of standard and -omics based assays
• Took urine and blood everyday
Took serum and peripheral blood mononuclear cells (for DNA sequencing and RNA transcript analysis) which, through analyses, all resulted in integrated personal omics
Undertook a variety of phenotypic assays and lifestyle changes (increased exercise…)
Found a high number of loss of function SNPs in his genome, which are very important
• Found that he had genetic variants that predisposed him to hyperglycaemia and diabetes (and was later diagnosed as a type II diabetic)
o His risk factors were temporally affected- his risk factor of diseases increased during the two years
• Would look at his transcriptomic and know that he would get a cold BEFORE he displayed symptoms
–Important to study the -omics over time as they are temporally affected
Describe what -ome techniques can be performed from peripheral blood mononuclear cells (from the Snyderome)
o Whole genome sequencing
o Whole transcriptome sequencing (mRNA and miRNA)
o Proteome profiling
Describe what -ome techniques can be performed from serum (from the Snyderome)
o Untargeted proteome profiling o Targeted proteome profiling (cytokines) o Metabolome profiling o Autoantibodyome profiling o Medical/lab tests
What data can be pulled from whole genome sequencing?
Variant calling/phasing
What data can be pulled from whole transcriptome sequencing?
Variant calling/phasing
Heteroallelic and variant expression
RNA editing
Quantitative differential expression and dynamics
Variant confirmation in RNA and protein
What data can be pulled from proteome profiling?
Quantitative differential expression and dynamics
Variant confirmation in RNA and protein
What data can be pulled from untargeted proteome profiling?
Quantitative differential expression and dynamics
What data can be pulled from proteome profiling (cytokines)?
Quantitative expression
What data can be pulled from metabolome profiling?
Dynamics
What data can be pulled from autoantibodyome profiling?
Differential reactivity
What data can be pulled from medical/lab tests?
Glucose, HbA1cc, CRP, telomere length
Are all proteins produced by genomic ORFs of known function? Are they constant within a species?
• There are many coding regions (ORFs) in genomes that produce proteins with an unknown function (50% +)
o Some genes in these bacteria are specific within types of species
What are the two main potential issues when discovering new genetic ORFs that code for a protein?
o Genome sequencing reveals new proteins of unknown functions with no homologs in other species-
Genome sequencing reveals more homologs of proteins with unknown function (conservation of proteins with no known function suggests importance)
DUFs- Domains of Unknown Function
o Genome sequencing reveals protein homologous to known proteins in other species but without any context
Orphan enzymes- enzymes that are usually part of a greater part found isolated
• Indicates that the pathway these enzymes are part of are conducted differently in different organisms
What do orphan genes in genomics indicate?
• Standard classical biochemical cycles are not necessarily the same in each organism
o Some organisms miss enzymes at crucial steps within supposed universal pathways
o Functions have evolved independently in many cases- as classical biochemical pathways are not conserved
What do protein sequences dictate and what can be infered from this information?
- Protein sequences can tell us varying amounts of information about protein function- very few entries in the databases have been functionally proven
- Protein sequences dictate tertiary structure and protein functions
- The sequence of an unknown protein can be used to interrogate databases for similar proteins of known function (obtained via experimental, not computational, means)
- Sequence information does not always imply structural equality
How is which proteins with unknown function should be prioritized for further study decided?
• Can apply:
o Expression tools (transcriptomics, proteomics)
Interactions of the protein with other known proteins/levels at the central dogma
o Computational tools (predictions regarding the unknown protein functions)
o Structural tools (predictions regarding the unknown protein tertiary structure)
o Phenotypic and molecular tools
o Genomic evidence: the genes of the unknown protein
Gene clustering
Gene fusion
Shared regulatory sites
Phylogenetic occurrence
o Post-genomic evidence: the transcript and protein of the unknown protein
Co-expression
Protein-Protein interactions
Organelle proteomes
Essentially and other phenome data
Structures
What are computational approaches/databases to search in order to elucidate more information about an unknown protein?
o BLAST searchers
o PROSITE-Finding shared motifs
o Structural predictions/genomics
o GRAVY (Grand Average Hydropathy)
o Kyte-Doolittle Plots (protein topology and hydrophobicity)
o PSORT-Protein Subcellular Location Predictor
Describe why structural predictions/genomics is a useful analysis to perform on an unknown protein
Many of the predicted ORF’s have unknown functions
Structure can give insight to function
Structural genomics aims to systematically solve the structure of every protein in a genome
What is the use of modelling protein structures?
The use of modelling protein structures
• Use protein sequence to predict relationships to structures already present within the database
• Predict function for the many thousands of FUN proteins within genome sequence databases
• Clue to structure if impossible to obtain experimentally
• Predict effects of point mutations on known structures
• Model drug/protein interactions for many thousands of compounds- virtual screening
o Drug docking experiments
What is the GRAVY database and what information does it correlate with?
o GRAVY (Grand Average Hydropathy)- provides a value based on the overall solublity of a predicted protein sequence
Can combine with transmembrane regions (TMR) analysis
• Plot GRAVY values (ProtParam) and predicted TMR (TmPred) for predicted protein sequences:
o Hydrophobicity (GRAVY; X-axis) tends to rise with increased TMR
o Majority are hydrophilic as they are intracellular
o There are analytical challenges of hydrophobic (insoluble proteins)
What does BLAST stand for?
o BLAST- Basic Local Alignment Search Tool
What is a BLAST search, how does it function and what is its limitation?
o Searches unknown amino acid sequences (generally translated from DNA) against protein sequence databases (many entries)
Takes query sequence and matches it against an entry in the database
o Scores ranked based on number and order of identical and similar sequences
o Be aware that although there might be an identity match, the function of the protein needs to be EXPERIMENTALLY functionally determined (as it is also quite likely that the protein matched in the database hasn’t had a functional determination either
What protein parameters are used to compare protein sequences to each other
• Protein parameters for searches (shared physicochemical properties of amino acids o Hydrophobicity o Positively charged o Negatively charged o Polar o Charged o Small o Tiny o Aromatic o Aliphatic o Van der Waals volume
Define the term identity when used to compare amino acid sequences. Give an example of how it would be written
• Identity between amino acid sequences- a calculation based on the number or amount of conserved (i.e. identical) residues shared between two (or more) amino acid sequences
o Often expressed as protein X and protein Y share 62.5% sequence identity
Define the term similarity when used to compare amino acid sequences.
• Similarity- a calculation based on the number of not only identical amino acids, but also similar amino acids (e.g. small, polar, non-polar substitutions)
Define the term homology when used to compare amino acid sequences.
• Homology- at the sequence level, 2 proteins with a high degree of sequence similarity/identity are said to be highly homologous- strictly, homologous proteins descend from a common ancestor
What is a motif?
o A motif is a short (5-80 amino acid residues) stretch of sequence that is characteristic of a particular function or signal
What information can a motif give about a protein?
• Domains and motifs give us clues about broad functional capability
o A motif might signify several sites, which might indicate function
A protease recognition site
An enzyme activity
A binding site for a substrate or ligand
A site for post-translational modification
o Likely discoveries of sequence motifs
A clue to structure
A clue to function
• Enzyme catalytic sites
• Family relationships
• Prosthetic group attachment sites (e.g. biotin, heme, etc.)
• Binding of metal ions (e.g. Ca2+)
• Sites for disulfide bond formation (cysteine)
• Sites for protein-protein, protein-nucleic acid or protein-ligand binding
o Computational analysis of potential motifs must always be verified experimentally as there are a lot of exceptions to motifs depending on the motif
What is the PROSITE database and what information does it contain?
o Most widely used prediction tool for motifs
o Release August 2020 contains:
1860 documentation entries
1311 different conserved patterns/rules
o Motifs of lengths 4-80 are present, most common between 10-14 amino acids in length
o SWISS-PROT/UniProtKB: currently contains 563, 082 manually annotated proteins and more than 188 million unreviewed entries
Identify PROSITE syntax, including: - character X {...} [...] (n) (n,m)
Each position in the motif is separated by a hyphen
One character denotes a specified residue that is required at that position
X is wild card- match any amino acid
{…} denotes a set of disallowed residues (these residues break the rule)
[…] denotes a set of allowed residues
(n) denotes a repeat of n
(n,m) denotes a repeat of between n and m inclusive
What are limitations of using motifs for analysis of proteins of unknown function?
o Limitations that a motif might need to satisfy
Some motifs may not be viable in particular positions due to folding or other constraints (e.g. glycosaminoglycan attachment site (S-G-x-G) has two acidic amino acids (Glu or Asp) from -2 to -4 relative to the serine)
High frequency (especially short) motifs have low confidence- need to characterise them functionally through experiments
Motif components may not be order dependent
Proof is never absolute until have functional characterisation
How can a motif be manually and computationally recognised?
o Pattern extraction- recognising motif
Recognise shared identical amino acids within the sequences between sequences- these are most likely the motif
• Short sequences will be too general, but long sequences may be too specific
Large-scale data analysis enables pattern extraction