Proteomics Flashcards
What is the proteome?
Total protein profile/complement present in a biological sample at a given time
What is proteomics?
(clinical and general)
In a clinical view: Scientific area with the emphasis to analyze the whole proteome.
In a general view: A subset of methodologies that are needed in the analysis of proteins/proteomes
Why is it that Proteome ≠ Transcriptome?
!
Proteins are the dynamic and structural elements responsible for function, shape, etc. One gene can code for many genes products:
- Different promoters (DNA → RNA, e.g. p53)
- Alternative splicing (mRNA: introns / exons, e.g. p53):
- Alternative splicing plays an important role in protein diversity without significantly increasing genome size
- Common in cancer
- Decoration of proteins with modifications (phosphate, acetylation, e.g. p53)
- Different stability of mRNA‘s influenced by cellular signals & feedback mechanisms
Example: In Cancer ~10% Protein to mRNA Correlation
WHat are the principles of proteome organization?
!
Protein-Protein Interactions –> Modularity
- The majority of a cellular proteome is organized in stable and transient complexes that form functional dynamic networks.
- Cell signaling: transient complex formation in time
- Regulation by PTM switches (time frame sec. to hrs)
- Ø Protein complexes represent key functional units for the control of biochemical processes
–> Interactome: >2x105 PTMs
What are the goals of proteomics research?
- Establishment of parts list:
- What is there?
- Confirmation of predicted genes
- Characterization of PTM’s à Signaling networks
- Secretome Analysis à no mRNA available!
- Differential proteomics:
- Definition of proteins that make the difference (control vs disease)
- Biomarker discovery
- Changes in PTM patterns
- Quantitative protein assays:
- Multiplexed, mass spectrometry based “ELISA”
- Interactome:
- Definition of proteins building a functional complex (proteinaceous machines, PPI)
–> This leads often to hypotheses generating
What are the main 5 proteome characteristics?
What are the technical challenges associated with them?
!
- No amplification of proteins possible –> High sensitivity of analytical method
- Differential solubility –> Generality of analysis
- Proteome is dynamic –> Point of sample collection
- Protein modifications, protein processing, protein degradation –> Detect and identify
differentially modified forms - Dynamic range (a lot of proteins, some in really low concentrations) –> Suppression effects
What is the main technical approach for proteomics research?
Slice the Iceberg
Since analyzing a full sample is too difficult, we need to slice the sample in multiple smaller samples –> separation technologies –> LC-MS
How should sample preparation be done to access a proteome?
!
- Sample preparation should be kept as simple as possible. Each step is prone to introduce modifications, loss of proteins, variance!
- No single method can deal with all proteins
- -> often compromises have to be made
- For metabolomics similar considerations apply
- Sample collection and preservation influence sample quality
- -> loss of proteins by freeze-thaw cycles, degradation of proteins and PTM’s by enzymatic activities …
- Highly abundant proteins obscure large parts of a proteome
- -> pre-fractionation/depletion steps have to be considered
- At the end, sample preparation must be compatible with mass spectrometry
- -> Avoid too much salt, no detergents/polymers, …
- Intact protein (top-down) or peptide (bottom-up) based approaches
- -> solubility of intact proteins
What is affinity based enrichment?
!
Sample Fractionation
- Per se carried out under native conditions –> protein-protein interactions remain
- Bait fishing by use of antibody, or by a tag approach
- General pitfalls:
- Buffer conditions influence protein-protein interactions (ionic strength, detergents, pH, …)
- Low affinity between bait and target (especially with antibodies)
- Non-specific cross-reactions
- Transient interactions with fast on/off rates (low yields)
- Overexpression of genetically engineered Tag-protein & co-expression of native protein (interferences!)
What is Substrate-based Affinity Isolation?
!
Sample fractionation: Functional/Chemical Proteomics
Add a substrate that will attach to the protein of interest. Since we know what substrate we used, we can detect the shift in the MS
What sample fractionation methodes are there?
(Separation methods for proteins & peptides)
- Affinity based enrichment
- Substrate-based Affinity Isolation
- Centrifugation
- Subcellular Fractionation
- Electrophoresis
- Isoelectrical Focusing
- Macromolecule Electrophoresis
- Liquid Chromatography
What is the Bottom-up LC-MS/MS based Workflow (shotgun proteomics)?
!
LC-MS/MS:
- Eluted peptides are ionized and sent through MS -> m/z of precursor peptide ions
- Precursor ions are selected for fragmentation -> MS/MS of fragments, spectrum file (mz-Int)
- This information is sufficient to identify peptides
Why do we need protein quantification?
- mRNA levels correlate only little with protein concentrations.
- Proteins often have to be processed to be functional.
- Proteins are degraded or secreted.
- PTMs regulate many processes and are only present on protein level.
- Protein often work in complexes which is not blueprinted in the DNA or RNA sequence.
- Information about the stoichiometry of proteins in protein complexes tells more about the function of a complex
- …
What different types of quantitative proteomics are there?
Quantitative proteomics
Quantitative modification proteomics
Comparative proteomics
Biomarker studies
What is the aim of Quantitative proteomics?
measuring protein concentrations of a few dozens (complexes) to thousands (organelles, cells, tissue) of proteins in order to shed light on function and pathways
What is the aim of Quantitative modification proteomics?
measuring the levels of PTMs on proteins, i.e. biological activity or functional regulation
What is Comparative proteomics?
compares the expression levels of proteins between different biological conditions in order to find proteins whose concentration significantly differs between conditions.
These proteins can either be up or down regulated
What is the aim of Biomarker studies?
try to find proteins in body fluids (blood, urine, tears, saliva, ..) whose expression changes as a function of a disease state. They may then be used for early detection of a disease, for disease monitoring or measuring the effectiveness of treatment
What is the fold change?
Measuring Change
- Fold change = value1/value2
- Many biological systems react linearly to the fold change, therefore the fold change is often a natural measure
- Often expressed as log2(fold change) = log2(value1) - log2(value2)
- p-value:
- Is the outcome of a statistical test evaluating the significance of a fold change
- Takes into account the statistical variation of the values
What is absolute and relative quantification?
Absolute quantification:
aims at obtaining the concentration (counts/ml or g/ml or counts per cell, cpc) of a protein. This value has a physical meaning and can be compared to different experiments and used for chemical modeling (systems biology).
Relative quantification:
the quantitative values do not need to have a physical meaning but are values which are a monotonously increasing function of the actual concentrations. In the linear dynamic range these values are proportional to the actual concentrations and a ratio of two values is equal to the ratio of the two concentrations
What is the Dynamic Range of Quantitative measurements?
Important concept in quantification :
In the linear dynamic range, the response values are proportional to the actual concentrations and a ratio of two values is equal to the ratio of the two concentrations
What is spectral counting?
Count number of MS/MS spectra that match to a given protein
What is label-free LC-MS1 quantification?
- Based on LC-MS or LC-MS/MS experiments.
- Use peak heights/area of LC-MS1 peaks as indicator for peptide abundance.
- No additional sample prep needed. Therefore, kind of fast, cheap.
But fastidious sample preparation needed! - Software has to detect peaks, calculate peak volumes and align LC-MS1 peaks over different runs (retention time shifts between runs).
- Continuous alignment according to alignment tree
- Match features over different LC-MS runs (same RT and m/z)
- Transfer of MS2 identifications, where missing
What is Retention Time Alignment?
Removing ions that elute over long time and produce a row of signals (streaks)
Iterative strategy: calculate regression, reassign features, remove outliers, and repeat steps
What is Peptide/Protein Quantification by MS1?
- Signal intensities of LC-MS1 signals are calculated and can be used for relative quantification.
- Peptide identifications can be mapped to the MS1 signals by means of their accurate mass and retention time.
- Using the identified peptides, protein abundance is infered.
What is isotopic distribution?
Definition: Atomic composition of a peptide (CaHbNcOdSe)
The mass spectral peak representing the monoisotopic mass is not always the most abundant isotopic peak in a spectrum – although it stems from the most abundant isotope of each atom type.
This is due to the fact that as the number of atoms in a molecule increases the probability of having at least one heavy isotope increases. For example, if there are 100 carbon atoms in a molecule, each of which has an 1% chance of being a heavy 13C isotope, then the whole molecule is not unlikely to contain at least one heavy isotope.
What is Peptide Peak Intensity?
- Peak intensity:
- Peak height of highest isotope
- Summed intensity of all isotopes (or first three)
- Peak intensities in mass spectrometry reflect the number of ions present in the instrument.
- The number of ions in turn is proportional to the number of molecules in the sample.
- However, different molecules ionize with different efficiency!
What is normalization of LC-MS/MS data?
Why is ti necessary?
Normalization refers to the process of removing such biases by:
- global adjustment: forcing log intensity values around a central value, e.g. median à correction of differences in loaded amounts, but not for non-linear or intensity-based biases
- non-linear adjustment techniques: robust scatter plot smoothing (minus vs average plots, needs a reference sample -> add. bias?), lowess regression (local linear fits), EigenMS (singular value decomposition)
LC-MS/MS data is prone of having many sources of technical variance, or bias:
- variation in ionization efficiencies
- differences in LC column (temperature, fluctuation in flow rate…)
- sample preparation issues (pipetting errors…)
What is isotope labelling?
Idea: tag peptides/proteins with chemically equivalent molecules of different mass using isotopes. Each sample is labeled with a different mass tag.
==> Quantification possibility on MS1 and/or MS2 level!
Can help with distinguishing reasons why a peptide signal is missing.
What chemical isotope labelling techniques are there?
- ICAT: Isotope Coded Affinity Tag
- Label proteins then digest (MS1)
- Only cysteine containing peptides can be measured
- Only 2 labels (heavy and light), which makes it more complicated to analyze multiple samples (e.g. time series)
- Slight retention time shift between heavy and light
- Confusion of double labels with oxidation
- Cleavable ICAT produces lighter tag, which results in better MS/MS spectra.
- Rarely used today.
-
18O: Enzymatic Labeling
- Digest and label peptides (MS1)
- 18O is a technique where proteolysis and stable isotope incorporation (labeling) occurs simultaneously during digestion.
- Samples are digested, usually with trypsin, in the presence of H218O resulting in a 2–4 Da mass shift from the incorporation of one or two 18O atoms on the c-terminus of each peptide.
- The presence of the label on the carboxyl termini of peptides is advantageous because it facilitates the assignment of ‘y’ ions in MS/MS spectra.
- 18O labeling is not ‘pure’, i.e. the number of 18O atoms incorporated in a peptide is not fixed and a deconvolution software has to be used for accurate quantification.
- iTRAQ: Isobaric Tag for Relative and Absolute Quantitation
- Digest and label peptides (MS/MS)
- Every peptide can be labeled -> multiple peptides per protein increase statistical power
- Introduces variability after protein digestion
- Different chemical reaction kinetics for different peptides
- Expensive (commercialized by Applied Biosystems)
- Limited dynamic range
- MS instrument must be sensitive in low mass range (e.g. QTOF, but not ion trap)
- mTRAQ labeling for SRM method and relative quantification (no reporter ions)
- Signal interference from noise or unknown fragments distorts quantification.
- TMT: Tandem Mass Tag
- Same advantages and disadvantages as iTRAQ
- Commercialized by Thermo Fisher
- TMT with 16 channels (16-plex) has been introduced
What is the Concept of relative Quantification by heavylabeled Standards?
Idea:
Produce heavy-labeled proteins by metabolic labeling, or standard peptides by chemical synthesis. Each proteome is then spiked with heavy-labeled “standard”.
What is Absolute Quantification of Proteins (AQUA)?
Spike with synthetic heavy labeled peptides:
Add synthetic, heavy-labeled AQUA peptides after digestion and use MS1, better MS/MS signals for absolute protein quantification
What is Protein Standard Absolute Quantification (PSAQ)?
Spike with heavy labeled artificial Protein:
Add synthetic, heavylabeled protein composed of peptides representing target proteins to protein mixture, then digest digestion and use MS1, better MS/MS signals for protein standard absolute quantification
Can potentially compensate for problems during proteolysis!
What is Metabolic Isotope-Labeling?
Grow cells/tissue in media with heavy-labeled amino acids:
Mix non-labeled and labeled lysates at defined ratio, then digest and use MS1 signal for quantification
- Around 2000, 15N labeling was introduced to proteomics. Yeast was grown in 96% 15N medium and analyzed by LC-MS. 15N labeling works well for organisms (eg. Fungi, plants) that are able to synthesize all amino acids and that can be grown in 15N medium.
- In mammals, Lysine cannot be synthesized and needs to be taken up from nutrients. SILAC (Stable Isotope Labeling by Amino Acids in Cell Cultures, Ong et al. MCP, 2002) is used preferably with 12C and 13C labeled L-lysine. Arg, Leu or Met are also used frequently.
- Possible to label entire organisms (fly, rat) -> expensive
- Limited to situations where the cells are metabolically active
What is Stable Isotope Labeling by Amino Acids in Cell Culture (SILAC)?
Metabolic Isotope-Labeling
When both samples are combined, the ratio of peak intensities in the mass spectrum reflects the relative protein abundance. In this example, the labelled protein has the same abundance in both samples (ratio 1).
preferably with 12C and 13C labeled L-lysine. Arg, Leu or Met are also used frequently.
What fragmentation Techniques are there?
CID (Collision Induced Dissociation):
- Fragmentation of positively charged peptides by collision with inert gas.
- Fragmentation is mainly induced by ‘mobile’ protons that render amide bonds less stable.
- Produces predominantly y and b ions, including neutral losses (-H20,-NH3).
- CID spectra produced in an iontrap miss the low-mass region (1/3 of precursor m/z rule!), as this ions can’t be trapped.
ETD (Electron Transfer Dissociation):
- Negatively charged anions transfer electron to positively charged peptides.
- Induction of backbone fragmentation along the N-Cα bond.
- Produces predominantly c and z ions.
- Good for highly charged peptides/proteins, leaves labile modifications intact
HCD (High-energy Collisional Dissociation):
- HCD is a beam-type CID, similar to the iontrap CID, but performed in a quadrupole.
- HCD uses higher energy and shorter activation times (~0.1 ms) than linear ion trap CID (~30 ms).
- HCD generated spectra are not subject to same limitation as HCD
- -> lower mass ions of type a2, b2, y1, y2, immonium, and internal cleavage, or reporter ions from isobaric tags (see quantitative proteomics) are recorded.
What are the CID/HCD MS/MS Fragmentation Rules?
- Mobile proton model
- Fragment ion intensity depends on:
- Position within peptide
- Amino acids flanking fragmentation site
- Position of other charges
- Fragments are correlated (e.g. a and b ions) and often occur in series.
What main fragment ion types are there in MS?
Immonium ions:
An amino acid internal fragment with just a single side chain formed by a combination of a type and y type cleavage is called an immonium ion
Internal cleavage ions:
Double backbone cleavage gives rise to internal fragments. Usually, these are formed by a combination of b type and y type cleavage to produce the illustrated structure, an amino-acylium ion. Sometimes, internal cleavage ions can be formed by a combination of a type and y type cleavage, an amino-immonium ion.
Neutral loss ions:
A non-charged molecule is released from a charged ion upon fragmention.
How are the ions of a peptide labelled/called?
Ions that are on the N-terminal of the peptide are called a, b, c ions and the ones that are on the C-terminal of the peptide are often called x, y, z.
How was MS/MS Spectrum Interpretation originally done?
PepSeq was a first software for automated MS/MS spectrum interpretation:
- Create all possible amino acid combinations whose mass corresponds to measured peptide mass
- Build hypothetical spectra
- Find best match
BUT:
- There is a combinatorial explosion with larger peptides making this approach very calculation intensive
- Very high false positive rate because of so many options with identical amino acid composition to consider
- Becomes a heroic task when thousands of spectra have to be interpreted
What is the Aim of a Good Scoring Algorithm?
!
- High statistical power, i.e. the score should be able to distinguish good matches from bad or random matches
- It should be robust with regards to variation in the data, i.e. work well for various sample processing and MS methods
- It should be fast
What is PSM?
The best scoring match for a peptide fragment spectrum interpretation is called peptide spectrum match, PSM.
What are the problems with protein identification from peptides?
!
How can they be solved?
- Link between peptides and proteins is lost with bottom-up proteomics.
- A peptide can be identified several times in different forms.
- A peptide sequence can belong to multiple proteins.
- A protein can contain several identified peptides.
- Proteins can be very similar (homologue) and cannot be distinguished by the identified peptides.
- Many (~ 50% in large datasets) of the proteins with only one PSM are false positives. Their peptide is often shared with other proteins.
- Simplest solution: Discard all single hit proteins. But this might be too restrictive, especially for small proteins.
- Parsimony or Occam‘s razor approach, which finds minimal set of proteins that explains maximum number of peptides is a good approach to reduce false positives
How can peptides be identified via database search?
- Pick an appropriate protein database (SwissProt/Tremble/predicted exon…, Taxonomy…)
- An algorithm (SEQUEST, Andromeda, Xtandem!…) will compare the experimental spectra with theoretical spectra from the database; it will take into account known digestion and fragmentation rules, possible peptide modifications etc. Each algorithm has its own way of establishing a score for a possible peptide match; highest score decides peptide identification, the socalled peptide spectrum match (PSM).
- Which database is best suited for my MS/MS search?
- Complete or small
- Human and model organisms: UniProtKB/SwissProt is a good starting point, since it is ‘complete’ for human and more and more for model organisms, too.
- Should I consider canonical sequences or also mutations and variants annotated in UniProtKB/SwissProt?
- With the availability of fast genome sequencing, sequence databases can be built from scratch for organism not well represented in UniProtKB. However, these databases may contain a large portion of error or missing sequences due to problems with the definition of ORF’s (see Proteogenomics lecture).
- Fast genome sequencing enables to build customized protein sequence databases for individual organisms “Proteogenomics”.
What are problems associated with the bottom up proteomics workflow?
- Spectra are noisy, some fragments will be missed, fragmentations of unknown chemistry occur, unknown PTM, etc -> matches between theoretical and real spectra are only partial and may occur by chance!
- Comparison of results generated on different instruments, by alternative sample processing, interpreted with different search algorithms etc.
- How to go from statistically validated peptides to statistically validated proteins?
What are possible stategies for peptide identification?
- By de-novo
- Using spectral libraries
- By database search
Each strategy comes with a different type of search space, of different sizes:
De-novo biggest, spectral libraries smallest
What has to be considered when doing database searches?
For database searches, the website uniprot.org/uniprot typically lets you download all the known protein AA sequences by organism, and you may opt to include known isoforms, as well as automatically (non-manually) reviewed sequences.
–> Choice of these factors affects the starting search base
Not just the number of proteins of the DB is important: the number of actual possible sequences to check against a spectrum also depends upon the number of missed cleavages and the Post Translation Modifications (PTM) we search for.
–> explosion of the number of candidate sequences
What are the advantages and disadvanteges of a large and a small search space?
- Large search space
- Whole genome, TrEMBL, all variants, many potential modifications, all possible subsequences, relaxed protease cleavage specificity
- Slow
- Many false positives
- Complete
- Small search space
- SwissProt, few PTM’s, stringent protease cleavage specificity …
- Fast
- Few false positives
- Missing entries
- The ideal search space would contain only the peptides present in the sample. However, this set of peptides is generally unknown
What do we need to test for in a single spectrum peptide validation?
We need to test for :
- Random matches
- Statistical scores:
- e-value
- p-value
- DeltaScore
What is a random match in a single spectrum?
- A spectrum usually has non- assignable peaks, missing peaks and peaks of unpredictable intensity (reasons include unexpected fragmentation patterns or modifications)…
- Moreover: different peptides may have fragments in common …
- The peptide may not even be in the database…
- …so there are many incorrect hits
- Those incorrect hits are random matches
Keep in mind that high score for xcorr could also be random matches.