Proteomics Flashcards

1
Q

What is the proteome?

A

Total protein profile/complement present in a biological sample at a given time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is proteomics?

(clinical and general)

A

In a clinical view: Scientific area with the emphasis to analyze the whole proteome.

In a general view: A subset of methodologies that are needed in the analysis of proteins/proteomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is it that Proteome ≠ Transcriptome?

!

A

Proteins are the dynamic and structural elements responsible for function, shape, etc. One gene can code for many genes products:

  • Different promoters (DNA → RNA, e.g. p53)
  • Alternative splicing (mRNA: introns / exons, e.g. p53):
    • Alternative splicing plays an important role in protein diversity without significantly increasing genome size
    • Common in cancer
  • Decoration of proteins with modifications (phosphate, acetylation, e.g. p53)
  • Different stability of mRNA‘s influenced by cellular signals & feedback mechanisms

Example: In Cancer ~10% Protein to mRNA Correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

WHat are the principles of proteome organization?

!

A

Protein-Protein Interactions –> Modularity

  • The majority of a cellular proteome is organized in stable and transient complexes that form functional dynamic networks.
  • Cell signaling: transient complex formation in time
  • Regulation by PTM switches (time frame sec. to hrs)
  • Ø Protein complexes represent key functional units for the control of biochemical processes

–> Interactome: >2x105 PTMs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the goals of proteomics research?

A
  • Establishment of parts list:
    • What is there?
    • Confirmation of predicted genes
    • Characterization of PTM’s à Signaling networks
    • Secretome Analysis à no mRNA available!
  • Differential proteomics:
    • Definition of proteins that make the difference (control vs disease)
    • Biomarker discovery
    • Changes in PTM patterns
  • Quantitative protein assays:
    • Multiplexed, mass spectrometry based “ELISA”
  • Interactome:
    • Definition of proteins building a functional complex (proteinaceous machines, PPI)

–> This leads often to hypotheses generating

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the main 5 proteome characteristics?

What are the technical challenges associated with them?

!

A
  1. No amplification of proteins possible –> High sensitivity of analytical method
  2. Differential solubility –> Generality of analysis
  3. Proteome is dynamic –> Point of sample collection
  4. Protein modifications, protein processing, protein degradation –> Detect and identify
    differentially modified forms
  5. Dynamic range (a lot of proteins, some in really low concentrations) –> Suppression effects
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the main technical approach for proteomics research?

A

Slice the Iceberg

Since analyzing a full sample is too difficult, we need to slice the sample in multiple smaller samples –> separation technologies –> LC-MS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How should sample preparation be done to access a proteome?

!

A
  • Sample preparation should be kept as simple as possible. Each step is prone to introduce modifications, loss of proteins, variance!
  • No single method can deal with all proteins
  • -> often compromises have to be made
  • For metabolomics similar considerations apply
  • Sample collection and preservation influence sample quality
  • -> loss of proteins by freeze-thaw cycles, degradation of proteins and PTM’s by enzymatic activities …
  • Highly abundant proteins obscure large parts of a proteome
  • -> pre-fractionation/depletion steps have to be considered
  • At the end, sample preparation must be compatible with mass spectrometry
  • -> Avoid too much salt, no detergents/polymers, …
  • Intact protein (top-down) or peptide (bottom-up) based approaches
  • -> solubility of intact proteins
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is affinity based enrichment?

!

A

Sample Fractionation

  • Per se carried out under native conditions –> protein-protein interactions remain
  • Bait fishing by use of antibody, or by a tag approach
  • General pitfalls:
    • Buffer conditions influence protein-protein interactions (ionic strength, detergents, pH, …)
    • Low affinity between bait and target (especially with antibodies)
    • Non-specific cross-reactions
    • Transient interactions with fast on/off rates (low yields)
    • Overexpression of genetically engineered Tag-protein & co-expression of native protein (interferences!)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Substrate-based Affinity Isolation?

!

A

Sample fractionation: Functional/Chemical Proteomics

Add a substrate that will attach to the protein of interest. Since we know what substrate we used, we can detect the shift in the MS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What sample fractionation methodes are there?

(Separation methods for proteins & peptides)

A
  • Affinity based enrichment
  • Substrate-based Affinity Isolation
  • Centrifugation
  • Subcellular Fractionation
  • Electrophoresis
    • Isoelectrical Focusing
    • Macromolecule Electrophoresis
  • Liquid Chromatography
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Bottom-up LC-MS/MS based Workflow (shotgun proteomics)?

!

A

LC-MS/MS:

  • Eluted peptides are ionized and sent through MS -> m/z of precursor peptide ions
  • Precursor ions are selected for fragmentation -> MS/MS of fragments, spectrum file (mz-Int)
  • This information is sufficient to identify peptides
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why do we need protein quantification?

A
  • mRNA levels correlate only little with protein concentrations.
  • Proteins often have to be processed to be functional.
  • Proteins are degraded or secreted.
  • PTMs regulate many processes and are only present on protein level.
  • Protein often work in complexes which is not blueprinted in the DNA or RNA sequence.
  • Information about the stoichiometry of proteins in protein complexes tells more about the function of a complex
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What different types of quantitative proteomics are there?

A

Quantitative proteomics

Quantitative modification proteomics

Comparative proteomics

Biomarker studies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the aim of Quantitative proteomics?

A

measuring protein concentrations of a few dozens (complexes) to thousands (organelles, cells, tissue) of proteins in order to shed light on function and pathways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the aim of Quantitative modification proteomics?

A

measuring the levels of PTMs on proteins, i.e. biological activity or functional regulation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Comparative proteomics?

A

compares the expression levels of proteins between different biological conditions in order to find proteins whose concentration significantly differs between conditions.

These proteins can either be up or down regulated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the aim of Biomarker studies?

A

try to find proteins in body fluids (blood, urine, tears, saliva, ..) whose expression changes as a function of a disease state. They may then be used for early detection of a disease, for disease monitoring or measuring the effectiveness of treatment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the fold change?

A

Measuring Change

  • Fold change = value1/value2
  • Many biological systems react linearly to the fold change, therefore the fold change is often a natural measure
  • Often expressed as log2(fold change) = log2(value1) - log2(value2)
  • p-value:
    • Is the outcome of a statistical test evaluating the significance of a fold change
    • Takes into account the statistical variation of the values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is absolute and relative quantification?

A

Absolute quantification:
aims at obtaining the concentration (counts/ml or g/ml or counts per cell, cpc) of a protein. This value has a physical meaning and can be compared to different experiments and used for chemical modeling (systems biology).

Relative quantification:
the quantitative values do not need to have a physical meaning but are values which are a monotonously increasing function of the actual concentrations. In the linear dynamic range these values are proportional to the actual concentrations and a ratio of two values is equal to the ratio of the two concentrations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the Dynamic Range of Quantitative measurements?

A

Important concept in quantification :
In the linear dynamic range, the response values are proportional to the actual concentrations and a ratio of two values is equal to the ratio of the two concentrations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is spectral counting?

A

Count number of MS/MS spectra that match to a given protein

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is label-free LC-MS1 quantification?

A
  • Based on LC-MS or LC-MS/MS experiments.
  • Use peak heights/area of LC-MS1 peaks as indicator for peptide abundance.
  • No additional sample prep needed. Therefore, kind of fast, cheap.
    But fastidious sample preparation needed!
  • Software has to detect peaks, calculate peak volumes and align LC-MS1 peaks over different runs (retention time shifts between runs).
  • Continuous alignment according to alignment tree
  • Match features over different LC-MS runs (same RT and m/z)
  • Transfer of MS2 identifications, where missing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is Retention Time Alignment?

A

Removing ions that elute over long time and produce a row of signals (streaks)

Iterative strategy: calculate regression, reassign features, remove outliers, and repeat steps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is Peptide/Protein Quantification by MS1?

A
  • Signal intensities of LC-MS1 signals are calculated and can be used for relative quantification.
  • Peptide identifications can be mapped to the MS1 signals by means of their accurate mass and retention time.
  • Using the identified peptides, protein abundance is infered.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is isotopic distribution?

A

Definition: Atomic composition of a peptide (CaHbNcOdSe)

The mass spectral peak representing the monoisotopic mass is not always the most abundant isotopic peak in a spectrum – although it stems from the most abundant isotope of each atom type.

This is due to the fact that as the number of atoms in a molecule increases the probability of having at least one heavy isotope increases. For example, if there are 100 carbon atoms in a molecule, each of which has an 1% chance of being a heavy 13C isotope, then the whole molecule is not unlikely to contain at least one heavy isotope.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is Peptide Peak Intensity?

A
  • Peak intensity:
    • Peak height of highest isotope
    • Summed intensity of all isotopes (or first three)
  • Peak intensities in mass spectrometry reflect the number of ions present in the instrument.
  • The number of ions in turn is proportional to the number of molecules in the sample.
  • However, different molecules ionize with different efficiency!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is normalization of LC-MS/MS data?

Why is ti necessary?

A

Normalization refers to the process of removing such biases by:

  • global adjustment: forcing log intensity values around a central value, e.g. median à correction of differences in loaded amounts, but not for non-linear or intensity-based biases
  • non-linear adjustment techniques: robust scatter plot smoothing (minus vs average plots, needs a reference sample -> add. bias?), lowess regression (local linear fits), EigenMS (singular value decomposition)

LC-MS/MS data is prone of having many sources of technical variance, or bias:

  • variation in ionization efficiencies
  • differences in LC column (temperature, fluctuation in flow rate…)
  • sample preparation issues (pipetting errors…)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is isotope labelling?

A

Idea: tag peptides/proteins with chemically equivalent molecules of different mass using isotopes. Each sample is labeled with a different mass tag.

==> Quantification possibility on MS1 and/or MS2 level!
Can help with distinguishing reasons why a peptide signal is missing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What chemical isotope labelling techniques are there?

A
  • ICAT: Isotope Coded Affinity Tag
    • Label proteins then digest (MS1)
    • Only cysteine containing peptides can be measured
    • Only 2 labels (heavy and light), which makes it more complicated to analyze multiple samples (e.g. time series)
    • Slight retention time shift between heavy and light
    • Confusion of double labels with oxidation
    • Cleavable ICAT produces lighter tag, which results in better MS/MS spectra.
    • Rarely used today.
  • 18O: Enzymatic Labeling
    • Digest and label peptides (MS1)
    • 18O is a technique where proteolysis and stable isotope incorporation (labeling) occurs simultaneously during digestion.
    • Samples are digested, usually with trypsin, in the presence of H218O resulting in a 2–4 Da mass shift from the incorporation of one or two 18O atoms on the c-terminus of each peptide.
    • The presence of the label on the carboxyl termini of peptides is advantageous because it facilitates the assignment of ‘y’ ions in MS/MS spectra.
    • 18O labeling is not ‘pure’, i.e. the number of 18O atoms incorporated in a peptide is not fixed and a deconvolution software has to be used for accurate quantification.
  • iTRAQ: Isobaric Tag for Relative and Absolute Quantitation
    • Digest and label peptides (MS/MS)
    • Every peptide can be labeled -> multiple peptides per protein increase statistical power
    • Introduces variability after protein digestion
    • Different chemical reaction kinetics for different peptides
    • Expensive (commercialized by Applied Biosystems)
    • Limited dynamic range
    • MS instrument must be sensitive in low mass range (e.g. QTOF, but not ion trap)
    • mTRAQ labeling for SRM method and relative quantification (no reporter ions)
    • Signal interference from noise or unknown fragments distorts quantification.
  • TMT: Tandem Mass Tag
    • Same advantages and disadvantages as iTRAQ
    • Commercialized by Thermo Fisher
    • TMT with 16 channels (16-plex) has been introduced
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the Concept of relative Quantification by heavylabeled Standards?

A

Idea:
Produce heavy-labeled proteins by metabolic labeling, or standard peptides by chemical synthesis. Each proteome is then spiked with heavy-labeled “standard”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is Absolute Quantification of Proteins (AQUA)?

A

Spike with synthetic heavy labeled peptides:
Add synthetic, heavy-labeled AQUA peptides after digestion and use MS1, better MS/MS signals for absolute protein quantification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is Protein Standard Absolute Quantification (PSAQ)?

A

Spike with heavy labeled artificial Protein:
Add synthetic, heavylabeled protein composed of peptides representing target proteins to protein mixture, then digest digestion and use MS1, better MS/MS signals for protein standard absolute quantification

Can potentially compensate for problems during proteolysis!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is Metabolic Isotope-Labeling?

A

Grow cells/tissue in media with heavy-labeled amino acids:
Mix non-labeled and labeled lysates at defined ratio, then digest and use MS1 signal for quantification

  • Around 2000, 15N labeling was introduced to proteomics. Yeast was grown in 96% 15N medium and analyzed by LC-MS. 15N labeling works well for organisms (eg. Fungi, plants) that are able to synthesize all amino acids and that can be grown in 15N medium.
  • In mammals, Lysine cannot be synthesized and needs to be taken up from nutrients. SILAC (Stable Isotope Labeling by Amino Acids in Cell Cultures, Ong et al. MCP, 2002) is used preferably with 12C and 13C labeled L-lysine. Arg, Leu or Met are also used frequently.
  • Possible to label entire organisms (fly, rat) -> expensive
  • Limited to situations where the cells are metabolically active
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is Stable Isotope Labeling by Amino Acids in Cell Culture (SILAC)?

A

Metabolic Isotope-Labeling

When both samples are combined, the ratio of peak intensities in the mass spectrum reflects the relative protein abundance. In this example, the labelled protein has the same abundance in both samples (ratio 1).

preferably with 12C and 13C labeled L-lysine. Arg, Leu or Met are also used frequently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What fragmentation Techniques are there?

A

CID (Collision Induced Dissociation):

  • Fragmentation of positively charged peptides by collision with inert gas.
  • Fragmentation is mainly induced by ‘mobile’ protons that render amide bonds less stable.
  • Produces predominantly y and b ions, including neutral losses (-H20,-NH3).
  • CID spectra produced in an iontrap miss the low-mass region (1/3 of precursor m/z rule!), as this ions can’t be trapped.

ETD (Electron Transfer Dissociation):

  • Negatively charged anions transfer electron to positively charged peptides.
  • Induction of backbone fragmentation along the N-Cα bond.
  • Produces predominantly c and z ions.
  • Good for highly charged peptides/proteins, leaves labile modifications intact

HCD (High-energy Collisional Dissociation):

  • HCD is a beam-type CID, similar to the iontrap CID, but performed in a quadrupole.
  • HCD uses higher energy and shorter activation times (~0.1 ms) than linear ion trap CID (~30 ms).
  • HCD generated spectra are not subject to same limitation as HCD
  • -> lower mass ions of type a2, b2, y1, y2, immonium, and internal cleavage, or reporter ions from isobaric tags (see quantitative proteomics) are recorded.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are the CID/HCD MS/MS Fragmentation Rules?

A
  • Mobile proton model
  • Fragment ion intensity depends on:
    • Position within peptide
    • Amino acids flanking fragmentation site
    • Position of other charges
  • Fragments are correlated (e.g. a and b ions) and often occur in series.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What main fragment ion types are there in MS?

A

Immonium ions:
An amino acid internal fragment with just a single side chain formed by a combination of a type and y type cleavage is called an immonium ion

Internal cleavage ions:
Double backbone cleavage gives rise to internal fragments. Usually, these are formed by a combination of b type and y type cleavage to produce the illustrated structure, an amino-acylium ion. Sometimes, internal cleavage ions can be formed by a combination of a type and y type cleavage, an amino-immonium ion.

Neutral loss ions:
A non-charged molecule is released from a charged ion upon fragmention.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How are the ions of a peptide labelled/called?

A

Ions that are on the N-terminal of the peptide are called a, b, c ions and the ones that are on the C-terminal of the peptide are often called x, y, z.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How was MS/MS Spectrum Interpretation originally done?

A

PepSeq was a first software for automated MS/MS spectrum interpretation:

  1. Create all possible amino acid combinations whose mass corresponds to measured peptide mass
  2. Build hypothetical spectra
  3. Find best match

BUT:

  • There is a combinatorial explosion with larger peptides making this approach very calculation intensive
  • Very high false positive rate because of so many options with identical amino acid composition to consider
  • Becomes a heroic task when thousands of spectra have to be interpreted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is the Aim of a Good Scoring Algorithm?

!

A
  • High statistical power, i.e. the score should be able to distinguish good matches from bad or random matches
  • It should be robust with regards to variation in the data, i.e. work well for various sample processing and MS methods
  • It should be fast
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is PSM?

A

The best scoring match for a peptide fragment spectrum interpretation is called peptide spectrum match, PSM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What are the problems with protein identification from peptides?

!

How can they be solved?

A
  • Link between peptides and proteins is lost with bottom-up proteomics.
  • A peptide can be identified several times in different forms.
  • A peptide sequence can belong to multiple proteins.
  • A protein can contain several identified peptides.
  • Proteins can be very similar (homologue) and cannot be distinguished by the identified peptides.
  • Many (~ 50% in large datasets) of the proteins with only one PSM are false positives. Their peptide is often shared with other proteins.
  • Simplest solution: Discard all single hit proteins. But this might be too restrictive, especially for small proteins.
  • Parsimony or Occam‘s razor approach, which finds minimal set of proteins that explains maximum number of peptides is a good approach to reduce false positives
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

How can peptides be identified via database search?

A
  • Pick an appropriate protein database (SwissProt/Tremble/predicted exon…, Taxonomy…)
  • An algorithm (SEQUEST, Andromeda, Xtandem!…) will compare the experimental spectra with theoretical spectra from the database; it will take into account known digestion and fragmentation rules, possible peptide modifications etc. Each algorithm has its own way of establishing a score for a possible peptide match; highest score decides peptide identification, the socalled peptide spectrum match (PSM).
  • Which database is best suited for my MS/MS search?
    • Complete or small
  • Human and model organisms: UniProtKB/SwissProt is a good starting point, since it is ‘complete’ for human and more and more for model organisms, too.
    • Should I consider canonical sequences or also mutations and variants annotated in UniProtKB/SwissProt?
  • With the availability of fast genome sequencing, sequence databases can be built from scratch for organism not well represented in UniProtKB. However, these databases may contain a large portion of error or missing sequences due to problems with the definition of ORF’s (see Proteogenomics lecture).
  • Fast genome sequencing enables to build customized protein sequence databases for individual organisms “Proteogenomics”.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are problems associated with the bottom up proteomics workflow?

A
  • Spectra are noisy, some fragments will be missed, fragmentations of unknown chemistry occur, unknown PTM, etc -> matches between theoretical and real spectra are only partial and may occur by chance!
  • Comparison of results generated on different instruments, by alternative sample processing, interpreted with different search algorithms etc.
  • How to go from statistically validated peptides to statistically validated proteins?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What are possible stategies for peptide identification?

A
  • By de-novo
  • Using spectral libraries
  • By database search

Each strategy comes with a different type of search space, of different sizes:
De-novo biggest, spectral libraries smallest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What has to be considered when doing database searches?

A

For database searches, the website uniprot.org/uniprot typically lets you download all the known protein AA sequences by organism, and you may opt to include known isoforms, as well as automatically (non-manually) reviewed sequences.
–> Choice of these factors affects the starting search base

Not just the number of proteins of the DB is important: the number of actual possible sequences to check against a spectrum also depends upon the number of missed cleavages and the Post Translation Modifications (PTM) we search for.
–> explosion of the number of candidate sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What are the advantages and disadvanteges of a large and a small search space?

A
  • Large search space
    • Whole genome, TrEMBL, all variants, many potential modifications, all possible subsequences, relaxed protease cleavage specificity
    • Slow
    • Many false positives
    • Complete
  • Small search space
    • SwissProt, few PTM’s, stringent protease cleavage specificity …
    • Fast
    • Few false positives
    • Missing entries
  • The ideal search space would contain only the peptides present in the sample. However, this set of peptides is generally unknown
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What do we need to test for in a single spectrum peptide validation?

A

We need to test for :

  • Random matches
  • Statistical scores:
    • e-value
    • p-value
  • DeltaScore
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is a random match in a single spectrum?

A
  • A spectrum usually has non- assignable peaks, missing peaks and peaks of unpredictable intensity (reasons include unexpected fragmentation patterns or modifications)…
  • Moreover: different peptides may have fragments in common …
  • The peptide may not even be in the database…
  • …so there are many incorrect hits
  • Those incorrect hits are random matches

Keep in mind that high score for xcorr could also be random matches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is the e-value for a single spectrum?

A

We assume that all matches, except the best scoring candidate, are purely random, and are false identifications of the real peptide.

The extrapolation of the distribution of random matches at the best score gives the expected number of random matches for this score.

We can interpret this number as the E- value, i.e. the number of peptides that are expected to score as well as the best score, by chance.

  • The E-value is a statistical measure that can standardize the reporting of confidence of peptide identifications, as it depends only on the distribution of scores (of one spectrum) and not on the scores themselves.
  • E-values are relative to the number of guesses, which means that they are to some extent taking into account the number of tests performed: N candidate peptides provide N tests.
  • E-values are not probabilities (E-values are not restricted to [0,1])
52
Q

What is the p-value for a single spectrum?

A

One could instead of the E-value look for a p-value, which does return a

probability. Taking the image above, we estimate the p-value under the null hypothesis that VWNLANCK is a random match.

p-value= probability, if VWNLANCK is a random match, that its score is equal or higher than the reported one (~2 here):
p-value = E-value /N
where N is the number of candidate peptides.

The p-value is neither the probability that VWNLANCK is random, nor the probability that it is a invalid match. A predefined statistical threshold α merely allows us to reject the null hypothesis if p-value ≤ α

ATTENTION the p-value will have to be corrected when performing multiple tests!

53
Q

What is the deltaScore for a single spectrum?

What is deltaCn?

A

The deltaScore is defined as the difference between the top score xcorrn and the next best match. It is indicative of how well separated the best candidate is from random matches.

deltaCn the normalised version of the deltaScore

54
Q

How can a multiple spectra peptide validation be done?

A

There are two ways for dealing with multiple spectra:

  • Decoy database search:
    • placing decoys in the database
  • Semi-supervised calculation of false discovery rate (FDR)
    • model the distribution of wrong matches and that of right matches

If we were to collect the top score from all spectra in a sample, we would get a bimodal distribution:

  • Many low scores for the random matches and higher scores for the true matches.
  • The random and true score distribution will overlap, if the dataset is large enough.
  • The question is: what is the number of random and true matches at a given score x?
  • We could know if we deconvoluted the distribution.
55
Q

What is a decoy database search?

A

Statistical Validation of Score across multiple spectra

Approach 4: p-value with the decoy distribution as the null distribution and multiple testing correction

p-value = for a given score x, the percentage of decoy PSMs that have a score equal or higher than x.

  • Random score distribution in a MS/MS search can be estimated by searching a database of realistic peptides, which are not present in the sample.
  • For example peptides of a sufficiently different species could be used as decoy database. Database entries that are too similar (homologues) may have to be removed.
  • A decoy database can also be calculated in silico by shuffling or reversing the original target database. Again, entries too similar to the original entries have to be removed/changed. Peptides < 8 AAs have a high chance to be reproduced during the shuffling procedure.
  • Note! The decoy database should be of the same size (see search space issue).
  • Incorrect decoy hits should be similar to random matches derived from target sequences in terms of search engine-assigned scores, i.e. the decoys should be representative of incorrect hits.
  • The decoys allow to distinguish False Positives (search has identified a decoy peptide) from True Positive (search has identified a peptide from the target database).
56
Q

What is FDR?

A

False Discovery Rate

Statistical Validation of Score across multiple spectra

The expected proportion of incorrect matches among the accepted assignments that have a score higher than the threshold score x.

FDR measures the error rate associated to the whole data set. Setting FDR determines the percentage of incorrect PSMs that one is willing to accept

  • FDR can be calculated on PSM level à FDR = (#wrong PSMs)/(#total PSMs)
  • FDR can be calculated on peptide level, i.e. each peptide sequence is only counted once independent of number of PSMs.
  • FDR can be calculated on protein level
57
Q

What is worth to note about statistical scores?

A
  • Statistical scores do not depend on the details of the peptide scoring function.
  • The underlying scoring function can even be multidimensional, i.e. include several scores.
  • Statistical scores have a unified probabilistic interpretation, i.e. they correspond to frequencies and counts.
  • The statistical scores allow the comparison of different search engines with each other.
58
Q

What does occam’s principle mean in the context of shotgun protein idetification?

A

Occam’s principle: The simplest solution is often the correct solution

In the context of protein inference from peptides, where a same peptide can be shared by different proteins, this means that one should aim at finding the smallest list of proteins that explains all (or most of) the peptides.

59
Q

What is the Apportionment of Shared Peptides?

A

Shared = degenerate peptides

A model that aims at deriving the smallest list of proteins suffcient to explain all the observed peptides (Occam‘s razor, see Wiki for definition).

60
Q

What are problems with protein assignment in large shotgun datasets?

!

A

Correct peptide assignments tend to correspond to “multi-hit” proteins, those to which other correctly assigned peptides exist.

Incorrect peptide assignment (false positive) tend to correspond to “single-hit” proteins to which no other correctly assigned peptide corresponds.

–> False positive identification error rate on the protein level is higher than on the peptide level.

–> Hard to distinguish single-hit correct proteins from the incorrect ones!

61
Q

What are PTM?

A

posttranslational modifications (PTM)

  • After transcription various functionally relevant modifications are introduced first on the mRNA level (splicing, mRNA editing) and then on the protein level (posttranslational modifications (PTM), co-translational modifications)
  • Modifications change the function of proteins by changing their interactions with other molecules, their activity, stability or localization.
  • There are ~ 200 known in vivo modifications in mammals (& ~ 200 chemically induced), and a total of 691 listed on uniport.org.
  • Modifications are ubiquitous across all domains of life.
  • Almost all proteins can be modified (~70% phosphorylated, similar numbers for ubiquitination & acetylation)
62
Q

What are proteoforms?

!

A

The huge number of potential combinations of the various modifications for each protein
–> introduces a new level of complexity to the proteome

63
Q

How can PTMs be detected?

A

MS/MS and antibodies can be used to detect modifications. MS/MS is the more versatile approach that has the capacity to find and monitor many modifications in an entire proteome.

64
Q

What are problems when mapping the landscape of PTMs?

A

Mapping the entire landscape of modifications remains difficult, since

  • many modifications show up only in very specific condition
  • may be of low stoichiometry or concentration
  • are ambiguous due to identical mass shift like others.

Enzyme specificity is not always very high and non-functional modifications occur. Modifications may also be introduced non-enzymatically (e.g. oxidation, acetylation, methylation).

65
Q

What does the PTM phosporylation do?

A
  • Commonly affected amino acids: Ser, Thr, Tyr
  • Change in peptide mass (Da): +79.966331
  • Enzyme mediated & ATP dependent (Kinase)
  • Examples for biological function:
    • Signal transduction
    • Regulation of enzyme activity
    • Protein-protein and protein-ligand interactions
  • Most prominent PTM involved in most cellular processes. Uses highly abundant ATP as source.
  • Many kinase genes exist (~500+ in human), which all have different roles and motifs. Many phosphatase genes (~200+ in human).
  • The number of phosphorylation sites increases with organism complexity.
  • Tyrosine phosphorylation:
    • Often at beginning of a signaling cascade
    • Tightly negatively regulated by reactive tyrosine phosphatases
    • Tyrosine phosphorylation binds tighter to recognition domains (SH2 domain)
    • Rarely plays a structural role
    • Low percentage of all phosphorylations occur on tyrosine
  • Serine/Threonine phosphorylation
    • Transmit, amplify and clean signals.
    • Less tightly regulated, less accurate binding
    • Can be constitutively on -> structural role
66
Q

What does the PTM acetylation do?

A
  • Commonly affected amino acids: protein N-terminus, ε-amin on Lys
  • Change in peptide mass (Da): +42.010565
  • Examples for biological functions:
    • Metabolism (extensive acetylation of mitochondrial proteins)
    • About 85% of all human proteins and 68% in yeast are acetylated at their Nα- terminus (important role in the synthesis, stability and localization of proteins)
    • Acetylation of histones regulate gene transcription (chromatin de-condensation)
67
Q

What does the PTM glycosylation do?

A
  • Commonly affected amino acids: Asn (N-linked) , Ser or Thr (O-linked)
  • Change in peptide mass (Da): > 800 (complex carbohydrates or oligosaccharides)
  • Enzyme-mediated & ATP-dependent
  • Examples for biological function:
    • A variety of structural and functional roles in membrane and secreted proteins
    • Protein folding and stability
  • Tricky to study using MS
    • Need to sequence peptide and glycan
    • Require sophisticated:
      • Enrichment protocols
      • Fragmentation methods
      • Software (Byonic, Sweetheart)
      • Manual validation
  • Glycoproteomics: “The last frontier of proteomics”
68
Q

What does the PTM ubiquitinylation do?

A
  • Commonly Affected Amino Acids: ε-amino group on Lys
  • Change in peptide mass (Da): > 1000
  • 76-amino acid protein (Mono/poly-ubiquitination)
  • Observed as +114.042929 (GG) or +383.228103 (LRGG) Da mass tag after trypsin digestion
  • Enzyme mediated (ubiquition activating E1, conjugating E2, and ligating E3; removal by DUB’s)
  • Examples for biological function:
69
Q

What are common chemical artefacts (PTMs)?

A
  • Oxidation
    • Commonly Affected Amino Acids: Met, Trp
    • Change in peptide mass (Da): +15.99491
    • Example for biological function:
      • Regulation of protein activity and/or stability
  • Deamidation
    • Commonly Affected Amino Acids: Asn, Gln
    • Change in peptide mass (Da): +0.98402
    • Example of Biological function:
      • Associated with ageing
      • Regulation of protein activity?
  • Pyro-Glu (Da): -17.02655 (Q), -18.01056 (E)
  • Carbamylation (Da): +43.00581
  • Formylation (Da): +27.99491
  • etc..
70
Q

What PTMs are reversible and which irreversible?

A
  • Many important modifications can be swiftly (seconds) added and removed from their substrates by specific enzymes. This allows the cell to react to a rapidly changing environment.
    • Phosphorylation:Kinase(+)–Phosphatase(-)
    • Histonelysineacetylation:histoneacetyl-transferase(HAT)(+)– histone de-acetylase (HDAC) (-)
    • Histone methylation: Histone methyl-transferases (HMT) (+) – histone de-methylase (-)
    • Ubiquitination: ubiquitin-activating enzymes (E1s), ubiquitin- conjugating enzymes (E2s) and ubiquitin ligases (E3s) (+) – De- ubiquitinating enzymes (DUBs) (-)…
  • Other PTMs are ‘irreversible’
    • Proteolyticcleavage
    • Myristoylation
    • N-terminalacetylation…
71
Q

What are dynamic roles of PTMs?

A
  • Sensor for availability of metabolic resources.
  • Signal integration/coordination: only if several PTM signals are attached to a protein, the protein performs its function.
  • Signal Oscillator: system with multiple modifications can behave like an oscillator
  • Signal transmitter (wave): cascades of kinase/phosphatase reactions can travel from the plasma membrane to the nucleus
  • Signal amplifier: Negative feedback amplifier for MAPK pathway
72
Q

What are challanges in LC-MS2 based PTM analysis?

A
  • Low abundance:
    • Biologically relevant PTMs may affect <5% of a protein species (sub-stoichiometric).
    • Enrichment required (large amounts of starting material needed)
      • Efficient Enrichment methods not available for all PTMs
  • Complex fragmentation (specialized fragmentation methods needed)
    • labile modifications (neutral loss on precursor ions).
    • Glycosylation (glycan fragmentation obscures peptide fragments, hydrophobicity -> too low to bind on RP column)
  • Ion suppression
    • Phosphorylation
  • Interferance with protease cleavage, e.g. acetylation or methylation of K/R inhibits
  • trypsin cleavage
  • Proteoforms
    • Digestion destroys information whether modifications on different peptides coexist or not.
  • Informatics
    • Search space explosion due to combinatorics
    • Pin-pointing site of modification with labile PTM, e.g. phosphorylation
    • Mass shift of PTM can be ambiguous
73
Q

What is variable modification search?

What is the problem when with it?

!

A

MS2 interpretation tools generate theoretical spectra of all PTM variants of the database sequences using a list of suspected PTMs provided by the user.

BUT there is a problem : the number of candidates rapidly becomes enormous related to the number and types of modifications

A common database search strategy identifies <50% of all available fragment spectra.

74
Q

What is Open Modification Search (OMS)?

A

Definitions & Principles

  • No prior selection of modifications
  • Finds unknown or unexpected PTMs
  • Finds unknown or unexpected SNPs
  • Find combinations of modifications
  • Deals with search space explosion by prior reduction of candidate peptides or proteins
  • Many false positives and ambiguous results

OMS search work in 2 steps :

  1. Rapid restricted search: specify no or very limited number of modifications
  2. Exhaustive search for modifications exploring a protein, peptide, or spectrum database compiled from the results returned in step 1.
75
Q

How can the position of a PTM wihtin a peptide be determined?

A
  • A common problem with OMS is ‘ambiguous’ PSMs.
  • Position of modifications on residues consistent with mass shift (e.g. +80 (phospho) on S,T, Y; +16 on M,W; point mutations replacing one amino acid with another one; …)
  • Prefer more likely scenario
  • Use specialized methods to position different modifications
  • What QuickMod and most other search tools don’t tell you is how confident they are about a modification site assignment.
  • The peptide level false discovery rate refers to identification of peptide + delta mass of PTM.
  • Likely, that a difference in FDR at a set score threshold between entire set of PSM and PSM(modified)
  • What is the PTM site probability or PTM false localization rate (FLR)?
  • Note:
    • PTM site scoring algorithms work for one modification/PSM!
    • They disfavor peptides with adjacent modification sites

MDscore: Mascot delta score

A-Score: Ambiguity score

76
Q

What is the workflow when processing data with MaxQuant and Perseus?

!

A
77
Q

What does MaxQuant do?

!

What are the key features?

A

Processes MS raw data:
MaxQuant performs feature detection and quantification of high accuracy MS data.

  • Feature detection of MS1 spectra (mass and intensity of peptide peaks)
  • Feature quantification
  • Peptide and protein identification (with Andromeda search algorithm)
  • Protein quantification

Key features of MaxQuant:

  • a3D peak reconstruction for quantification n
  • on linear mass correction for accurate peptide masses.
78
Q

What is MaxLFQ?

!

A

Label-free quantification:
Peptide intensity-based MS1 quantification accounting for varying detection of quantifiable peptides from sample to sample.

79
Q

What does Perseus do?

!

A

Perseus uses tab delimited txt files as input. It is a software suite that is used to (graphically) analyze quantitative omics data.

Data analysis of data from MaxQuant

  • Exploratory analysis & normalization
  • Statistical analysis:
    • protein interactomes
  • Data integration
  • Visualization:
    • expression proteomics
  • Specialized tools
80
Q

What is Andromeda?

A

Andromeda is the MaxQuant embedded search engine that is used for peptide identification.

Decoy (forward reverse) databases are used to determine peptide and protein FDRs.

81
Q

What are the two meanings of Normalisation?

A
  1. transformation of data so that the resulting distribution of the data is normal-like (Gaussian-like).
    Log-transform of the data is one of the basic principles in data analysis. It allows to :
  • Reduce the intensity range
  • Can apply parametric test statistics
  • Variance stabalising

Often Log2 transform in biology

However, transforming data can lead to the creation of outliers

  1. removing technological bias so that different samples can be compared.
    This allows to remove systematic shifts: typically abundance variation between samples (Random shifts: instrument variations, contamination of only some samples…)
    However, removing bias can only be done if :
  • The samples contain essentially the same proteins
  • A large enough number of proteins have unchanged abundance between samples
82
Q

What normalisation techniques that remove a bias are there?

A
  • Normalisation by the global median
  • Quantile normalisation
  • LOcally WEighted regression and Smoothing Scatterplot (LOWESS)
  • Variance Stabilizing Normalisation (VSN) (Combination of the above)
83
Q

What is normalisation by the global median?

A

Multiply the log2 intensities of a sample by a number, so that the median of the sample is the global median of all samples

84
Q

What is quantile normalisation?

A

non-linear transformation that replaces each feature value (row) with the mean of the features across all the samples with the same rank or quantile.

This:

  • forces all samples onto a similar distribution
  • preserves however abundance order
85
Q

What is LOcally WEighted regression and Smoothing Scatterplot (LOWESS)?

A

Main idea:

  1. Use a type of sliding window to divide the data into smaller blobs
  2. At each data point, use a type of least squares to fit a line

This:

  • corrects for non-linear bias
  • may introduce additional bias
86
Q

What is Variance Stabilizing Normalisation (VSN)?

A

Allows to do apply all other normalisation techniques to remove bias
–> transform that remove bias and shifts

  1. For each sample, allow one shift an one multiplication factor
  2. Apply transformation
  3. Assume that those peptides that do not have differential abundance follo random fluctiuations around an average
  4. Apply Maximum-Likelihood Estimation
87
Q

What is imputation?

A

In statistics, imputation is the process of replacing missing data with substituted values. There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency.

  • Optimal imputation approach may differ depending on whether imputation is performed at peptide level or at protein level
  • If done at protein level, there will be a hidden peptide level imputation (missing peptides have implicit zero intensity)
  • Targeted imputation strategies have been developed when nature of missing values is known (MCAR using maximum likelihood or MNAR using MinDet).
  • On the other hand, the best generic approach when nature of missing value cannot be determined is still being investigated.
88
Q

What is MinDet of the R package MSnbase?

A

Targeted imputation strategie

For each sample, the missing entries are replaced with a minimal value observed in that sample. The minimal value observed is estimated as being the q-th quantile (default q = 0.01) of the observed values in that sample.

89
Q

What is MinProb of the R package MSnbase?

A

Performs the imputation of left-censored missing data by random draws from a Gaussian distribution centred to a minimal value. Considering an expression data matrix with n samples and p features, for each sample, the mean value of the Gaussian distribution is set to a minimal observed value in that sample. The minimal value observed is estimated as being the q-th quantile (default q = 0.01) of the observed values in that sample

90
Q

What is knn of the R package MSnbase?

A

For each gene with missing values, we find the k nearest neighbors using a Euclidean metric, confined to the columns for which that gene is NOT missing. Each candidate neighbor might be missing some of the coordinates used to calculate the distance. In this case we average the distance from the non-missing coordinates. Having found the k nearest neighbors for a gene, we impute the missing elements by averaging those (non-missing) elements of its neighbors. This can fail if ALL the neighbors are missing in a particular element. In this case we use the overall column mean for that block of genes

91
Q

What is maximum likelihood estimation (MLE)?

A

Maximum Likelihood Estimation:

all missing values filled in with simulated values drawn from their predictive distribution given the observed data and the specified parameter

92
Q

What means differentially expressed?

How can differential expression be tested?

A

Differentially expressed:

if the difference between the sample means is statistically significant

Can be tested through:

  • Parametric statistics:
    we know the form of the underlying distribution, typically the normal distribution, and we test the parameter(s) of the distribution -> Student’s or Welsh’s t-test
  • Non parametric statistics:
    no parent distribution is assumed, instead rely usually on ranks to compare distributions (Wilcoxon-Mann-Whitney).
93
Q

What is a Type I and a type II error?

A

Type I error: rejected H0 when H0 was true
Type II error: failed to reject H0 when H0 was false

94
Q

What is the FDR?

How can it be controlled?

A

(global) false discovery rate: FDR= false positives / positives

The original Benjamin-Hochberg procedure has been adapted over the years so as to
evolved into a method to produce FDR-controlled adjusted p values.

FDR-controlled adjusted p values of 0.01 means that 1% of the significant DE peptides are
false positives

95
Q

What differential expression tests are there?

A
  • Student (Welsh)’s two-sample t-test:
    • paramteric statistic
  • EBayes: moderated t-statistics
    • paramteric statistic
  • Wilcoxon–Mann–Whitney test:
    • Non-parametric
  • Benjamin-Hochberg (BH) procedure for multiple testing
96
Q

What does log2 transform of raw proteomics data do?

A
  • reduces the intensity range
  • allows application of parametric test statistics (make data look Gaussian-like)
  • is variance stabilising
97
Q

What sample preparation has to be done for the Bottom-up Proteomics Workflow before LC-MS/MS?

A

Since native proteins are difficult to digest, we need to denaturate them. To do so we first add DTT that will reduce the disulfide bonds into SH bonds. Then we add IAA (iodoacetamide) to maintain these bonds reduced.

For some experiments, we need to add some chemical PTM, like:

  • Oxidation ((M,H,W)
  • Carbamoylation (-NH2)
  • Carbamidomethylation (e–phil)
  • Deamidation (N, Q)

We then need to digest the protein wit serine endopeptidases (usually with trypsine), that cut every time there is an Arg/lys site.

All these modification have an impact on the mass of the protein, which has to be considered later on

98
Q

What are the problems when identifying proteins by database search?

A
  • Only 30-50% of spectra are positively identified. What is the rest?
    • Splice variants?
    • -Unknown PTM?
  • Are there other algorithms for MS2 spectra interpretation?
99
Q

What is a peptide sequence tag?

A
  • Collision-induced dissociation (CID) spectra usually contain a short, easily identifiable series of sequence ions yielding a partial sequence.
  • Short partial sequence is not very specific in a database search, but a partial sequence + related ‘tag masses’ is very specific.
  • The sequence tag qualifier consists of the observed mass of the first peak of an identified sequence ladder, a stretch of interpreted amino acid sequence, and the observed mass of the final peak of the ladder
  • The tag “m1 - m2 - m3” has to be mapped to a protein sequence database.
  • Since it is not known whether the tag is a b-ion tag (left to right) or y-ion tag (right to left), both the sequence and the reversed sequence have to be searched.
  • The lower mass has to match the N-terminal peptide sequence up to the tag (m1), peptide M minus the higher mass has to match the C-terminal sequence (and vice versa for the reversed sequence).
  • To allow for a modification or mutation, only one of the masses m1 or m3 need to match in a modification tolerant search
100
Q

What is needed for de novo peptide sequencing?

A
  • Algorithms to read peptide sequence from MS/MS spectrum.
  • Algorithm = Search of potential sequence + scoring of potential sequence.
  • These methods require high quality spectra
101
Q

What can cause problems/difficulties with de novo sequencing?

A
  • the presence of both y and b ions (sequence is read in both directions).
  • mass ambiguities (I - L, Q - K, ..):
    • some can be solved with high resolution MS/MS
  • cleavages do not occur on every peptide bond:
    • poor quality spectrum (some fragment ions are below noise level).
    • the C-terminal side of proline is often resistant to cleavage.
    • absence of mobile protons.
    • peptides with free N-termini often lack fragmentation between the first and second amino acid
  • De Novo sequencing has a huge search space:
    • In De Novo sequencing all possible amino acid sequences are candidates.
    • E.g. all sequences of length 10: 2010 ~ 1013 sequences !!!
    • About 104 fully tryptic peptide sequences of length 10 in human UniProt database.
  • De Novo is prone to false positives. True sequence is often not the highest scoring one
102
Q

What are some applications of de novo sequencing?

A
  • Genome is unknown
  • Errors in database
  • Splice variants and mutations
  • Post translational modifications (PTMs)

Example: Immunopeptidomics

103
Q

What de novo sequencing methods are there?

A
  • Spectrum graph
  • Dynamic programming on spectrum
  • Mass array
  • Hidden Markov Model
  • Sequence optimization by Genetig Algorithm
  • Decision tree
  • Deep learning
104
Q

What is a spectrum graph?

A

De novo sequencing method

  • Convert the various ion types into a graph containing nodes that represent a single ion type.
  • Find pathways through connecting the nodes to determine sequence candidates.
  • Searching the graph could be done by breadth-first or depth-first algorithms. However, the optimal solution should only use a original peak once, which makes the problem more complicated
  • very simple de-novo algorithm based and a data structure called mass arrays, which allow original peaks to be used more than once
105
Q

What do we do once we got the de novo peptide sequence?

A
  • De novo peptide sequence has to be mapped against a sequence database in order to retrieve the corresponding protein.
  • If the sequence is not complete, the order of the sequence can be N-term -> C-term or C-term -> N-term, and both orders have to be matched.
  • Different sequences can produce the same fragmentation spectrum
106
Q

What is the difference between discovery vs. targeted proteomics?

A

Discovery or shotgun proteomics:

  • Data Dependent Acquisition (DDA)
  • Peptide Identification/Quantification on MS1 level with DIA
  • Shortcomings of DDA:
    • Limited sensitivity, biased towards more abundant proteins
    • Reproducibility
    • missing data points
    • Identification of non-relevant proteins

Targeted proteomics:

  • Selected Reaction Monitoring (SRM) and its family
  • Peptide Quantification on MS2 level with DIA
  • Potential benefits of targeted approach:
    • Reduced noise
    • Very sensitive
    • Reproducible
    • Focus is exclusively on proteins of interest
107
Q

What are Advantages/Disadvantages of Selected Reaction Monitoring (SRM)?

A

QQQ are really efficient to reduce the noise and they are extremely sensitive (plus they are fast). However, they have a low resolution and one needs to know in advance what we are looking for. If the target protein is known, the best peptides have to be known:

  • Peptides should not be shared with other proteins → proteotypic peptides
  • Peptides should give a clear signal in LC-MS.
  • Libraries containing proteotypic peptides should also be used to define suitable fragment ions
108
Q

What is Multiple Reaction Monitoring (MRM)?

A

Targeted LC-MS/MS - Acquisition method

multiplexed SRM, usually done with QQQ

5 transitions from precursor to fragment

109
Q

What is DIA (SWATH)?

A

Data-Independent Acquisition

Usually done with QQ-TOF or Q-Orbitrap

DIA is a method of molecular structure determination in which all ions within a selected m/z range are fragmented and analyzed in a second stage of tandem mass spectrometry. The problem with DIA is that one needs to have a larger window of selection that goes from 2m/z in DDA to 20m/z in DIA.

This means that more peptides will co-fragment, the MS/MS spectra will be more complex and we will have more interferences

110
Q

What are proteotypic petpides?

A

selection of best peptide representing unique proteins.

–> Not all peptides of a given protein do have the same ionization efficiency.

Important: find the best “flying” peptides to achieve maximal sensitivity!

Criterias:

  • Most observations in discovery experiments (e.g. spectral counts)
  • Peptide length (8-25 amino acids)
  • Peptide hydrophobicity (avoid to high and too low)
111
Q

What is GO?

A

controlled vocabulary and repository of the logical structure of the biological functions (‘terms’) and their relationships to one another

  • The GO terms themselves are arranged in directed graphs
  • GO is made up of three root nodes (aspects):
    • biological process
    • cellular component
    • molecular function

Gives functional information: functional descriptions that aggregate genes into common protein complexes, biological pathways, network modules, and other genes sets consisting of genes playing similar roles

112
Q

What are pathway databases?

A

contain information on how and where genes interact

Gives topological information: regulatory relationships that exist among genes, protein complexes, biological pathways, and biological network modules

e.g. WikiPathways

Limitations:

  • Coverage is limited
  • No defined relationships between pathways
113
Q

What are problems when using knowledge databases to analyse results?

A
  • Gene ontologies and annotations and pathways are subject to research bias
  • data is heterogeneous: well researched areas lead to more terms for specific functions
  • data is incomplete: absence of evidence of function does not imply absence of function
  • -> dangerous data aggregation
  • Both gene annotations and gene ontology, as well as pathways, are updated continuously: always check that the tool you are using uses the latest annotations
114
Q

What is an over-representation analysis?

What are its limitations?

A

unranked Analysis

  • Takes an unranked subset of the identified proteins and identifies which gene sets (GO terms, or pathways) are over represented in the subset, either with respect to the detected genes/proteins, or with respect to the whole genome –> background
  • Typical method for statistical evaluation for over-representation are Fisher’s exact test, hypergeometric test, binomial or χ2 test

Limitation of the over-representation analysis method:

  • The result (i.e., which GO terms are statistically significant), depends on the arbitrary cut-off for differential expression
  • All interesting proteins/genes are considered equal
  • Proteins/genes are assumed independent from one another
    • potential for false positives
    • proteins of small individual effect but working together may be missed
  • Biological significance is not necessarily correlated to statistical significance
115
Q

What is Gene Set Enrichment Analysis (GSEA)?

A

ranked analysis

All genes/proteins are taken into account!

  • Takes a ranked list of identified proteins and determines which gene sets (GO terms, or pathways) deviate from a random distribution across the list
  • Statistical significance of enrichment score can be tested by checking the supremum of the running sum against permutations of gene labels (Subramanian), or compared to the cumulative distribution of all gene sets (Panther).
116
Q

What are the 2 main types of protein-protein interaction (PPI) databases?

Examples?

A

Experimental:

  • in vitro reconstitution
  • genetic: synthetic lethality, Yeast Two Hybrid
  • co-purification methods:
    • AP-MS
    • CoIP-WB
    • biotin proximity ligation,
    • protein correlation profiling

Computational:

  • prediction using structural information
  • inference from orthologous interactions
  • co-expression
  • Text mining (co-occurrence of protein names in texts)

Examples:

  • string-db.org
  • IntAct
  • thebiogrid.org
117
Q

What are limitations of public PPI databases?

A
  • PPI databases contain data from a broad range of experimental sources and false discovery rates are not annotated
  • Bias towards frequently studied proteins results in a biased network topology
  • Temporal regulation not represented in the public PPI databases
  • Binary PPI databases do not provide information on actual protein complexes
118
Q

What is proteogenomics?

A

MS/MS searches are usually done on a standard protein coding sequences from Ensembl, GENCODE, RefSeq or UniProtKB (SwissProt/TrEMBL).

However, recent research showed the unexpected transcription and translation of so-called non-coding genes such as:

  • Long non-coding RNA (lncRNA)
  • Endogenous retroviral (ERE) or transposable elements (TE)
  • Splice variants, skipped introns, and 3’ or 5’-UTR sequences
  • Germline or somatic mutations (SAAV, InDels, frameshifts)
  • Circular RNA

Proteogenomic research investigates how to search for these non-canonical peptides by MS/MS

The term was introduced by Jaffe et al. Proteomics, 2004, using MS/MS searches to refine gene models, to validate gene expression at protein level, and to improve protein sequence databases.

Recent advances in sequencing technologies (WGS, RNA- seq, ..) made it possible to investigate unsequenced, partially sequenced genomes or personalized genomes.

RNA-seq data allows building sequence databases based on mRNA, which are expressed in a sample

119
Q

What protein sequence databases are there?

A
  • Six-frame translation of genome or WGS:
    • Very large, mostly non-expressed sequences, no exon-exon junctions
  • Ensembl: automated gene annotation system (gene prediction, conservation, translation evidence, …)
    • Biotypes: “protein-coding gene,” “long noncoding RNA (lncRNA) gene” and “pseudogene.”
  • GENCODE: gene annotations by integrating Ensembl automated predictions and the Human and Vertebrate Genome Analysis and Annotation (HAVANA) manual annotations.
  • RefSeq: manual curation of a collection of publicly available data for many organisms (NCBI)
  • UniProtKB (UniProt consortium): Expressed protein sequences with extensive functional annotation, GO, literature, and links to other databases
    • 5 levels of expression evidence (protein, mRNA, homology, predicted, uncertain)
    • SwissProt: manually annotated and reviewed records
    • TrEMBL: automatic annotation waiting for manual validation (redundant)
  • Repositories of identified peptides/proteins:
    • PeptideAtlas.org: compilation of many public datasets, spectrum libs
    • ProteomicsDB.org: nicely organized repository for human, mouse, Arabidopsis. Tissue specific, spectrum libs
120
Q

What are non-coding transcripts?

A
  • Transcripts can belong to many classes:
    • lncRNA, intronic sequences or alternative splicing, alternative start or stop sites, frameshifts, mutations, pseudogenes, retroviral elements, micro RNA, small interfering RNA, circular RNA, viral RNA, bacterial RNA, …
    • Proteasomal spliced peptides
  • Despite their name some non-coding transcripts can still encode non-canonical proteins (noncProt)
  • noncProt can be functional or result from aberrant translation and/or the aberrant transcription of non-coding genomic regions, which often yields unstable proteins not expressed in healthy cells
  • Often not present in standard databases such as UniProt
121
Q

What is long non-coding RNA (lncRNA)?

A
  • lncRNA are transcripts >200bp and typically not translated into functional proteins
  • ~ 300`000 lncRNAs known
  • Structural conservation of lncRNA (but not expression conservation)
  • Human lncRNA sequences in fasta format can be downloaded from lncipedia (127,802 entries), Ensembl (50,745 entries), NONCODE (173,112 entries), GENCODE (261,478 entries), …
122
Q

What are Endogenous Retroelements (ERE) in humans?

A
  • > 3x106 EREs, which make up 42% of human genome
  • EREs are mainly expressed in embryonic stem cells, germline cells and medullary thymic epithelial cells (mTEC)
  • Important role in evolution, immunity and brain development
  • Implicated in autoimmunity and cancer
  • Long terminal repeats (LTR), 8% of human genome, viral origin,
  • Long and short interspersed nuclear elements (LINE and SINE) (non-LTR), 21% and 13% of genome, respectively
  • Transposition is tightly controlled. Only 80-100 LINE1 elements are active per individual
  • Somatic LINE1 insertions in 50% of cancers
  • ERE sequences can be downloaded in fasta format for example from RepBase
123
Q

How do single amino acid mutations affect proteogenomics?

A
  • AA mutations or variants
    • germline or somatic
    • Heterozygous (1 allele) or homozygous (2 alleles)
  • AA mutations play an important role in many diseases
  • They can impact on the function of a protein by changing its expression level, or by directly changing the proteins active sites or PTMs
  • Detection of AA variants on the protein level is important for many aspects
  • dbSNP (NCBI): contains SNP, InDels & annotation
  • COSMIC (Sanger): Catalogue Of Somatic Mutations In human Cancer
124
Q

What needs to be considered when Searching Proteogenomics Sequence Databases?

A
  • The role of non-canonical proteins is increasingly recognized. In many projects researchers try to find non-canonical proteins or peptides next to canonical ones.
  • For the MS/MS searches, a large database is used which consists of a canonical and non-canonical part.
  • A non-canonical PSM is accepted if it scores higher than the cononical ones and passes the FDR threshold.
  • However, non-canonical proteins are low abundance and non-canonical hits are expected to be rare.
  • The error has to be carefully controlled in these searches in order to avoid many false positives.
    • Extraordinary claims require extraordinary evidence
  • Large sequence databases decrease sensitivity
  • Non-canonical databases are prone to false positives
125
Q

What is HCD (High-energy Collisional Dissociation)?

!

A

Fragmentation Technique

  • HCD is a beam-type CID, similar to the iontrap CID, but performed in a quadrupole.
  • HCD uses higher energy and shorter activation times (~0.1 ms) than linear ion trap CID (~30 ms).
  • HCD generated spectra are not subject to same limitation as CID spectra
  • -> lower mass ions of type a2, b2, y1, y2, immonium, and internal cleavage, or reporter ions from isobaric tags (see quantitative proteomics) are recorded.