$$\rightleftharpoonup{xx}$$
$$\longleftharp{xx}$$,
$$\longrightharp{xx}$$,
The workflow described above was applied to a MS dataset available on the PRIDE repository38,39. The original study developed a method (iMixPro), using stable isotope labeling of amino acids in cell culture (SILAC), to eliminate false positives from affinity-purification MS (AP-MS) experiments38. In brief, an AP-MS experiment consists of using beads-bound antibodies to fetch a protein of interest (bait) and its interactors (preys). The collected proteins are then digested and prepared for MS. The sample preparation method and the instrument settings are described in the original study and on the PRIDE repository (PXD004246). A challenge in such experiments is the abundance of false positives, notably from proteins binding to the beads but not the bait. Here, we used SILAC to generate different isotope ratios between true preys and false positives: 3 control samples (no bait) cultured in light medium, 1 sample expressing the bait cultured in light medium, and 1 sample expressing the bait cultured in heavy medium are processed with the beads and further mass spectrometry analysis. With such design, non-specific proteins binding to the beads will have an heavy-to-light ratio of 1:4; when true preys will have a ratio of 1:138.
We re-analyzed their AP-MS data using the OpenProt database; the baits included three endogenous proteins (PTPN14, JIP3 and IQGAP1), and two over-expressed proteins (RAF1 and RNF41). Since the experiments used SILAC, the Galaxy workflow for protein quantification was used (Supplementary Material S3, Figure 2). The workflow was run using the whole OpenProt database (OpenProt_all) or a restricted OpenProt database (OpenProt_2pep, including only proteins previously detected with a minimum of two unique peptides).
Protein identification and quantification were good and reproducible across the different used databases. As shown in Figure 3, most proteins identified in the original paper were also identified using either the OpenProt_2pep or OpenProt_all database (a detailed list is available in Supplementary Material S5). This result shows that the pipeline described here and the OpenProt databases are able to produce protein identification and quantification comparable to that of current procedures based on the UniProtKB databases40. However, the use of OpenProt databases has the unique advantage of allowing detection of novel and previously undetectable proteins, as demonstrated in this case study.
11 well-supported proteins (1 Isoform and 10 AltProts), yet currently not annotated in databases, were identified across all datasets, with confident peptides, using the OpenProt_2pep database (all protein accessions, along with the number of supporting peptides, are available in Supplementary Material S5). This database allows the use of a traditional 1% FDR as the search space increase remains moderate. These 11 proteins were not identified in the original study as they were absent from the database.
29 novel proteins (16 isoforms and 13 AltProts) were discovered across all datasets, with confident peptides, using the OpenProt_all database (all protein accessions, along with the number of supporting peptides, are available in Supplementary Material S6). As shown in Figure 3, the recommended stringent FDR did not affect the most confident protein identifications, although it did decrease the total number of identified proteins. Comparatively to the OpenProt_2pep database, a higher number of novel proteins can be confidently identified. All of these novel proteins are absent from the OpenProt_2pep database. This highlights the crucial role of the chosen database for MS-based proteomics.
One novel protein was discovered as an interactor of the RAF1 protein (IP_637643). Using the OpenProt website, one can see this protein had not been detected by MS nor ribosome profiling until now (OpenProt v1.3). The protein is 46 amino acids long and can only give two unique peptides upon tryptic digestion. The peptide detected in the RAF1 AP-MS dataset (fraction 18) had a good quality spectrum, as shown in Figure 4, and displayed a heavy-to-light ratio of 1,09. The protein is encoded in the NANOGNBP1 gene, which is a pseudogene of NANOGNB. The transcript (ENST00000448444), currently annotated as non-coding, was detected across several tissues according to the GTEx portal40. The protein contains a predicted functional domain associated with DNA binding (Gene Ontology GO:0003677)41.

Figure 1: Database choice for proteomics analyses chart. Analyses of MS data, notably the database choice, depend on the research objectives. Three common objectives are outlined in blue (classic proteomic pipeline), green (exhaustive proteomic search) and orange (proteomic discovery). Each objective depends on an appropriate database and pipeline. A single identification tool may be used for an exhaustive and classical proteomics pipelines. For the proteomic discovery pipeline, we strongly recommend using multiple identification engines. Recommended FDRs are indicated in red, and protein database sizes are indicated in grey boxes. Please click here to view a larger version of this figure.

Figure 2: Graphical representation of the Galaxy workflow used. Step-by-step representation of the proteomic analysis workflow used for re-analysis of Eyckerman et al. data38. Input files, peptide search, and protein quantification are indicated by orange boxes. Blue boxes correspond to the tools used and grey boxes correspond to the output files generated. The different search engines (MS-GF+ and X!Tandem) are indicated by different colors (respectively red and purple) as well as the arrows indicating their necessary inputs and outputs. The green box highlights the tool generating a list of protein identifications. When multiple outputs are generated, the one used for downstream steps is indicated as the closest to the arrow. This workflow is freely available in Supplementary Material S2. The X!Tandem default parameters configuration file is available in Supplementary Material S4. Please click here to view a larger version of this figure.

Figure 3: Comparison of interactor identification per bait using different databases. Venn diagrams of protein identifications using the most confident OpenProt database (in orange, supporting evidence of minimum 2 unique peptides, OpenProt_2pep) with a 1% FDR, or the whole OpenProt database (in blue, OpenProt_all) with a 0.001% FDR, or as reported in the original paper (in grey)38. Each diagram corresponds to identified interactors for the mentioned bait: RAF1, RNF41, PTPN14, JIP3 and IQGAP1. Please click here to view a larger version of this figure.

Figure 4: MS/MS spectrum of identified MDNLWAK(13C6) peptide from novel protein IP_637643. Intensity is relative (0 to 100%). Selected peaks are indicated in red, y ions annotations are in dark red and b ions annotations in green. Extracted from the TOPPview software34. Precursor Error = 2.70 ppm, PEP score = 0.12. Please click here to view a larger version of this figure.
| Term | Definition | Reference |
| Alternative ORF (AltORF) | non-canonical ORF currently not annotated in genome annotations, but annotated in OpenProt. | 15 |
| Reference ORF (RefORF) | canonical ORF annotated in genome annotations and OpenProt. | 15 |
| Alternative protein (AltProt) | novel protein coded by an AltORF, with no significant similarity with a RefProt. Accession prefix: IP_. | 15 |
| Reference protein (RefProt) | protein currently annotated in protein sequence databases such as UniProtKB, Ensembl or NCBI RefSeq, and also in OpenProt. | 15 |
| Novel Isoform | novel protein coded by an AltORF, with a significant similarity with a RefProt. Accession prefix: II_. | 15 |
| OpenProt_2pep database | contains the sequence of all RefProts and novel proteins predicted by OpenProt, already detected with a minimum of 2 unique peptides. | 15 |
| OpenProt_1pep database | contains the sequence of all RefProts and novel proteins predicted by OpenProt, already detected with a minimum of 1 unique peptide. | 15 |
| OpenProt_all database | contains the sequence of all RefProts and novel proteins predicted by OpenProt. | 15 |
Table 1: Definition of terms used in OpenProt and throughout the protocol
Supplementary Material S1: Galaxy workflow for database handling. This will append the CRAPome and decoy sequences (reverse) to the input database. Output is a Fasta file. Please click here to download.
Supplementary Material S2: Galaxy workflow for protein identification. This will identify proteins from a mass spectrometry data file using two search engines (MS-GF+ and X!Tandem). Each parameter can be tuned as desired before running the workflow. Please click here to download.
Supplementary Material S3: Galaxy workflow for protein quantification using stable isotope labeling (SIL). This will identify and quantify proteins from a mass spectrometry data file using two search engines (MS-GF+ and X!Tandem). Each parameter can be tuned as desired before running the workflow. Please click here to download.
Supplementary Material S4: X!Tandem default parameters configuration file. This XML file is necessary for running the X!TandemAdapter tool on the Galaxy platform. Please click here to download.
Supplementary Material S5: Quantified proteins from iMixPro datasets. Data files from Eyckerman et al. 201638 were processed using OpenProt databases and quantified proteins are listed for each condition. Baits are PTPN14, JIP3, IQGAP1, RAF1 and RNF41. Gene names indicated in green correspond to proteins also identified in the original paper38. Gene names indicated in orange correspond to known interactors according to BioGrid that were not reported in the original paper. Gene names indicated in light blue correspond to novel proteins identified as interactors (the corresponding protein accession number is indicated in brackets). Gene names indicated in light grey and italics correspond to likely contaminants (keratin proteins). Please click here to download.
Supplementary Material S6: Identified novel proteins from iMixPro datasets. Data files from Eyckerman et al. 201638 were processed using OpenProt databases and novel identified proteins are listed for each condition. Baits are PTPN14, JIP3, IQGAP1, RAF1 and RNF41. Protein accession numbers are listed, starting with II_ for novel isoforms of a known protein, and with IP_ for novel proteins from an alternative ORF (AltProt).The number of supporting peptides are indicated in brackets. Please click here to download.